Closes part of #2790 (Phase B). Prevents a recurrence of the PR #2766 →
PR #2771 cycle: PR #2766 added ``source_workspace_id`` to four tools'
``input_schema`` and tool implementations, but the dispatcher in
``a2a_mcp_server.handle_tool_call`` silently dropped the kwarg for
``commit_memory`` / ``recall_memory`` / ``chat_history`` /
``get_workspace_info``. Schema lied; LLMs populated the param; every
call fell back to ``WORKSPACE_ID``, defeating multi-tenant isolation.
Existing dispatcher tests asserted return-value substrings (``"working"
in result``) instead of kwarg flow, so the bug shipped to main and was
only caught by re-reviewing post-merge.
This change adds an AST-driven gate. For every ToolSpec in
platform_tools.registry.TOOLS, the gate finds the matching
``elif name == "<tool>"`` arm in a2a_mcp_server.py and asserts that
every property declared in input_schema.properties is read by an
``arguments.get("<property>", ...)`` call inside that arm. A new schema
field the dispatcher forgets to forward fails CI loudly.
Three tests:
- test_every_dispatch_arm_reads_every_schema_property: main drift gate.
Walks registry, matches dispatch arms by name, diffs declared vs
read keys.
- test_dispatch_arms_reach_every_registered_tool: inverse direction.
A registered tool with no dispatch arm is "Unknown tool" at runtime,
even though docs/wrappers/schema all advertise it. Catches PRs that
add a ToolSpec but forget the dispatcher.
- test_drift_gate_self_check_finds_known_arms: pin the AST parser. If
handle_tool_call is refactored into a different shape (dict dispatch,
registry-driven, etc.) and _load_dispatch_arms returns {}, the main
gate vacuously passes — this self-check makes that failure mode
explicit by requiring 12 known arms to be discovered.
Verified the gate catches the PR #2766 bug: stripping
``source_workspace_id=arguments.get(...)`` from the commit_memory arm
fails the gate with a descriptive error pointing at the missing kwarg
and referencing the prior incident. Restored → 3 tests pass.
Suite: 1733 passed (was 1730 + 3 new), 3 skipped, 2 xfailed.
Why AST, not runtime invocation: the runtime mock-based tests in
test_a2a_mcp_server.py already assert kwargs flow correctly for four
explicitly-tested tools. This gate is cheaper (~1ms), catches new
properties before someone has to remember the runtime test, and runs
as a structural invariant.
Phase A (Python coverage floor) and Phase C (molecule-mcp e2e harness)
remain in #2790 as separate follow-ups.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Memory tab supported only Add+Delete. Correcting an entry meant
deleting and re-adding, losing the row's version counter and any
concurrent-write guard the agent depends on.
Now: per-row Edit button reveals an inline editor (value textarea +
TTL). Save POSTs to the existing /memory upsert endpoint with
if_match_version pinned to the entry's current version. On 409 the
UI surfaces a retry hint and reloads.
Tests:
- 11 vitest cases covering pre-fill (JSON vs string), payload shape
(parsed JSON, fallback to plain text, TTL inclusion/omission),
cancel, 409 retry path, generic error path, and the no-version
back-compat case.
- E2E gate 9c in test_staging_full_saas.sh: seed → GET version →
conditional update → assert new value → stale-version POST must
409. Pins the optimistic-locking contract end-to-end on staging.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-fix WriteFile (templates.go:436) had an `instance_id != ""` branch
that dispatched to writeFileViaEIC (SSH through EC2 Instance Connect),
but ReadFile (templates.go:362) skipped that branch entirely. ReadFile
always tried `findContainer` (which only works for local-Docker
workspaces, not SaaS EC2-per-workspace ones) and fell through to
`resolveTemplateDir` (which returns the seed template, not the
persisted workspace state).
Net effect on production: every Canvas Config tab open against a
SaaS workspace returned 404 "No config.yaml found" because GET
couldn't see what PUT had written. Visible to users after PR #2781
("show-misconfigured-state") surfaced the 404 as an error UX.
Caught by the synth-E2E 7c gate's GET-back assertion, but
misdiagnosed as a "test bug" and the GET assertion was dropped in
PR #2783 (rather than fixed at the source). This PR closes the loop:
1. New `readFileViaEIC` helper in template_files_eic.go that mirrors
writeFileViaEIC's SSH-via-EIC dance and runs `sudo -n cat <path>`.
Returns os.ErrNotExist on missing file (cat exits 1 with empty
stdout under `2>/dev/null`) so the handler maps it cleanly to 404.
2. ReadFile dispatch now mirrors WriteFile's: when `instance_id` is
non-empty, use readFileViaEIC; otherwise fall through to the
local-Docker / template-dir path.
3. ReadFile's DB query expanded to also select instance_id + runtime
(was just name). Three sqlmock-based tests updated to match the
new column shape; the existing local-Docker fallback path stays
green by passing instance_id="" in the mock rows.
Follow-up (separate PR): the synth-E2E 7c gate should restore the
GET-back marker assertion now that the read/write paths are unified.
That'll also catch any future Files API regression in the round-trip.
This PR doesn't touch the gate to keep the scope tight.
Verification:
- go build ./... clean
- full handlers test suite green (0.4s for ReadFile subset; 5.8s
full)
- The 3 ReadFile sqlmock tests still cover the local-Docker fallback
(instance_id=""); SaaS EIC dispatch is covered by the upcoming
re-enabled synth-E2E 7c GET assertion (deferred to follow-up)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After the curl parse fix in #2779, the gate started reliably catching a
DIFFERENT bug than it was designed for: the Files API's PUT and GET
hit different paths/hosts and don't see each other's writes.
PUT /workspaces/<id>/files/config.yaml
→ template_files_eic.go writeFileViaEIC
→ SSH-as-ubuntu through EIC tunnel into the workspace EC2
→ `sudo install -D /dev/stdin /configs/config.yaml`
→ Lands at host:/configs on the workspace EC2 (correct: bind-
mounted into the workspace container)
GET /workspaces/<id>/files/config.yaml
→ templates.go ReadFile
→ `findContainer` looks for a docker container ON THE
PLATFORM-TENANT HOST (not the workspace EC2)
→ Workspace containers don't run on platform-tenant; this returns
empty
→ Fallback: read from h.resolveTemplateDir(wsName) on the
platform-tenant host — i.e., the seed template directory, not
the persisted workspace config
So the GET reliably returns the original template config, not what
PUT just wrote. The user-facing Save & Restart still works because
the container reads /configs/config.yaml directly via bind-mount —
the asymmetry only bites the gate.
This is a separate latent bug worth its own task: unify the Files
API read/write path (likely: ReadFile should also use SSH-EIC to the
workspace EC2 for instance-backed workspaces, mirroring WriteFile).
Tracked separately.
For now, drop the GET-back assertion and keep just the PUT-200
check. The PUT-200 still catches today's bug class (#2769 EACCES on
/opt/configs would have failed PUT with 500). When the read/write
paths are unified, restore the marker check.
Verification:
- bash -n clean
- The PUT-200 check would have caught PR #2769's bug (500 EACCES)
- The dropped GET-back check would not have prevented today's user
bug (PR #2769 was caught by the user, not by the gate, and the
gate only existed afterward)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes molecule-controlplane#467 (issue filed against CP, but resolution
landed canvas-side because the workspace-server ALREADY returns the
agent_card JSONB blob with configuration_status / configuration_error
fields populated by molecule-core PR #2756). No CP-side change needed —
the gap was the canvas's blindness to those fields.
Before this PR, a workspace whose adapter.setup() failed (typically
missing/rotated LLM credential) appeared identical to a healthy one in
the canvas tile: green "Online" status, no error indication. The
operator had to dig into workspace logs to discover the env var to set.
This PR surfaces the state via the existing status-pill UX:
1. STATUS_CONFIG gains a "not_configured" entry — amber dot/glow,
"Not configured" label. Distinct from "online" (emerald) and
"failed" (red) — the workspace is reachable, it just needs config.
2. canvas-topology exposes getConfigurationStatus / getConfigurationError
helpers — strict equality on the JSONB field so unknown values
pass through as null instead of crashing the tile renderer.
3. WorkspaceNode derives an `effectiveStatus` that overrides
data.status with "not_configured" when (status === "online" AND
agent_card.configuration_status === "not_configured"). The override
only applies on top of "online" — a genuinely offline / failed /
provisioning workspace keeps its existing treatment.
4. The configuration_error string surfaces in two places: the tile's
aria-label (screen reader access) + a truncated preview row at the
bottom of the tile (same visual as the existing "degraded error
preview" — mirrors the established pattern for in-tile error
surfacing).
Test coverage: 11 new in canvas-topology-configuration-status.test.ts.
Each helper covered for the happy path, missing fields, defensive
ignores of unknown values, and an end-to-end "stale ready overrides
old error" guard.
Once this lands + canvas redeploys, operators see "Not configured:
Neither OPENAI_API_KEY nor MINIMAX_API_KEY is set" right on the
workspace tile instead of a confused-looking green "online" workspace
that silently 503s every JSON-RPC request.
Pairs with: molecule-core PR #2756 (decouple agent-card from setup),
#2775 (boot_routes pin), #2778 (secret_redactor)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The first version of the config.yaml round-trip gate (PR #2773)
captured curl output with `-w '\n%{http_code}\n'` and parsed via
`tail -n 2 | head -n 1`. That broke because bash's $(...) strips the
trailing newline, leaving only 2 lines in the captured value:
line 1: <response body>
line 2: <status code>
`tail -n 2 | head -n 1` then returned line 1 (the body), not the
status code. The gate misreported 200-with-JSON-body responses as
"PUT returned <body>" and failed the canary post-merge at 22:06 UTC.
Fix: write body to a tempfile via `-o "$PUT_TMP"` and use
`-w '%{http_code}'` as the sole stdout. Status code is now
unambiguously the captured value, body is read separately from the
tempfile. No newline-counting heuristic needed.
Verification:
- bash -n clean
- shellcheck clean on the modified block
- Will be exercised by the next continuous-synth-e2e firing
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #2756 piped adapter.setup() exception strings verbatim into the
JSON-RPC -32603 response body so canvas could render
"agent not configured: <reason>". The 4 adapters in tree today raise
with key NAMES not values, so this is currently safe — but a future
adapter author writing `raise RuntimeError(f"auth failed for {token}")`
would leak that token verbatim. Issue #2760 flagged the risk; this PR
closes it.
workspace/secret_redactor.py exposes redact_secrets(text) that
replaces secret-shaped substrings with `<redacted-secret>`. Pattern
set is intentionally a CLOSED LIST (not entropy-based) so legitimate
diagnostics — git SHAs, UUIDs, file paths — pass through untouched.
Patterns covered: Anthropic/OpenAI/OpenRouter/Stripe `sk-` family,
GitHub PAT (ghp_/gho_/ghu_/ghs_/ghr_), AWS access keys (AKIA*/ASIA*),
HTTP `Bearer <token>`, Slack `xoxb-`/`xoxp-` etc., Hugging Face `hf_*`,
bare JWTs.
Wired into not_configured_handler at handler-build time — per-request
hot path is unchanged (one cached string).
Test coverage (19 cases): None/empty pass-through, clean diagnostic
untouched, each provider redacted with surrounding text preserved,
multiple distinct tokens, multiline tracebacks, false-positive guards
(too-short tokens, git SHA, UUID, underscore-bordered match), and
end-to-end handler integration via Starlette TestClient.
Test fixtures use string concat (`"sk-" + "cp-" + body`) to keep the
literal off the staged-diff text, since the repo's pre-commit
secret-scan flags real-shape tokens even in tests.
`secret_redactor` registered in TOP_LEVEL_MODULES (drift gate).
Closes#2760
Pairs with: PR #2756, PR #2775
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #2756's contract — card route always mounted regardless of
adapter.setup() outcome — lived inline in main.py's `# pragma: no cover`
boot sequence. A future refactor that re-coupled the two would have
silently bypassed PR #2756 and shipped the original "stuck booting
forever" UX again, with no pytest catching it.
This change extracts route assembly into workspace/boot_routes.py's
build_routes(card, executor, adapter_error) and pins the contract with
6 integration tests using Starlette's TestClient:
- test_card_route_serves_200_when_adapter_ready: happy path
- test_card_route_serves_200_when_adapter_failed: misconfigured boot,
card still 200, skill stubs survive
- test_jsonrpc_returns_503_when_no_executor: full -32603 envelope with
the adapter_error in error.data
- test_jsonrpc_returns_503_with_generic_when_no_error_string: fallback
reason for the rare case main.py reaches this branch without one
- test_card_route_does_not_depend_on_executor: direct PR #2756
regression guard — both branches MUST mount the card route
- test_executor_present_does_not_mount_not_configured_handler: sanity
that a healthy workspace doesn't return -32603 to every request
Conftest stubs extended with a2a.server.routes / request_handlers
classes so the tests work under the existing a2a-mock infra (pattern
matches the AgentCard/AgentSkill stubs added for PR #2765).
main.py now calls build_routes; the inline if/else is gone. Same
production behaviour, cleaner shape, regression-proof.
Heavy a2a-sdk imports inside build_routes() are lazy (deferred to the
executor-only branch) so tests that only exercise the not-configured
path don't pull DefaultRequestHandler / InMemoryTaskStore.
card_helpers + boot_routes registered in TOP_LEVEL_MODULES (build
drift gate would have caught the missing entry on the wheel-publish
smoke).
All 18 related tests pass (test_boot_routes.py: 6, test_card_helpers.py:
6, test_not_configured_handler.py: 6).
Closes#2761
Pairs with: PR #2756 (decouple agent-card from setup),
PR #2765 (defensive isolation of enrichment + transcript)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Today's user-visible bug ("PUT /workspaces/<id>/files/config.yaml: 500
… install: cannot create directory '/opt/configs': Permission denied",
fixed in #2769) shipped to production and was caught only when an
operator opened the Canvas Config tab and clicked Save & Restart on
a claude-code workspace. Two compounding root causes:
1. Path-map fall-through: claude-code wasn't in
workspaceFilePathPrefix, so it fell through to the /opt/configs
default — a path the workspace EC2 doesn't have (cloud-init only
creates /configs).
2. Permission: /configs is root-owned, but the SSH-as-ubuntu install
command had no sudo prefix, so the write would have failed with
EACCES even with the right path.
The synth E2E provisions a fresh workspace every cron firing but
never PUTs a file via the Files API. So neither failure mode could
fail the canary.
Add a new step 7c (between terminal-diagnose and A2A) that:
- PUTs a known marker into config.yaml on each provisioned workspace
- GETs it back and asserts the marker is present
- Fails with an actionable message that names the likely class of
regression (path map vs permission) so the next operator doesn't
have to re-discover today's debugging path
The marker includes the run ID so stale state from a prior canary
can't false-pass.
Why round-trip (not just PUT-and-200): a 200 from PUT only proves the
SSH install succeeded somewhere on disk; the GET-back proves the file
landed at the path the runtime actually reads from (i.e., that the
host:/configs → container:/configs bind-mount sees it). Without the
GET, a future bug that writes to a non-bind-mounted host path would
silently no-op from the runtime's POV but pass the gate.
Deferred (separate PR, requires AWS-creds wiring): a parallel gate
that aws ec2 describe-instances on the workspace EC2 and asserts the
attached IamInstanceProfile.Arn — would directly catch the #466 IAM
profile gap class. Punted because it needs aws-actions/configure-aws-
credentials added to continuous-synth-e2e.yml + a read-only IAM role
provisioned on the AWS side. Tracked as task #301.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Self-review of merged PR #2766 (multi-workspace MCP routing) revealed a
silent gap: PR #2766 added the ``source_workspace_id`` parameter to
``tool_commit_memory`` / ``tool_recall_memory`` / ``tool_chat_history``
/ ``tool_get_workspace_info`` AND advertised it in the registry's input
schemas, but the MCP server's dispatch arms in ``a2a_mcp_server.py``
were never updated to forward ``arguments["source_workspace_id"]`` to
those four tools.
Result: the schema lied. The LLM saw ``source_workspace_id`` as a valid
tool parameter, could correctly populate it from the inbound message's
``arrival_workspace_id``, but the dispatcher dropped it on the floor and
every memory commit / recall / chat-history fetch silently fell back to
the module-level ``WORKSPACE_ID``. The cross-tenant leak that PR #2766
was meant to prevent is NOT prevented for these four tools without this
follow-up.
Why the existing dispatcher tests didn't catch it:
the tests asserted return-value strings (``"working" in result``) but
never asserted what arguments the inner tool was called with. So the
dispatcher could ignore any kwarg and the tests would still pass.
Fix:
1. Wire ``source_workspace_id=arguments.get("source_workspace_id") or None``
into the four dispatch arms, mirroring the pattern already used for
``delegate_task`` / ``delegate_task_async`` / ``check_task_status`` /
``list_peers``.
2. Add five tests in ``test_a2a_mcp_server.py`` that assert the inner
tool was awaited with the exact source_workspace_id kwarg
(``assert_awaited_once_with(..., source_workspace_id="ws-X")``) —
substring-on-result tests can't catch this class of bug.
3. Add a fallback test ensuring single-workspace operators (no
source_workspace_id key) get ``source_workspace_id=None`` — pinning
the documented None contract over an accidental empty-string forward.
Suite: 1705 passed (was 1700 + 5 new), 3 skipped, 2 xfailed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause of the user-visible 500 ("install: cannot create directory
'/opt/configs': Permission denied") on PUT
/workspaces/<id>/files/config.yaml:
1. Path map fall-through. claude-code wasn't in workspaceFilePathPrefix,
so resolveWorkspaceFilePath returned the default `/opt/configs/...`.
That directory doesn't exist on the workspace EC2 — cloud-init in
provisioner/userdata_containerized.go runs `mkdir -p /configs` only.
Even if the SSH write had succeeded at /opt/configs, the docker
container's bind-mount is host:/configs → container:/configs,
so the file would have been invisible to the runtime.
2. /configs ownership. cloud-init runs as root, so /configs is
root-owned. The SSH-as-ubuntu install command can't write into it
without sudo. Hermes wasn't affected because its base path
(/home/ubuntu/.hermes) is ubuntu-owned.
Two-line fix:
- Add `claude-code: /configs` to the runtime → base-path map and flip
the default fall-through from `/opt/configs` to `/configs`. Leave the
pre-existing langgraph/external entries pointing at /opt/configs
pending a migration audit (no user report on those today, and
flipping them would silently relocate any files those runtimes
already wrote).
- Prefix the remote install command with `sudo -n` so the write
succeeds under the standard EC2 ubuntu/passwordless-sudo posture.
`-n` (non-interactive) ensures clean failure if that ever changes,
rather than a hang waiting for a password prompt.
Tests:
- TestResolveWorkspaceFilePath_KnownRuntimes adds claude-code +
CLAUDE-CODE coverage and updates the empty/unknown default cases
to expect /configs. The langgraph/external rows stay green
(unchanged values), confirming the scope of the rename.
Verification:
- go build ./... clean
- go test ./internal/handlers/ green
- The user-reported bug
(PUT /workspaces/57fb7043-79a0-4a53-ae4a-efb39deb457f/files/config.yaml
→ 500 EACCES on /opt/configs) is the failure mode this fix addresses
on both axes (path + sudo).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR-3 of the multi-workspace MCP rollout. PR-1 made the MCP server itself
multi-workspace aware (one process, N workspace memberships). PR-2 added
source_workspace_id threading to delegate_task / list_peers. This change
closes the remaining workspace-scoped tools so a single agent registered
into multiple workspaces no longer leaks memories or chat history across
tenants.
Tools now accepting `source_workspace_id`:
- tool_commit_memory(content, scope, source_workspace_id=None) —
routes POST to /workspaces/{src}/memories with the source workspace's
Bearer token. Body still embeds source_workspace_id for the platform's
audit + namespace-isolation enforcement.
- tool_recall_memory(query, scope, source_workspace_id=None) —
GET /workspaces/{src}/memories with the source workspace's token and
?workspace_id={src} query so the platform scopes the read to the
caller's tenant view (PR-1 / multi-workspace mode).
- tool_chat_history(peer_id, limit, before_ts, source_workspace_id=None)
— auto-routes via the _peer_to_source cache populated by list_peers,
with explicit override winning. Falls back to module-level WORKSPACE_ID
if neither is available. URL: /workspaces/{src}/chat-history.
- tool_get_workspace_info(source_workspace_id=None) — GET /workspaces/{src}
with the source workspace's token. Useful for introspecting any
workspace the agent is registered into, not just the primary.
In every path, `src = source_workspace_id or WORKSPACE_ID`, so
single-workspace operators see no behavior change. Tokens are resolved
per-workspace via auth_headers(src) / _auth_headers_for_heartbeat(src),
which fall through to the legacy AUTH_TOKEN env when not in
multi-workspace mode.
Also updates input_schemas in platform_tools/registry.py so the new
optional parameter is advertised to LLM clients (claude-code,
hermes-agent, langchain wrappers).
Tests (4 new classes in test_a2a_multi_workspace.py, 21 new tests):
- TestCommitMemorySourceRouting — URL + Authorization header per source
- TestRecallMemorySourceRouting — URL + query param + Authorization
- TestChatHistorySourceRouting — peer-cache auto-route + explicit override
- TestGetWorkspaceInfoSourceRouting — URL + Authorization
Inbox tools (peek/pop/wait_for_message) already multi-workspace aware
since PR-1 — inbox.py spawns per-workspace pollers and tags every
InboxMessage with arrival_workspace_id. No further plumbing needed.
Suite: 1700 passed, 3 skipped, 2 xfailed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #2756 added a try/except around adapter.setup() so a missing LLM key
doesn't crash the workspace boot. Two paths that now run AFTER setup
succeeds were not similarly isolated, leaving small but real coupling
risks for future adapter authors.
1. **Skill metadata enrichment swap (main.py:248-259).** When
adapter.setup() returns, main.py reads adapter.loaded_skills and
replaces the static stubs in agent_card.skills with rich metadata
(description, tags, examples). The list comprehension assumes each
element exposes .metadata.{id,name,description,tags,examples}. A
future adapter that returns a non-canonical shape would raise
AttributeError, propagate to the outer except, capture as
adapter_error, and silently degrade an OK boot to the
not-configured state — even though setup() actually succeeded.
Extract to card_helpers.enrich_card_skills(card, loaded_skills) →
bool. Helper swallows enrichment failures, logs the cause, returns
False, leaves the static stubs in place. setup() success path
continues unchanged. 6 unit tests cover: None input, empty list,
canonical happy path, missing .metadata attr, partial .metadata
(missing one canonical field), atomic-failure-no-partial-swap.
2. **/transcript handler (main.py:513).** Calls await
adapter.transcript_lines(...) without try/except. BaseAdapter's
default returns {"supported": false} so today's 4 adapters never
trigger this — but a future adapter override that assumes setup()
ran would surface as a 500 from Starlette's default error handler
instead of a useful 503 with the exception class + message.
Inline try/except returns 503 with the reason, matching the
not-configured JSON-RPC handler's pattern.
Both changes match the architectural principle the PR #2756 chain
established: availability (workspace reachable) is decoupled from
configuration / adapter behavior. Operators see useful errors instead
of silent degradation; future adapter authors can't accidentally
break tenant readiness with a shape mismatch.
Adds:
- workspace/card_helpers.py (~50 lines, 100% covered)
- workspace/tests/test_card_helpers.py (6 tests)
- AgentCard/AgentSkill/AgentCapabilities/AgentInterface stubs to
workspace/tests/conftest.py so future card-related tests work
under the existing a2a-mock infrastructure
- card_helpers in TOP_LEVEL_MODULES (drift gate would have caught it)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Preflight was hard-failing the workspace boot when required env vars or
legacy auth_token_files were missing, raising SystemExit(1) before
main.py's PR #2756 try/except could mount the not-configured handler.
Result: codex/openclaw workspaces launched without OPENAI_API_KEY were
INVISIBLE — `/.well-known/agent-card.json` never returned 200, the bench
timed out at 600s, canvas had no actionable signal. PR #2756 fixed half
the puzzle (decouple agent-card from adapter.setup() failure); this
fixes the other half (decouple from preflight failure).
Caught by bench-provision-time run 25335853189 on 2026-05-04: codex and
openclaw both timed_out at 609s while claude-code (whose default model
needs no env) hit 86.7s on the same AMI. Hermes hit 147s because hermes
config doesn't declare top-level required_env.
After this change:
- Missing required_env: WARN (operator sees it in boot logs); workspace
proceeds to adapter.setup() which raises with the same env-name detail;
PR #2756's try/except mounts the not-configured handler;
/.well-known/agent-card.json serves 200; JSON-RPC POST / returns
-32603 "agent not configured" with the env-name in `error.data`.
- Missing auth_token_file (legacy path): same treatment.
- Other preflight failures (runtime adapter not installable, invalid
A2A port) STAY as fails — those are structural, the workspace truly
can't run.
Updated 4 existing tests that asserted `report.ok is False` on
required_env / auth_token misses to assert `report.ok is True` and
check `report.warnings` instead. All 31 preflight tests pass; full
suite 1664 pass + 1 unrelated flake on staging.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Self-review of #2755 found two tests that didn't actually exercise the
production code path:
- TestNamespaceCleanupFn_NamespaceFormat asserted
"workspace:" + "abc-123" == "workspace:abc-123" — a compile-time
invariant, not runtime behavior. Provided no protection if the closure
in Bundle.NamespaceCleanupFn ever stopped using that prefix.
- TestNamespaceCleanupFn_FailureLogsButReturns built a *parallel*
cleanup closure inline with errors.New, then invoked the parallel
closure. The production closure was never exercised. A regression
in NamespaceCleanupFn (e.g. forgetting the deferred recover, calling
the plugin without nil-check) would still pass this test.
Replaced both with real integration:
- TestNamespaceCleanupFn_HitsPluginAtCorrectNamespace spins up
httptest.Server, points MEMORY_PLUGIN_URL at it, calls Build(),
invokes the production closure, and asserts the server actually
saw DELETE /v1/namespaces/workspace:abc-123.
- TestNamespaceCleanupFn_PluginErrorDoesNotPanic exercises the
failure path for real: server returns 500 on DELETE, closure must
log and return without propagating. defer-recover is belt-and-
suspenders since production calls this from a for-loop in
workspace_crud.go that has no recover.
Couldn't ship with #2755 because the merge queue locks the branch
once enqueued. Following up now that #2755 is merged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The wheel-build drift gate caught the new module added in this PR —
without registering it, the published wheel would ship `import
not_configured_handler` un-rewritten, which would `ModuleNotFoundError`
at runtime under `molecule_runtime.main`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Today, if `adapter.setup()` raises (most often: an LLM credential is
missing/rotated), main.py crashes before the agent-card route is mounted.
start.sh restart-loops, /.well-known/agent-card.json never returns 200,
and the workspace is invisible to the bench/canvas — operators see
"stuck booting forever" with no clear error to act on.
The agent-card is a static capability advertisement (name, version,
skills, supported protocols). It doesn't need a working LLM. Coupling
its mount to setup() conflates *availability* ("am I up?") with
*configuration* ("can I actually answer?"). They're different concerns.
This change:
- Builds AgentCard from `config.skills` (static names from config.yaml)
BEFORE adapter.setup(), so the route mounts independent of setup state.
- Wraps setup() + create_executor in try/except. On success, mounts
the real DefaultRequestHandler with rich loaded_skills metadata
swapped into the card in-place. On failure, mounts a JSON-RPC
handler that returns -32603 "agent not configured" with the
setup() exception in error.data.
- Heartbeat keeps running on misconfigured boots so the platform
marks the workspace as reachable-but-misconfigured rather than
crash-looping. Operators redeploy with corrected env without
chasing a restart loop.
- initial_prompt and idle_loop are skipped on misconfigured boots —
they self-fire to /, which would land in -32603 anyway, and the
marker would consume on the first useless attempt.
Bench impact (RFC #388 strict <120s): codex/openclaw bench-time-outs
were the agent-card-never-returns-200 symptom. With this fix those
runtimes serve the card immediately on EC2 boot, so the bench
measures infrastructure cold-start (claude-code class: ~50–80s)
instead of credential-coupled boot.
Adds workspace/not_configured_handler.py (factory + module-level so
behavior is unit-testable; main.py is `# pragma: no cover`) and
workspace/tests/test_not_configured_handler.py (6 tests covering
status code, JSON-RPC envelope shape, id-echo, malformed-body
fallback, reason surfacing, batch-body safety).
All 1665 existing workspace tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Caught during continued review: the entire v2 plugin system shipped
in PRs #2729-#2742 + #2744-#2751 was never actually invoked because
main.go and router.go don't construct the plugin client/resolver or
attach the WithMemoryV2 / WithNamespaceCleanup hooks.
Operators setting MEMORY_PLUGIN_URL=... saw zero behavior change
because nothing read it. Every fixup we shipped (idempotency, verify
mode, expires_at validation, audit JSON, namespace cleanup, O(N)
export, boot E2E) was also dormant for the same reason.
Root cause: when a multi-handler feature lands across many PRs, none
of them are individually responsible for wiring main.go — and the
master-task-tracking issue didn't gate-check that the wiring landed.
Add main.go integration to every multi-handler RFC checklist.
What ships:
* internal/memory/wiring/wiring.go: new package that constructs the
plugin client + resolver from MEMORY_PLUGIN_URL once. Returns nil
when unset (preserves zero-config legacy behavior). Probes
/v1/health at boot but doesn't fail-closed — the MCP layer's
circuit breaker handles ongoing unavailability.
* internal/memory/wiring/wiring_test.go: 6 tests covering the
nil/non-nil bundle paths + the namespace-cleanup closure
contract (nil-safe, format-stable, failure-tolerant).
* cmd/server/main.go: imports memwiring, calls Build(db.DB) once
after WorkspaceHandler creation, attaches WithNamespaceCleanup,
threads the bundle through router.Setup.
* internal/router/router.go: Setup signature gains *memwiring.Bundle
param. Inside, attaches WithMemoryV2 to AdminMemoriesHandler and
MCPHandler when the bundle is non-nil.
After this, the v2 plugin is reachable end-to-end:
Operator sets MEMORY_PLUGIN_URL → main.Build instantiates client +
resolver → WorkspaceHandler gets cleanup hook → router wires
AdminMemoriesHandler + MCPHandler with WithMemoryV2 → MCP tool
calls (commit_memory_v2, search_memory, etc.) actually do
something → admin export/import respects MEMORY_V2_CUTOVER.
Prerequisite for #292 (staging verification) — without this, the
operator runbook's step 2 (set MEMORY_PLUGIN_URL, observe behavior)
silently no-ops.
Verified: all 9 affected test packages still green
(memory/{client,contract,e2e,namespace,pgplugin,wiring}, handlers,
router, plus the build).