Same root cause as the workspace/molecule_ai_status.py docstring fix
in this PR: this doc claimed `molecule-monorepo-status` was a usable
shell alias and `from molecule_ai_status import set_status` was a
usable Python import. Both worked under the pre-#87 monolithic-template
layout (where workspace/Dockerfile created the symlink and COPY'd the
modules into /app/) but neither works in current standalone template
images that install the runtime as a wheel:
- `which molecule-monorepo-status` errors — only `a2a-db` and
`molecule-runtime` are registered console scripts.
- `from molecule_ai_status` raises ImportError — modules are under the
`molecule_runtime` package now.
Switched both examples to the canonical `python3 -m
molecule_runtime.molecule_ai_status` form (CLI) and `from
molecule_runtime.molecule_ai_status import set_status` (Python). Same
form the runtime ships in its own usage banner, so anyone discovering
this doc gets a runnable example.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-existing test_set_status_exception_prints_to_stderr asserted on the
legacy "molecule-monorepo-status: failed to update" prefix string. The
prior commit renamed it to "molecule_ai_status: failed to update" so
the printed label matches the canonical module-form invocation
(`python3 -m molecule_runtime.molecule_ai_status`) instead of a shell
alias that only ever existed in the dev-only base image. Updating the
expected substring in lockstep.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comprehensive sweep follow-up to the MCP server path fix. Audited every
/app/ reference in the runtime source against the live claude-code
template image and confirmed the actual /app/ contents post-#87 are
ONLY: __init__.py, adapter.py, claude_sdk_executor.py, requirements.txt
— every other workspace module ships in the wheel under
site-packages/molecule_runtime/. Two more leaks found:
1. executor_helpers.py:_A2A_INSTRUCTIONS_CLI — inter-agent system prompt
for non-MCP runtimes (Ollama, custom) had 5 lines telling the model
`python3 /app/a2a_cli.py X`. Models copy these examples verbatim, so
every CLI-runtime delegation would fail at the shell layer (no such
file). Replaced with `python3 -m molecule_runtime.a2a_cli` form,
which works regardless of where the wheel is installed.
2. molecule_ai_status.py docstring — usage examples invoked
`python3 /app/molecule_ai_status.py` and claimed a
`molecule-monorepo-status` shell alias. Both broken in current
templates: the file's at site-packages, and `which
molecule-monorepo-status` errors (the legacy symlink only existed
in the dev-only workspace/Dockerfile base image, not in the
standalone template Dockerfiles that ship to production).
Updated docstring + the __main__ usage banner + the stderr error
prefix to use the same `python3 -m molecule_runtime.X` form.
Plugins audited and clean: WORKSPACE_PLUGINS_DIR=/configs/plugins,
SHARED_PLUGINS_DIR=$PLUGINS_DIR fallback /plugins. No /app/
assumptions.
Regression test: `test_a2a_cli_instructions_use_module_invocation_not_legacy_app_path`
asserts the legacy /app/a2a_cli.py path can't drift back into the CLI
system prompt and that the canonical module form is present.
The legacy workspace/Dockerfile + workspace/entrypoint.sh + workspace/scripts/
still contain /app/-shaped paths but are dev-only base-image scaffolding
(per workspace/build-all.sh's own header comment) — not shipped to the
standalone template images. Out of scope here; can be cleaned up in a
separate dead-code pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DEFAULT_MCP_SERVER_PATH was hardcoded to /app/a2a_mcp_server.py, which
was correct under the pre-#87 monolithic-template Docker layout where
the workspace/ tree was COPY'd into /app/. After the universal-runtime
refactor (#87, #117), workspace modules ship inside the
molecule-ai-workspace-runtime wheel under
site-packages/molecule_runtime/, while /app/ now holds only
template-specific files (adapter.py + the runtime-native executor for
that template).
Net effect: in every workspace built since the wheel cutover, Claude
Code SDK's mcp_servers={"a2a": {"command": python, "args":
["/app/a2a_mcp_server.py"]}} pointed at a missing file. The subprocess
launch failed silently, the SDK registered zero MCP tools, and the
agent's list_peers / delegate_task / a2a_send_message / a2a_send_signal
all disappeared. Symptom observed today: Design Director said
"I tried to reach the perf auditor via the inter-agent MCP tools
(list_peers, delegate_task) but those tools didn't resolve in this
environment" and fell back to running the audit itself with WebFetch.
Why this slipped through E2E: the priority-runtimes harness sends a
single message and verifies a reply — it does not exercise inter-agent
delegation, so the missing MCP tools are invisible at that layer.
Fix: resolve the path relative to executor_helpers.py via __file__,
which tracks wherever the wheel is installed (site-packages today,
anywhere else tomorrow). The A2A_MCP_SERVER_PATH env override is
preserved for tests / non-default layouts.
Regression test: assert os.path.exists(DEFAULT_MCP_SERVER_PATH) so
any future move of a2a_mcp_server.py out of the package directory
fails at unit-test time instead of silently disabling delegation in
production.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Audited every a2a-sdk surface in workspace/ against the installed
1.0.2 wheel. Found and fixed:
main.py (the live workspace startup path):
• create_jsonrpc_routes(rpc_url='/', enable_v0_3_compat=True) —
rpc_url required in 1.x; v0.3 compat enables inbound legacy
clients (`"role": "user"` lowercase) without forcing them to
upgrade. Pairs with the outbound rename below.
a2a_executor.py:
• TextPart/FilePart/FileWithUri removed in 1.x. Part is now a
flat proto message: Part(text=…) / Part(url=…, filename=…,
media_type=…). Updated the file-attachment branch (only
reachable when an agent emits files; the harness's PONG path
didn't exercise this, but it's a latent crash).
• Message field names: messageId/taskId/contextId →
message_id/task_id/context_id (proto3 snake_case).
• Role enum: Role.agent → Role.ROLE_AGENT (proto enum).
Outbound JSON-RPC payloads (8 files):
• "role": "user" → "role": "ROLE_USER" — proto3 JSON serialization
is strict about enum values. Sites: a2a_client, a2a_cli, main
(initial+idle prompts), heartbeat, builtin_tools/a2a_tools,
builtin_tools/delegation. Wire JSON keys stay camelCase
(proto3 default), only the role enum value changed.
google-adk/adapter.py:
• new_agent_text_message → new_text_message (4 sites). This
adapter's directory has a hyphen, so it can't be imported as a
Python module — effectively dead code, but the wheel ships the
file and a future fix should keep it correct against 1.x.
Why one PR instead of seven: every previous a2a-sdk migration find
landed as its own publish → cascade → harness → next-bug cycle.
Today's audit ran every a2a-sdk symbol/type/method in workspace/
against the installed 1.0.2 wheel in a single sweep + tested the
critical paths (Message construction, Part construction, Role enum
parsing) against the actual SDK. Should be the last migration PR.
Verified locally:
python3 scripts/build_runtime_package.py --version 0.1.99 \
--out /tmp/build-final
pip install /tmp/build-final
python -c "import molecule_runtime.main; \
from molecule_runtime.a2a_executor import LangGraphA2AExecutor"
→ ✓ all imports clean against a2a-sdk 1.0.2
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7th a2a-sdk migration find from the v0 → v1 transition.
create_jsonrpc_routes() now requires rpc_url as a positional arg
(was implicit at root in 0.x). Pass '/' to match
a2a.utils.constants.DEFAULT_RPC_URL — that's also what
workspace-server's a2a_proxy.go forwards to (POSTs to workspace URL
without appending a path).
Symptom before fix: every workspace startup crashed with
TypeError: create_jsonrpc_routes() missing 1 required positional
argument: 'rpc_url'
Caught by harness 9 phase 4 (claude-code + langgraph both on
0.1.24). The user's "use langgraph for fast iteration" call cut
the diagnose cycle from 15min to ~30s — without that, this would
have taken another hermes round-trip to surface.
Updated reference_a2a_sdk_v0_to_v1_migration.md memory with this
entry alongside the previous 6 finds.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
a2a-sdk 1.x added agent_card as a required argument to
DefaultRequestHandler.__init__. main.py constructed it with only
agent_executor + task_store, so every workspace startup that reached
the handler init step crashed with:
TypeError: DefaultRequestHandlerV2.__init__() missing 1 required
positional argument: 'agent_card'
This is the 6th a2a-sdk migration find from the v0 → v1 transition
(see reference_a2a_sdk_v0_to_v1_migration memory). Pattern is the
same: SDK exposes a new required arg, our call site needs to pass
the existing object we already construct upstream.
Why the import-only smoke gates didn't catch this: it's a call-time
constructor error inside `async def main()`, not a module load
error. The runtime-pin-compat smoke imports main_sync but doesn't
invoke main() against a real config. Worth filing a follow-up to
extend the smoke to a "construct + dispose" cycle.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CRITICAL: every workspace boot since the a2a-sdk 1.0 migration (#1974)
has been crashing at AgentCard construction with:
ValueError: Protocol message AgentCard has no "supported_protocols" field
The protobuf field is `supported_interfaces` (plural, interfaces — see
a2a-sdk types/a2a_pb2.pyi:189). The 0.3→1.0 migration left the kwarg
as `supported_protocols`, which doesn't exist in the 1.0 schema, so
the constructor raises before any subsequent line of main runs.
Why this hid for so long:
- publish-runtime.yml's smoke step only IMPORTED molecule_runtime.main;
importing the module is fine, only CONSTRUCTING the AgentCard fails
- The user-visible symptom is "Workspace failed: " with empty
last_sample_error, indistinguishable from generic boot timeouts
- The state_transition_history=True bug (fixed in #2179) was a
sibling of this — same migration, same class, just caught first
Fix is symmetric with #2179:
1. workspace/main.py: rename the kwarg + comment explaining why
2. .github/workflows/publish-runtime.yml: extend the smoke block to
instantiate AgentCard with the exact production call shape, so
the next field-rename of this class fails at publish time
instead of breaking every workspace startup
Verification:
- Constructed AgentCard against fresh a2a-sdk 1.0.2 in a clean
venv with the corrected kwarg → succeeds
- Constructed it with the original `supported_protocols` kwarg →
fails immediately with the exact error production sees
- Smoke test pinned to mirror main.py's exact call shape; main.py
+ smoke must stay in lockstep going forward
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two structural fixes for the cascade race conditions that bit us
five times today:
1. **PyPI propagation wait** (cascade job): poll PyPI for the
just-published version with a 60s budget BEFORE firing
repository_dispatch. PyPI accepts the upload but takes a few
seconds to make it available via the package index. Cascade was
firing too fast — downstream template builds ran `pip install`
against a stale index, resolved to the previous version, and
docker layer cache locked that in for subsequent rebuilds.
Pairs with the build-arg cache invalidation in molecule-ci PR
(separate change). Wait without invalidation = next build still
pip-resolves correctly. Invalidation without wait = first cascade
build may still race PyPI propagation. Together: no race, no
stale cache.
2. **Path filter expansion**: scripts/build_runtime_package.py is
the build script and changes to it (e.g. import-rewrite fixes,
manifest emit, lib/ subpackage move) directly affect what ships
in the wheel. Was missing from the path filter, so PRs touching
only scripts/ (like #2174's lib/ fix) didn't auto-publish — the
operator had to remember a manual dispatch. Add it to the closed
list of files that trigger auto-publish.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the molecule-core-side ask of controlplane #285. CP #289 already
landed migration 022 + the handler change exposing \`last_error\` in
/cp/admin/orgs responses. This makes the canary harness actually USE
that field — pre-fix the harness exited with just "Tenant provisioning
failed for <slug>" and forced operators to scrape CP server logs to
learn WHY.
The diagnostic burst dumps the matched org row from the LIST_JSON
already in scope (no extra HTTP call), pretty-printed and prefixed,
right before \`fail\`. Mirrors the TLS-readiness burst pattern from
PR #2107 at step 4. Includes a not-found fallback for DB-drift cases.
No redaction needed — adminOrgSummary is already ops-safe (id, slug,
name, plan, member_count, instance_status, last_error, timestamps;
no tokens, no encrypted fields).
Verification: smoke-tested both branches (org found with last_error +
slug-not-found fallback) with synthetic JSON; bash syntax OK; the only
shellcheck warning is pre-existing on line 93.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a comment block citing a2a-sdk's own
a2a/compat/v0_3/conversions.py, which says verbatim:
state_transition_history=None, # No longer supported in v1.0
So a future reader who notices the missing kwarg won't try to add it
back. The capability is now universal: every v1.x Task carries a
history list and tasks/get supports historyLength via the
apply_history_length helper. No flag because nothing's optional.
Confirmed by reading the SDK source directly:
- a2a/types.py AgentCapabilities exposes only: streaming,
push_notifications, extensions, extended_agent_card.
- a2a/compat/v0_3/conversions.py explicitly maps None when
down-converting v1 → v0.3 (deliberate removal, not rename).
- a2a/server/request_handlers/default_request_handler_v2.py uses
apply_history_length(task, params) — agent doesn't opt in.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
a2a-sdk 1.x's AgentCapabilities only exposes 4 fields:
streaming, push_notifications, extensions, extended_agent_card.
The state_transition_history field was removed in the v1 protobuf
schema. main.py still passed it as a kwarg, so every workspace
that reached the AgentCard construction step (line 188) crashed:
ValueError: Protocol message AgentCapabilities has no
"state_transition_history" field
Symptom: every claude-code + hermes workspace stuck in `provisioning`
forever — caught when the user provisioned a Design Director crew
manually via the canvas while harness 5 was running.
Why every prior smoke gate missed it:
- runtime-pin-compat.yml smokes `from molecule_runtime.main import
main_sync` — only imports the module. AgentCapabilities() runs
inside `async def main()`, not at module load.
- Template image boot smoke does `import every /app/*.py` — same
story. main.py imports fine; the field error only fires at call.
The fix is one line — drop the kwarg. Fields we actually need
(streaming + push_notifications) are still passed.
Follow-up worth filing: smoke step that instantiates Adapter() +
calls a no-op setup() against a stub config. That would have
caught this before publish.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the LOW-severity dependabot alert on workspace-server's go-redis
pin. Upstream advisory GHSA-92cp-5422-2mw7: "go-redis allows potential
out-of-order responses when CLIENT SETINFO times out" — fixed in 9.7.3.
Patch bump within the v9.7 line; semver guarantees no API change.
Full workspace-server test suite passes (18/18 packages clean).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a section to CONTRIBUTING.md → "Pull Requests" explaining the two
system-level guards that protect against the "I enabled auto-merge then
pushed more commits" race:
1. Repo-wide setting: "Automatically delete head branches" (catches
pushes to a merged-and-deleted branch — the post-merge orphan case).
2. CI workflow `pr-guards` calling molecule-ci's
disable-auto-merge-on-push (catches pushes during queue
processing — disables auto-merge, posts a comment, requires
explicit re-engage).
Why doc-not-just-memory: my agent-side memory is local. Other
contributors on other machines need this in the repo where they
read it. Cites the 2026-04-27 PR #2174 incident with the
specific commit SHAs that got orphaned.
Companion: molecule-ci README updated separately to document the
reusable workflow under "What each workflow validates" so devs
who land in the molecule-ci repo first can find the contract.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Thin caller for molecule-ci's reusable disable-auto-merge-on-push
workflow. Forces operator re-engagement when a commit is pushed to
an open PR with auto-merge already enabled.
Pairs with the org-wide "Automatically delete head branches" repo
setting (also enabled today). Defense in depth:
1. Repo setting blocks pushes to a merged-and-deleted branch
(post-merge orphan case — what bit #2174 today: my second
commit landed on an already-merged-and-deleted branch).
2. This workflow catches in-queue races (push lands while the
merge queue is processing) by disabling auto-merge so the
operator must explicitly re-engage.
Together they cover the full lifecycle of "auto-merge enabled →
new commits arrive" without relying on operator discipline.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the medium-severity dependabot alert #7 on workspace-server's
docker pin: "Moby firewalld reload makes published container ports
accessible from remote hosts" — fixed in v28.3.3, pulling v28.5.2
(latest in the v28 line).
Patch+minor bump within the v28 train; no client-API breaks
(workspace-server only uses docker.Client for container exec /
inspect, all stable since v20+).
Verification: full workspace-server test suite passes (18/18 packages
clean). Build clean.
Out of scope:
- Alerts #10 and #11 (the AuthZ bypass + plugin-priv off-by-one)
require v29.3.1, which is not yet published to the Go module
proxy (latest published is v28.5.2). They'll close in a follow-up
PR once v29 lands as a Go module.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two compounding bugs that bit hermes (and any other workspace that
reaches main.py:142):
1. workspace/lib/ was in EXCLUDE_DIRS so the published wheel didn't
contain the directory at all. main.py imports `from lib.pre_stop
import read_snapshot` (and `build_snapshot`, `write_snapshot`) so
every workspace startup that reaches the snapshot path crashed
with `ModuleNotFoundError: No module named 'lib'`.
2. Even if lib/ had shipped, `lib` wasn't in SUBPACKAGES so the
import-rewriter would have left the bare `from lib.pre_stop`
unqualified — it would still fail because the package would only
be reachable as `molecule_runtime.lib`.
Fix: move `lib` from EXCLUDE_DIRS to SUBPACKAGES (one entry each).
Drift gate extension: the existing gate I added in #2163 only
asserted TOP_LEVEL_MODULES against workspace/*.py. This change adds
the symmetric assertion for SUBPACKAGES against workspace/<dir>/
(filtered by EXCLUDE_DIRS + presence of __init__.py). Catches both:
- Subpackage added to workspace/ but missed in SUBPACKAGES
- Subpackage missing from workspace/ but lingering in SUBPACKAGES
- Subpackage wrongly in EXCLUDE_DIRS while also referenced by
rewritten imports (the lib case)
Tested locally: build of 0.1.99 now ships lib/ and main.py contains
`from molecule_runtime.lib.pre_stop import ...` correctly rewritten.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tonight's wire-real E2E sweep exposed 12+ root causes across the post-
#87 template extraction. Most would have been caught by an actual
provision-and-online test running on each template — but the test only
covered claude-code + hermes. Extending it to cover all 8 ensures any
future regression in any template fails the test, not production.
What's added:
- run_openai_runtime(runtime, label): generic provisioner for the 5
OpenAI-backed templates (langgraph, crewai, autogen, deepagents,
openclaw). Same shape as run_hermes minus the HERMES_* config block
that hermes-agent needs.
- run_gemini_cli: separate function — gemini-cli wants a Google AI
key (E2E_GEMINI_API_KEY), not OpenAI.
- Each new runtime registered in the dispatch loop. New `all` keyword
for E2E_RUNTIMES runs every covered runtime.
claude-code + hermes keep their dedicated functions; both have unique
provisioning quirks (claude-code OAuth + claude-code-specific volume
mounts; hermes 15-min cold-boot) that don't generalize cleanly.
Skip-if-no-key pattern matches the existing one — partially-keyed CI
gets clean skips, not false-fails.
Usage:
E2E_OPENAI_API_KEY=... E2E_RUNTIMES=langgraph ./test_priority_runtimes_e2e.sh
E2E_OPENAI_API_KEY=... E2E_RUNTIMES=all ./test_priority_runtimes_e2e.sh
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The conftest mock only exposed `new_agent_text_message`, the pre-v1
name. After fixing a2a_executor.py to use the v1 name
`new_text_message`, the mock didn't satisfy the import → CI red.
Mock both names (aliased to the same lambda) so any in-flight test
that still references the old name keeps working until the next
sweep removes those references.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the HIGH-severity dependabot alert on workspace-server's jwt-go
pin. Upstream advisory GHSA-mh63-6h87-95cp / CVE-2025-30204:
"jwt-go allows excessive memory allocation during header parsing" —
fixed in v5.2.2.
Patch bump within the v5.x line; semver guarantees no API change. Full
workspace-server test suite passes (\`go test ./...\` clean across all
18 packages).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
a2a-sdk v1 renamed `new_agent_text_message` → `new_text_message`
(role=Role.agent is now the default). Same fix landed in the hermes
template earlier today; this is the runtime-side equivalent.
NOT dead code: a2a_executor.py is the LangGraph A2A executor, used by
the langgraph + deepagents templates. Both templates currently import
it via bare `from a2a_executor import LangGraphA2AExecutor` — which is
a separate bug in those templates, filed/fixed separately.
Symptom in a2a_executor.py form: any langgraph or deepagents workspace
that calls create_executor crashes with `ImportError: cannot import
name 'new_agent_text_message' from 'a2a.helpers'`. Doesn't surface for
claude-code or hermes (their templates use their own executors and
don't load a2a_executor).
Five call sites updated, one import line, one comment. Test suite
already passes against the new symbol — `python -c "from
molecule_runtime.a2a_executor import LangGraphA2AExecutor"` resolves
cleanly after this change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#2000 fixed one symptom — TENANT_IMAGE pinned to `staging-a14cf86`
(10 days stale) silently no-op'd four upstream fixes on 2026-04-24.
This adds the audit pattern as a re-runnable script so the broader
class is observable on demand without new CI infrastructure.
Audit results today (2026-04-27):
controlplane / production: 54 vars audited, 0 drift-prone pins
controlplane / staging: 52 vars audited, 0 drift-prone pins
So the immediate audit deliverable is clean — TENANT_IMAGE is the only
known violation and #2000 already fixed it. The script makes the
ongoing audit a 5-second command instead of a manual one.
Detection regex catches:
* branch-SHA suffixes (`staging|main|prod|production-<6+ hex>`)
— the exact 2026-04-24 incident shape
* version pins after `:` or `=` (`:v1.2.3`, `=v0.1.16`)
— same drift class, just rendered differently
Anchoring on `:` or `=` keeps prose like "version 1.2.3 of the api"
out of the false-positive set. UUIDs, ARNs, AMI IDs, secrets, and
floating tags (`:staging-latest`, `:main`) pass through untouched.
Regression test (tests/ops/test_audit_railway_sha_pins.sh) pins 20
representative cases — 9 should-flag (covering all four branch
prefixes + semver variants + middle-of-value matches) and 11
should-pass (the false-positive guards). Same regex inlined in both
files so a future tweak that weakens detection fails the test in
lockstep with weakening the audit.
Both files shellcheck clean.
CI gate (acceptance criterion's "regression: add a CI check") is
deliberately scoped out — querying Railway from CI requires plumbing
RAILWAY_TOKEN as a repo secret, which is multi-step setup. The
re-runnable script + test cover the same surface today; the CI
workflow is a small follow-up once the token is provisioned.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Quota gates are resource-state conflicts, not payment failures —
RFC 9110 reserves 402 for billing/payment failures specifically. The
canonical Molecule-AI/docs PR #82 already shipped the corrected text;
this brings the molecule-core copy of the tutorial in line.
The inline parenthetical "(not 402 Payment Required — quota gates are
resource-state conflicts, not payment failures, per RFC 9110)" doubles
as a regression anchor: a future edit that flips 409 back to 402 would
have to also reword that explanation, making the change a deliberate
two-step act rather than a casual oversight.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the second of two skipped tests in workspace_provision_test.go
that were blocked on interface refactors. The Broadcaster + CP
provisioner halves landed in earlier #1814 cycles; this is the
plugin-source-registry half.
Refactor:
- Add handlers.pluginSources interface with the 3 methods handler
code actually calls (Register, Resolve, Schemes)
- Compile-time assertion `var _ pluginSources = (*plugins.Registry)(nil)`
catches future method-signature drift at build time
- PluginsHandler.sources narrowed from *plugins.Registry to the
interface; production wiring (NewPluginsHandler, WithSourceResolver)
still passes *plugins.Registry — satisfies the interface
Production fix (#1206 leak):
- resolveAndStage's Fetch-failure path was interpolating err.Error()
into the HTTP response body via `failed to fetch plugin from %s: %v`.
Resolver errors routinely contain rate-limit text, github request
IDs, raw HTTP body fragments, and (for local resolvers) file system
paths — none has any business landing in a user's browser.
- Body now carries just `failed to fetch plugin from <scheme>`; the
status code already differentiates the failure shape (404 not
found, 504 timeout, 502 generic). Full err detail stays in the
server-side log line one statement above.
Test:
- 6 sub-tests covering every error path inside resolveAndStage:
empty source, invalid format, unknown scheme, local
path-traversal, unpinned github (PLUGIN_ALLOW_UNPINNED unset),
Fetch failure with a leaky synthetic error
- The Fetch-failure case plants 5 realistic leak markers in the
resolver's error string (rate limit text, x-github-request-id,
auth_token, ghp_-prefixed token, /etc/passwd path); the assertion
fails if ANY appears in the response body
- Table-driven so a future error path added to resolveAndStage gets
one new row, not a copy-paste of the assertion logic
Verification:
- 6/6 sub-tests pass
- Full workspace-server test suite passes (interface refactor is
non-breaking; production caller paths unchanged)
- go build ./... clean
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The wheel's pyproject.toml has declared
`molecule-runtime = "molecule_runtime.main:main_sync"` since the
publish pipeline was created on 2026-04-26, but the function
itself was never present in workspace/main.py — it lived in the
pre-monorepo molecule-ai-workspace-runtime repo and was lost
during the consolidation that made workspace/ the source of truth.
The 0.1.15 wheel still had main_sync from a leftover snapshot,
so the regression went unnoticed until 0.1.16 (the first wheel
built from the new source-of-truth) shipped. Symptom: every
workspace container restart loops with
ImportError: cannot import name 'main_sync' from 'molecule_runtime.main'
— the molecule-runtime CLI script's first line tries to import
the missing symbol. Workspaces stay in `provisioning` until the
10-min sweep marks them failed.
Caught by .github/workflows/runtime-pin-compat.yml, which already
imports the symbol by name as its smoke test. (That check kept
failing red on every recent merge_group run; this PR fixes the
underlying symbol-not-found instead of the smoke step.)
Also strengthens publish-runtime.yml's wheel smoke from
`import molecule_runtime.main` (loads the module — passes even
when entry-point target is missing) to `from molecule_runtime.main
import main_sync` (the actual contract the CLI script needs).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The skipped test exists to assert that provisionWorkspaceCP never
leaks err.Error() in WORKSPACE_PROVISION_FAILED broadcasts (regression
guard for #1206). Writing the test body required substituting a
failing CPProvisioner — but the handler's `cpProv` field was the
concrete *CPProvisioner type, so a mock had nowhere to plug in.
Refactor:
- Add provisioner.CPProvisionerAPI interface with the 3 methods
handlers actually call (Start, Stop, GetConsoleOutput)
- Compile-time assertion `var _ CPProvisionerAPI = (*CPProvisioner)(nil)`
catches future method-signature drift at build time
- WorkspaceHandler.cpProv narrowed to the interface; SetCPProvisioner
accepts the interface (production caller passes *CPProvisioner
from NewCPProvisioner unchanged)
Test:
- stubFailingCPProv whose Start returns a deliberately leaky error
(machine_type=t3.large, ami=…, vpc=…, raw HTTP body fragment)
- Drive provisionWorkspaceCP via the cpProv.Start failure path
- Assert broadcast["error"] == "provisioning failed" (canned)
- Assert no leak markers (machine type, AMI, VPC, subnet, HTTP
body, raw error head) in any broadcast string value
- Stop/GetConsoleOutput on the stub panic — flags a future
regression that reaches into them on this path
Verification:
- Full workspace-server test suite passes (interface refactor
is non-breaking; production caller path unchanged)
- go build ./... clean
- The other skipped test in this file (TestResolveAndStage_…)
is a separate plugins.Registry refactor and remains skipped
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two compounding bugs surfaced when 0.1.16 hit production today:
1. scripts/build_runtime_package.py had a hand-curated TOP_LEVEL_MODULES
set listing every workspace/*.py that should get its bare imports
rewritten to `molecule_runtime.X`. The set silently went stale:
- Missing: transcript_auth (added since #87 phase 1c), runtime_wedge,
watcher → unrewritten imports shipped, every workspace startup
died with ModuleNotFoundError.
- Stale: claude_sdk_executor, cli_executor (both removed in #87),
hermes_executor (never existed) → harmless but misleading.
2. publish-runtime.yml's wheel-smoke step asserted on stable invariants
(BaseAdapter, AdapterConfig, a2a_client error sentinel) but never
imported main. So even though main.py held the broken bare
`from transcript_auth import ...`, the smoke check passed.
Fixes:
- Build script now derives the on-disk module set from workspace/*.py
and asserts it matches TOP_LEVEL_MODULES exactly. Drift in either
direction fails the build with a specific diff message instead of
shipping a broken wheel. Closed-list typo guard preserved (we still
edit the set explicitly when a module is added/removed) — the gate
just makes drift impossible to ignore.
- TOP_LEVEL_MODULES updated to current reality: drop the 3 stale,
add the 3 missing.
- publish-runtime.yml wheel-smoke now `import molecule_runtime.main`
before the invariant asserts. main is the entry point and
transitively imports every module — any bare-import bug surfaces
as ModuleNotFoundError before PyPI accepts the upload.
Tested locally: `python3 scripts/build_runtime_package.py
--version 0.1.99 --out /tmp/build-test` succeeds, and
/tmp/build-test/molecule_runtime/main.py contains the rewritten
`from molecule_runtime.transcript_auth import ...`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When E2E_INTENTIONAL_FAILURE=1 poisons the tenant token, step 5/11's
`tenant_call POST /workspaces` curl exits 22 (HTTP error under
--fail-with-body). `set -e` propagates rc=22 directly, but the
script's documented contract emits only {0,1,2,3,4}, and the sanity
workflow's case statement only matches those. rc=22 falls through
to "Unexpected rc — investigate harness" and opens a false-positive
priority-high "safety net broken" issue (#2159, weekly run on
2026-04-27).
The trap now captures $? at entry (must be the first statement
before any command clobbers it) and at the end normalizes any
non-contract code to 1 (generic failure). Leak detection continues
to exit 4 directly, so its semantics are preserved.
Adds tests/e2e/test_harness_rc_normalization.sh — a self-contained
regression test that builds a stub harness with the same trap
pattern, triggers controlled exit codes, and asserts the
normalization. Covers the 5 contracted codes + curl-22 (the bug) +
3 representative network-failure codes + sigsegv-139.
Verification:
- 10/10 regression tests pass
- shellcheck clean on both modified files
- production teardown path unchanged for legitimate {1,2,3,4}
failures and the leak-detection exit 4
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>