The iter-3 split created mcp_heartbeat / mcp_inbox_pollers /
mcp_workspace_resolver but the wheel build's drift-gate check at
scripts/build_runtime_package.py:TOP_LEVEL_MODULES wasn't updated.
Without this fix the wheel ships those modules un-rewritten, so
their imports of platform_auth / configs_dir / etc. break at
runtime. Caught by the 'PR-built wheel + import smoke' check.
Refs RFC #2873 iter 3.
PR #2756 piped adapter.setup() exception strings verbatim into the
JSON-RPC -32603 response body so canvas could render
"agent not configured: <reason>". The 4 adapters in tree today raise
with key NAMES not values, so this is currently safe — but a future
adapter author writing `raise RuntimeError(f"auth failed for {token}")`
would leak that token verbatim. Issue #2760 flagged the risk; this PR
closes it.
workspace/secret_redactor.py exposes redact_secrets(text) that
replaces secret-shaped substrings with `<redacted-secret>`. Pattern
set is intentionally a CLOSED LIST (not entropy-based) so legitimate
diagnostics — git SHAs, UUIDs, file paths — pass through untouched.
Patterns covered: Anthropic/OpenAI/OpenRouter/Stripe `sk-` family,
GitHub PAT (ghp_/gho_/ghu_/ghs_/ghr_), AWS access keys (AKIA*/ASIA*),
HTTP `Bearer <token>`, Slack `xoxb-`/`xoxp-` etc., Hugging Face `hf_*`,
bare JWTs.
Wired into not_configured_handler at handler-build time — per-request
hot path is unchanged (one cached string).
Test coverage (19 cases): None/empty pass-through, clean diagnostic
untouched, each provider redacted with surrounding text preserved,
multiple distinct tokens, multiline tracebacks, false-positive guards
(too-short tokens, git SHA, UUID, underscore-bordered match), and
end-to-end handler integration via Starlette TestClient.
Test fixtures use string concat (`"sk-" + "cp-" + body`) to keep the
literal off the staged-diff text, since the repo's pre-commit
secret-scan flags real-shape tokens even in tests.
`secret_redactor` registered in TOP_LEVEL_MODULES (drift gate).
Closes#2760
Pairs with: PR #2756, PR #2775
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #2756's contract — card route always mounted regardless of
adapter.setup() outcome — lived inline in main.py's `# pragma: no cover`
boot sequence. A future refactor that re-coupled the two would have
silently bypassed PR #2756 and shipped the original "stuck booting
forever" UX again, with no pytest catching it.
This change extracts route assembly into workspace/boot_routes.py's
build_routes(card, executor, adapter_error) and pins the contract with
6 integration tests using Starlette's TestClient:
- test_card_route_serves_200_when_adapter_ready: happy path
- test_card_route_serves_200_when_adapter_failed: misconfigured boot,
card still 200, skill stubs survive
- test_jsonrpc_returns_503_when_no_executor: full -32603 envelope with
the adapter_error in error.data
- test_jsonrpc_returns_503_with_generic_when_no_error_string: fallback
reason for the rare case main.py reaches this branch without one
- test_card_route_does_not_depend_on_executor: direct PR #2756
regression guard — both branches MUST mount the card route
- test_executor_present_does_not_mount_not_configured_handler: sanity
that a healthy workspace doesn't return -32603 to every request
Conftest stubs extended with a2a.server.routes / request_handlers
classes so the tests work under the existing a2a-mock infra (pattern
matches the AgentCard/AgentSkill stubs added for PR #2765).
main.py now calls build_routes; the inline if/else is gone. Same
production behaviour, cleaner shape, regression-proof.
Heavy a2a-sdk imports inside build_routes() are lazy (deferred to the
executor-only branch) so tests that only exercise the not-configured
path don't pull DefaultRequestHandler / InMemoryTaskStore.
card_helpers + boot_routes registered in TOP_LEVEL_MODULES (build
drift gate would have caught the missing entry on the wheel-publish
smoke).
All 18 related tests pass (test_boot_routes.py: 6, test_card_helpers.py:
6, test_not_configured_handler.py: 6).
Closes#2761
Pairs with: PR #2756 (decouple agent-card from setup),
PR #2765 (defensive isolation of enrichment + transcript)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #2756 added a try/except around adapter.setup() so a missing LLM key
doesn't crash the workspace boot. Two paths that now run AFTER setup
succeeds were not similarly isolated, leaving small but real coupling
risks for future adapter authors.
1. **Skill metadata enrichment swap (main.py:248-259).** When
adapter.setup() returns, main.py reads adapter.loaded_skills and
replaces the static stubs in agent_card.skills with rich metadata
(description, tags, examples). The list comprehension assumes each
element exposes .metadata.{id,name,description,tags,examples}. A
future adapter that returns a non-canonical shape would raise
AttributeError, propagate to the outer except, capture as
adapter_error, and silently degrade an OK boot to the
not-configured state — even though setup() actually succeeded.
Extract to card_helpers.enrich_card_skills(card, loaded_skills) →
bool. Helper swallows enrichment failures, logs the cause, returns
False, leaves the static stubs in place. setup() success path
continues unchanged. 6 unit tests cover: None input, empty list,
canonical happy path, missing .metadata attr, partial .metadata
(missing one canonical field), atomic-failure-no-partial-swap.
2. **/transcript handler (main.py:513).** Calls await
adapter.transcript_lines(...) without try/except. BaseAdapter's
default returns {"supported": false} so today's 4 adapters never
trigger this — but a future adapter override that assumes setup()
ran would surface as a 500 from Starlette's default error handler
instead of a useful 503 with the exception class + message.
Inline try/except returns 503 with the reason, matching the
not-configured JSON-RPC handler's pattern.
Both changes match the architectural principle the PR #2756 chain
established: availability (workspace reachable) is decoupled from
configuration / adapter behavior. Operators see useful errors instead
of silent degradation; future adapter authors can't accidentally
break tenant readiness with a shape mismatch.
Adds:
- workspace/card_helpers.py (~50 lines, 100% covered)
- workspace/tests/test_card_helpers.py (6 tests)
- AgentCard/AgentSkill/AgentCapabilities/AgentInterface stubs to
workspace/tests/conftest.py so future card-related tests work
under the existing a2a-mock infrastructure
- card_helpers in TOP_LEVEL_MODULES (drift gate would have caught it)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The wheel-build drift gate caught the new module added in this PR —
without registering it, the published wheel would ship `import
not_configured_handler` un-rewritten, which would `ModuleNotFoundError`
at runtime under `molecule_runtime.main`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the recurrence path of PR #2556. The data fix realigned 8→4
templates in publish-runtime.yml's TEMPLATES variable, but the
underlying drift hazard was unguarded — the next manifest change
could silently leave cascade out of sync again.
This gate fails any PR that changes manifest.json or
publish-runtime.yml in a way that makes the cascade list diverge
from manifest workspace_templates (suffix-stripped). Either
direction is caught:
missing-from-cascade templates that won't auto-rebuild on a new
wheel publish (the codex-stuck-on-stale-runtime
bug class — PR #2512 added codex to manifest,
cascade wasn't updated, codex stayed pinned to
its last-built runtime version for weeks).
extra-in-cascade cascade dispatches to deprecated templates
(the wasted-API-calls + dead-CI-noise class —
PR #2536 pruned 5 templates from manifest;
cascade kept dispatching to all 8 until
PR #2556).
Triggers narrowly: only on PRs that touch manifest.json,
publish-runtime.yml, or the script itself. Fast (single grep+sed+comm
pipeline, no Go build).
Surfaced during the RFC #388 prior-art audit; folded in as the
structural follow-up to the data fix#2556 promised.
Self-tested both failure modes locally before commit:
- Drop codex from cascade → script fails with "MISSING: codex"
- Add langgraph to cascade → script fails with "EXTRA: langgraph"
Refs: https://github.com/Molecule-AI/molecule-controlplane/issues/388
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CP's deprovision flow calls Secrets.DeleteSecret() (provisioner/ec2.go:806)
but only when the deprovision runs to completion. Crashed provisions and
incomplete teardowns leak the per-tenant `molecule/tenant/<org_id>/bootstrap`
secret. At ~$0.40/secret/month, ~45 leaked secrets surfaced as ~$19/month
on the AWS cost dashboard.
The tenant_resources audit table (mig 024) tracks four kinds today —
CloudflareTunnel, CloudflareDNS, EC2Instance, SecurityGroup — and the
existing reconciler doesn't catch Secrets Manager orphans. The proper fix
(KindSecretsManagerSecret + recorder hook + reconciler enumerator) is filed
as a follow-up controlplane issue. This sweeper is the immediate stopgap.
Parallel-shape to sweep-cf-tunnels.sh:
- Hourly schedule offset (:30, between sweep-cf-orphans :15 and
sweep-cf-tunnels :45) so the three janitors don't burst CP admin
at the same minute.
- 24h grace window — never deletes a secret younger than the
provisioning roundtrip, so an in-flight provision can't be racemurdered.
- MAX_DELETE_PCT=50 default (mirrors sweep-cf-orphans for durable
resources; tenant secrets should track 1:1 with live tenants).
- Same schedule-vs-dispatch hardening as the other janitors:
schedule → hard-fail on missing secrets, dispatch → soft-skip.
- 8-way xargs parallelism, dry-run by default, --execute to delete.
Requires a dedicated AWS_JANITOR_* IAM principal — the prod molecule-cp
principal lacks secretsmanager:ListSecrets (it only has scoped
Get/Create/Update/Delete). The workflow's verify-secrets step will hard-fail
on the first scheduled run until those secrets are configured, surfacing
the missing setup loudly rather than silently no-op'ing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The wheel-build drift gate caught it correctly: any new top-level
module under workspace/ must be listed in TOP_LEVEL_MODULES so its
`from event_log import …` statements get rewritten to
`from molecule_runtime.event_log import …` at package time.
Without this entry, the published wheel ships event_log.py un-rewritten
and crashes at runtime with ModuleNotFoundError on first heartbeat.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The harness needs Authorization + X-Molecule-Org-Id (per-tenant, NOT
CP_ADMIN_API_TOKEN) when targeting *.moleculesai.app subdomains.
Existing single-Origin-header form silent-failed with 404 against
staging tenants since the SaaS edge WAF rewrites unauthenticated
/workspaces calls to Next.js (per
reference_saas_waf_origin_header.md).
Switch to a headers array so multiple -H flags compose cleanly with
curl arg-quoting, and document the env var contract at the top of
the script.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds two scripts:
scripts/test-all-runtimes-a2a-e2e.sh
Provisions one workspace per runtime (claude-code, hermes, codex,
openclaw), sets provider keys, waits online, sends two A2A messages
per workspace. First message validates round-trip; second message
validates session continuity. Cleans up via trap on EXIT.
scripts/test-hermes-plugin-e2e.sh
Hermes-only variant focused on the plugin /a2a/inbound path.
Proof-point: session continuity between turns (the plugin path's
deliverable; old chat-completions path lost context per turn).
Both honor SKIP_<runtime> env vars for incremental testing and tolerate
the SaaS edge WAF Origin header requirement (per
reference_saas_waf_origin_header.md).
Run:
PLATFORM=https://demo-tenant.staging.moleculesai.app \\
./scripts/test-all-runtimes-a2a-e2e.sh
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The hourly Sweep stale Cloudflare Tunnels job got cancelled mid-cleanup
on 2026-05-02 (run 25248788312, killed at 5min after deleting 424/672
stale tunnels). A second manual dispatch finished the remaining 254
fine, so the immediate backlog cleared, but two underlying bugs would
re-trip on the next big cleanup.
Bug 1: serial delete loop. The execute branch was a `while read; do
curl -X DELETE; done` pipeline at ~0.7s/tunnel — fine for the
steady-state cleanup of a handful, but a 600+ backlog needs ~7-8min.
This commit fans out to $SWEEP_CONCURRENCY (default 8) workers via
`xargs -P 8 -L 1 -I {} bash -c '...' _ {} < "$DELETE_PLAN"`. With 8x
parallelism the same 600+ list drains in ~60s. Notes:
- We use stdin (`<`) not GNU's `xargs -a FILE` so the script stays
portable to BSD xargs (matters for local-runner testing on macOS).
- We pass ONLY the tunnel id on argv. xargs tokenizes on whitespace
by default; tab-separating id+name on argv risks mangling. The
name is kept in a side-channel id->name map ($NAME_MAP) and looked
up by the worker only on failure, for FAIL_LOG readability.
- Workers print exactly `OK` or `FAIL` on stdout; tally with
`grep -c '^OK$' / '^FAIL$'`.
- On non-zero FAILED, log the first 20 lines of $FAIL_LOG as
"Failure detail (first 20):" — same diagnostic surface as before
but consolidated so we don't spam logs on a flaky CF API.
Bug 2: workflow's 5-min cap was set as a hangs-detector but turned out
to be a real-job-too-slow detector. Raised to 30 min — generous
headroom for the ~60s steady-state run while still surfacing genuine
hangs (and in line with the sweep-cf-orphans companion job).
Bug 3 (drive-by): the existing trap was `trap 'rm -rf "$PAGES_DIR"'
EXIT`, which would have been silently overwritten by any later trap
registration. Replaced with a single `cleanup()` function that wipes
PAGES_DIR + all four new tempfiles (DELETE_PLAN, NAME_MAP, FAIL_LOG,
RESULT_LOG), called once via `trap cleanup EXIT`.
Verification:
- bash -n scripts/ops/sweep-cf-tunnels.sh: clean
- shellcheck -S warning scripts/ops/sweep-cf-tunnels.sh: clean
- python3 yaml.safe_load on the workflow: clean
- Synthetic 30-line delete plan with every 7th id sentinel'd to
return {"success":false}: TEST PASS, DELETED=26 FAILED=4, FAIL_LOG
side-channel name lookup verified.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The page-merge loop passed the entire accumulating tunnel JSON to
python3 -c via argv on every iteration. On a busy account (verified
2026-05-02: 672 tunnels, 14 pages on Hongmingwangrabbit account) this
exceeds the GH Ubuntu runner's combined argv+envp limit (~128 KB) and
dies with `python3: Argument list too long` at exit 126 — the workflow
has been silently failing this way since the very first run that hit a
real account, masked earlier by a missing-CF_ACCOUNT_ID secret check.
Buffer each page response to a file under a temp dir, merge from disk
at the end. Also bumps the page cap from 20 to 40 (1000 → 2000 tunnel
ceiling) so the existing soft-cap warning has headroom; the disk-merge
shape is O(n) in tunnel count rather than the previous O(n^2) so the
larger ceiling is cheap.
Verified locally against the live account (672 tunnels): script now
runs cleanly to the existing MAX_DELETE_PCT safety gate, which trips
at 99% > 90% as designed and surfaces the actual orphan backlog for
operator-driven cleanup.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wheel-build smoke gate detected `configs_dir` missing from
scripts/build_runtime_package.py:TOP_LEVEL_MODULES. Without it the
build would ship `import configs_dir` un-rewritten and every
external-runtime install would die on `ModuleNotFoundError` at first
import.
Two callers used `import configs_dir as _configs_dir` to belt-and-
suspenders against an imagined name collision, but the rewriter
rejects `import X as Y` because the rewrite would produce
`import molecule_runtime.X as X as Y` (invalid syntax). No actual
collision exists (only docstring/comment references). Switched to
plain `import configs_dir`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Demo-day preparation bundle for the funding demo (~2026-05-06). Adds:
- scripts/demo-freeze.sh — captures current ghcr.io
workspace-template-* :latest digests for all 8 runtimes, then
disables both cascade vectors that could re-tag :latest mid-demo:
publish-runtime.yml in molecule-core (PATH 1 — staging push to
workspace/** auto-bumps the wheel and fans out to 8 templates) and
publish-image.yml in each of the 8 template repos (PATH 2 — direct
template repo merge re-tags :latest). Defaults to dry-run; requires
--execute to apply. Writes both digest + workflow receipts to
scripts/demo-freeze-snapshots/.
- scripts/demo-thaw.sh — re-enables every workflow demo-freeze.sh
disabled, keyed off the receipt timestamp. Defaults to executing
(the inverse safety polarity from freeze, where the destructive
default is dry-run). --dry-run prints without applying.
- scripts/demo-day-runbook.md — operator runbook indexing the six
rollback levers (platform image rollback, template image rollback,
tenant redeploy, workspace delete, Railway rollback, Vercel
rollback) plus pre-warm timing and post-demo cleanup. Also covers
read-only diagnostics for "is this working?" moments and the
CP_ADMIN_API_TOKEN rotation step that must follow demo (the token
gets copy-pasted into shells during incident response).
- scripts/demo-freeze-snapshots/.gitignore — generated freeze
receipts are operational state, not source. Tracked .gitkeep so
the directory exists when the script writes to it.
Both scripts dry-run-tested locally. Did not exercise --execute since
that would actually disable production workflows mid-development.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The existing wheel-publish smoke (`wheel_smoke.py`) only IMPORTS
`molecule_runtime.main` at module scope. Lazy imports buried inside
`async def execute(...)` bodies (e.g. `from a2a.types import FilePart`)
NEVER evaluate at static-import time — they crash at first message
delivery in production.
The 2026-04-2x v0→v1 a2a-sdk migration shipped 5 such regressions in
templates that all looked fine at module-load smoke. This change adds
`smoke_mode.py` plus a `MOLECULE_SMOKE_MODE=1` short-circuit in
`main.py`: after `adapter.create_executor(...)`, the boot path invokes
`executor.execute(stub_ctx, stub_queue)` once with a 5s timeout
(`MOLECULE_SMOKE_TIMEOUT_SECS`). Healthy import tree → execution
proceeds far enough to hit a network boundary and times out (exit 0).
Broken lazy import → `ImportError` / `ModuleNotFoundError` from inside
the executor body (exit 1). Other downstream errors (auth, validation)
pass — those are caught by adapter-level tests, not this gate.
Stub `(RequestContext, EventQueue)` is built from the real a2a-sdk so
SendMessageRequest/RequestContext constructor changes also surface as
import-tree failures (the regression class also includes "SDK
refactored mid-publish"). The stub-build itself is wrapped — if it
raises, that's a smoke fail too.
Phase 2 (separate PR, molecule-ci) wires this into
publish-template-image.yml so the publish gate runs the boot smoke
against every template image before pushing the tag.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two small follow-ups to the PR #2433 → #2436 → #2439 incident chain.
1) `import inbox # noqa: F401` in workspace/a2a_mcp_server.py was
misleading — `inbox` IS used (at the bridge wiring inside main()).
F401 means "imported but unused", which would mask a real future
F401 if the usage is removed. Drop the noqa, keep the explanatory
block comment about the rewriter's `import X` → `import mr.X as X`
expansion (and the `import X as Y` → `import mr.X as X as Y` trap
the comment exists to prevent re-introducing).
2) scripts/test_build_runtime_package.py — 17 unit tests covering
`rewrite_imports()` and `build_import_rewriter()` in
scripts/build_runtime_package.py. Until now the function had zero
coverage despite the entire wheel build depending on it. Tests
pin: bare-import aliasing, dotted-import preservation, indented
imports, from-imports (simple + dotted + multi-symbol + block),
the `import X as Y` rejection added in PR #2436 (with comment-
stripping + indented + comma-not-alias edge cases), allowlist
anchoring (`a2a` ≠ `a2a_tools`), and end-to-end reproduction
of the PR #2433 failing pattern + the #2436 fix pattern.
3) Wire scripts/test_*.py into CI by adding a second discover pass
to test-ops-scripts.yml. Top-level scripts/ tests live alongside
their target file (parallels the scripts/ops/ test layout); the
existing scripts/ops/ pass keeps running because scripts/ops/
has no __init__.py so a single discover from scripts/ root
doesn't recurse. Two passes is simpler than retrofitting
namespace packages. Path filter widened from `scripts/ops/**`
to `scripts/**` so PRs touching the build script trigger the
new tests.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #2433 (notifications/claude/channel) shipped 'import inbox as
_inbox_module' inside a2a_mcp_server.py:main(). The build script's
import rewriter expands plain 'import inbox' to
'import molecule_runtime.inbox as inbox', so the original source
became 'import molecule_runtime.inbox as inbox as _inbox_module',
which is invalid Python.
Caught at the publish-runtime + PR-built-wheel-smoke gate (the
SyntaxError trace is in run 25200422679). The wheel didn't ship to
PyPI because publish-runtime's smoke-import step refused to install
it, but staging is currently sitting on a broken-build commit until
this fix-forward lands.
Changes:
- a2a_mcp_server.py: lift `import inbox` to top of file (rewriter
produces clean `import molecule_runtime.inbox as inbox`), call
inbox.set_notification_callback directly in main()
- build_runtime_package.py: rewrite_imports() now raises ValueError
when it sees 'import X as Y' for any X in the workspace allowlist,
instead of silently producing a syntax-error wheel. Operator gets
a clear actionable error at build time pointing at the offending
line + suggested rewrites ('from X import …' or plain 'import X').
The build-time gate (this PR's rewriter check) catches the regression
class earlier than the smoke-time gate (PR #2433's failure). Adding
'PR-built wheel + import smoke' to staging branch protection's
required checks is filed separately so this class doesn't merge again.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a notification seam to the universal molecule-mcp wheel so push-
notification-capable MCP hosts (Claude Code today; any compliant
client tomorrow) get inbound A2A messages as conversation interrupts
instead of having to poll wait_for_message / inbox_peek.
Wire-up:
- inbox.py: module-level _NOTIFICATION_CALLBACK + set_notification_callback()
Fires from InboxState.record() AFTER lock release, with same dict
shape inbox_peek returns. Best-effort — a raising callback never
prevents the message from landing in the queue.
- a2a_mcp_server.py: _build_channel_notification() pure helper +
bridge wiring in main() that schedules notifications via
asyncio.run_coroutine_threadsafe (poller is a daemon thread, MCP
loop is asyncio).
- Method name 'notifications/claude/channel' matches the contract
documented in molecule-mcp-claude-channel/server.ts:509.
- wheel_smoke.py: pin set_notification_callback as a published name,
same regression class as the 0.1.16 main_sync incident.
Pollers (wait_for_message / inbox_peek) keep working unchanged for
runtimes without notification support.
Tests: 6 new in test_inbox.py (callback fires once on record, dedupe
short-circuits before fire, raising cb doesn't break inbox, set/clear
semantics), 5 new in test_a2a_mcp_server.py (method name pin, content
mapping, meta routing, no-id JSON-RPC notification spec, missing-
field tolerance). All 59 combined tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The universal MCP server (a2a_mcp_server.py) was outbound-only — agents
in standalone runtimes (Claude Code, hermes, codex, etc.) could
delegate, list peers, and write memories, but never observed the
canvas-user or peer-agent messages addressed to them. This blocked
"constantly responding" loops without forcing operators back onto a
runtime-specific channel plugin.
This PR closes the inbound gap with a poller-fed in-memory queue and
three new MCP tools:
- wait_for_message(timeout_secs?) — block until next message arrives
- inbox_peek(limit?) — list pending messages (non-destructive)
- inbox_pop(activity_id) — drop a handled message
A daemon thread polls /workspaces/:id/activity?type=a2a_receive every
5s, fills the queue from the cursor (since_id), and persists the cursor
to ${CONFIGS_DIR}/.mcp_inbox_cursor so a restart doesn't replay backlog.
On 410 (cursor pruned) we fall back to since_secs=600 for a bounded
recovery window. Activity-row → InboxMessage extraction mirrors the
molecule-mcp-claude-channel plugin's extractText (envelope shapes #1-3
+ summary fallback).
mcp_cli.main starts the poller alongside the existing register +
heartbeat threads. In-container runtimes (which have push delivery via
canvas WebSocket) skip activation, so inbox tools return an
informational "(inbox not enabled)" message instead of double-delivery.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ship the baseline universal MCP path that any external runtime (Claude
Code, hermes, codex, anything that speaks MCP stdio) can use, before
optimizing per-runtime channels. Today the workspace MCP server only
spins up inside the container; external operators have no way to call
the 8 platform tools (delegate_task, list_peers, send_message_to_user,
commit_memory, etc.) from outside.
Three additive changes:
1. **`platform_auth.get_token()` env-var fallback** — adds
`MOLECULE_WORKSPACE_TOKEN` as a fallback when no
`${CONFIGS_DIR}/.auth_token` file exists. File-first preserves
in-container behavior unchanged. External operators (no /configs
volume) now have a way to supply the token without faking the
filesystem layout.
2. **`molecule-mcp` console script** — adds a new entry point in the
published `molecule-ai-workspace-runtime` PyPI wheel. Operators run
`pip install molecule-ai-workspace-runtime`, set 3 env vars
(WORKSPACE_ID, PLATFORM_URL, MOLECULE_WORKSPACE_TOKEN), and register
the binary in their agent's MCP config. `mcp_cli.main` is a thin
validator wrapper — it checks env BEFORE importing the heavy
`a2a_mcp_server` module so a misconfigured first-run gets a friendly
3-line error instead of a 20-line module-level RuntimeError
traceback.
3. **Wheel smoke gate** — extends `scripts/wheel_smoke.py` to assert
`cli_main` and `mcp_cli.main` are importable. Same regression class
as the 0.1.16 main_sync incident: a silent rename or unrewritten
import here would break every external operator on the next wheel
publish (memory: feedback_runtime_publish_pipeline_gates.md).
Test coverage:
- `tests/test_platform_auth.py` — 8 new tests for the env-var fallback:
file-priority, env-fallback, whitespace handling, cache, header
construction, empty-env-as-unset.
- `tests/test_mcp_cli.py` — 8 new tests for the validator: each
required var separately, file-or-env satisfies token requirement,
whitespace-only env treated as missing, help mentions canvas Tokens
tab.
- Full `workspace/tests/` suite green: 1346 passed, 1 skipped.
- Local end-to-end: built wheel, installed in venv, ran `molecule-mcp`
with no env → friendly error; with env → MCP server starts.
Why now / why this shape: user redirect was "support the baseline
first so all runtimes can use, then optimize". A claude-only MCP
channel leaves hermes/codex/third-party operators broken on
runtime=external. This PR ships the runtime-agnostic baseline; per-
runtime polish (claude-channel push delivery, hermes-native
bindings) is a follow-up PR. PR #2412 fixed the partner bug where
canvas Restart silently revoked the operator's token — the two
together unblock the external-runtime story end-to-end.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Iterates a list of tenant slugs (default canary set on production,
operator-supplied on staging), curls each tenant's /buildinfo plus
canvas's /api/buildinfo, compares to origin/main's HEAD SHA, prints a
table with one of {current, stale, unreachable} per surface. Returns
non-zero if any surface is stale, so it can be wired into a periodic
alert later.
Why this exists: every "is the fix live?" question used to be
answered with a one-off curl + git rev-parse + manual diff. This
script does that uniformly across every public surface (workspace
tenants + canvas) and is parseable. The redeploy verifier (#2398)
covers the deploy moment; this covers any-time-after.
Reads EXPECTED_SHA from `gh api repos/Molecule-AI/molecule-core/
commits/main` so it always reflects the actual upstream tip, not
local working-copy state. Falls back to local origin/main with a
WARN if `gh` isn't logged in — debugging is still useful even if
the comparison may lag.
Depends on:
- #2409 (TenantGuard /buildinfo allowlist) — without it every
tenant looks "unreachable" because the route 404s before the
handler. Already merged on staging; will hit production after
the next staging→main fast-forward + redeploy.
- #2407 (canvas /api/buildinfo) — already on main + Vercel.
Usage:
./scripts/ops/check-prod-versions.sh # production canary set
TENANT_SLUGS="a b c" ./scripts/ops/check-prod-versions.sh # custom set
ENV=staging TENANT_SLUGS="..." ./scripts/ops/check-prod-versions.sh
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
publish-runtime.yml had a broad smoke (AgentCard call-shape, well-known
mount alignment, new_text_message) inline as a heredoc. runtime-prbuild-
compat.yml had a narrow inline smoke (just `from main import main_sync`).
Result: a PR could introduce SDK shape regressions that pass at PR time
and only fail at publish time, post-merge.
Extract the broad smoke into scripts/wheel_smoke.py and invoke it from
both workflows. PR-time gate now matches publish-time gate — same script,
same assertions. Eliminates the drift hazard of two heredocs that have
to be kept in lockstep manually.
Verified locally:
* Built wheel from workspace/ source, installed in venv, ran smoke → pass
* Simulated AgentCard kwarg-rename regression → smoke catches it as
`ValueError: Protocol message AgentCard has no "supported_interfaces"
field` (the exact failure mode of #2179 / supported_protocols incident)
Path filter for runtime-prbuild-compat extended to include
scripts/wheel_smoke.py so smoke-only edits get PR-validated. publish-
runtime path filter intentionally NOT extended — smoke-only edits should
not auto-trigger a PyPI version bump.
Subset of #131 (the broader "invoke main() against stub config" goal
remains pending — main() needs a config dir + stub platform server).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI failure: the Ops scripts (unittest) job runs `python -m unittest
discover` which doesn't have pytest installed. test_check_migration_
collisions.py imported pytest unconditionally, failing module import:
ImportError: Failed to import test module: test_check_migration_collisions
Traceback (most recent call last):
File ".../test_check_migration_collisions.py", line 12, in <module>
import pytest
ModuleNotFoundError: No module named 'pytest'
The tests use no pytest-specific features (just bare assert + plain
class). Sibling test_sweep_cf_decide.py in the same dir already uses
unittest.TestCase. Convert this one to match: drop the pytest import,
make TestMigrationFileRe inherit from unittest.TestCase.
unittest.TestLoader.discover() requires TestCase subclasses for
auto-discovery, so the fix is two lines (drop import, add base).
Bare assert statements work fine inside TestCase methods.
Verified: `python3 -m unittest scripts.ops.test_check_migration_collisions -v`
runs all 9 tests, all pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two PRs targeting staging can each add a migration with the same
numeric prefix (e.g. 044_*.up.sql). Each passes CI independently.
They collide at merge time. Worst case: second migration silently
doesn't apply and prod schema drifts from what the code expects.
Caught manually 2026-04-30 during PR #2276 rebase: 044_runtime_image_pins
collided with 044_platform_inbound_secret from RFC #2312. This workflow
makes that detection automatic at PR-open time.
How it works:
scripts/ops/check_migration_collisions.py runs on every PR that
touches workspace-server/migrations/**. For each new/modified
migration filename, extracts the numeric prefix and checks:
1. Does the base branch already have a DIFFERENT migration file with
the same prefix? (PR branched off an old base, base advanced and
another PR landed the same number — needs rebase.)
2. Is another OPEN PR (not this one) also adding a migration with
the same prefix? (Race-window collision — both pass CI separately,
would collide at merge time.)
Either case → exit 1 with a clear ::error:: message naming the
conflicting PR(s) so the author knows what to renumber.
Implementation notes:
- Uses git ls-tree (not working-tree walk) so it works against any
base ref without checkout.
- Uses gh pr diff --name-only per open PR, bounded by `gh pr list
--limit 100`. ~30s worst case for a busy repo, <5s normally.
- --diff-filter=AM picks up Added or Modified — renaming a migration
in place is also flagged (intentional; renaming migrations isn't
safe).
- Same filename in both PR and base = no collision (PR is editing
in-place, fine).
Tests:
scripts/ops/test_check_migration_collisions.py — 9 cases on the
regex classifier (the load-bearing piece). End-to-end git/gh path
is exercised by running the workflow against real PRs.
Hard-gates Tier 1 item 1 (#2341). Cheapest, cleanest gate. Catches
one specific class of merge-time foot-gun automatically.
Refs hard-gates discussion 2026-04-30. Tier 1 of 4 (others tracked
in #2342, #2343, #2344).
Issue: scripts/dev-start.sh assumed `go` was on PATH; on a fresh dev
box without Go installed, line 111 (`go run ./cmd/server`) failed
with `go: not found` and the script bailed before printing the
readiness banner. The script's own prerequisite list (line 13-21)
said "Go 1.25+" but there was no signpost between "open the doc" and
"command not found."
Fix: detect `go` via `command -v`. If present, keep the existing
`go run` path (fast iteration, attaches to local log). If not,
fall back to `docker compose up -d --build platform` which uses the
published platform container — slower first run but the script
still works without forcing the dev to install Go just to read logs.
Either path leaves /health on :8080 so the rest of the script's
wait loop is unchanged.
If both paths fail, the error message names the install URL
(https://go.dev/dl/) and the fallback diagnostic (`/tmp/molecule-platform.log`)
so the dev has a single, actionable next step.
Verified: `sh -n` syntax check passes.
Closes#2329 item 2.
CP's tenant-delete cascade removes the DNS record (with sweep-cf-orphans
as a backstop) but does NOT delete the underlying Cloudflare Tunnel.
Each E2E provision creates one Tunnel named `tenant-<slug>`; without
cleanup these accumulate indefinitely on the account, consuming the
tunnel quota and cluttering the dashboard.
Observed 2026-04-30: dozens of `tenant-e2e-canvas-*` tunnels in Down
state with zero replicas, weeks past their tenant's deletion. Same
class of bug as the DNS-records leak that drove sweep-cf-orphans
(controlplane#239).
Parallel-shape to sweep-cf-orphans:
- Same dry-run-by-default + --execute pattern
- Same MAX_DELETE_PCT safety gate (default 90% — higher than DNS
sweep's 50% because tenant-shaped tunnels are orphans by design)
- Same schedule/dispatch hardening (hard-fail on missing secrets
when scheduled, soft-skip when dispatched)
- Cron offset to :45 to avoid CF API bursts colliding with the DNS
sweep at :15
Decision rules (in order):
1. Name doesn't match `tenant-<slug>` → keep (unknown — never sweep
tunnels that might belong to platform infra).
2. Tunnel has active connections (status=healthy or non-empty
connections array) → keep (defense-in-depth: don't kill a live
tunnel even if CP forgot the org).
3. Slug ∈ {prod_slugs ∪ staging_slugs} → keep.
4. Otherwise → delete (orphan).
Verified by:
- shell syntax check (bash -n)
- YAML lint
- Decide-logic offline smoke (7 cases, all pass)
- End-to-end dry-run smoke with stubbed CP + CF APIs
Required secrets (added to existing org-secrets):
CF_API_TOKEN must include account:cloudflare_tunnel:edit
scope (separate from zone:dns:edit used by
sweep-cf-orphans — same token if scope is
broad, or a new token if narrowly scoped).
CF_ACCOUNT_ID account that owns the tunnels (visible in
dash.cloudflare.com URL path).
CP_PROD_ADMIN_TOKEN reused from sweep-cf-orphans.
CP_STAGING_ADMIN_TOKEN reused from sweep-cf-orphans.
Note: CP-side root cause (tenant-delete should cascade to tunnel
delete) is in molecule-controlplane and worth fixing separately. This
janitor is the operational backstop in the meantime — same pattern
applied to DNS records when the same root cause was unaddressed.
Mirrors PR-C's Upload migration: replaces the docker-cp tar-stream
extraction with a streaming HTTP GET to the workspace's own
/internal/file/read endpoint. Closes the SaaS gap for downloads —
without this PR, GET /workspaces/:id/chat/download still returns 503
on Railway-hosted SaaS even after A+B+C+F land.
Stacks: PR-A #2313 → PR-B #2314 → PR-C #2315 → PR-F #2319 → this PR.
Why a single broad /internal/file/read instead of /internal/chat/download:
Today's chat_files.go::Download already accepts paths under any of the
four allowed roots {/configs, /workspace, /home, /plugins} — it's not
strictly chat. Future PRs (template export, etc.) will reuse this
endpoint via the same forward pattern; reusing avoids three near-
identical handlers (one per domain) with duplicated path-safety logic.
Path safety is duplicated on platform + workspace sides — defence in
depth via two parallel checks, not "trust the workspace."
Changes:
* workspace/internal_file_read.py — Starlette handler. Validates path
(must be absolute, under allowed roots, no traversal, canonicalises
cleanly). lstat (not stat) so a symlink at the path doesn't redirect
the read. Streams via FileResponse (no buffering). Mirrors Go's
contentDispositionAttachment for Content-Disposition header.
* workspace/main.py — registers GET /internal/file/read alongside the
POST /internal/chat/uploads/ingest from PR-B.
* scripts/build_runtime_package.py — adds internal_file_read to
TOP_LEVEL_MODULES so the publish-runtime cascade rewrites its
imports correctly. Also includes the PR-B additions
(internal_chat_uploads, platform_inbound_auth) since this branch
was rooted before PR-B's drift-gate fix; merge-clean alphabetic
additions.
* workspace-server/internal/handlers/chat_files.go — Download
rewritten as streaming HTTP GET forward. Resolves workspace URL +
platform_inbound_secret (same shape as Upload), builds GET request
with path query param, propagates response headers (Content-Type /
Content-Length / Content-Disposition) + body. Drops archive/tar
+ mime imports (no longer needed). Drops Docker-exec branch entirely
— Download is now uniform across self-hosted Docker and SaaS EC2.
* workspace-server/internal/handlers/chat_files_test.go — replaces
TestChatDownload_DockerUnavailable (stale post-rewrite) with 4
new tests:
- TestChatDownload_WorkspaceNotInDB → 404 on missing row
- TestChatDownload_NoInboundSecret → 503 on NULL column
(with RFC #2312 detail in body)
- TestChatDownload_ForwardsToWorkspace_HappyPath → forward shape
(auth header, GET method, /internal/file/read path) + headers
propagated + body byte-for-byte
- TestChatDownload_404FromWorkspacePropagated → 404 from
workspace propagates (NOT remapped to 500)
Existing TestChatDownload_InvalidPath path-safety tests preserved.
* workspace/tests/test_internal_file_read.py — 21 tests covering
_validate_path matrix (absolute, allowed roots, traversal, double-
slash, exact-match-on-root), 401 on missing/wrong/no-secret-file
bearer, 400 on missing path/outside-root/traversal, 404 on missing
file, happy-path streaming with correct Content-Type +
Content-Disposition, special-char escaping in Content-Disposition,
symlink-redirect-rejection (lstat-not-stat protection).
Test results:
* go test ./internal/handlers/ ./internal/wsauth/ — green
* pytest workspace/tests/ — 1292 passed (was 1272 before PR-D)
Refs #2312 (parent RFC), #2308 (chat upload+download 503 incident).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #2265 renamed the harness trace endpoint and event name; sync the
cross-repo scripts/README.md to match.
Closes#2270
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The runner was speculatively calling `/workspaces/:id/heartbeat-history` —
that endpoint doesn't exist on workspace-server. On local dev it 404'd;
on tenant builds the platform's :8080 canvas-proxy fallback intercepted
it and returned 28KB of Next.js HTML which then landed in the JSON event
log. Neither outcome was useful trace data.
`GET /workspaces/:id/activity` is the existing endpoint that reads
activity_logs. That table already records the events the RFC §V1.0
step 6 'platform-side transition' check needs (a2a_send / a2a_receive /
task_update / agent_log / error, plus duration_ms + status). Rename
the runner's fetch + emitted event accordingly.
Verified: GET /workspaces/<uuid>/activity?since_secs=60 returns 200
with `[]` against the local platform; no SaaS skip needed since the
endpoint exists in both environments.
Refs: molecule-core#2256 (V1.0 gate #1 measurement comment).
Three review-driven fixes to the runner before #2261 merges:
1. `WAIT_ONLINE_SECS / 3` truncated; an operator passing 200 actually
waited 198s. Round up so 200 → 67 polls × 3s = 201s ≥ requested.
2. The heartbeat-history endpoint isn't on tenant workspace-servers —
the platform's :8080 fallback proxies unmatched paths to the
canvas Next.js, so the SaaS run captured 28KB of HTML in the
`heartbeat_trace` event log. Skip the fetch in MODE=saas; emit an
explicit `<skipped: ...>` placeholder. Local mode behaviour
unchanged.
3. ORG_ID and ORG_SLUG had no client-side format check, so a typo'd
value got swallowed by TenantGuard's intentionally-opaque 404
(which doesn't tell the operator whether slug, UUID, or auth was
wrong). Validate UUID and slug shape up front; matching errors
are actionable.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two docs covering load-bearing patterns from today's work that
weren't previously discoverable:
1. workspace/platform_tools/README.md — explains the ToolSpec
single-source-of-truth pattern (#2240), the CLI-block alignment
gap that hand-maintained generation can't close (#2258), the
snapshot golden files + LF-pinning (#2260), and the add/rename/
remove playbook. The next reader who lands in
workspace/platform_tools/ now has the design rationale + the
safe-edit procedure colocated with the code.
2. scripts/README.md — disambiguates the three measure-coordinator-
task-bounds.sh files that now exist across two repos:
- scripts/measure-coordinator-task-bounds.sh (canonical OSS, this repo)
- scripts/measure-coordinator-task-bounds-runner.sh (Hermes/MiniMax variant, this repo)
- scripts/measure-coordinator-task-bounds.sh (production-shape, in molecule-controlplane)
Cross-references reference_harness_pair_pattern (auto-memory) for
the cross-repo design rationale. Documents the common safety
pattern (cleanup trap, DRY_RUN, non-target guard,
cleanup_*_failed events) and the heartbeat-trace caveat.
Refs: #2240, #2254, #2257, #2258, #2259, #2260; molecule-controlplane#321.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The original measure-coordinator-task-bounds.sh was hardcoded for
local-dev (workspace-server on :8080) with claude-code/langgraph
templates and OPENROUTER_API_KEY. Running it against staging requires
both auth-chain plumbing (per-tenant ADMIN_TOKEN + X-Molecule-Org-Id
TenantGuard header + tenant subdomain routing) and template/secret
flexibility (e.g. Hermes/MiniMax for Token Plan keys).
This adds:
* `measure-coordinator-task-bounds-runner.sh` — separate runner that
wraps the same workspace-server API calls but takes everything as
env-var inputs. Two MODE values:
- `local` → direct workspace-server (no auth/tenant scoping)
- `saas` → tenant subdomain + per-tenant ADMIN_TOKEN bearer +
X-Molecule-Org-Id TenantGuard header. Auto-fetches
tenant token via CP /cp/admin/orgs/<slug>/admin-token
given ORG_SLUG + CP_ADMIN_API_TOKEN, OR accepts a
pre-resolved TENANT_ADMIN_TOKEN.
* Configurable PM_TEMPLATE / CHILD_TEMPLATE / MODEL / SECRET_NAME /
SECRET_VALUE — defaults match the original (claude-code-default +
langgraph + OpenRouter). Hermes/MiniMax example documented in the
header.
* Per-poll status_change events during wait_online, so a workspace
that never reaches online surfaces its last status (provisioning,
failed, etc.) instead of a bare timeout.
* WAIT_ONLINE_SECS knob (default 180s; SaaS cold-start needs ~420s
for first hermes-image pull on a freshly-provisioned EC2 tenant).
* `${args[@]+...}` guard on the api() helper — avoids `set -u`
exploding on an empty header array (the local-dev hot-path).
The original script also gained a SECRET_VALUE block earlier in the
session — that change (separately staged) makes the secret-name
configurable without forcing every operator through the new runner.
V1.0 gate #1 (RFC #2251, Issue 4 repro) measurement results posted
as a separate comment on molecule-core#2256.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Self-review follow-ups on #2257:
- Drop `local exit_code=$?` from cleanup(). `trap`-handler return values
are ignored, so capturing $? only misled a future reader into thinking
exit-code preservation was happening.
- Replace silenced `>/dev/null 2>&1` DELETE with `-w '%{http_code}'`
capture. ADMIN_TOKEN expiring mid-run was the realistic failure mode
here — previously we swallowed it under the silenced redirect, leaving
workspaces leaked with no signal. Now a 401/403/5xx surfaces as a
`cleanup_failed` JSON event with a remediation hint pointing at
cleanup-rogue-workspaces.sh; 404 is treated as success (the
post-condition — workspace absent — holds).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three follow-ups from #2254 code review before the harness is safe to
run against staging:
1. Cleanup trap. Workspaces are now auto-deleted on EXIT/INT/TERM. A
Ctrl-C mid-run no longer leaks the PM + Researcher pair against
shared infra. KEEP_WORKSPACES=1 opts out for post-run inspection.
2. Tenant scoping + admin auth. Non-localhost PLATFORM values now
require both ADMIN_TOKEN and TENANT_ID; the script refuses to run
without them. The previous version sent unauthenticated POSTs that,
on staging, would either 401 every request or — worse — provision
into the wrong tenant. Memory `feedback_never_run_cluster_cleanup_
tests_on_live_platform` calls out the same hazard class.
3. DRY_RUN=1 mode. Prints platform target, tenant id, auth fingerprint,
and the planned actions, then exits before any state mutation. The
intended pre-flight before running against staging.
Also tightened OR_KEY check (the chained default silently accepted an
empty OPENROUTER_API_KEY) and added a heartbeat-trace caveat to the
interpretation guide explaining what `<endpoint_unavailable>` means
for the bound question.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a reproduction harness for Issue 4 of the 2026-04-28 CP review,
referenced in RFC molecule-core#2251. The RFC review (issue #2251
comment) flagged that Issue 4 was hypothesized but not reproduced
before V1.0 implementation begins — this script closes that gap.
What it does:
- Provisions a coordinator (PM, claude-code-default) + 1 child
(Researcher, langgraph) via the platform API.
- Sends an A2A kickoff with a synthesis-heavy task that requires
SYNTHESIS_DEPTH (default 3) sequential delegations followed by a
600-word post-delegation synthesis.
- Times the coordinator's full A2A round-trip with millisecond
precision and emits one JSON event per phase (machine-readable).
- Pulls the coordinator's heartbeat trace post-run so the team can
see whether any platform-side state transition fired during the
long synthesis (the V1.0 RFC's MAX_TASK_EXECUTION_SECS would
surface as such a transition; absence of one in this trace
confirms the RFC's premise).
Why a measurement harness, not a pass/fail test:
Issue 4's claim is "absence of platform-side bound", which is hard
to assert in a single CI run. Outputting structured measurement
data lets the team interpret across multiple runs / staging vs
prod / different SYNTHESIS_DEPTH values rather than relying on one
reproduction snapshot.
The script's header has the full interpretation guide:
- ELAPSED < 60s → not informative (LLM was just fast)
- 60–300s → within DELEGATION_TIMEOUT, ambiguous
- >= 300s without trace transitions → BUG CONFIRMED
- curl_failed → coordinator hung past A2A_TIMEOUT or genuinely
slow (disambiguate by querying status separately)
Doesn't run in CI by default — invoked manually against staging or a
local platform with PLATFORM=... and OPENROUTER_API_KEY=... env vars.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The PR-built wheel + import smoke gate refused the platform_tools
package because it's a new subdirectory under workspace/ that wasn't
in scripts/build_runtime_package.py:SUBPACKAGES. The drift gate (which
exists for exactly this reason) caught it cleanly:
error: SUBPACKAGES drifted from workspace/ subdirectories:
in workspace/ but NOT in SUBPACKAGES (will ship un-rewritten or
be excluded): ['platform_tools']
Adding platform_tools to SUBPACKAGES wires the package into the
runtime wheel + applies the canonical
from platform_tools.<x> -> from molecule_runtime.platform_tools.<x>
import-rewrite step that every other subpackage uses.
Verified locally: scripts/build_runtime_package.py succeeds, the
rewritten a2a_mcp_server.py reads
from molecule_runtime.platform_tools.registry import TOOLS
which matches the package layout in the wheel.
Manual fresh-user clean-slate test surfaced three friction points in
the existing dev-start.sh:
1. The script ran docker compose -f docker-compose.infra.yml
directly, bypassing infra/scripts/setup.sh — so the workspace
template registry was never populated and the canvas template
palette came up empty (the "Template palette is empty"
troubleshooting hit).
2. ADMIN_TOKEN was not handled at all. Without it, the AdminAuth
fail-open gate worked initially but slammed shut the moment the
first workspace registered a token — at which point the canvas
could no longer call /workspaces or /templates. New users hit
401s with no obvious next step.
3. The script wasn't mentioned in docs/quickstart.md. New users
followed the documented 4-step manual flow and never discovered
the single command existed.
Fixes:
- dev-start.sh now calls infra/scripts/setup.sh, which brings up
full infra (postgres + redis + langfuse + clickhouse + temporal)
AND populates the template/plugin registry from manifest.json.
- On first run, dev-start.sh writes MOLECULE_ENV=development to
.env. This activates middleware.isDevModeFailOpen() which lets
the canvas keep calling admin endpoints without a bearer (the
intended local-dev escape hatch). The .env is preserved on
re-runs and sourced before the platform launches.
- The script intentionally does NOT auto-generate an ADMIN_TOKEN.
A first attempt did, and broke the canvas because isDevModeFailOpen
requires ADMIN_TOKEN empty AND MOLECULE_ENV=development together.
Setting ADMIN_TOKEN in dev would close the hatch and the canvas
has no way to read that token in a dev build (no
NEXT_PUBLIC_ADMIN_TOKEN bake step here). The .env comment block
explicitly warns future contributors not to add it.
- Both processes' logs go to /tmp/molecule-{platform,canvas}.log
instead of stdout-mixed so the readiness banner stays clean.
- Health-poll loops cap at 30s with a clear timeout error pointing
to the log file, instead of hanging forever.
- The readiness banner now lists the log paths AND tells the user
the next step is "open localhost:3000 → add API key in Config →
Secrets & API Keys → Global", instead of just listing service
URLs.
Quickstart doc rewrite leads with:
git clone ...
cd molecule-monorepo
./scripts/dev-start.sh
The 4-step manual flow is preserved as "Manual setup (advanced)"
for contributors who want per-component logs.
Verified end-to-end from clean Docker (no containers, no volumes,
no .env) three times: total wall-clock ~12s for a re-run with
cached npm/docker layers. Platform's HTTP 200 on /workspaces
without a bearer confirms the dev-mode auth hatch is active.
Two compounding bugs that bit hermes (and any other workspace that
reaches main.py:142):
1. workspace/lib/ was in EXCLUDE_DIRS so the published wheel didn't
contain the directory at all. main.py imports `from lib.pre_stop
import read_snapshot` (and `build_snapshot`, `write_snapshot`) so
every workspace startup that reaches the snapshot path crashed
with `ModuleNotFoundError: No module named 'lib'`.
2. Even if lib/ had shipped, `lib` wasn't in SUBPACKAGES so the
import-rewriter would have left the bare `from lib.pre_stop`
unqualified — it would still fail because the package would only
be reachable as `molecule_runtime.lib`.
Fix: move `lib` from EXCLUDE_DIRS to SUBPACKAGES (one entry each).
Drift gate extension: the existing gate I added in #2163 only
asserted TOP_LEVEL_MODULES against workspace/*.py. This change adds
the symmetric assertion for SUBPACKAGES against workspace/<dir>/
(filtered by EXCLUDE_DIRS + presence of __init__.py). Catches both:
- Subpackage added to workspace/ but missed in SUBPACKAGES
- Subpackage missing from workspace/ but lingering in SUBPACKAGES
- Subpackage wrongly in EXCLUDE_DIRS while also referenced by
rewritten imports (the lib case)
Tested locally: build of 0.1.99 now ships lib/ and main.py contains
`from molecule_runtime.lib.pre_stop import ...` correctly rewritten.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#2000 fixed one symptom — TENANT_IMAGE pinned to `staging-a14cf86`
(10 days stale) silently no-op'd four upstream fixes on 2026-04-24.
This adds the audit pattern as a re-runnable script so the broader
class is observable on demand without new CI infrastructure.
Audit results today (2026-04-27):
controlplane / production: 54 vars audited, 0 drift-prone pins
controlplane / staging: 52 vars audited, 0 drift-prone pins
So the immediate audit deliverable is clean — TENANT_IMAGE is the only
known violation and #2000 already fixed it. The script makes the
ongoing audit a 5-second command instead of a manual one.
Detection regex catches:
* branch-SHA suffixes (`staging|main|prod|production-<6+ hex>`)
— the exact 2026-04-24 incident shape
* version pins after `:` or `=` (`:v1.2.3`, `=v0.1.16`)
— same drift class, just rendered differently
Anchoring on `:` or `=` keeps prose like "version 1.2.3 of the api"
out of the false-positive set. UUIDs, ARNs, AMI IDs, secrets, and
floating tags (`:staging-latest`, `:main`) pass through untouched.
Regression test (tests/ops/test_audit_railway_sha_pins.sh) pins 20
representative cases — 9 should-flag (covering all four branch
prefixes + semver variants + middle-of-value matches) and 11
should-pass (the false-positive guards). Same regex inlined in both
files so a future tweak that weakens detection fails the test in
lockstep with weakening the audit.
Both files shellcheck clean.
CI gate (acceptance criterion's "regression: add a CI check") is
deliberately scoped out — querying Railway from CI requires plumbing
RAILWAY_TOKEN as a repo secret, which is multi-step setup. The
re-runnable script + test cover the same surface today; the CI
workflow is a small follow-up once the token is provisioned.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two compounding bugs surfaced when 0.1.16 hit production today:
1. scripts/build_runtime_package.py had a hand-curated TOP_LEVEL_MODULES
set listing every workspace/*.py that should get its bare imports
rewritten to `molecule_runtime.X`. The set silently went stale:
- Missing: transcript_auth (added since #87 phase 1c), runtime_wedge,
watcher → unrewritten imports shipped, every workspace startup
died with ModuleNotFoundError.
- Stale: claude_sdk_executor, cli_executor (both removed in #87),
hermes_executor (never existed) → harmless but misleading.
2. publish-runtime.yml's wheel-smoke step asserted on stable invariants
(BaseAdapter, AdapterConfig, a2a_client error sentinel) but never
imported main. So even though main.py held the broken bare
`from transcript_auth import ...`, the smoke check passed.
Fixes:
- Build script now derives the on-disk module set from workspace/*.py
and asserts it matches TOP_LEVEL_MODULES exactly. Drift in either
direction fails the build with a specific diff message instead of
shipping a broken wheel. Closed-list typo guard preserved (we still
edit the set explicitly when a module is added/removed) — the gate
just makes drift impossible to ignore.
- TOP_LEVEL_MODULES updated to current reality: drop the 3 stale,
add the 3 missing.
- publish-runtime.yml wheel-smoke now `import molecule_runtime.main`
before the invariant asserts. main is the entry point and
transitively imports every module — any bare-import bug surfaces
as ModuleNotFoundError before PyPI accepts the upload.
Tested locally: `python3 scripts/build_runtime_package.py
--version 0.1.99 --out /tmp/build-test` succeeds, and
/tmp/build-test/molecule_runtime/main.py contains the rewritten
`from molecule_runtime.transcript_auth import ...`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two paper cuts the fix addresses:
1. nuke-and-rebuild.sh wipes the compose stack but never re-populates
workspace-configs-templates/, org-templates/, or plugins/. Those dirs
are .gitignored — the curated set lives in manifest.json as external
repos cloned via clone-manifest.sh (idempotent). Without that step,
a fresh checkout or a post-deletion run leaves the dirs empty, which
silently hides the entire template palette in Canvas + falls back to
bare default workspace provisioning. Symptom: "Deploy your first
agent" shows zero templates.
2. The existing ws-* container reap was already in the script (good),
but it only fires when this script runs. Folks running `docker compose
down -v` directly leave orphan ws-* containers behind. Documented
that explicitly in the script comment so future readers understand
why those lines are critical.
The fix is just `bash clone-manifest.sh` added to the script. clone-
manifest.sh is idempotent — populated dirs short-circuit, so a re-nuke
on a healthy machine pays only a few stat calls.
scripts/test-nuke-and-rebuild.sh exercises the canonical workflow end-
to-end:
- plants a fake orphan ws-* container, then asserts it gets reaped
- renames the manifest dirs to simulate a fresh checkout, then
asserts they get repopulated
- waits for /health and asserts the platform sees the same template
count on disk as via /configs in the container (catches bind-mount
drift)
- asserts the image-auto-refresh watcher (PR #2114) starts, since
that's load-bearing for the CD chain users now rely on
The test pre-flights port 5432/6379/8080 and exits 0 with a SKIP
message if a non-target compose project is holding them — common when
parallel monorepo checkouts coexist on one Docker daemon.
scripts/ is intentionally outside CI shellcheck per ci.yml comment, but
both files pass `shellcheck --severity=warning` anyway.
Defers but does not solve the runtime root-cause for orphan ws-* after
plain `docker compose down -v`: the orphan-sweeper in the platform only
reaps containers whose workspace row says status='removed', so a wiped
DB → no row → sweeper ignores them. Proper fix needs container labels
keyed to a per-platform-instance UUID so the sweeper can confidently
reap "containers I provisioned that aren't in my DB anymore" without
nuking a sibling platform's containers on a shared daemon. Tracked as
task #109's follow-up; out of scope for this PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The production-side end of the runtime CD chain. Operators (or the post-
publish CI workflow) hit this after a runtime release to pull the latest
workspace-template-* images from GHCR and recreate any running ws-* containers
so they adopt the new image. Without this, freshly-published runtime sat in
the registry but containers kept the old image until naturally cycled.
Implementation notes:
- Uses Docker SDK ImagePull rather than shelling out to docker CLI — the
alpine platform container has no docker CLI installed.
- ghcrAuthHeader() reads GHCR_USER + GHCR_TOKEN env, builds the base64-
encoded JSON payload Docker engine expects in PullOptions.RegistryAuth.
Both empty → public/cached images only; both set → private GHCR pulls.
- Container matching uses ContainerInspect (NOT ContainerList) because
ContainerList returns the resolved digest in .Image, not the human tag.
Inspect surfaces .Config.Image which is what we need.
- Provisioner.DefaultImagePlatform() exported so admin handler picks the
same Apple-Silicon-needs-amd64 platform as the provisioner — single
source of truth for the multi-arch override.
Local-dev companion: scripts/refresh-workspace-images.sh runs on the
host and inherits the host's docker keychain auth — alternate path for
when GHCR_USER/TOKEN aren't set in the platform env.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Code-quality + efficiency review of PR #2079:
- Hoist all_slugs = prod_slugs | staging_slugs out of decide() into the
caller (was rebuilt on every record — 1k records × ~50-slug union per
call). decide() signature now (r, all_slugs, ec2_names).
- Compile regexes at module scope (_WS_RE, _E2E_RE, _TENANT_RE) +
hoist platform-core literal set (_PLATFORM_CORE_NAMES). Same change
mirrored in the bash heredoc.
- Drop decorative # Rule N: comments (numbering was out of order, 3
before 2 — actively confusing).
- Move the "edits must mirror" reminder OUTSIDE the CANONICAL DECIDE
block in the .sh file, eliminating the .replace() comment-skip hack
in TestParityWithBashScript.
- Drop per-line .strip() in _slice_canonical (would mask a real
indentation bug; both blocks already at column 0).
- subTest() in TestPlatformCore loops so a single failure no longer
short-circuits the rest of the items.
- merge_group + concurrency on test-ops-scripts.yml (parity with
ci.yml gate behaviour).
- Fix don't apostrophe in inline comment that closed the python
heredoc's single-quote and broke bash -n.
All 25 tests still pass. bash -n clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes#2027.
The CF orphan sweep deletes DNS records — a misclassification could nuke
a live workspace's tunnel. The decision function had MAX_DELETE_PCT
percentage gating but no automated test of category → action mapping.
Approach: extract the decide() function to scripts/ops/sweep_cf_decide.py
as a verbatim copy bracketed by `# CANONICAL DECIDE BEGIN/END` markers.
The shell script keeps its inline heredoc (so the operational path is
untouched) but bracketed by the same markers. A parity test
(TestParityWithBashScript) reads both files and asserts the bracketed
blocks match line-for-line — drift fails CI loudly.
Coverage (25 tests, 1 file, stdlib unittest only):
- Rule 1 platform-core: apex, _vercel, _domainkey, www/api/app/doc/send/status/staging-api
- Rule 3 ws-*: live (matches EC2 prefix) on prod + staging; orphan on prod + staging
- Rule 4 e2e-*: live + orphan on staging; orphan on prod
- Rule 2 generic tenant: live prod + staging; unknown subdomain kept-for-safety
- Rule 5 fallthrough: external domain + unrelated apex
- Rule priority: api.moleculesai.app stays platform-core (not tenant); _vercel stays verification
- Safety gate: under/at/over default 50% threshold; zero-total no-divide; custom threshold
- Empty live-sets: documents that decide() alone classifies as orphan, gate is the defense
CI: new .github/workflows/test-ops-scripts.yml runs `python -m unittest
discover` against scripts/ops/ on every PR/push that touches the
directory. Lightweight — no requirements file, stdlib only.
Local: `cd scripts/ops && python -m unittest test_sweep_cf_decide -v` →
25 tests, all OK.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The script's own help text documents \`MAX_DELETE_PCT=62 ./sweep-cf-orphans.sh\`
as the way to relax the safety gate, but the in-script assignment on line 35
was unconditional and overwrote any env value — so the override never worked.
During today's staging tenant-provision recovery (CP #255 context), hit the
57%-delete threshold and needed the documented override to clear 64 orphan
records. The one-char change to \`\${MAX_DELETE_PCT:-50}\` honors the env
while keeping the 50% default when no caller overrides.
Ran with MAX_DELETE_PCT=62 after the fix — deleted 64 records, CF zone 111→47.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the "panic-button at >65 records" manual sweep that nukes
every pattern-match unconditionally (would delete live workspaces
along with orphans).
This version:
- Queries CP prod + staging /admin/orgs for live tenant slugs
- Queries AWS EC2 describe-instances for live workspace Name tags
- Only deletes CF records whose slug/ws-id has no live counterpart
- Dry-run by default (--execute to actually delete)
- Safety gate refuses to delete >50% of records (configurable via
MAX_DELETE_PCT env var) — catches the "API returned zero orgs, every
tenant looks orphan" failure mode before it nukes production
- Per-category accounting: orphan-ws / orphan-e2e-tenant / etc.
Usage:
CF_API_TOKEN=... CF_ZONE_ID=... \
CP_PROD_ADMIN_TOKEN=... CP_STAGING_ADMIN_TOKEN=... \
bash scripts/ops/sweep-cf-orphans.sh # dry-run
bash scripts/ops/sweep-cf-orphans.sh --execute # actually delete
Ref: #1976 (root-cause: tenant.Delete + workspace.Delete don't clean
their CF records — until that's fixed, this script is the maintenance
path)
Co-authored-by: Hongming Wang <hongmingwang.rabbit@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: molecule-ai[bot] <276602405+molecule-ai[bot]@users.noreply.github.com>
The Canvas template palette was empty on a fresh clone because
`workspace-configs-templates/`, `org-templates/`, and `plugins/` are
gitignored and nothing populated them. The registry already exists —
`manifest.json` at repo root lists every curated
`workspace-template-*`, `org-template-*`, and `plugin-*` repo, and
`scripts/clone-manifest.sh` clones them — but the step was absent
from the README and setup.sh, so new users never ran it.
### What this commit does
**1. `setup.sh` runs `clone-manifest.sh` automatically** (once).
After starting the Docker network but before booting infra, iterate
`manifest.json` and clone any workspace_templates / org_templates /
plugins that aren't already populated. Idempotent — subsequent
runs skip dirs that have content. Requires `jq`; when jq is missing
the step prints a clear install hint and skips (doesn't fail).
**2. `clone-manifest.sh` is idempotent.** Before running `git clone`,
check whether the target directory already exists and is non-empty —
skip if so. Lets `setup.sh` rerun safely without forcing the operator
to delete already-cloned template repos.
**3. `ListTemplates` logs the reason it skips a template.** The
handler previously swallowed `resolveYAMLIncludes` errors with
`continue`, so a broken template showed up as an empty palette with
no log trail. Now the include-expansion and yaml.Unmarshal failure
paths both emit a descriptive `log.Printf` — the exact message that
made the stale `org-templates/molecule-dev/` snapshot debuggable:
ListTemplates: skipping molecule-dev — !include expansion failed:
!include "core-platform.yaml" at line 25: open .../teams/
core-platform.yaml: no such file or directory
**4. Remove the in-tree `org-templates/molecule-dev/` snapshot** (170
files). Matches the explicit intent of prior commit
`bfec9e53` — "remove org-templates/molecule-dev/ — standalone repo
is source of truth". A later "full staging snapshot" re-added a
partial copy that had `!include` references to 7 role files that
never existed in the snapshot (`core-platform.yaml`,
`controlplane.yaml`, `app-docs.yaml`, `infra.yaml`, `sdk.yaml`,
`release-manager/workspace.yaml`, `integration-tester/workspace.yaml`).
`clone-manifest.sh` repopulates it fresh from
`Molecule-AI/molecule-ai-org-template-molecule-dev`.
.gitignore exception for `molecule-dev/` is dropped accordingly
— the whole `/org-templates/*` tree is now gitignored, symmetric
with `/plugins/` and `/workspace-configs-templates/`.
**5. Doc updates** (README, README.zh-CN, CONTRIBUTING) mention `jq`
as a prerequisite and describe what setup.sh now does.
### Verification
On a fresh-nuked DB with the updated branch:
1. `bash infra/scripts/setup.sh` — cleanly clones 33/33 manifest
repos (20 plugins, 8 workspace_templates, 5 org_templates), then
boots infra. Second run skips all 33 (idempotent).
2. `go run ./cmd/server` — "Applied 41 migrations", :8080 healthy.
3. `curl http://localhost:8080/org/templates` returns 4 templates
(was `[]`):
- Free Beats All
- MeDo Smoke Test
- Molecule AI Worker Team (Gemini)
- Reno Stars Agent Team
4. `bash tests/e2e/test_api.sh` — 61/61 pass.
5. `npx vitest run` in canvas — 902/902 pass.
6. `shellcheck infra/scripts/setup.sh` — clean.
### SaaS parity
All changes are local-dev surface. `setup.sh`, `clone-manifest.sh`,
and the local `org-templates/` directory aren't part of the CP
provisioner path — SaaS tenant machines get their templates via
Dockerfile layers or CP-side provisioning, not `clone-manifest.sh`.
The `ListTemplates` log addition is harmless either way (replaces a
silent `continue` with a `log.Printf + continue`).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>