molecule-core

Author	SHA1	Message	Date
Hongming Wang	3d8a0a58fa	ci(auto-sync): App-token dispatch + ubuntu-latest + workflow_dispatch auto-sync-main-to-staging.yml hasn't fired since 2026-04-29 despite multiple staging→main promotes since. The promote PR #2442 (Phase 2) has been wedged on `mergeStateStatus: BEHIND` for hours because staging is missing the merge commit from PR #2437. Three compounding bugs, all fixed here: 1. GitHub no-recursion suppresses the `on: push` trigger. When the merge queue lands a staging→main promote, the resulting push to main is "by GITHUB_TOKEN", and per https://docs.github.com/en/actions/using-workflows/triggering-a-workflow#triggering-a-workflow-from-a-workflow that push event does NOT fire any downstream workflows. Verified empirically against SHA `76c604fb` (PR #2437): exactly ONE workflow fired on that push — `publish-workspace-server-image`, dispatched explicitly by auto-promote-staging.yml's polling tail with an App token (the documented #2357 workaround). Every other `on: push` workflow on main, including auto-sync, was silently suppressed. Same fix extended here: auto-promote-staging.yml's polling tail now ALSO dispatches `auto-sync-main-to-staging.yml --ref main` via the App token after the merge lands. App-initiated dispatch propagates `workflow_run` cascades, which is what the publish tail relies on too. Failure path: emits `::error::` with the recovery command — operator runs it once and the next promote self-heals. auto-sync.yml gains `workflow_dispatch:` so it can be invoked from the dispatch above + manually if a future promote also misses (defense in depth). 2. `runs-on: [self-hosted, macos, arm64]` was wrong for this repo. Comment claimed "matches the rest of this repo's workflows" — false: this is the ONLY workflow in molecule-core/.github/workflows/ with a non-ubuntu runs-on. Copy-paste artefact from molecule-controlplane (which IS private and has a Mac runner). molecule-core has no Mac runner registered, so even when the trigger DID fire (the 3 historic manual-UI merges), the job would have sat unassigned if the runner were offline. Switched to `ubuntu-latest` to match every other workflow in this repo. 3. The `on: push` trigger remains as a defense-in-depth path for the rare case of a manual UI merge by a real user (which uses their PAT and DOES fire downstream workflows — confirmed via the 2026-04-29 `d35a2420` run with `triggering_actor=HongmingWang-Rabbit` that fired 16 workflows including auto-sync). Belt-and-suspenders. Long-term: switching auto-promote's `gh pr merge --auto` call to use the App token (instead of GITHUB_TOKEN) would let `on: push` triggers fire naturally and obviate the need for the explicit dispatches in the polling tail. Tracked in #2357 — out of scope here. Operator recovery for the current Phase 2 wedge: after this lands on staging, dispatch auto-sync once via `gh workflow run auto-sync-main-to-staging.yml --ref main` to backfill the missed sync from `76c604fb`. PR #2442 will go from BEHIND → CLEAN and auto-merge. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 22:28:35 -07:00
Hongming Wang	91766e68e7	Merge pull request #2496 from Molecule-AI/followup/sweeper-cleanup test(sweeper): integration coverage + accessor consolidation (#2494 follow-ups)	2026-05-02 05:03:41 +00:00
Hongming Wang	77882c920e	Merge pull request #2495 from Molecule-AI/harness/phase-2-followup-review-nits harness(phase-2-followup): fix assert_status mislabel + honest race comment	2026-05-02 05:02:15 +00:00
Hongming Wang	0064f02c00	test(sweeper): integration coverage for manifest-override + accessor consolidation Two follow-ups from PR #2494's review: 1. Two new sweep tests exercise the lookup path through sweepStuckProvisioning end-to-end: - ManifestOverrideSparesRow: claude-code 11min old, manifest=20min → no UPDATE, no broadcast (sparing works through the sweeper) - ManifestOverrideStillFlipsPastDeadline: claude-code 21min old, manifest=20min → flipped + payload.timeout_secs=1200 Closes the gap that the unit-test on provisioningTimeoutFor alone left open: a future refactor could drop the lookup arg from the sweeper's call and only the unit test caught it. Verified by regression-injecting `lookup→nil` in sweepStuckProvisioning — both new tests fail, the old ones still pass. 2. addProvisionTimeoutMs now goes through ProvisionTimeoutSecondsForRuntime instead of calling provisionTimeouts.get directly. Single accessor path for the same data — the canvas response and the sweeper now resolve identically by construction. No production behavior change; tests + accessor cleanup only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 22:00:36 -07:00
Hongming Wang	a15972066b	harness(phase-2-followup): fix assert_status mislabel + honest race comment Two review nits from PR #2493 that don't affect correctness but matter for honesty in the harness's own self-documentation: 1. tenant-isolation.sh F3/F4 used assert_status for non-HTTP values. LEAKED_INTO_ALPHA/BETA are jq-derived counts, not HTTP codes — but the assertion ran through assert_status, which formats the result as "(HTTP 0)". Anyone reading the test output would believe these assertions involved an HTTP call. Adds a plain `assert` helper matching per-tenant-independence.sh's pattern, and uses it on the two count comparisons. 2. per-tenant-independence.sh Phase F over-claimed coverage. The comment said the concurrent-INSERT race catches "shared-pool corruption" + "lib/pq prepared-statement cache collision". Both are real failure modes — but neither can fire across tenants in THIS topology, because each tenant owns its own DATABASE_URL and its own postgres-{alpha,beta} container. The comment now lists only what the test actually catches (redis cross-keyspace bleed, shared cp-stub state corruption, cf-proxy buffer mixup) and notes that a future shared-Postgres variant is the right place for the lib/pq cache assertion. No behavioural change — both replays still pass 13/13 + 12/12, all six replays pass on a clean run-all-replays.sh boot. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 22:00:04 -07:00
Hongming Wang	40e09508b6	Merge pull request #2494 from Molecule-AI/fix/sweeper-honor-template-timeout fix(sweeper): honour template-manifest provision_timeout_seconds	2026-05-02 04:47:53 +00:00
Hongming Wang	18edf88d59	fix(sweeper): honour template-manifest provision_timeout_seconds Real wiring gap discovered while investigating issue #2486 cluster of prod claude-code workspaces failed at exactly 10m. The runtimeProvisionTimeoutsCache (#2054 phase 2) reads runtime_config.provision_timeout_seconds from each template's config.yaml so the canvas spinner respects per-template timeouts — but the sweeper in registry/provisiontimeout.go hardcoded 10 min (claude-code) / 30 min (hermes) and never consulted the manifest. So a template that declared a longer window had a UI that waited correctly but a sweeper that killed the row at the hardcoded floor anyway. Resolution order pinned by new TestProvisioningTimeout_ManifestOverride: 1. PROVISION_TIMEOUT_SECONDS env (ops-debug global override) 2. Template manifest lookup (per-runtime, beats hermes default too) 3. Hermes default (30 min — CP bootstrap-watcher 25 min + 5 min slack) 4. DefaultProvisioningTimeout (10 min) Wiring: - registry: new RuntimeTimeoutLookup function type, threaded through StartProvisioningTimeoutSweep + sweepStuckProvisioning + the pre-existing provisioningTimeoutFor. - handlers: ProvisionTimeoutSecondsForRuntime exposes the cache's lookup as a method so main.go can pass it without breaking the handlers→registry import direction. - cmd/server/main.go: wire wh.ProvisionTimeoutSecondsForRuntime into the sweep boot. Verified: - go test -race ./... passes (every workspace-server package). - Regression-injected the lookup arm: 3 manifest-override subcases fail with the actual-vs-expected gap, confirming the new test is load-bearing. - The original two timeout tests (env-override, hermes default) keep passing — `lookup=nil` argument preserves their semantics. Operator action enabled: a template wanting a 15-min window can now just set `runtime_config.provision_timeout_seconds: 900` in its config.yaml and the sweeper honours it on the next workspace-server restart. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 21:44:42 -07:00
Hongming Wang	3ca2f40e16	Merge pull request #2493 from Molecule-AI/harness/phase-2-multi-tenant harness(phase-2): multi-tenant compose + cross-tenant isolation replays	2026-05-02 04:39:09 +00:00
Hongming Wang	c275716005	harness(phase-2): multi-tenant compose + cross-tenant isolation replays Brings the local harness from "single tenant covering the request path" to "two tenants covering both the request path AND the per-tenant isolation boundary" — the same shape production runs (one EC2 + one Postgres + one MOLECULE_ORG_ID per tenant). Why this matters: the four prior replays exercise the SaaS request path against one tenant. They cannot prove that TenantGuard rejects a misrouted request (production CF tunnel + AWS LB are the failure surface), nor that two tenants doing legitimate work in parallel keep their `activity_logs` / `workspaces` / connection-pool state partitioned. Both are real bug classes — TenantGuard allowlist drift shipped #2398, lib/pq prepared-statement cache collision is documented as an org-wide hazard. What changed: 1. compose.yml — split into two tenants. tenant-alpha + postgres-alpha + tenant-beta + postgres-beta + the shared cp-stub, redis, cf-proxy. Each tenant gets a distinct ADMIN_TOKEN + MOLECULE_ORG_ID and its own Postgres database. cf-proxy depends on both tenants becoming healthy. 2. cf-proxy/nginx.conf — Host-header → tenant routing. `map $host $tenant_upstream` resolves the right backend per request. Required `resolver 127.0.0.11 valid=30s ipv6=off;` because nginx needs an explicit DNS resolver to use a variable in `proxy_pass` (literal hostnames resolve once at startup; variables resolve per request — without the resolver nginx fails closed with 502). `server_name` lists both tenants + the legacy alias so unknown Host headers don't silently route to a default and mask routing bugs. 3. _curl.sh — per-tenant + cross-tenant-negative helpers. `curl_alpha_admin` / `curl_beta_admin` set the right Host + Authorization + X-Molecule-Org-Id triple. `curl_alpha_creds_at_beta` / `curl_beta_creds_at_alpha` exist precisely to make WRONG requests (replays use them to assert TenantGuard rejects). `psql_exec_alpha` / `psql_exec_beta` shell out per-tenant Postgres exec. Legacy aliases (`curl_admin`, `psql_exec`) keep the four pre-Phase-2 replays working without edits. 4. seed.sh — registers parent+child workspaces in BOTH tenants. Captures server-generated IDs via `jq -r '.id'` (POST /workspaces ignores body.id, so the older client-side mint silently desynced from the workspaces table and broke FK-dependent replays). Stashes `ALPHA_PARENT_ID` / `ALPHA_CHILD_ID` / `BETA_PARENT_ID` / `BETA_CHILD_ID` to .seed.env, plus legacy `ALPHA_ID` / `BETA_ID` aliases for backwards compat with chat-history / channel-envelope. 5. New replays. tenant-isolation.sh (13 assertions) — TenantGuard 404s any request whose X-Molecule-Org-Id doesn't match the container's MOLECULE_ORG_ID. Asserts the 404 body has zero tenant/org/forbidden/denied keywords (existence of a tenant must not be probable from the outside). Covers cross-tenant routing misconfigure + allowlist drift + missing-org-header. per-tenant-independence.sh (12 assertions) — both tenants seed activity_logs in parallel with distinct row counts (3 vs 5) and confirm each tenant's history endpoint returns exactly its own counts. Then a concurrent INSERT race (10 rows per tenant in parallel via `&` + wait) catches shared-pool corruption + prepared-statement cache poisoning + redis cross-keyspace bleed. 6. Bug fix: down.sh + dump-logs SECRETS_ENCRYPTION_KEY validation. `docker compose down -v` validates the entire compose file even though it doesn't read the env. up.sh generates a per-run key into its own shell — down.sh runs in a fresh shell that wouldn't see it, so without a placeholder `compose down` exited non-zero before removing volumes. Workspaces silently leaked into the next ./up.sh + seed.sh boot. Caught when tenant-isolation.sh F1/F2 saw 3× duplicate alpha-parent rows accumulated across three prior runs. Same fix applied to the workflow's dump-logs step. 7. requirements.txt — pin molecule-ai-workspace-runtime>=0.1.78. channel-envelope-trust-boundary.sh imports from `molecule_runtime.` (the wheel-rewritten path) so it catches the failure mode where the wheel build silently strips a fix that unit tests on local source still pass. CI was failing this replay because the wheel wasn't installed — caught in the staging push run from #2492. 8. .github/workflows/harness-replays.yml — Phase 2 plumbing. Removed /etc/hosts step (Host-header path eliminated the need; scripts already source _curl.sh). * Updated dump-logs to reference the new service names (tenant-alpha + tenant-beta + postgres-alpha + postgres-beta). * Added SECRETS_ENCRYPTION_KEY placeholder env on the dump step. Verified: ./run-all-replays.sh from a clean state — 6/6 passed (buildinfo-stale-image, channel-envelope-trust-boundary, chat-history, peer-discovery-404, per-tenant-independence, tenant-isolation). Roadmap section updated: Phase 2 marked shipped. Phase 3 promoted to "replace cp-stub with real molecule-controlplane Docker build + env coherence lint." Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 21:36:40 -07:00
Hongming Wang	093e5038d2	Merge pull request #2491 from Molecule-AI/followup/provision-panic-test-hardening test(provision): harden panic tests with re-raise guard + broadcast count	2026-05-02 03:18:03 +00:00
Hongming Wang	56a5427709	Merge pull request #2492 from Molecule-AI/harness/phase-0-sudo-free-plus-replays harness(phase-0): sudo-free Host-header path + chat_history + envelope replays	2026-05-02 03:17:14 +00:00
Hongming Wang	955755ce1e	test(provision): tighten Assertion 4 message to name both failure modes Per review nit on PR #2491: the previous message ("a goroutine reached cpProv.Start but never broadcast its failure") could mislead an operator if Assertion 2 and 4 both fire — Assertion 4 also catches "goroutine exited via an earlier path before reaching Start." Spell both modes out and cross-reference Assertion 2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 20:14:39 -07:00
Hongming Wang	5cca462843	harness(phase-0): sudo-free Host-header path + chat_history + envelope replays Three changes that bring the local harness from "covers what staging covers minus the SaaS topology" to "exercises every surface we shipped this session against the prod-shape Dockerfile.tenant image." 1. Drop the /etc/hosts requirement. Replays previously needed `127.0.0.1 harness-tenant.localhost` in /etc/hosts to resolve the cf-proxy. That gated the harness behind a sudo step on every fresh dev box and CI runner. The cf-proxy nginx already routes by Host header (matches production CF tunnel: URL is public, Host carries tenant identity), so the no-sudo path is to target loopback :8080 with `Host: harness-tenant.localhost` set as a header. New `tests/harness/_curl.sh` centralises this — curl_anon / curl_admin / curl_workspace / psql_exec wrappers all set the Host + auth headers automatically. seed.sh, peer-discovery-404.sh, buildinfo-stale-image.sh updated to source it. Legacy /etc/hosts users still work via env-var override. 2. Fix the seed.sh FK regression that blocked DB-side replays. POST /workspaces ignores any `id` in the request body and generates one server-side. seed.sh was minting client-side UUIDs that never reached the workspaces table, so any replay that INSERTed into activity_logs (FK-constrained on workspace_id) failed with the workspace-not-found error. Capture the returned id from the response instead. 3. Two new replays cover the surfaces shipped this session. chat-history.sh — exercises the full SaaS-shape wire that PR #2472 (peer_id filter), #2474 (chat_history client tool), and #2476 (before_ts paging) ride on. 8 phases / 16 assertions: peer_id filter, limit cap, before_ts paging, OR-clause covering both source_id and target_id, malformed peer_id 400, malformed before_ts 400, URL-encoded SQLi-shape rejection. Verified PASS against the live harness. channel-envelope-trust-boundary.sh — exercises PR #2471 + #2481 by importing from `molecule_runtime.*` (the wheel-rewritten path) so it catches "wheel build dropped a fix that unit tests still pass." 5 phases / 11 assertions: malicious peer_id scrubbed from envelope, agent_card_url omitted on validation failure, XML-injection bytes scrubbed, valid UUID preserved, _agent_card_url_for direct gate. Verified PASS against published wheel 0.1.79. run-all-replays.sh auto-discovers — no registration needed. Full lifecycle (boot → seed → 4 replays → teardown) runs clean. Roadmap section updated to reflect Phase 1 (this PR) → Phase 2 (multi-tenant + CI gate) → Phase 3 (real CP) → Phase 4 (Miniflare + LocalStack + traffic replay). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 20:12:49 -07:00
Hongming Wang	82cc331517	test(provision): harden panic tests with re-raise guard + assert broadcast count Post-merge follow-up to PR #2487 review feedback: 1. guardAgainstReraise(fn) helper around every panic-test exercise. The original RecoversAndMarksFailed had its own outer recover() to detect re-raise; NoOpWhenNoPanic and PersistFailureLogged didn't. If a future regression makes logProvisionPanic re-raise, those two would have crashed the test process (taking sibling tests down) instead of reporting a clean failure. Now all three use the shared guard. 2. Concurrent repro now asserts bcast.count == 7 — the new concurrentSafeBroadcaster's count field was added in the race fix but not actually consumed. Cross-checks the existing recorder-set assertion from a different angle: a goroutine could in principle reach cpProv.Start (recorder hits) but then lose its WORKSPACE_PROVISION_FAILED broadcast on the failure path. Pinning both rules out that silent-drop variant for the canvas-broadcast contract specifically. 3. Comment on captureLog noting log.SetOutput is process-global and incompatible with t.Parallel() — preempts a future footgun if someone parallelizes the panic suite. Verified: all four tests pass under -race; full handlers + db packages green under -race. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 20:11:11 -07:00
Hongming Wang	d81319fd6b	Merge pull request #2490 from Molecule-AI/docs/internal-docs-refresh docs(internal): refresh workspace-runtime-package + backends parity + workspace-terminal dead link	2026-05-02 03:10:26 +00:00
Hongming Wang	b54968878a	docs(internal): refresh runtime-package mirror policy + parity matrix + dead-link fix - workspace-runtime-package.md: add explicit "Where to make changes" section documenting the mirror-only policy on Molecule-AI/molecule-ai-workspace-runtime — direct PRs are auto-rejected by mirror-guard CI; staging push regenerates both the mirror and the PyPI wheel via .github/workflows/publish-runtime.yml. - infra/workspace-terminal.md: replace dead molecule-core#1528 reference (repo renamed to molecule-monorepo, no longer accepting issues at the old name) with a forward-pointer to monorepo + molecule-controlplane issue trackers. - architecture/backends.md: bump audit date to 2026-05-02 and add rows for channel envelope enrichment (#2471), chat_history MCP tool (#2474), /activity before_ts paging (#2476), /activity peer_id filter (#2472), runtime_wedge smoke gate (#2473 + #2475), and the canvas-E2E state-file requirement (#2327). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 20:06:06 -07:00
Hongming Wang	47617a93ef	Merge pull request #2487 from Molecule-AI/fix/provision-goroutine-observability fix(provision): entry log + panic recovery on provision goroutines (#2486)	2026-05-02 03:05:56 +00:00
Hongming Wang	4f64c4366f	test(provision): swap to concurrent-safe broadcaster in 7-burst harness CI Platform (Go) ran with -race and the concurrent test tripped the detector: captureBroadcaster (sequential-test stub) writes lastData unguarded; 7 fan-out goroutines call markProvisionFailed → that stub concurrently. Local non-race run had hidden it. Introduce concurrentSafeBroadcaster (mutex-counted) for this single fan-out test. Sequential tests keep using captureBroadcaster — the fix is local to the test that creates the goroutines. Verified ./internal/handlers passes with -race. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 20:03:11 -07:00
Hongming Wang	7a19724194	fix(provision): route panic recovery through markProvisionFailed + fix log capture Three fixes addressing review of the issue #2486 observability PR: 1. CI failure: original inline UPDATE in logProvisionPanic used a hard-coded `status='failed'` literal, which trips workspace_status_enum_drift_test (the post-PR-#2396 gate that requires every status write to flow through models.Status* via parameterized $N). Refactor to call h.markProvisionFailed which uses StatusFailed parameterized. 2. Canvas-broadcast gap (review finding): inline UPDATE skipped RecordAndBroadcast, so panic recovery marked the row failed in DB but the canvas spinner stayed on "provisioning" until the next poll. markProvisionFailed fires WORKSPACE_PROVISION_FAILED, so canvas now flips to a failure card immediately. 3. Critical test bug (review finding): `defer log.SetOutput(log.Writer())` in three test sites evaluated log.Writer() at defer-fire time AFTER the SetOutput swap — restoring the buffer to itself, never restoring os.Stderr. Subsequent tests in the package were running with the panic tests' captured buffer as their writer. Extracted captureLog(t) helper that captures `prev` BEFORE the swap and uses t.Cleanup. Plus: softened the "goroutine never started" comment in the concurrent repro harness — the harness atomic-counts BEFORE the entry log fires, so "never started" was misleading; the real failure mode is "entry log renamed/removed or writer hijacked." Verified: full handlers suite passes; drift gate passes (Platform Go CI failure root-caused). Regression-injected the recover body again — both panic tests still fail as expected, confirming the contract is gated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 19:56:34 -07:00
Hongming Wang	6f0e914521	Merge pull request #2479 from Molecule-AI/fix/molecule-mcp-non-pipe-stdout fix(mcp): friendly fail-fast when stdio isn't pipe-compatible	2026-05-02 02:20:51 +00:00
Hongming Wang	e4452a2a88	Merge pull request #2489 from Molecule-AI/docs/fix-readme-clone-urls docs(readme): fix clone + deploy URLs after molecule-core rename	2026-05-02 02:20:15 +00:00
Hongming Wang	fe92194584	test(provision): concurrent 7-burst repro harness for #2486 silent-drop Goal: a deterministic, in-process reproduction of the prod incident where 7 simultaneous claude-code provisions on the hongming tenant produced ZERO log lines from any of the four documented exit paths. Approach: stub CPProvisioner that records every Start() call, sqlmock for the prepare flow, fire 7 goroutines concurrently against provisionWorkspaceCP, then assert: 1. Entry log fired exactly 7 times (one per goroutine). 2. Stub Start() recorded all 7 distinct workspace IDs. 3. Each goroutine's entry log names its own workspace ID. Result on staging head as of 2026-05-02: PASSES — meaning the silent-drop class isn't reproducible against current head with stub CP. Tenant hongming runs sha `76c604fb` (725 commits behind staging), so the bug is most likely already fixed upstream — hongming needs a redeploy. The test stays as a regression gate: any future refactor that re-introduces silent goroutine swallow in the CP provision path (rate-limit drop, channel-send-without-receiver, panic without recover, etc.) trips it. A safeWriter wraps the captured log buffer because raw bytes.Buffer.Write isn't safe for concurrent goroutines — without serialization the 7 entry-log lines interleave at byte boundaries and the strings.Count assertion gets unreliable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 19:19:05 -07:00
Hongming Wang	f6a48d593e	test: standardise on `from a2a_mcp_server import ...` in TestStdioPipeAssertion github-code-quality bot flagged 4 instances of `import a2a_mcp_server` in the new TestStdioPipeAssertion class — every other test in the file uses the `from a2a_mcp_server import ...` per-test pattern, so this is a real inconsistency. Switching the new tests to match. No behavior change; resolves the 4 unresolved review threads blocking the merge queue. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 19:17:55 -07:00
Hongming Wang	78ab8e97c6	docs(readme): fix clone + deploy URLs after molecule-core rename	2026-05-01 19:17:03 -07:00
Hongming Wang	46daae1ffb	fix(provision): entry log + panic recovery on workspace provision goroutines Issue #2486: 7 claude-code workspaces stuck in provisioning produced NONE of the four documented exit-path log lines in provisionWorkspaceCP — neither prepare-failed, nor start-failed, nor persist-instance-id-failed, nor success. Operators couldn't tell whether the goroutine ran at all. Add an entry log at the top of provisionWorkspaceOpts + provisionWorkspaceCP so a missing entry distinguishes "goroutine never started" from "started but exited via an unlogged path." Add logProvisionPanic at the same defer site so a panic inside either provisioner doesn't (a) crash the whole workspace-server process, taking every other tenant workspace with it, and (b) silently leave the row in `provisioning` until the 10-min sweeper fires. The recover persists status='failed' with a sanitized panic-class message via a fresh 10s context (the goroutine's own ctx may have been the one panicking). Tests pin three contracts: - no-op when no panic (otherwise every successful provision emits a spurious log line) - recovers + persists failed status on panic, with stack trace - defense-in-depth: if the persist itself fails, log it instead of leaving the operator with a recovered-panic log but no row Regression-injected by neutering the recover() body — all three tests fail until the recover + UPDATE path is restored. This is observability + resilience only, not a root-cause fix for #2486. The actual silent-drop class still needs reproduction once the tenant is on a build that includes this entry log. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 19:14:20 -07:00
Hongming Wang	1181699482	Merge pull request #2481 from Molecule-AI/fix/channel-peer-id-trust-boundary fix(channel): validate peer_id at envelope build — close path-traversal foothold	2026-05-02 01:46:49 +00:00
Hongming Wang	0b979aed78	fix(channel): validate peer_id at envelope build — close path-traversal foothold Two trust-boundary leaks surfaced in code review of the channel-envelope enrichment work: 1. _agent_card_url_for(peer_id) interpolated raw input into ${PLATFORM_URL}/registry/discover/<peer_id> with no UUID guard. An upstream row with peer_id=`../../foo` produced an agent-visible URL pointing at a sibling registry path. Same trust-boundary rationale discover_peer's docstring already calls out: "never interpolate path-traversal characters into the URL". Now gated by _validate_peer_id; returns "" on validation failure. 2. _build_channel_notification echoed raw peer_id back into meta["peer_id"], which on the push path renders inside the agent's <channel peer_id="..." kind="..."> XML-attribute context. Attacker bytes (control chars, embedded quotes) would land in agent-rendered text wired into the next conversation turn. Now canonicalised through _validate_peer_id before any meta write; on validation failure we set "" rather than reflecting the raw bytes. Defense-in-depth — both layers gate independently. Mutation-verified by stashing both prod-side files and confirming both regression tests fail. Tests: - test_envelope_enrichment_invalid_peer_id_skips_lookup: updated to pin the safe behavior (peer_id="" + agent_card_url absent), not the prior leak shape. - test_envelope_enrichment_strips_path_traversal_peer_id: NEW. Hard regression for peer_id="../../foo" — pins both the URL-builder and the meta echo against this specific exploit shape. - Two existing tests updated to use UUID-shape placeholders instead of "ws-peer-uuid" / "peer-ws-uuid" since those non-UUIDs now correctly get stripped by the validator. Resolves the Required-grade finding from the multi-axis review on PR #2471. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 18:43:49 -07:00
Hongming Wang	88b156a3bc	Merge pull request #2480 from Molecule-AI/chore/runtime-wedge-dedup-fixture chore(tests): drop redundant local _reset fixture from test_runtime_wedge	2026-05-02 01:33:31 +00:00
Hongming Wang	8838f99ed3	chore(tests): drop redundant local _reset fixture from test_runtime_wedge PR #2475 promoted runtime_wedge reset to an autouse conftest fixture in workspace/tests/conftest.py covering every test in this directory. The local @pytest.fixture(autouse=True) _reset in test_runtime_wedge.py became dead-but-harmless (idempotent reset is idempotent — both fixtures ran on every test, double-resetting). Remove the local copy so future maintainers don't have to keep two definitions in sync. Caught during a deeper /code-review-and-quality pass on the #2475 follow-ups — the original PR landed the conftest fixture but missed the dedup of the now-redundant in-file fixture. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 18:31:21 -07:00
Hongming Wang	9bbf32b526	Merge pull request #2471 from Molecule-AI/feat/channel-envelope-enrichment feat(a2a-mcp): enrich channel envelope with peer name/role/agent_card_url	2026-05-02 01:31:15 +00:00
Hongming Wang	885eff2350	test: drop unused _OTHER_PEER constant github-code-quality bot flagged it as an unused module-level global — correctly. The earlier draft of the negative-cache test was going to exercise two distinct peer IDs hitting the registry concurrently, but the test was simplified to a single-peer flow before merge and the constant lost its consumer. Resolves the only blocking review thread on PR #2471. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 18:28:24 -07:00
Hongming Wang	82beb98fff	Merge pull request #2474 from Molecule-AI/feat/chat-history-mcp-tool feat(a2a-mcp): add chat_history tool for prior turns with a peer	2026-05-02 01:27:38 +00:00
Hongming Wang	afc01d6995	fix(mcp): friendly fail-fast when stdio isn't pipe-compatible When molecule-mcp is launched with stdin or stdout redirected to a regular file (molecule-mcp > out.txt, ad-hoc CI smoke-tests, local debugging), asyncio.connect_read_pipe / connect_write_pipe later raise ValueError: Pipe transport is only for pipes, sockets and character devices — surfaced to the operator as a confusing traceback with no hint about what to do. Add _assert_stdio_is_pipe_compatible() to detect the same constraint synchronously before the event loop starts, exit cleanly with code 2, and print a stderr message that names: - which stream failed (stdin vs stdout) - the asyncio transport requirement - the two common causes (>file, <file) and a working alternative (molecule-mcp 2>&1 \| tee out.txt) Wired into cli_main() (the synchronous wrapper around asyncio.run(main())) so wheel-smoke + the production launch path both go through the guard without changing the async stdio loop body. Closed/stale-fd case also handled — os.fstat OSError exits 2 with the same guidance instead of escaping. Tests: 4 new in TestStdioPipeAssertion — pipe-pair happy path, regular-file stdout (the bug condition), regular-file stdin (symmetric case), and closed-fd. Mutation-verified — all 4 fail without the prod helper. 37/37 in test_a2a_mcp_server.py. Closes Molecule-AI/molecule-ai-workspace-runtime#61. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 18:26:24 -07:00
Hongming Wang	e6eda38318	fix(a2a-client): negative-cache registry failures in enrich_peer_metadata Self-review on PR #2471: failure outcomes (4xx/5xx/non-JSON/network exception) weren't writing to _peer_metadata, so a peer with a flaky or missing registry record re-fired the 2s-bounded GET on EVERY push. The cache became a no-op for the exact failure scenarios it most needs to defend against, and the poller thread stalled 2s per push for that peer until the registry came back. Cache the failure outcome as `(now, None)` so the TTL window suppresses re-fetch. Two new tests pin the behaviour for both HTTP failures (5xx) and transport exceptions (httpx.ConnectError). Type signature widens to `dict \| None` on the value tuple's second slot to match the new sentinel; readers already handle `None` as "no enrichment available" — that's the documented graceful-degrade contract — so no caller change needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 18:16:35 -07:00
Hongming Wang	1282b6f3d6	docs(a2a-tools): drop stale comment — before_ts is now server-supported Self-review on PR #2474 + #2476: the comment said we don't forward before_ts, but the code below does. Misleading after #2476 added the server-side filter. Replace with a one-liner that just states the forward-and-validate contract. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 18:13:51 -07:00
Hongming Wang	d50d6f4209	Merge pull request #2476 from Molecule-AI/feat/activity-before-ts-filter feat(activity): add before_ts paging knob to /activity route	2026-05-02 01:10:24 +00:00
Hongming Wang	ff21c26835	Merge pull request #2475 from Molecule-AI/chore/runtime-wedge-followups chore(smoke): runtime_wedge follow-ups from PR #2473 review	2026-05-02 01:07:18 +00:00
Hongming Wang	15e1ea36de	feat(activity): add before_ts paging knob to /activity route The wheel-side chat_history MCP tool advertises a `before_ts` parameter for backward paging through long histories, and the docs describe it as the canonical pagination knob — but the server silently ignored it until now. Without this fix, an agent passing before_ts to chat_history would always get the most-recent N rows and pagination would be broken end-to-end. Add `before_ts` query param parsed as RFC3339 at the trust boundary and translated into a `created_at < $X` clause on the existing builder. Mirrors the strict-inequality shape since_id uses for forward paging (`created_at > cursorTime`) so paging across both directions has consistent semantics. Tests: 3 new branches (positive filter, composition with peer_id into the canonical chat_history paging shape, RFC3339 rejection across 4 malformed inputs including URL-encoded SQL injection). Mutation-verified pre-commit; existing 9 activity tests still pass. Reported by self-review on PR #2474. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 18:04:31 -07:00
Hongming Wang	46bc63e373	chore(smoke): runtime_wedge follow-ups from PR #2473 review Three review nits from PR #2473: 1. Narrow `_check_runtime_wedge` import catch to (ImportError, ModuleNotFoundError). The bare `except Exception:` would have masked an `AttributeError`/`TypeError` from a runtime_wedge API rename — silently degrading the smoke gate to "no wedge info" with no log line. The `runtime_wedge_signature.json` snapshot test (task #169) carries the API-drift load instead. 2. Drop the unreachable `or "<unspecified>"` fallback. `wedge_reason()` only returns "" when not wedged, but the call is guarded by `is_wedged()` being True and `mark_wedged` requires a non-None reason. The defensive arm couldn't fire. 3. Promote `reset_runtime_wedge` from a per-file fixture in test_smoke_mode.py to an autouse fixture in workspace/tests/conftest.py. Heartbeat tests or future adapter tests that call `mark_wedged` without cleanup would otherwise leak a sticky wedge into smoke tests later in the same pytest process — smoke tests would fail-via-leak instead of asserting their actual contract. Two-sided reset survives early test failures. Also: `test_check_runtime_wedge_returns_none_when_module_missing` now `monkeypatch.delitem(sys.modules, "runtime_wedge")` before patching `__import__`, so the test re-exercises the import path instead of resolving from the module cache (the test was passing today by luck — it would still pass even if the catch arm were deleted, because the cached module's `is_wedged` returned False). Tests: 28 still pass in test_smoke_mode.py, 57 across smoke + wedge + heartbeat. Regression-injection-checked: catch tightening doesn't regress the existing wedge tests.	2026-05-01 18:01:51 -07:00
Hongming Wang	09e99a09c6	feat(a2a-mcp): add chat_history tool for prior turns with a peer When a peer_agent push lands and the agent needs context from prior turns with that workspace ("what task did this peer assign me last hour?", "what did I tell them?"), the only options today are re-deriving from memory (lossy) or scrolling activity_logs in the canvas (no agent-facing tool). Surface the platform's existing audit log directly via a new MCP tool so agents can read both sides of an A2A conversation in chronological order. Implementation: - a2a_tools.py: new tool_chat_history(peer_id, limit=20, before_ts="") hits /workspaces/<self>/activity?peer_id=X&limit=N (the new server filter from molecule-core#2472). Reverses the DESC response into chronological order so the agent reads top-down. Graceful error envelope on validation/network/non-200 — never crashes the MCP server, agent can branch on Error: prefix. - platform_tools/registry.py: ToolSpec wired into the A2A section so the rendered system-prompt block automatically includes it. Same pattern as the existing inbox_peek/inbox_pop/wait_for_message. - a2a_mcp_server.py: dispatch in handle_tool_call. - executor_helpers.py: _CLI_A2A_COMMAND_KEYWORDS gets a None entry (CLI runtimes don't expose chat history today; flip to a keyword when a2a_cli grows a `history` subcommand). - snapshots/a2a_instructions_mcp.txt regenerated. Tests: 10 new branches in TestChatHistory (validation / param forwarding / limit cap / before_ts pass-through / DESC→chronological reorder / 400 verbatim / 500 generic / network exc / non-list resp). Mutation-verified: reverting a2a_tools.py fails 10/10. Full test suite remains green at 1516 passed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 17:54:23 -07:00
Hongming Wang	645d687b0a	Merge pull request #2472 from Molecule-AI/feat/activity-peer-id-filter feat(activity): add peer_id filter to /workspaces/:id/activity	2026-05-02 00:52:32 +00:00
Hongming Wang	b39dc62de6	Merge pull request #2473 from Molecule-AI/feat/universal-turn-smoke-runtime-wedge feat(smoke): consult runtime_wedge after execute() to catch SDK init wedges	2026-05-02 00:52:31 +00:00
Hongming Wang	103ac09aeb	docs(a2a-mcp): list new envelope attrs in initialize instructions The agent learns about <channel> tag attributes ONLY from the instructions string returned by initialize. Without this update the wheel ships peer_name / peer_role / agent_card_url on the wire but no agent ever uses them — they get printed inline in the push tag, the agent doesn't know they're there, and the UX gain from the enrichment is lost. Update _build_channel_instructions to: - List the new attrs in the <channel> tag template under PUSH PATH - Add per-attribute semantics (when present, what to do with them, what \"absent\" means — graceful-degrade vs bug) - Point at the discover endpoint for agent_card_url so the agent treats it as a follow-on URL not the body of the message Tests: structural pin asserting all three attr names appear in the instructions AND the per-field semantics phrases (\"registry resolved\", \"discover endpoint\") so a future copy-edit that shortens the prose can't silently drop the agent guidance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 17:49:40 -07:00
Hongming Wang	59f0a449bd	feat(smoke): consult runtime_wedge after execute() to catch SDK init wedges Timeout-as-PASS in run_executor_smoke missed the PR-25-class regression: claude-agent-sdk takes 60s to time out on a malformed argv, our outer wait_for fires at 5s default and reports "imports healthy, hit a network boundary." A broken image then ships to GHCR. Universal fix uses the existing runtime_wedge module (already documented as the cross-cutting wedge holder, already read by heartbeat). Adapters opt-in by calling runtime_wedge.mark_wedged() from their executor's wedge catch arm; the smoke now consults runtime_wedge.is_wedged() at the end of every result path and upgrades a provisional PASS to FAIL when the flag is set. Non-opt-in adapters keep working as before — the check is additive. CI uses MOLECULE_SMOKE_TIMEOUT_SECS=90 to outlast the SDK's 60s initialize() handshake so the wedge marks before our outer wait_for fires. Module + helper docstrings call out the calibration so a future contributor doesn't lower it without thinking through what that wins back vs. what it loses. Tests: 7 new cases pinning the wedge-aware paths — mark+raise (PR-25 shape), mark+block (still-running execute that wait_for cuts short), clean+clean (additive contract), import-resilience (fail-open when runtime_wedge unimportable). Regression-injection-checked: silencing the new check fails both wedge-shape tests at unit-test time.	2026-05-01 17:46:43 -07:00
Hongming Wang	c85fac4663	feat(activity): add peer_id filter to /workspaces/:id/activity Surfaces the conversation history with one specific peer for the wheel-side chat_history MCP tool. The filter joins (source_id = $X OR target_id = $X) so both inbound (peer was sender) and outbound (peer was recipient) turns appear in the same view, ordered by created_at, and composes with existing type/source/ since_secs/since_id/limit filters. Validates peer_id as a UUID at the trust boundary so a malformed caller can't smuggle SQL fragments via the parameter — the args are bound but the explicit rejection gives the wheel a cleaner 400 signal than an empty list, and defends against any future code path that might interpolate the value into a URL or another query. Tests: 3 new branches (positive filter, composition with type+source, UUID-shape rejection across 5 malformed inputs). Mutation-verified: reverting activity.go fails all peer_id tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 17:46:15 -07:00
Hongming Wang	2297c083c8	Merge pull request #2470 from Molecule-AI/fix/inbox-filter-self-notify-rows fix(inbox): skip self-notify rows to break echo loop	2026-05-02 00:45:33 +00:00
Hongming Wang	0fec3d6fe4	fix(test): anchor envelope-enrichment TTL test to monotonic baseline Setting fetched_at = 0.0 assumed wall-clock semantics, but time.monotonic() returns process uptime — when this test ran early in the pytest run, current was <300s and the entry was treated as fresh, silently skipping the re-fetch the assertion expects. Anchor to time.monotonic() - TTL - 60 so the entry is unambiguously past the freshness window regardless of when in the run the test fires. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 17:45:05 -07:00
Hongming Wang	050aa33fc1	feat(a2a-mcp): enrich channel envelope with peer name/role/agent_card_url The bare envelope only carried `peer_id` for peer_agent inbound, so a receiving agent had to round-trip to /registry to find out who's talking. Surface the sender's display name, role, and an agent-card URL alongside the routing fields so the agent can render "ops-agent (sre): ping" in one shot without an extra lookup. a2a_client.py: - Add _peer_metadata cache `dict[peer_id → (fetched_at, record)]` - Add enrich_peer_metadata(peer_id) — sync, hits cache or registry with a tight 2s timeout, returns None on validation/network/non-200 so callers can degrade gracefully - TTL = 5 min so a busy multi-peer chat doesn't hit registry on every push, but role/name renames propagate within a session - Add _agent_card_url_for(peer_id) — deterministic from peer_id alone a2a_mcp_server.py: - _build_channel_notification calls enrich_peer_metadata when peer_id is non-empty; meta carries peer_name + peer_role + agent_card_url alongside the existing routing fields - agent_card_url surfaces unconditionally (constructable from peer_id); peer_name/role only when registry lookup succeeds — never blocks the push on a registry stall Tests: 6 new branches (canvas_user no enrichment / cache hit no GET / cache miss fetches once / registry-fail graceful degrade / TTL expiry re-fetches / invalid peer_id skips lookup). Mutation-verified: 6/6 fail without prod code, 39/39 pass with. Tracks the broader RFC at #2469 (workspace-server activity_type rename to break the echo loop). Independent of PR #2470 — this is the metadata-enrichment half of the same UX improvement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 17:40:09 -07:00
Hongming Wang	4343fffcff	Merge pull request #2468 from Molecule-AI/feat/always-prompt-template-provider-model feat(canvas): always prompt provider+model on template deploy	2026-05-02 00:37:27 +00:00
Hongming Wang	2d8c45989a	fix(inbox): skip self-notify rows in poller to break echo loop The workspace-server's `/notify` handler writes the agent's own send_message_to_user POSTs to activity_logs as activity_type= 'a2a_receive', method='notify', source_id=NULL so the canvas chat-history loader can restore those bubbles after a page reload. The activity API exposes the row to /workspaces/:id/activity? type=a2a_receive, so the inbox poller picks it up and pushes the agent's own outbound back as an inbound `← molecule: Agent message: ...` — confirmed live 2026-05-01. Add `_is_self_notify_row` predicate matched on (method='notify' AND no source_id) and call it from `_poll_once` before enqueue. The predicate combines BOTH discriminators so a future caller using method='notify' with a real peer_id still passes through. Cursor advances past skipped rows so we don't re-poll the same self-notify on every iteration. Belt-and-braces: long-term fix lives in workspace-server (rename the misclassified activity_type to 'agent_outbound' — RFC at #2469). This guard stays regardless because it only excludes rows we never want. Tests: 7 new — predicate true/false matrix + integrated _poll_once behavior (skip, cursor advance, notification suppression). Mutation-verified: reverting inbox.py to the prior shape fails 7/7; applied state passes 48/48. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 17:35:49 -07:00

1 2 3 4 5 ...

3713 Commits