molecule-core

Author	SHA1	Message	Date
Hongming Wang	db032bafa0	Merge branch 'main' into staging	2026-04-28 15:41:14 -07:00
hongmingwang-moleculeai	81634e04d1	Merge pull request #2210 from Molecule-AI/fix/auto-sync-concurrency-and-cleanup fix(ci): auto-sync concurrency + cleanup follow-ups	2026-04-28 22:08:35 +00:00
Hongming Wang	97d5883e76	fix(ci): auto-sync concurrency + cleanup follow-ups Three small fixes from the self-review of #2209: 1. Required: concurrency group. Two pushes to main in quick succession (manual UI merge then auto-promote-staging's ff-push, or any back-to-back main pushes) would race two auto-sync runs against the same staging branch — second `git push origin staging` fails non-fast-forward, surfacing as a red CI alert for what should be a no-op. Add `concurrency: { group: auto-sync-main-to-staging, cancel-in-progress: false }` so the second run waits for the first and sees its result. 2. Hygiene: `git merge --abort` on conflict. The conflict-error path exits 1 with the work tree in a half-merged state. Doesn't affect future runs (each gets a fresh checkout) but is an unpleasant artifact for anyone who shells into the runner. Abort first, then exit. 3. Doc accuracy: "Loop safety" comment. The original said the chain terminates because "main is either a no-op or advances further." That's true but understates the actual safety: GitHub Actions explicitly does NOT trigger downstream workflow runs from `GITHUB_TOKEN`-authored pushes. So the loop is impossible by construction, not just by happy coincidence of ref state. Updated the comment to reflect the actual mechanism. Plus a step-name nit: "Fast-forward staging → main" reads as if main is the target. Renamed to "Fast-forward staging to main" for consistency with the workflow's name (main → staging). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 14:59:23 -07:00
Hongming Wang	08c93e2316	Merge pull request #2209 from Molecule-AI/feat/auto-sync-main-to-staging feat(ci): auto-sync main → staging to keep staging-as-superset invariant	2026-04-28 21:49:09 +00:00
Hongming Wang	c59715e143	feat(ci): auto-sync main → staging to keep staging-as-superset invariant Background `auto-promote-staging.yml` advances main via `git merge --ff-only` + `git push origin main` — clean fast-forward, no merge commit. But manual `staging → main` merges via the GitHub UI / API create a merge commit on main that staging doesn't have. The next `staging → main` PR then evaluates as "BEHIND" because staging is missing that merge commit, requiring a manual `gh pr update-branch` round-trip. This pattern bit twice on 2026-04-28 (PRs #2202 and #2205, both manual bridges to land pipeline fixes themselves). Each needed update-branch + re-CI before they could merge. Annoying and avoidable. What this workflow does Triggered on every push to main (regardless of source: auto-promote, UI merge, API merge, direct push): 1. Check whether main is already in staging's ancestry. If yes, no-op — auto-promote-staging keeps them aligned via ff push, and the no-op case is the steady state. 2. If not (manual merge commit on main, or direct main hotfix): try `git merge --ff-only origin/main` first. Works when staging hasn't diverged with its own commits. 3. If ff fails (staging has its own in-flight feature work): `git merge --no-ff origin/main -m "chore: sync main → staging"`. Absorbs main's tip while keeping staging's own history. 4. Push staging. Loop safety Pushing the synced staging triggers auto-promote-staging.yml, which checks gates on staging's new tip and, if green, ff-pushes staging to main. Since staging now ⊇ main, the resulting push to main is either a no-op (no ref change → no push event fires → auto-sync doesn't re-trigger) or advances main further. In the latter case auto-sync fires once more, sees main already in staging's ancestry, no-ops. Bounded. Conflict handling If the merge step hits conflicts (staging and main diverged with incompatible changes), the workflow fails with a clear summary pointing to manual resolution. This shouldn't happen in practice — staging is the integration branch; conflicts indicate a direct main hotfix touching the same code as in-flight staging work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 14:43:43 -07:00
hongmingwang-moleculeai	5990f7a876	Merge pull request #2208 from Molecule-AI/fix/peer-discovery-fall-back-to-db-fields fix(workspace): keep peers visible when agent_card is null	2026-04-28 21:23:22 +00:00
Hongming Wang	96acbd719b	test: update test_peer_capabilities_format for fallback behavior The previous assertion `'Silent Agent' not in result` was pinning the buggy behavior — peers without an agent_card were silently dropped from the prompt. With the fallback to DB name+role those peers are correctly visible. Flip the assertion so the test pins the new (correct) rendering and would catch a regression to the silent-drop behavior.	2026-04-28 14:15:42 -07:00
hongmingwang-moleculeai	11a38a0ad4	Merge pull request #2207 from Molecule-AI/fix/secret-scan-printf-and-wordsplit fix(ci): printf format-string sink + filename word-split in secret-scan	2026-04-28 21:11:32 +00:00
Hongming Wang	8ff0748ab9	fix(workspace): keep peers visible in coordinator prompt when agent_card is null Bug: a Design Director coordinator with 6 freshly-created worker peers rendered an empty `## Your Peers` section in its system prompt — the hosting registry endpoint correctly returned all 6 peers, but `summarize_peer_cards()` silently dropped every entry whose `agent_card` column was null (the default until A2A discovery has run end-to-end against the worker). The coordinator then refused to delegate any task because "no peers exist". Fix: fall back to the registry row's `name` and `role` columns when `agent_card` is missing, malformed, or wrong-typed, instead of skipping the peer. The registry endpoint (`workspace-server/internal/handlers/discovery.go:queryPeerMaps`) has always returned both fields — they were just being thrown away on the consumer side. `build_peer_section()` now renders `Role: …` when the agent_card-derived skill list is empty so the coordinator's prompt still has something concrete to delegate against. Also hoists `import json` out of the per-peer loop body to module level (was previously imported once per iteration). Tests: new `test_shared_runtime_peer_summary.py` pins all four fallback cases (null / malformed string / wrong type / null + no DB name) plus the agent-card-present happy path and the mixed-list case the coordinator actually consumes. First peer-summary test coverage `shared_runtime.py` has had — no prior tests existed. Refs: 2026-04-27 Design Director discovery report from infra team.	2026-04-28 14:10:29 -07:00
Hongming Wang	2c8792d3e0	fix(ci): printf format-string sink + filename word-split in secret-scan Two latent bash bugs in the canonical secret-scan workflow caught during the post-merge review of molecule-controlplane #301 (a private consumer that inlined this workflow's logic and got both fixes there). Same bugs apply here; fixing in canonical means every public consumer (gh-identity, github-app-auth, the 8 workspace template repos) inherits the fix on their next workflow_call. Bug 1: `printf "$OFFENDING"` is a format-string sink. OFFENDING is built from filenames: `${f} (matched: ${pattern})\n`. When passed to printf as the first argument, `%` characters in a filename are interpreted as conversion specifiers — corrupting the error message or printing `%(missing)` artifacts. No filename in the current tree triggers it, but a future test fixture, build artifact, or contributor-supplied path could. Fix: `printf '%b' "$OFFENDING"` interprets the literal `\n` we appended without treating OFFENDING as a format string. Bug 2: `for f in $CHANGED` word-splits on whitespace. Filenames containing spaces would split into multiple tokens. The self-exclude check (`[ "$f" = "$SELF" ] && continue`) and the diff lookup would both operate on partial-path tokens. No filename in the current tree has whitespace, but the failure would be silent if one ever did. Fix: `while IFS= read -r f; do ... done <<< "$CHANGED"` reads whole lines as filenames. Added `[ -z "$f" ] && continue` to match the original `for` loop's implicit empty-input skip. Both fixes are mechanically straightforward (~16 lines net diff, mostly comments documenting the why). No behavior change for filenames in the current tree; strictly better for the edge cases. The same fixes already shipped in molecule-controlplane via #301 which inlined a copy of this workflow. The runtime's bundled pre-commit hook (molecule-ai-workspace-runtime: molecule_runtime/scripts/pre-commit-checks.sh) likely has the same bugs — flagged as a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 14:02:50 -07:00
hongmingwang-moleculeai	693135e56a	Merge pull request #2206 from Molecule-AI/feat/auto-promote-on-e2e-green feat(ci): auto-promote-on-e2e — retag :latest on green E2E Staging SaaS	2026-04-28 21:02:47 +00:00
Hongming Wang	9d4ab7b1a2	feat(ci): auto-promote-on-e2e — retag :latest on green E2E Staging SaaS Closes the final gap in the SaaS pipeline. After auto-promote-staging fast-forwards main, publish-workspace-server-image builds new `:staging-<sha>` images, but `:latest` (what prod tenants pull) only moves on either a manual `promote-latest.yml` dispatch or a canary- verify retag (gated on Phase 2 fleet that doesn't exist). This workflow closes that gap by retagging `platform:staging-<sha>` + `platform-tenant:staging-<sha>` → `:latest` whenever E2E Staging SaaS passes for a `main` push. Uses crane (no Docker daemon needed). Verifies both images exist before retagging either, so a half-published state is impossible. Why trigger only on `main` (not staging): - `:latest` is what prod tenants pull. Only SHAs that have reached `main` (via auto-promote-staging) should advance `:latest`. - Triggering on staging would let a staging-only revert advance `:latest` to a SHA that never reaches `main`, breaking the invariant "production runs what's on `main`". Why a separate workflow rather than folding into e2e-staging-saas.yml: - Test concerns and release concerns separate. - Disabling promote during an incident is one workflow toggle, not an edit to the long E2E file. - When Phase 2 canary work eventually lands, the canary path can replace this trigger without touching the E2E workflow. Doc-aligned: per molecule-controlplane/docs/canary-tenants.md, "green staging E2E → :latest" is the recommended approach for the current scale (≤20 paying tenants); canary fleet is deferred until blast radius grows. Pipeline after this lands is fully self-healing: staging push → 4 gates green → auto-promote fast-forwards main → publish-workspace-server-image → E2E Staging SaaS → THIS WORKFLOW retags :latest → tenant fleet auto-pulls in 5 min (or redeploy-tenants-on-main fans out faster) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 13:58:41 -07:00
Hongming Wang	5cba11b2fb	Merge pull request #2205 from Molecule-AI/staging staging → main: pipeline self-healing fixes (#2203 + #2204) — final manual bridge	2026-04-28 13:57:38 -07:00
Hongming Wang	7eeac153de	Merge branch 'main' into staging	2026-04-28 13:35:54 -07:00
hongmingwang-moleculeai	b823645c01	Merge pull request #2204 from Molecule-AI/fix/auto-promote-gates-use-file-paths fix(ci): auto-promote gate-check uses workflow file paths, not names	2026-04-28 20:18:42 +00:00
Hongming Wang	17018745d0	fix(ci): auto-promote gate-check uses workflow file paths, not names Observed 2026-04-28: auto-promote ran for staging head `96955f7b` with all gates actually green (verified via /commits/<sha>/check-runs API) yet `check-all-gates-green` reported `CodeQL → missing/none` and aborted. Same SHA was promotable; auto-promote couldn't see it. Cause: `gh run list --workflow="CodeQL"` matched two workflows in this repo: - codeql.yml (explicit, scans both staging and main) - codeql (GitHub UI-configured Code-quality default setup, internal, scans default branch only) gh CLI rejects ambiguous `--workflow=<name>` lookups and returns no result → the gate fell through to `missing/none` and ALL_GREEN was set false. Every staging push since both names existed has been silently dead-locked. Fix: switch GATES from display-name strings to workflow file paths. File paths are the unique identifier for a workflow file in .github/workflows/; display names are decoration and can collide. The same `gh run list --workflow=<file.yml>` query that fails on "CodeQL" succeeds on "codeql.yml" because the file path resolves unambiguously. No behavior change for the other three gates (CI, E2E Canvas, E2E API Smoke) since their names didn't collide — they keep working, they just identify by ci.yml / e2e-staging-canvas.yml / e2e-api.yml now. The log line shape changes from `CI → completed/success` to `ci.yml → completed/success` which is fine for ops grep. When adding/removing a gate going forward: file paths only. Keep branch-protection required-checks (check-run display names) in sync as a separate manual step. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 13:15:13 -07:00
hongmingwang-moleculeai	96955f7b66	Merge pull request #2203 from Molecule-AI/fix/e2e-gates-always-emit-result fix(ci): e2e gates always emit a result so auto-promote can read it	2026-04-28 19:56:44 +00:00
Hongming Wang	31d25b5a74	fix(ci): e2e gates always emit a result so auto-promote can read it The auto-promote-staging.yml gate-check (line 99) treats "workflow didn't run" as failure. Path-filtered triggers on E2E API Smoke Test and E2E Staging Canvas meant a platform-only or test-only push to staging — say, the prior PR #2201 which only touched tests/e2e/test_staging_full_saas.sh — never triggered the canvas workflow, and auto-promote saw `missing/none`, marked all_green=false, and aborted. Same class for any push that doesn't touch the gate's watched paths. Dead-lock by design, never noticed because the gate was new. Fix per Design B (always-run + fast-skip): - Drop `paths:` from the push/pull_request triggers on both gate workflows. The workflow now always fires on every staging+main push/PR. - Add a `detect-changes` job using `dorny/paths-filter@v3` that decides whether to do real work, scoped to the same paths the trigger filter used to watch. - Real work job (e2e-api / playwright) gates on `needs: detect-changes; if: needs.detect-changes.outputs.X == 'true'`. - Add a sibling `no-op` job that runs when the filter output is false, emitting `::notice::… no-op pass`. The workflow run's conclusion is `success` either way — auto-promote sees green and proceeds. manual `workflow_dispatch` and the weekly canvas `schedule` short- circuit detect-changes to always-run — those triggers exist precisely to exercise the suite and shouldn't be silently no-op'd. Why this approach over making auto-promote-staging smarter: The alternative (Design A, considered + rejected) was to teach auto-promote-staging to read each gate's `paths:` filter and treat "no run because filter excluded the commit" as conditional pass. That couples auto-promote to other workflows' YAML schema and breaks silently if a gate is renamed or its filter changes. Design B keeps the auto-promote contract simple ("each gate emits success") and makes each gate self-describing — adding a new gate doesn't require touching auto-promote. Cost: ~10-30s of runner overhead per gate per push for the no-op when paths don't match. Negligible vs the alternative of dead-locked auto-promote chains. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 12:43:26 -07:00
Hongming Wang	9f2878d185	Merge pull request #2202 from Molecule-AI/staging staging → main: e2e teardown patience (#2201) one-time bridge	2026-04-28 12:40:38 -07:00
Hongming Wang	588e67840b	Merge branch 'main' into staging	2026-04-28 12:20:20 -07:00
hongmingwang-moleculeai	5c19c53caf	Merge pull request #2201 from Molecule-AI/fix/e2e-teardown-patience fix(e2e): teardown patience matches prod cascade duration (~30–90s)	2026-04-28 18:46:43 +00:00
Hongming Wang	e7eeeb4f59	Merge pull request #2199 from Molecule-AI/fix/pin-compat-narrow-pypi-job-trigger ci(pin-compat): split into two workflows so each gets a narrow paths filter	2026-04-28 18:20:48 +00:00
Hongming Wang	c66569efbf	Merge pull request #2200 from Molecule-AI/feat/cascade-probe-wheel-hash-validation feat(cascade): verify wheel content sha256 against just-built dist	2026-04-28 18:20:36 +00:00
Hongming Wang	4fce32ec3c	fix(e2e): teardown patience matches prod cascade duration (~30–90s) E2E Staging SaaS has been failing on every cron + push run since 2026-04-27 with `LEAK: org … still present post-teardown (count=1)`, exit 4. Root cause: the curl timeout on the teardown DELETE was 30s and the post-DELETE leak check was a single 10s sleep — but the DELETE handler runs the full GDPR Art. 17 cascade synchronously, including EC2 termination which AWS reports in 30–60s. Real-world wall time on a prod-shaped run was 57s on 2026-04-27 (hongmingwang DELETE); the 30s curl timeout aborted the request mid-cascade and the 10s post-sleep check found the row still present (status not yet 'purged'). Two-part fix to match real cascade timing: 1. DELETE curl gets its own --max-time 120 (was 30) so the synchronous cascade has room to complete in-band. 2. The leak check polls up to 60s for status='purged' instead of one rigid 10s sleep. Covers two cases: - DELETE returns 5xx mid-cascade but the cascade finishes anyway (we still observe a clean state). - DELETE legitimately exceeds 120s — eventual-consistency catches the eventual purge instead of false-flagging a leak. The 5–15s estimate in `molecule-controlplane/internal/handlers/ purge.go`'s comment is the API-call cost only, not the AWS-side time-to-termination it waits on. The async-purge refactor noted in that comment would let us drop these timeouts back to ~15s — file that under future work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 11:13:56 -07:00
Hongming Wang	a089712cef	feat(cascade): verify wheel content sha256 against just-built dist Closes #132. Extends the cascade propagation probe (added in #2197 and clarified in #2198) with a content-integrity check. The previous probe verified pip can RESOLVE the version we just published (catches surface 1+2 propagation lag — metadata + simple index). It did NOT verify pip can DOWNLOAD bytes that match what we uploaded — leaving a window where a Fastly stale-content scenario (rare but PyPI has had it: e.g. 2026-04-01 incident where a CDN node served a previous version's wheel under the new version's URL for ~90s after upload) would pass the probe and ship corrupt builds to all 8 receiver templates. Two-stage check, both must pass before the cascade fans out: (a) `pip install --no-cache-dir PACKAGE==VERSION` succeeds — version is resolvable. (Existing, unchanged.) (b) `pip download` of the same wheel + `sha256sum` matches the hash captured pre-upload from `dist/*.whl`. (New.) Captured BEFORE upload via a new `wheel_hash` step that exposes `steps.wheel_hash.outputs.wheel_sha256`, bubbled up as `needs.publish.outputs.wheel_sha256`, and consumed by the cascade probe via the EXPECTED_SHA256 env var. `pip download` is the right primitive: it writes the actual .whl file (vs `pip install` which unpacks and discards), so we can sha256sum it directly. Combined with --no-cache-dir + a wiped /tmp/probe-dl per poll, every poll re-fetches from the live Fastly edge — no local-cache mask. Per-poll cost: ~3-5s pip install + ~3s pip download + 4s sleep. 30-poll budget = ~5-6 min wall on a slow runner (vs the previous ~4-5 min for resolve-only). Well within the cascade's tolerance for a known-rare CDN issue, and the overwhelming-common case (Fastly serves matching bytes immediately) exits on the first poll. Verified locally: pip download of the current PyPI-latest (molecule-ai-workspace-runtime 0.1.29) produced sha256=7e782b2d50812257…, exactly matching PyPI's own metadata endpoint. The mismatch path is exercised inline (different builds of the same version produce different hashes by definition — the build_runtime_package.py output is timestamp-deterministic only within a single CI invocation). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 10:53:50 -07:00
Hongming Wang	a8f59f5fc2	ci(pin-compat): split into two workflows so each gets a narrow paths filter Closes #134. The post-merge review of #2196 flagged that the combined workflow's `paths:` filter (the union of both jobs' needs: `workspace/**` + `scripts/build_runtime_package.py` + the workflow itself) caused the `pypi-latest-install` job to fire on every doc-only / adapter-only / unrelated workspace/ edit. The PyPI artifact that job tests against can't change based on our workspace/ source — only on actual PyPI publishes — so those runs add noise without information. Splits the previously-merged combined workflow: runtime-pin-compat.yml (kept): - PyPI-latest install + import smoke (was: pypi-latest-install) - Narrow `paths:` filter — only fires when workspace/requirements.txt or this workflow file changes - Cron-driven daily for upstream-yank detection (unchanged) runtime-prbuild-compat.yml (new): - PR-built wheel + import smoke (was: local-build-install) - Broad `paths:` filter — fires on any workspace/ source change, scripts/build_runtime_package.py, or this workflow file - No cron (workspace/ doesn't change between firings) Behavior identical to before for content; only the trigger surface is narrower per-job. Each workflow's name is its own status check, so branch protection (which currently lists neither as required) can gate them independently in future. The prior comment in the combined file explicitly acknowledged the asymmetry and proposed this split as a follow-up; this is that follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 10:50:09 -07:00
Hongming Wang	2f6fe9ab79	Merge pull request #2197 from Molecule-AI/fix/cascade-pip-resolve-propagation ci(publish-runtime): use pip-resolve probe to bound cascade fan-out	2026-04-28 15:25:06 +00:00
Hongming Wang	e6ce54006d	ci(publish-runtime): use pip-resolve probe to bound cascade fan-out The cascade's PyPI-propagation gate polled `/pypi/<pkg>/<ver>/json`, which is one of THREE surfaces pip touches when resolving an install: 1. /pypi/<pkg>/<ver>/json — metadata endpoint (the old check) 2. /simple/<pkg>/ — pip's primary download index 3. files.pythonhosted.org — CDN-fronted wheel binary Each has its own cache. Any one of them can lag behind the others, and the previous gate would let the cascade fire while (2) or (3) still served the previous version. Downstream `pip install` in the template repos then resolved to the OLD wheel, the docker layer cache locked that stale resolution in, and subsequent rebuilds kept shipping the old runtime — the "five times in one night" cache trap referenced in the prior comment. Replace the metadata-only poll with an actual `pip install --no-cache-dir --force-reinstall --no-deps PACKAGE==VERSION` from a fresh venv. If pip can resolve and install the exact version we just published, every receiver template will too — pip itself is the ground truth for what the receivers will see, no proxy guessing about which surface is lagging. - Venv created once outside the loop; only `pip install` runs in the poll body. - --no-cache-dir + --force-reinstall ensures every poll hits the live PyPI surfaces (no local-cache mask). - --no-deps keeps each poll fast — we only care about resolving THIS package, not its dep tree. - Loop budget: 30 attempts × 4s ≈ 2 min (vs prior 30 × 2s = 60s). Generous vs typical PyPI propagation, surfaces real upstream issues past the budget. Verified locally: - Probing a non-existent version (0.1.999999) → pip exits 1, loop retries. - Probing the current PyPI-latest → pip exits 0, `pip show` returns the version, loop succeeds. Closes #130. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 18:16:33 -07:00
Hongming Wang	7484e6fbec	Merge pull request #2196 from Molecule-AI/fix/runtime-pin-compat-test-pr-artifact ci(runtime-pin-compat): test the PR-built wheel, not PyPI-latest	2026-04-28 00:42:02 +00:00
Hongming Wang	7065579967	ci(runtime-pin-compat): test the PR-built wheel, not the PyPI-latest one Closes #128's chicken-and-egg. The original gate installed the CURRENTLY-PUBLISHED molecule-ai-workspace-runtime from PyPI, then overlaid workspace/requirements.txt, then smoke-imported. That catches problems with the already-shipped artifact (the daily-cron upstream-yank case), but it cannot catch problems introduced by the PR itself: the imports it exercises are from the OLD wheel, not the PR's source. A PR that adds `from a2a.utils.foo import bar` (where `bar` is added in a2a-sdk 1.5 and the runtime currently pins 1.3) slips through: 1. Pip resolves the existing PyPI wheel + a2a-sdk 1.3. 2. Smoke imports the OLD main.py — no reference to `bar` → green. 3. Merge → publish-runtime.yml ships a wheel WITH the new import. 4. Tenant images redeploy → all crash on first boot with ImportError: cannot import name 'bar' from 'a2a.utils.foo'. Splits the workflow into two jobs: - pypi-latest-install (renamed from default-install): unchanged behavior. Runs on the daily cron and on requirements.txt / workflow edits. Catches upstream PyPI yanks + the already-shipped artifact going stale. - local-build-install (new): runs scripts/build_runtime_package.py on the PR's workspace/, builds the wheel with python -m build (mirroring publish-runtime.yml byte-for-byte), installs that wheel, then runs the same smoke import. Tests the artifact that WOULD be published if this PR merges. Path filter widened to workspace/** so any runtime-source change triggers the local-build job. The pypi-latest job's filter is the same union; its internal logic is unchanged so the daily-cron and upstream-detection use cases continue to work. Verified locally: built the wheel from current workspace/ source via the same script + python -m build invocation, installed into a fresh venv, imported from molecule_runtime.main import main_sync successfully. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 17:39:00 -07:00
hongmingwang-moleculeai	2e45c94e33	Merge pull request #2195 from Molecule-AI/fix/wheel-smoke-call-shape-coverage ci(publish-runtime): smoke well-known mount alignment + message helper	2026-04-28 00:37:37 +00:00
Hongming Wang	1b0fab674b	ci(publish-runtime): smoke well-known mount alignment + message helper The existing wheel-smoke catches AgentCard kwarg-shape regressions (state_transition_history, supported_protocols) but doesn't catch the SDK-contract drift class that #2193 just fixed in production: the a2a-sdk 1.x rename of /.well-known/agent.json → /.well-known/agent-card.json, plus AGENT_CARD_WELL_KNOWN_PATH moving to a2a.utils.constants. main.py's readiness probe hardcoded the old literal and 404'd every attempt, silently dropping every workspace's initial_prompt for ~weeks before a user reported it. Two additions to the smoke block: 1. Mount alignment: build an AgentCard, call create_agent_card_routes(), and assert AGENT_CARD_WELL_KNOWN_PATH is among the mounted paths. Catches a future SDK release that decouples the constant value from the route factory's mount path. The source-tree test (workspace/tests/test_agent_card_well_known_path.py) catches the main.py side; this catches the SDK side BEFORE PyPI upload. 2. Message helper smoke: import a2a.helpers.new_text_message and instantiate one. The v0→v1 cheat sheet (memory: reference_a2a_sdk_v0_to_v1_migration.md) flagged this as a real migration find — main.py and a2a_executor.py call it in hot paths, so an import break errors every reply before the message even leaves the workspace. Verified by running the equivalent Python inside ghcr.io/molecule-ai/workspace-template-langgraph:latest: ✓ well-known mount alignment OK (/.well-known/agent-card.json) ✓ message helper import + call OK Closes the structural-fix half of the #2193 finding from the code- review-and-quality pass: "the wheel publish smoke didn't catch this. This is the 7th a2a-sdk migration find of this kind. Task #131 is the right root-cause fix." Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 17:34:12 -07:00
hongmingwang-moleculeai	19572119df	Merge pull request #2194 from Molecule-AI/fix/orphan-sweeper-revoke-stale-tokens fix(orphan-sweeper): self-heal auth-token conflict after volume wipe	2026-04-28 00:32:35 +00:00
Hongming Wang	317196463a	fix(orphan-sweeper): close TOCTOU race with issueAndInjectToken on restart Independent code review caught a real bug in the previous commit's stale-token revoke pass. The platform's restart endpoint (workspace_restart.go:104) Stops the workspace container synchronously then dispatches re-provisioning to a goroutine (line 173). For a workspace that's been idle past the 5-minute grace window — extremely common: user comes back to a long-idle workspace and clicks Restart — this opens a race window: 1. Container stopped → ListWorkspaceContainerIDPrefixes returns no entry → workspace becomes a stale-token candidate. 2. issueAndInjectToken runs in the goroutine: revokes old tokens, issues a fresh one, writes it to /configs/.auth_token. 3. If the sweeper's predicate-only UPDATE `WHERE workspace_id = $1 AND revoked_at IS NULL` runs AFTER IssueToken commits but is racing the SELECT-then-UPDATE window, it revokes the freshly-issued token alongside the old ones. 4. Container starts with a now-revoked token → 401 forever. The fix carries the SAME staleness predicate from the SELECT into the per-workspace UPDATE: a token created within the grace window can't match `< now() - grace` and is automatically excluded. The operation is now idempotent against fresh inserts. Also addresses other findings from the same review: - Add `status NOT IN ('removed', 'provisioning')` to the SELECT (R2 + first-line C1 defence). 'provisioning' is set synchronously in workspace_restart.go before the async re-provision begins, so it's a reliable in-flight signal that narrows the candidate set. - Stop calling wsauth.RevokeAllForWorkspace from the sweeper — that helper revokes EVERY live token unconditionally; the sweeper needs "every STALE live token" which is a different (safer) operation. Inline the UPDATE so we own the predicate end-to-end. Drop the wsauth import (no longer needed in this package). - Tighten expectStaleTokenSweepNoOp regex to anchor at start and require the status filter, so a future query whose first line coincidentally starts with "SELECT DISTINCT t.workspace_id" can't silently absorb the helper's expectation (R3). - Defensive `if reaper == nil { return }` at top of sweepStaleTokensWithoutContainer — even though StartOrphanSweeper already short-circuits on nil, a future refactor that wires this pass directly without checking would otherwise mass-revoke in CP/SaaS mode (F2). - Comment in the function explaining why empty likes is intentionally NOT a short-circuit (asymmetry with the first two passes is the whole point — "no containers running" is the load-bearing case). - Add TestSweepOnce_StaleTokenRevokeUsesStalenessPredicate that asserts the UPDATE shape (predicate present, grace bound). A real-Postgres integration test would prove the race resolution end-to-end; this catches the regression where someone simplifies the UPDATE back to predicate-only. - Add TestSweepStaleTokens_NilReaperEarlyExit pinning the F2 guard. Existing tests updated to match the new query/UPDATE shape with tight regexes that pin all the safety guards (status filter, staleness predicate in both SELECT and UPDATE). Full Go suite green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 17:28:50 -07:00
Hongming Wang	3332e6878b	fix(orphan-sweeper): revoke stale tokens for workspaces with no live container Heals the user-reported "auth token conflict after volume wipe" failure mode. When an operator nukes a workspace's /configs volume outside the platform's restart endpoint (common via `docker compose down -v` or manual cleanup scripts), the DB still holds live workspace_auth_tokens for that workspace while the recreated container has an empty /configs/.auth_token. Subsequent /registry/register calls 401 forever: requireWorkspaceToken sees live tokens, container has no token to present, and the workspace is permanently wedged until an operator manually revokes via SQL. The platform's restart endpoint already handles this correctly via wsauth.RevokeAllForWorkspace inside issueAndInjectToken. This change adds a third orphan-sweeper pass — sweepStaleTokensWithoutContainer — as the safety net for the equivalent action taken outside the API. Detection criterion: workspace has at least one live (non-revoked) token whose most-recent activity (COALESCE(last_used_at, created_at)) is older than staleTokenGrace (5 minutes), AND no live Docker container's name prefix matches the workspace ID. Safety filters that bound the revoke radius: 1. Only runs in single-tenant Docker mode. The orphan sweeper is wired only when prov != nil in cmd/server/main.go — CP/SaaS mode never gets here, so an empty container list cannot be confused with "no Docker at all" (which would otherwise revoke every workspace's tokens in production SaaS). 2. staleTokenGrace = 5min skips tokens issued/used in the last 5 minutes. Bounds the race with mid-provisioning (token issued moments before docker run completes) and brief restart windows — a healthy workspace touches last_used_at every 30s heartbeat, so 5min is 10× the heartbeat interval. 3. The query joins workspaces.status != 'removed' so deleted workspaces are not revoked here (handled at delete time by the explicit RevokeAllForWorkspace call). 4. make_interval(secs => $2) avoids a time.Duration.String() → "5m0s" mismatch with Postgres interval grammar that I caught during implementation. 5. Each revocation logs the workspace ID so operators can correlate "workspace just lost auth" with this sweeper, not blame a network blip. Failure mode: revoke fails (transient DB error). Loop bails to avoid log spam; next 60s cycle retries. Worst case a workspace stays 401-blocked an extra minute. Tests: 5 new tests covering the headline scenario, the safety gate (workspace with container is NOT revoked), revoke-failure-bails-loop, query-error-non-fatal, and Docker-list-failure-skips-cycle. All 11 existing sweepOnce tests updated to register the new third-pass query expectation via a small `expectStaleTokenSweepNoOp` helper that keeps their existing assertions readable. Full Go test suite green: registry, wsauth, handlers, and all other packages. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 17:20:08 -07:00
hongmingwang-moleculeai	b9c867a7bf	Merge pull request #2193 from Molecule-AI/fix/agent-card-well-known-path-probe fix(workspace): use SDK constant for agent-card readiness probe	2026-04-27 23:46:14 +00:00
Hongming Wang	3eb599bbb6	fix(workspace): use SDK constant for agent-card readiness probe The initial-prompt readiness probe in workspace/main.py hardcoded the pre-1.x well-known path. After the a2a-sdk 1.x bump the SDK started mounting the agent card at the new canonical path (the value of `a2a.utils.constants.AGENT_CARD_WELL_KNOWN_PATH`), so the probe returned 404 every attempt and silently fell through to "server not ready after 30s, skipping". Net effect: every workspace silently dropped its `initial_prompt` from config.yaml — the agent never sent the kickoff self-message, and users hit a fresh chat with no context. Reported by an external user as "/.well-known/agent.json 404 — the a2a-sdk agent card route was not being mounted at the expected path". The route IS mounted; the probe was looking at the wrong place. Fix imports `AGENT_CARD_WELL_KNOWN_PATH` from `a2a.utils.constants` and uses it directly in the probe URL — the SDK constant is now the single source of truth, so any future rename travels through automatically. Adds two static regression tests pinning the invariant: 1. No hardcoded `/.well-known/agent.json` literal anywhere in main.py. 2. The probe URL fstring interpolates AGENT_CARD_WELL_KNOWN_PATH (catches a "fix" that imports the constant for show but reverts to a literal in the actual GET). Verified manually inside ghcr.io/molecule-ai/workspace-template-langgraph that AGENT_CARD_WELL_KNOWN_PATH == '/.well-known/agent-card.json' and that `create_agent_card_routes(card)` mounts at exactly that path — constant + mount are aligned in the runtime image, so the probe will now find the server. Full workspace test suite: 1209 passed, 2 xfailed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 16:43:32 -07:00
hongmingwang-moleculeai	79265f6b3a	Merge pull request #2192 from Molecule-AI/feat/single-command-spinup feat(dev-start): true single-command spinup — infra + templates + auth posture	2026-04-27 23:33:54 +00:00
Hongming Wang	f2c3594abc	feat(dev-start): true single-command spinup — infra + templates + auth posture Manual fresh-user clean-slate test surfaced three friction points in the existing dev-start.sh: 1. The script ran docker compose -f docker-compose.infra.yml directly, bypassing infra/scripts/setup.sh — so the workspace template registry was never populated and the canvas template palette came up empty (the "Template palette is empty" troubleshooting hit). 2. ADMIN_TOKEN was not handled at all. Without it, the AdminAuth fail-open gate worked initially but slammed shut the moment the first workspace registered a token — at which point the canvas could no longer call /workspaces or /templates. New users hit 401s with no obvious next step. 3. The script wasn't mentioned in docs/quickstart.md. New users followed the documented 4-step manual flow and never discovered the single command existed. Fixes: - dev-start.sh now calls infra/scripts/setup.sh, which brings up full infra (postgres + redis + langfuse + clickhouse + temporal) AND populates the template/plugin registry from manifest.json. - On first run, dev-start.sh writes MOLECULE_ENV=development to .env. This activates middleware.isDevModeFailOpen() which lets the canvas keep calling admin endpoints without a bearer (the intended local-dev escape hatch). The .env is preserved on re-runs and sourced before the platform launches. - The script intentionally does NOT auto-generate an ADMIN_TOKEN. A first attempt did, and broke the canvas because isDevModeFailOpen requires ADMIN_TOKEN empty AND MOLECULE_ENV=development together. Setting ADMIN_TOKEN in dev would close the hatch and the canvas has no way to read that token in a dev build (no NEXT_PUBLIC_ADMIN_TOKEN bake step here). The .env comment block explicitly warns future contributors not to add it. - Both processes' logs go to /tmp/molecule-{platform,canvas}.log instead of stdout-mixed so the readiness banner stays clean. - Health-poll loops cap at 30s with a clear timeout error pointing to the log file, instead of hanging forever. - The readiness banner now lists the log paths AND tells the user the next step is "open localhost:3000 → add API key in Config → Secrets & API Keys → Global", instead of just listing service URLs. Quickstart doc rewrite leads with: git clone ... cd molecule-monorepo ./scripts/dev-start.sh The 4-step manual flow is preserved as "Manual setup (advanced)" for contributors who want per-component logs. Verified end-to-end from clean Docker (no containers, no volumes, no .env) three times: total wall-clock ~12s for a re-run with cached npm/docker layers. Platform's HTTP 200 on /workspaces without a bearer confirms the dev-mode auth hatch is active.	2026-04-27 16:29:37 -07:00
hongmingwang-moleculeai	3f020b8591	Merge pull request #2191 from Molecule-AI/docs/ecosystem-watch-date-2026-04-27 docs: update ecosystem-watch date to 2026-04-27	2026-04-27 22:13:46 +00:00
Hongming Wang	8d77de68c4	docs: update ecosystem-watch date to 2026-04-27	2026-04-27 14:39:35 -07:00
Hongming Wang	1c8cf10728	Merge pull request #2190 from Molecule-AI/staging merge to production	2026-04-27 14:28:14 -07:00
hongmingwang-moleculeai	44dc3c6943	Merge pull request #2189 from Molecule-AI/fix/delegate-task-retry-transient fix(a2a): auto-retry transient transport errors in send_a2a_message (up to 5x)	2026-04-27 20:58:47 +00:00
Hongming Wang	e87a9c3858	fix(a2a): auto-retry transient transport errors in send_a2a_message Three different intermittent failures observed during a single manual-test session — RemoteProtocolError, ReadTimeout, ConnectError — each surfaced as a "Failed to deliver to <peer>" error chip in the canvas Agent Comms panel even though the next attempt would have succeeded (verified by direct probes from the same source workspace to the same peer). The error message even told the user "Usually a transient network blip — retry once," but it left the retry to a human reading the error message. Auto-retry inside send_a2a_message itself: up to 5 attempts (1 initial + 4 retries) with exponential backoff (1s, 2s, 4s, 8s, 16s-capped), each backoff jittered ±25% to break sync across siblings. Cumulative wall-clock capped at 600s by _DELEGATE_TOTAL_BUDGET_S so a string of 5×300s ReadTimeouts can't make the caller wait 25 minutes — once the deadline elapses, retries stop even if attempts remain. Retry only on transport-layer transients: - ConnectError / ConnectTimeout (peer's listening socket not ready) - RemoteProtocolError (peer closed TCP without writing — observed when a peer's prior in-flight Claude SDK session aborted) - ReadError / WriteError (network blip on Docker bridge) - ReadTimeout (peer wrote no response in 300s) Application-level errors are NOT retried — they're deterministic and retrying just wastes wall-clock: - HTTP 4xx (peer rejected the request format) - JSON parse failures (peer returned garbage) - JSON-RPC error in response body (peer's runtime errored cleanly) - Programmer-bug exceptions (ValueError, etc.) 8 new tests pin the contract: - retry succeeds after 2 RemoteProtocolErrors - retry succeeds after 1 ConnectError - all 5 attempts fail → returns formatted last-error - capped at exactly _DELEGATE_MAX_ATTEMPTS (regression cover for "did someone bump the constant accidentally?") - JSON-RPC error response NOT retried (1 attempt only) - non-httpx exception NOT retried (programmer bugs stay loud) - total budget caps the loop even if attempts remain - backoff schedule grows exponentially with ±25% jitter Refactor: extracted _format_a2a_error() so the success and exhausted paths share one error-formatting routine. _delegate_backoff_seconds() is a pure function so the schedule is unit-testable without monkey- patching asyncio.sleep. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 13:52:01 -07:00
hongmingwang-moleculeai	b5441b8c09	Merge pull request #2188 from Molecule-AI/fix/cascade-stop-removal-in-progress fix(workspace-server): cascade-delete race + ACTIVITY_LOGGED body fidelity	2026-04-27 20:46:46 +00:00
Hongming Wang	c91c09dc55	fix(activity): include request/response bodies in ACTIVITY_LOGGED broadcast Canvas Agent Comms bubbles for outbound delegation showed only "Delegating to <peer>" boilerplate during the live update window — the actual task text only surfaced after a refresh re-fetched the row from /workspaces/:id/activity. Symptom flagged today during a fresh delegation manual test where the bubble said "Delegating to Perf Auditor" instead of the user's "audit moleculesai.app for performance" prompt. Root cause: LogActivity's broadcast payload at activity.go:510-518 deliberately omitted request_body and response_body, so the canvas's live-update path (AgentCommsPanel.tsx:271-289) saw `p.request_body = undefined` and toCommMessage fell back to the `Delegating to ${peerName}` template string. The DB row stored the real task / reply, which is why GET-on-mount worked. Fix: include both bodies in the broadcast as json.RawMessage values (no re-marshal cost — they were already encoded for the DB insert above). Same pattern as tool_trace, which has been included since #1814. Each side is bounded by the workspace-side caller's own caps: the runtime's report_activity helper caps error_detail at 4096 chars and summary at 256; request/response are constrained by the runtime's own limits — typical delegate_task payload is hundreds of chars to a few KB. If a much-larger broadcast becomes a concern later, a soft cap can be added at this site without breaking the contract. Two regression tests pin the broadcast shape: - request_body present → canvas renders the actual task text - response_body present → canvas renders the actual reply text - response_body nil → omitted from payload (no empty-bubble flicker) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 13:38:23 -07:00
Hongming Wang	5a7659c54d	Merge pull request #2108 from Molecule-AI/ci/cicd-review-quick-wins ci: e2e-staging-saas on staging + canary auto-issue thresholded at 3 reds	2026-04-27 20:29:12 +00:00
hongmingwang-moleculeai	dccec657d6	Merge branch 'staging' into ci/cicd-review-quick-wins	2026-04-27 13:27:16 -07:00
hongmingwang-moleculeai	e0a35a3c77	Merge pull request #2187 from Molecule-AI/fix/mcp-server-path-wheel-relative fix(runtime): use lowercase wire role for v0.3 JSON-RPC compat layer	2026-04-27 20:27:03 +00:00
Hongming Wang	92d99d96fe	fix(provisioner): treat "removal already in progress" as no-op success Cascade-deleting a 7-workspace org returned 500 with "workspace marked removed, but 2 stop call(s) failed — please retry: stop eeb99b5d-...: force-remove ws-eeb99b5d-607: Error response from daemon: removal of container ws-eeb99b5d-607 is already in progress" even though the DB-side post-condition succeeded (removed_count=7) and the containers WERE removed shortly after. The fanout fired Stop() on every workspace concurrently and the orphan sweeper happened to reap two of them at the same instant, so Docker rejected the second ContainerRemove with "removal already in progress" — a race-condition ack, not a real failure. Retrying just races the same in-flight removal. The post-condition we care about (the container WILL be gone) is identical to a successful removal, so Stop() should treat it the same way it already treats "No such container" — a no-op return nil that lets the caller proceed with volume cleanup. Real daemon failures (timeout, EOF, ctx cancel) still surface as errors. Two pieces: - New isRemovalInProgress() predicate using the same string-match approach as isContainerNotFound (docker/docker has no typed errdef for this; the CLI itself relies on the message). - Stop() now treats the predicate as success, with a log line distinct from the not-found path so debugging can tell which race fired. Both substrings ("removal of container" + "already in progress") must match — "already in progress" alone would false-positive on unrelated operations like image pulls. Truth table pinned in 7 new test cases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 13:25:32 -07:00

1 2 3 4 5 ...

3303 Commits