molecule-core

Author	SHA1	Message	Date
devops-engineer	792bfdf8fd	chore: sync main → staging (auto, `0be89053`)	2026-05-07 22:55:22 +00:00
security-auditor	62e793040e	chore(observability): edge-429 probe + ratelimit observability runbook Two artifacts that unblock the parked follow-ups from #59: 1. scripts/edge-429-probe.sh (closes the "operator-blocked" status of #62). An operator without CF/Vercel dashboard access can reproduce a canvas-sized burst against a tenant subdomain and read each 429's response shape — workspace-server bucket overflow (JSON body + X-RateLimit-* headers) is distinguishable from CF (cf-ray) and Vercel (x-vercel-id) by inspection of the report. Read-only, parallel via background subshells (no GNU parallel dependency), no credential use. Smoke-tested against example.com end-to-end. 2. docs/engineering/ratelimit-observability.md (closes the "metric-blocked" status of #64). The existing molecule_http_requests_total{path,status} counter + X-RateLimit-* response headers already cover #64's acceptance criterion ("watch metrics for two weeks"). The runbook collects the PromQL queries, a decision tree for the re-tune (keep / per-tenant override / change default), an alert rule template, and a hard "do not roll ad-hoc per-bucket-key exposure" note (in-memory map includes SHA-256 of bearer tokens — exposing it is a security review surface, file a follow-up if needed). Neither artifact changes runtime behaviour. Pure operational tooling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 15:48:34 -07:00
claude-ceo-assistant (Claude Opus 4.7 on Hongming's MacBook)	25fb696965	chore: reconcile main → staging post-suspension divergence Refs Task #165 (Class D AUTO_SYNC_TOKEN plumbing). main and staging diverged after the 2026-05-06 GitHub-org suspension because Class D / Class G / feature work landed on staging while unrelated CI fixes (#34-47, ECR auth-inline, buildx→docker, pre-clone manifest deps) landed straight on main. Both branches edited the same workflow files, so every push to main triggered an Auto-sync run that aborted at `git merge --no-ff origin/main` with 7 content conflicts: - .github/workflows/canary-verify.yml (URL: github.com → Gitea) - .github/workflows/ci.yml (3 URL refs) - .github/workflows/publish-runtime.yml (cascade: HTTP repo-dispatch → Gitea push) - .github/workflows/publish-workspace-server-image.yml (drop AWS-action steps; ECR auth is inline) - .github/workflows/retarget-main-to-staging.yml (URL) - manifest.json (lowercase org slug + add mock-bigorg from main) - scripts/clone-manifest.sh (keep main's MOLECULE_GITEA_TOKEN auth path + drop awk-tolower since manifest is now lowercase) Resolution: union — staging's post-suspension Gitea/ECR migrations win on URL/policy edits; main's additive work (mock-bigorg manifest entry, inline ECR auth, MOLECULE_GITEA_TOKEN basic-auth) is preserved on top. After this lands, staging is a strict superset of main, so the next auto-sync run on a push to main will be a clean fast-forward / no-op. The auto-sync workflow on main also picks up staging's AUTO_SYNC_TOKEN swap (Class D #26) for free, fixing the latent layer-2 push-auth issue. Verified locally: - bash -n scripts/clone-manifest.sh - python -c 'yaml.safe_load(...)' on each touched workflow - python -c 'json.load(open(manifest.json))' (21 plugins, 9 templates, 7 org_templates) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 14:24:37 -07:00
devops-engineer	990b4d2eb8	fix(post-suspension): redirect clone-manifest to Gitea (Class G #168 followup) The Class G #168 PR (#40) caught explicit `github.com/Molecule-AI/<repo>` URL literals in 23 files but missed two indirect forms: - `scripts/clone-manifest.sh` lines 50,52 had `https://github.com/${repo}.git` (the org/repo path is a variable, so the Class-G regex `github\.com/Molecule-AI/` didn't match). - `manifest.json` had `"Molecule-AI/<repo>"` (no `github.com` prefix; the prefix gets prepended by the script). Together these are what `Dockerfile.tenant`'s stage-3 templates RUN actually fetches. After PR #40 the harness-replays workflow against staging still fails with `fatal: could not read Username for 'https://github.com'` because the in-image build is the unfixed shell loop. This PR: - scripts/clone-manifest.sh: replaces both clone URLs with `https://git.moleculesai.app/${repo}.git`. Anonymous public clones work for these repos (verified manually). - manifest.json: lowercases `Molecule-AI/` to `molecule-ai/` to match Gitea's canonical org slug. Gitea is case-insensitive so both work, but the lowercase form matches every other URL in the org and is what main's clone-manifest.sh (PR #38) already standardises on. This is the minimum-diff staging fix. Sister #173 already shipped a more sophisticated version on main (with optional MOLECULE_GITEA_TOKEN auth + per-build pre-clone). When auto-sync resolves the staging-vs-main conflict, this minimal version gets superseded by the main version naturally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 13:34:53 -07:00
devops-engineer	55689e0b10	fix(post-suspension): migrate github.com/Molecule-AI refs to git.moleculesai.app (Class G #168 ) The GitHub org Molecule-AI was suspended on 2026-05-06; canonical SCM is now Gitea at https://git.moleculesai.app/molecule-ai/. Stale github.com/Molecule-AI/... URLs return 404 and break tooling that clones / pip-installs / curls them. This bundles all non-Go-module URL fixes for this repo into a single PR. Go module path references (in *.go, go.mod, go.sum) are out of scope here -- tracked separately under Task #140. Token-auth clone URLs also flip ${GITHUB_TOKEN} -> ${GITEA_TOKEN} since the GitHub token does not auth against Gitea. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 13:08:15 -07:00
devops-engineer	a6d67b4c68	fix(ci): pre-clone manifest deps in workflow, drop in-image clone (closes #173 ) publish-workspace-server-image.yml could not run on Gitea Actions because Dockerfile.tenant's stage 3 ran `git clone` against private Gitea repos from inside the Docker build context, where no auth path exists. Every workspace-server rebuild required a manual operator-host push. Move cloning to the trusted CI context (where AUTO_SYNC_TOKEN — the devops-engineer persona PAT — is naturally available). Dockerfile.tenant now COPYs from .tenant-bundle-deps/, populated by the workflow's new "Pre-clone manifest deps" step. The Gitea token never enters the image. - scripts/clone-manifest.sh: optional MOLECULE_GITEA_TOKEN env embeds basic-auth in the clone URL; redacted in log output. Anonymous fallback preserved for future public-repo path. - .github/workflows/publish-workspace-server-image.yml: new pre-clone step before docker build; injects AUTO_SYNC_TOKEN. Fail-fast if the secret is empty. - workspace-server/Dockerfile.tenant: drop stage 3 (templates), COPY from .tenant-bundle-deps/ instead. Header documents the prereq. - .gitignore: ignore /.tenant-bundle-deps/ so a local build can't accidentally commit cloned repos. Verified locally: clone-manifest.sh with the devops-engineer persona token cloned all 37 repos (9 ws + 7 org + 21 plugins, 4.9MB after .git strip). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 12:59:46 -07:00
devops-engineer	8313b2a7a7	fix(scripts): clone-manifest.sh — use Gitea + lowercase org slug Post-2026-05-06 GitHub-org suspension: scripts/clone-manifest.sh was still pointing at https://github.com/${repo}.git, so the Docker build for workspace-server'\''s platform image fails at: fatal: could not read Username for 'https://github.com': No such device or address with no credentials available in the build container. Fix: clone from https://git.moleculesai.app/${repo}.git instead. manifest.json'\''s repo paths still read 'Molecule-AI/...' (the historic GitHub slug, mixed-case); Gitea lowercases the org component to 'molecule-ai/...'. Lowercase the org segment on the fly with awk so we don'\''t need to rewrite every manifest entry. Local verify: bash -n passes, lowercase transform produces correct Gitea paths, anonymous git clone of one of the manifest plugins over HTTPS to git.moleculesai.app succeeds. Class G in the prod-ship CI sweep — same shape as the github.com ref Harness Replays hits, this is the second instance found.	2026-05-07 08:17:58 -07:00
claude-ceo-assistant	422360b912	Merge pull request 'docs(workspace-runtime): migrate github.com refs at source (#41 )' (#15 ) from docs/workspace-runtime-readme-source-edit into staging	2026-05-07 09:25:28 +00:00
claude-ceo-assistant	1d9d8c7809	Merge pull request 'fix(scripts): migrate ghcr.io→ECR + raw.githubusercontent.com→Gitea (#46 )' (#16 ) from fix/script-ghcr-and-lint-paths into staging	2026-05-07 09:25:24 +00:00
documentation-specialist	26afbbfdf4	docs(internal): bulk-sed molecule-core .md docs → Gitea (#37 final molecule-core sweep) Mass-sed across 17 files / 38 active refs in molecule-core .md docs (README + CONTRIBUTING + docs/architecture/ + docs/blog/ + docs/guides/ + docs/integrations/ + docs/quickstart.md + scripts/README.md). Driver: /tmp/sweep_core.py — same pattern set as the internal-marketing bulk-sed (PR #50). 4 url-substitution patterns + SKIP_PATTERN preserves /pull/<n> /issues/<n> /commit/<sha> /releases/... historical refs. Files NOT touched in this PR: - docs/workspace-runtime-package.md — owned by molecule-core#15 (workspace-runtime source-edit per #41). Reverted my bulk-sed of that file to avoid merge conflict. - 2 Go-import-path refs in docs/memory-plugins/testing-your-plugin.md (github.com/Molecule-AI/molecule-monorepo/platform/internal/...) — Q5 cross-repo Go-module migration territory. - 1 GitHub Gist link in docs/guides/external-workspace-quickstart.md (gist.github.com/molecule-ai/...) — no Gitea equivalent; consistent with the same handling in docs#1. Manual fixes (2): - docs/blog/2026-04-20-chrome-devtools-mcp-seo/index.md:306 — GitHub Discussions (no Gitea equivalent) → issue tracker link - docs/guides/external-workspace-quickstart.md:218 — tracking-issue ?q= query-string url (regex didn't catch) → reformulated text + Gitea search-by-query approach Pattern matches my docs#1 (public docs site) PR + internal#50 (internal/marketing bulk-sed). Standard substitutions: - https://github.com/Molecule-AI/<repo> → https://git.moleculesai.app/molecule-ai/<repo> - /blob/<branch>/ + /tree/<branch>/ → /src/branch/<branch>/ Refs: molecule-ai/internal#37, molecule-ai/internal#38	2026-05-07 01:27:50 -07:00
documentation-specialist	5d4184f4a3	fix(scripts): migrate ghcr.io→ECR + raw.githubusercontent.com→Gitea (#46 ) Per documentation-specialist's grep agent (2026-05-07T07:30, see internal#46): runtime-breaking ghcr.io references in shell scripts + docker-compose + the slip-past-workflow lint_secret_pattern_drift.py all need migration. These were missed by security-auditor's workflow-only audit. Files (6): - .github/scripts/lint_secret_pattern_drift.py:40 — workspace-runtime pre-commit-checks.sh consumer URL: raw.githubusercontent.com → Gitea raw URL (https://git.moleculesai.app/molecule-ai/.../raw/ branch/main/...). The lint job runs in CI and would 404 today. - scripts/refresh-workspace-images.sh:54 — workspace-template image pull URL: ghcr.io → ECR (153263036946.dkr.ecr.us-east-2.amazonaws.com). - scripts/rollback-latest.sh — full rewrite of header + auth flow: * ghcr.io/molecule-ai/{platform,platform-tenant} → ECR * GITHUB_TOKEN with write:packages → AWS ECR auth (aws ecr get-login-password). Per saved memory reference_post_suspension_pipeline, prod cutover is to ECR. * Updated header docs to match new auth flow + prereqs. - scripts/demo-freeze.sh:13,17 — comment-only ghcr → ECR (the script doesn't currently exec these URLs, but the comments describe the cascade and need to match reality). - docker-compose.yml:215-216 — canvas image: ghcr.io → ECR + updated the auth comment to describe `aws ecr get-login-password` flow. - tools/check-template-parity.sh:21 — inline curl install instructions: raw.githubusercontent.com → Gitea raw URL. Hostile self-review: 1. rollback-latest.sh's GITHUB_TOKEN→aws-cli auth swap is a behavior change. Operators using this script now need aws CLI authenticated for region us-east-2 with ECR pull/push perms. Documented in updated header. Operators who don't have aws CLI will get 'aws: command not installed' which is a clear failure mode (not silent). 2. The Gitea raw URL shape (/raw/branch/main/) differs from GitHub's raw.githubusercontent.com structure. Verified pattern by inspecting other Gitea raw URLs in the codebase. If Gitea's URL changes (1.23+), update via the same one-line edit. 3. Doesn't touch packer/scripts/install-base.sh which has a similar ghcr.io ref per the grep agent's findings — that's bigger-scope (packer build pipeline) and lives in molecule-controlplane-ish territory; filing as parked follow-up under #46 if not already. Refs: molecule-ai/internal#46, molecule-ai/internal#37, molecule-ai/internal#38, saved memory reference_post_suspension_pipeline	2026-05-07 00:56:23 -07:00
documentation-specialist	bd145dcec6	docs(workspace-runtime): migrate github.com refs at source so mirror inherits Gitea links (internal#41) The molecule-ai-workspace-runtime mirror is regenerated on every runtime-v* tag from this monorepo's workspace/. Per saved memory reference_runtime_repo_is_mirror_only, mirror-guard rejects direct PRs to the mirror; edit at source. Source-side files that propagate to the mirror's published README + read by users of the in-monorepo workspace-runtime docs: - scripts/build_runtime_package.py (the README generator): * line 281 README_TEMPLATE: 'Shared workspace runtime for Molecule AI' link → Gitea * line 399 doc-link to workspace-runtime-package.md → Gitea path (with /src/branch/main/ shape) LEFT AS-IS (per Q3 audit-trail decision): * lines 379, 392 historical issue cross-refs (#2936, #2937) - workspace/build-all.sh:5 — comment block linking to template-* repos. Migrated to Gitea path-shape. - docs/workspace-runtime-package.md: * lines 101-108 adapter→repo table (8 templates, all PUBLIC on Gitea) — Gitea URLs * line 247 starter-repo link — substituted host + added inline note that starter doesn't survive the suspension migration (recreation pending; cross-link to this issue) * line 259 generic git clone command for new templates → Gitea * line 289 second starter mention — same handling as 247 Files NOT touched in this PR: - workspace/ Python source code (.py files) — those use github paths in docstrings + a few log strings; fix bundled with the cross-repo Go-module-style migration (per #37 Q5 + parked follow-ups). - 'Writing a new adapter' section's `gh repo create` command (line 254-256) — gh CLI doesn't talk to Gitea (per #45 parked follow-up). - 'Writing a new adapter' section's ghcr.io image ref (line 276) — per #46 ghcr→ECR migration (separate concern). After this PR merges to staging + a runtime-v* tag is pushed, the mirror's published README will inherit the Gitea link. Until then the mirror's README continues to reference github.com/Molecule-AI (stale but historical-marker-correct since the mirror existed pre-suspension). Refs: molecule-ai/internal#41, molecule-ai/internal#37, molecule-ai/internal#38, molecule-ai/internal#42, molecule-ai/internal#45, molecule-ai/internal#46	2026-05-07 00:48:04 -07:00
Hongming Wang	caf19e8980	feat(ops): hourly alarm for auto-promote PR stuck on REVIEW_REQUIRED (#2975 ) Closes the silent-block failure mode that left 25 commits — including the Memory v2 redesign and the reno-stars data-loss fix — wedged on staging for 12+ hours behind a single missing review. The auto-promote workflow opened the PR + armed auto-merge, but main's branch protection required a human review and nobody noticed until a user reported "still seeing old memory tab". ## Detection logic — `scripts/check-stale-promote-pr.sh` Reads open PRs `base=main head=staging` and alarms on: - `mergeStateStatus == BLOCKED` - `reviewDecision == REVIEW_REQUIRED` - createdAt older than `STALE_HOURS` (default 4h) Other BLOCKED reasons (DIRTY, BEHIND, failed checks) are NOT alarmed — those are the author's signal-to-fix. This script targets the specific "no human reviewed yet" wedge. Output: - `::warning` per stale PR (visible in workflow summary + Actions UI) - PR comment (idempotent via marker-string detection; one alarm per PR, never re-spammed) - Exit code = count of stale PRs (capped at 125) Logic in a script (not inline workflow YAML) so it's: - Unit-testable — tests/test-check-stale-promote-pr.sh exercises every branch with stubbed fixture JSON + frozen clock. 23 tests covering: empty list, single stale, just-under-threshold, wrong reviewDecision, wrong mergeStateStatus, mixed list (only matching PRs alarm), custom threshold via --stale-hours, exit-code-counts- matching-PRs, --help, unknown arg → 64, missing repo → 2. - Operator-runnable ad-hoc — `scripts/check-stale-promote-pr.sh` works from any shell with `gh` + `jq`. - SSOT — one detector, the workflow YAML is just schedule + invocation surface. Future sibling workflows that need the same check call the same script. ## Workflow — `.github/workflows/auto-promote-stale-alarm.yml` Triggers: - cron `27 * * * *` (hourly, off-the-hour to dodge cron herd) - workflow_dispatch with `stale_hours` + `post_comment` overrides Concurrency: `auto-promote-stale-alarm` group, cancel-in-progress=false (idempotent script; no benefit to cancelling a running scan). Permissions: `contents: read` + `pull-requests: write` (post comments). Sparse checkout — only fetches `scripts/check-stale-promote-pr.sh`. No node_modules, no go modules, no slow setup steps. Workflow runs in <30s on a clean repo. ## Why "alarm + comment" not "auto-approve" Considered options in issue #2975: 1. Slack/email alert — picked. 2. Bot-account auto-approve via molecule-ops — circumvents the human-review gate that branch protection encodes. 3. Trusted-promote bypass via CODEOWNERS — needs Org Admin config change; out of scope for a workflow PR. The comment-on-PR pattern picks (1) without external dependencies (no Slack token, no email config). Subscribers get notified via GitHub's existing PR notification delivery; the warning shows up in the Actions feed. ## Why this won't false-positive on legitimate slow reviews Threshold is 4h. Most legitimate gates clear in <1h, so 4× headroom is plenty for slow CI. The comment is idempotent (one alarm per PR, never re-posted) — adding noise stops at 1 comment regardless of how long the PR sits. ## Test plan - [x] `bash scripts/test-check-stale-promote-pr.sh` — 23/23 pass - [x] `python3 -c 'yaml.safe_load(...)'` clean - [x] `bash -n` clean on both scripts - [ ] Live verification: dispatch the workflow once main has caught up, confirm it correctly reports zero stale PRs	2026-05-05 17:55:27 -07:00
Hongming Wang	e342d0c5a7	fix(build): register a2a_response in TOP_LEVEL_MODULES The drift gate caught the new SSOT parser module — without registration the wheel ships it un-rewritten and runtime imports fail. Same pattern as inbox_uploads, a2a_tools_delegation, a2a_tools_rbac registrations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 17:34:05 -07:00
Hongming Wang	f01f374072	feat(mcp): add `molecule-mcp doctor` onboarding diagnostic Closes #2934 item 6 — the deferred follow-up from Ryan's onboarding- friction report. Quote: "this single command would have saved me 30 of the 45 minutes." When push delivery fails or the install half-works, the operator today has no signal — they hand-grep the Claude Code binary or chase the `from versions: none` red herring. Doctor renders six checks in one screen with concrete next-step suggestions: 1. Python version >=3.11? (wheel's pin) 2. Wheel install molecule-ai-workspace-runtime importable + version surfaced 3. PATH for binary `molecule-mcp` resolves on PATH; if not, prints the resolved user-site bin dir to add (or recommends pipx) 4. Env vars PLATFORM_URL + WORKSPACE_ID + token (env or _FILE or .auth_token) 5. Platform reach GET ${PLATFORM_URL}/healthz returns 2xx 6. Registry register POST /registry/register with the resolved token returns 2xx — end-to-end auth check Each line: `[OK\|WARN\|FAIL] <label>: <status>` plus a `next:` hint when not OK. ANSI colors auto-disable on non-TTY / NO_COLOR. Exit code: 0 on all-OK or only-WARN, 1 on any FAIL — scriptable from CI install-checks. ## Files `workspace/mcp_doctor.py` (new) — six check functions + `run()` entry point. Uses urllib (stdlib) so doctor works even on a partial install where `requests` is missing. `workspace/mcp_cli.py` Subcommand dispatch: molecule-mcp doctor → mcp_doctor.run() molecule-mcp --help → usage banner molecule-mcp → server (unchanged) `workspace/tests/test_mcp_doctor.py` (new) — 10 tests covering each check's pass/fail/skip path plus the end-to-end exit-code contract on a stripped env. `scripts/build_runtime_package.py` Adds `mcp_doctor` to TOP_LEVEL_MODULES so the wheel ships the new module. ## Out of scope (deferred follow-ups) - Claude Code-specific checks (parse ~/.claude.json, verify each MCP entry is plugin-sourced + dev-channels flag set). That's a separate Claude-Code-shaped doctor; lives in the channel plugin. - Automated remediation. Doctor is diagnostic — tells the operator what's wrong + how to fix it, doesn't apply changes. ## Verification - python -m pytest tests/test_mcp_doctor.py -v → 10/10 PASS - python -m pytest tests/test_mcp_cli.py → 67/67 PASS (existing CLI suite still green; subcommand dispatch added before env-validation, doesn't disturb the server-boot path) - manual: `molecule-mcp doctor` on a stripped env renders 4 FAIL + 2 WARN + exit code 1, with each `next:` hint actionable Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 15:44:36 -07:00
Hongming Wang	3b7ed9cf53	Merge pull request #2946 from Molecule-AI/fix/onboarding-followup-2934 mcp: surface specific TOKEN_FILE errors + link follow-ups (#2934)	2026-05-05 22:19:21 +00:00
Hongming Wang	da9061c131	mcp: surface specific TOKEN_FILE errors + link follow-ups (#2934 ) Self-review of #2935 turned up two real defects: 1. Stale README issue references — the build_runtime_package.py README template said "(issue #2934 follow-up)" twice, but the marketplace-plugin and `doctor` items now have dedicated tracking issues. Updated to point at #2936 and #2937 respectively. 2. Silent fallthrough on broken MOLECULE_WORKSPACE_TOKEN_FILE — when an operator EXPLICITLY pointed TOKEN_FILE at a path that didn't exist / wasn't readable / was blank / contained internal whitespace, the resolver silently returned the generic "set one of these three vars" error. That's exactly the silent failure mode #2934 flagged ("a new user has no chance"). Refactor `_read_token_from_file_env` to return `(token, error)`; surface the SPECIFIC failure when the operator's intent was clearly the file path. Skip the CONFIGS_DIR fallback in that case so the operator's config bug isn't masked by a different source happening to work. Adds 2 renames + 2 new tests in test_mcp_cli_split.py: - test_missing_file_returns_specific_error (asserts "does not exist") - test_empty_file_returns_specific_error (asserts "is empty") - test_multi_line_file_rejected (asserts "internal whitespace") - test_token_file_error_skips_configs_dir_fallback (asserts a valid CONFIGS_DIR/.auth_token does NOT silently rescue a broken TOKEN_FILE) All 81 mcp_cli + mcp_cli_multi_workspace + mcp_cli_split tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 15:07:15 -07:00
Hongming Wang	c4807a930d	Merge pull request #2940 from Molecule-AI/refactor/a2a-tools-inbox-extract-rfc2873-iter4e refactor(workspace): extract inbox tools from a2a_tools.py (RFC #2873 iter 4e)	2026-05-05 21:58:32 +00:00
Hongming Wang	475da5b64c	refactor(workspace): extract inbox tools from a2a_tools.py (RFC #2873 iter 4e) Continues the OSS-shape refactor. After iters 4a-4d (rbac, delegation, memory, messaging) the only behavior left in ``a2a_tools.py`` was ``report_activity`` plus three thin inbox-tool wrappers and the ``_enrich_inbound_for_agent`` helper. This iter extracts the inbox slice to ``a2a_tools_inbox.py`` so the kitchen-sink module shrinks from 280 LOC to ~165 LOC of imports + report_activity + back-compat re-export blocks. Extracted symbols: - ``_INBOX_NOT_ENABLED_MSG`` (sentinel) - ``_enrich_inbound_for_agent`` (poll-path peer enrichment helper) - ``tool_inbox_peek`` - ``tool_inbox_pop`` - ``tool_wait_for_message`` Re-exports (`from a2a_tools_inbox import …`) preserve the public ``a2a_tools.tool_inbox_`` surface so existing tests + call sites continue to resolve unchanged. New tests in test_a2a_tools_inbox_split.py: 1. Drift gate (5)* — every previously-public symbol on a2a_tools is the EXACT same object as a2a_tools_inbox.foo (`is`, not `==`), catches a future "wrap with logging" refactor that silently loses existing test coverage. 2. Import contract (1) — a2a_tools_inbox does NOT eagerly import a2a_tools at module load. Pins the layered architecture: the extracted slice depends on ``inbox`` + a lazy ``a2a_client`` import, never on the kitchen-sink that re-exports it. 3. _enrich_inbound_for_agent branches (5) — peer_id-empty (canvas_user) returns dict unchanged; missing peer_id key same; a2a_client unavailable (test harness, partial install) degrades gracefully with a bare envelope; registry hit populates peer_name + peer_role + agent_card_url; registry miss still surfaces agent_card_url (constructable from peer_id alone). The full timeout-clamp / validation / JSON-shape behavior matrix for the three wrappers stays in test_a2a_tools_inbox_wrappers.py — those tests pass identically against both the alias and the underlying impl. Wiring updates: - ``scripts/build_runtime_package.py``: add ``a2a_tools_inbox`` to ``TOP_LEVEL_MODULES`` so it ships in the runtime wheel and the drift gate doesn't fail the next publish. - ``.github/workflows/ci.yml``: add ``a2a_tools_inbox.py`` to ``CRITICAL_FILES`` so the 75% MCP/inbox/auth per-file floor applies — this is now where the inbox-delivery code actually lives.	2026-05-05 14:28:58 -07:00
Hongming Wang	1ad107cc15	Merge pull request #2935 from Molecule-AI/fix/onboarding-friction-2934 fix(onboarding): address Claude Code MCP onboarding friction (#2934)	2026-05-05 21:25:57 +00:00
Hongming Wang	01deeb36cf	fix(onboarding): address Claude Code MCP onboarding friction (#2934 ) Ryan's bug report (#2934) walked through ~45 min of debugging a stock external-runtime install. This PR fixes the four items he flagged that have a small surface, and stubs out the larger ones for follow-up. Fixed in this PR ================ #1 — Python floor disclosure (README in publish bundle) Add an explicit "Requires Python ≥3.11" section that calls out the cryptic "Could not find a version that satisfies the requirement" failure mode; recommend `pipx install` over `pip install` so the binary lands on PATH automatically; show the explicit `pip install --user` alternative with the PATH caveat. #3 — MOLECULE_WORKSPACE_TOKEN_FILE support (mcp_workspace_resolver.py) Add a third resolution step between the inline env var and the in-container CONFIGS_DIR fallback. Operators can write the bearer to a 0600 file (e.g. ~/.config/molecule/token) and point MOLECULE_WORKSPACE_TOKEN_FILE at it, keeping the secret out of ~/.zsh_history and out of plaintext in MCP-host configs like ~/.claude.json. Inline TOKEN still wins on conflict so rotation flows are predictable. README documents the safer option as the recommended path. 6 new tests pin every leg (file resolves, inline wins, missing/empty file falls through, blank env unset-equivalent, help text advertises it). #4 — Push delivery 3-condition gating (README in publish bundle) Document that real-time push on Claude Code requires (a) the server to declare experimental.claude/channel (we do), (b) the server to be marketplace-plugin-sourced (operators must scaffold their own until the official marketplace lands — see #2934 follow-up), and (c) the --dangerously-load-development-channels flag on the claude invocation. Until any of the three is in place, delivery silently falls back to poll mode with no diagnostic. The README now says all of this explicitly so a new operator doesn't grep the binary for channel_enable to figure it out. #8 — serverInfo.name mismatch (a2a_mcp_server.py) The server reported `serverInfo.name = "a2a-delegation"` while operators register it as `molecule` (the name in `claude mcp add molecule …`). Harmless on tool routing today but matters for any future Claude Code allowlist that gates push by hardcoded server name. Renamed to "molecule" with an inline comment explaining the invariant. Deferred (separate issues to track) =================================== #2 — covered transitively by #1's pipx recommendation; no separate fix. #5 — `moleculesai/claude-code-plugin` marketplace repo (substantial new repo work; the README references it as a documented follow-up). #6 — `molecule-mcp doctor` subcommand (substantial new CLI surface; mentioned in the README's push-vs-poll section as the planned diagnostic for silent push fallback). #7 — `--dangerously-load-development-channels` rename — not in our control; that's Claude Code's flag. Tests ===== 164/164 mcp_cli + a2a_mcp_server tests pass locally (WORKSPACE_ID=00000000-0000-0000-0000-000000000001 pytest …) including 6 new TestTokenFileEnv cases. Wheel builds successfully via scripts/build_runtime_package.py with the new README markers verified in the output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 14:19:09 -07:00
Hongming Wang	0d0840d9d9	Merge branch 'staging' into refactor/a2a-tools-messaging-extract-rfc2873-iter4d	2026-05-05 13:41:55 -07:00
Hongming Wang	23d3f057d3	Merge pull request #2890 from Molecule-AI/refactor/a2a-tools-memory-extract-rfc2873-iter4c refactor(workspace): extract memory tools from a2a_tools.py (RFC #2873 iter 4c)	2026-05-05 20:31:45 +00:00
Hongming Wang	6470e5f41b	Merge pull request #2887 from Molecule-AI/refactor/a2a-tools-delegation-extract-rfc2873-iter4b refactor(workspace): extract delegation handlers from a2a_tools.py (RFC #2873 iter 4b)	2026-05-05 17:40:40 +00:00
Hongming Wang	abba16beb4	Merge pull request #2883 from Molecule-AI/refactor/a2a-tools-rbac-extract-rfc2873-iter4a refactor(workspace): extract RBAC helpers from a2a_tools.py (RFC #2873 iter 4a)	2026-05-05 16:59:36 +00:00
Hongming Wang	9c752e0673	Merge pull request #2879 from Molecule-AI/refactor/mcp-cli-split-rfc2873-iter3 refactor(workspace): split mcp_cli.py into focused modules (RFC #2873 iter 3)	2026-05-05 16:58:05 +00:00
Hongming Wang	3e0d2e650a	refactor(workspace): extract messaging tools from a2a_tools.py to a2a_tools_messaging.py (RFC #2873 iter 4d) Fourth slice of the a2a_tools.py split (stacked on iter 4c). Owns the four human-and-peer messaging MCP tools + the chat-upload helper: * _upload_chat_files — stage local paths to /chat/uploads * tool_send_message_to_user — push canvas-chat via /notify * tool_list_peers — discover peers across registered workspaces * tool_get_workspace_info — JSON-encode workspace info * tool_chat_history — fetch prior conversation rows with a peer a2a_tools.py shrinks from 508 → 213 LOC (−295). The remaining 213 is just report_activity + back-compat re-exports. Inbox tools (tool_inbox_peek/pop/wait_for_message) deferred to iter 4e. Layered architecture: messaging depends on a2a_tools_rbac (iter 4a), a2a_client, platform_auth — NOT on kitchen-sink a2a_tools. An import-contract test pins this so future refactors that add `from a2a_tools import …` fail in CI. Tests: * 28 patch sites in TestToolSendMessageToUser + TestToolListPeers + TestToolGetWorkspaceInfo + TestChatHistory retargeted from `a2a_tools.{httpx, get_peers_, get_workspace_info, _upload_chat_files, _peer_, list_registered_workspaces}` to `a2a_tools_messaging.…` because the call sites moved. * test_a2a_tools_messaging.py adds 7 new tests: - 5 alias drift gates - 2 import-contract tests (no top-level a2a_tools dep + a2a_tools surfaces every messaging symbol) 137 tests total in the a2a_tools suite, all green. Refs RFC #2873.	2026-05-05 09:50:47 -07:00
Hongming Wang	210a26d31a	refactor(workspace): extract memory tools from a2a_tools.py to a2a_tools_memory.py (RFC #2873 iter 4c) Third slice of the a2a_tools.py split (stacked on iter 4b). Owns the two persistent-memory MCP tools: * tool_commit_memory — write to /workspaces/:id/memories with RBAC + GLOBAL-scope tier-zero enforcement * tool_recall_memory — search /workspaces/:id/memories with RBAC a2a_tools.py shrinks from 609 → 508 LOC (−101). Both handlers depend ONLY on a2a_tools_rbac (iter 4a), a2a_client, and the platform's /memories endpoint — no entanglement with delegation or messaging. Side-effects of the layered architecture: a2a_tools_memory's import contract is "depends on a2a_tools_rbac, never on a2a_tools" — the kitchen-sink module is for back-compat re-exports only. A test pins this so a future refactor that re-introduces `from a2a_tools import …` fails in CI. Tests: * 49 patch sites in TestToolCommitMemory + TestToolRecallMemory retargeted from `a2a_tools.{_check_memory_, _is_root_workspace, httpx.AsyncClient}` to `a2a_tools_memory.…` because the call sites moved. test_a2a_tools_memory.py adds 4 new tests (alias drift gate + import-contract + a2a_tools-side re-export). 117 tests total (77 impl + 28 rbac + 8 delegation + 4 memory), all green. Refs RFC #2873.	2026-05-05 09:50:39 -07:00
Hongming Wang	2227a14b1e	fix(build): add a2a_tools_delegation to TOP_LEVEL_MODULES drift gate Iter 4b's new module needs the rewrite-list entry. Stacked on iter 4a which already added a2a_tools_rbac. Refs RFC #2873 iter 4b.	2026-05-05 05:01:04 -07:00
Hongming Wang	17aec22f9b	fix(build): add a2a_tools_rbac to TOP_LEVEL_MODULES drift gate Iter 4a's new module needs to be in the rewrite list so the wheel ships its imports prefixed correctly. Caught by 'PR-built wheel + import smoke'. Refs RFC #2873 iter 4a.	2026-05-05 05:00:47 -07:00
Hongming Wang	8388144098	fix(build): add iter-3 mcp_* modules to TOP_LEVEL_MODULES drift gate The iter-3 split created mcp_heartbeat / mcp_inbox_pollers / mcp_workspace_resolver but the wheel build's drift-gate check at scripts/build_runtime_package.py:TOP_LEVEL_MODULES wasn't updated. Without this fix the wheel ships those modules un-rewritten, so their imports of platform_auth / configs_dir / etc. break at runtime. Caught by the 'PR-built wheel + import smoke' check. Refs RFC #2873 iter 3.	2026-05-05 05:00:29 -07:00
Hongming Wang	86015412eb	build(runtime): register inbox_uploads in TOP_LEVEL_MODULES The drift gate in build_runtime_package.py rejects any workspace/*.py module not listed in TOP_LEVEL_MODULES — it would ship un-rewritten and break wheel imports. Add inbox_uploads (introduced in this PR) to the list.	2026-05-05 04:41:07 -07:00
Hongming Wang	a8850bac55	Merge pull request #2778 from Molecule-AI/fix/redact-secrets-1777932233 fix(runtime): redact secret-shaped tokens from JSON-RPC error.data	2026-05-04 22:13:29 +00:00
Hongming Wang	28f22609d9	fix(runtime): redact secret-shaped tokens from JSON-RPC error.data PR #2756 piped adapter.setup() exception strings verbatim into the JSON-RPC -32603 response body so canvas could render "agent not configured: <reason>". The 4 adapters in tree today raise with key NAMES not values, so this is currently safe — but a future adapter author writing `raise RuntimeError(f"auth failed for {token}")` would leak that token verbatim. Issue #2760 flagged the risk; this PR closes it. workspace/secret_redactor.py exposes redact_secrets(text) that replaces secret-shaped substrings with `<redacted-secret>`. Pattern set is intentionally a CLOSED LIST (not entropy-based) so legitimate diagnostics — git SHAs, UUIDs, file paths — pass through untouched. Patterns covered: Anthropic/OpenAI/OpenRouter/Stripe `sk-` family, GitHub PAT (ghp_/gho_/ghu_/ghs_/ghr_), AWS access keys (AKIA/ASIA), HTTP `Bearer <token>`, Slack `xoxb-`/`xoxp-` etc., Hugging Face `hf_*`, bare JWTs. Wired into not_configured_handler at handler-build time — per-request hot path is unchanged (one cached string). Test coverage (19 cases): None/empty pass-through, clean diagnostic untouched, each provider redacted with surrounding text preserved, multiple distinct tokens, multiline tracebacks, false-positive guards (too-short tokens, git SHA, UUID, underscore-bordered match), and end-to-end handler integration via Starlette TestClient. Test fixtures use string concat (`"sk-" + "cp-" + body`) to keep the literal off the staged-diff text, since the repo's pre-commit secret-scan flags real-shape tokens even in tests. `secret_redactor` registered in TOP_LEVEL_MODULES (drift gate). Closes #2760 Pairs with: PR #2756, PR #2775 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 15:07:53 -07:00
Hongming Wang	4f4b6c4f90	test(runtime): pin PR #2756 's card-vs-setup decoupling with build_routes helper PR #2756's contract — card route always mounted regardless of adapter.setup() outcome — lived inline in main.py's `# pragma: no cover` boot sequence. A future refactor that re-coupled the two would have silently bypassed PR #2756 and shipped the original "stuck booting forever" UX again, with no pytest catching it. This change extracts route assembly into workspace/boot_routes.py's build_routes(card, executor, adapter_error) and pins the contract with 6 integration tests using Starlette's TestClient: - test_card_route_serves_200_when_adapter_ready: happy path - test_card_route_serves_200_when_adapter_failed: misconfigured boot, card still 200, skill stubs survive - test_jsonrpc_returns_503_when_no_executor: full -32603 envelope with the adapter_error in error.data - test_jsonrpc_returns_503_with_generic_when_no_error_string: fallback reason for the rare case main.py reaches this branch without one - test_card_route_does_not_depend_on_executor: direct PR #2756 regression guard — both branches MUST mount the card route - test_executor_present_does_not_mount_not_configured_handler: sanity that a healthy workspace doesn't return -32603 to every request Conftest stubs extended with a2a.server.routes / request_handlers classes so the tests work under the existing a2a-mock infra (pattern matches the AgentCard/AgentSkill stubs added for PR #2765). main.py now calls build_routes; the inline if/else is gone. Same production behaviour, cleaner shape, regression-proof. Heavy a2a-sdk imports inside build_routes() are lazy (deferred to the executor-only branch) so tests that only exercise the not-configured path don't pull DefaultRequestHandler / InMemoryTaskStore. card_helpers + boot_routes registered in TOP_LEVEL_MODULES (build drift gate would have caught the missing entry on the wheel-publish smoke). All 18 related tests pass (test_boot_routes.py: 6, test_card_helpers.py: 6, test_not_configured_handler.py: 6). Closes #2761 Pairs with: PR #2756 (decouple agent-card from setup), PR #2765 (defensive isolation of enrichment + transcript) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 14:59:56 -07:00
Hongming Wang	63ac99788b	fix(runtime): isolate card-skill enrichment + transcript handler from adapter shape mismatch PR #2756 added a try/except around adapter.setup() so a missing LLM key doesn't crash the workspace boot. Two paths that now run AFTER setup succeeds were not similarly isolated, leaving small but real coupling risks for future adapter authors. 1. Skill metadata enrichment swap (main.py:248-259). When adapter.setup() returns, main.py reads adapter.loaded_skills and replaces the static stubs in agent_card.skills with rich metadata (description, tags, examples). The list comprehension assumes each element exposes .metadata.{id,name,description,tags,examples}. A future adapter that returns a non-canonical shape would raise AttributeError, propagate to the outer except, capture as adapter_error, and silently degrade an OK boot to the not-configured state — even though setup() actually succeeded. Extract to card_helpers.enrich_card_skills(card, loaded_skills) → bool. Helper swallows enrichment failures, logs the cause, returns False, leaves the static stubs in place. setup() success path continues unchanged. 6 unit tests cover: None input, empty list, canonical happy path, missing .metadata attr, partial .metadata (missing one canonical field), atomic-failure-no-partial-swap. 2. /transcript handler (main.py:513). Calls await adapter.transcript_lines(...) without try/except. BaseAdapter's default returns {"supported": false} so today's 4 adapters never trigger this — but a future adapter override that assumes setup() ran would surface as a 500 from Starlette's default error handler instead of a useful 503 with the exception class + message. Inline try/except returns 503 with the reason, matching the not-configured JSON-RPC handler's pattern. Both changes match the architectural principle the PR #2756 chain established: availability (workspace reachable) is decoupled from configuration / adapter behavior. Operators see useful errors instead of silent degradation; future adapter authors can't accidentally break tenant readiness with a shape mismatch. Adds: - workspace/card_helpers.py (~50 lines, 100% covered) - workspace/tests/test_card_helpers.py (6 tests) - AgentCard/AgentSkill/AgentCapabilities/AgentInterface stubs to workspace/tests/conftest.py so future card-related tests work under the existing a2a-mock infrastructure - card_helpers in TOP_LEVEL_MODULES (drift gate would have caught it) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 14:15:27 -07:00
Hongming Wang	d1122f8d28	fix(build): register not_configured_handler in TOP_LEVEL_MODULES The wheel-build drift gate caught the new module added in this PR — without registering it, the published wheel would ship `import not_configured_handler` un-rewritten, which would `ModuleNotFoundError` at runtime under `molecule_runtime.main`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 10:24:02 -07:00
Hongming Wang	09010212a0	feat(ci): structural drift gate for cascade list vs manifest (RFC #388 PR-3) Closes the recurrence path of PR #2556. The data fix realigned 8→4 templates in publish-runtime.yml's TEMPLATES variable, but the underlying drift hazard was unguarded — the next manifest change could silently leave cascade out of sync again. This gate fails any PR that changes manifest.json or publish-runtime.yml in a way that makes the cascade list diverge from manifest workspace_templates (suffix-stripped). Either direction is caught: missing-from-cascade templates that won't auto-rebuild on a new wheel publish (the codex-stuck-on-stale-runtime bug class — PR #2512 added codex to manifest, cascade wasn't updated, codex stayed pinned to its last-built runtime version for weeks). extra-in-cascade cascade dispatches to deprecated templates (the wasted-API-calls + dead-CI-noise class — PR #2536 pruned 5 templates from manifest; cascade kept dispatching to all 8 until PR #2556). Triggers narrowly: only on PRs that touch manifest.json, publish-runtime.yml, or the script itself. Fast (single grep+sed+comm pipeline, no Go build). Surfaced during the RFC #388 prior-art audit; folded in as the structural follow-up to the data fix #2556 promised. Self-tested both failure modes locally before commit: - Drop codex from cascade → script fails with "MISSING: codex" - Add langgraph to cascade → script fails with "EXTRA: langgraph" Refs: https://github.com/Molecule-AI/molecule-controlplane/issues/388 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 03:52:39 -07:00
Hongming Wang	6f8f7932d2	feat(ops): add sweep-aws-secrets janitor — orphan tenant bootstrap secrets CP's deprovision flow calls Secrets.DeleteSecret() (provisioner/ec2.go:806) but only when the deprovision runs to completion. Crashed provisions and incomplete teardowns leak the per-tenant `molecule/tenant/<org_id>/bootstrap` secret. At ~$0.40/secret/month, ~45 leaked secrets surfaced as ~$19/month on the AWS cost dashboard. The tenant_resources audit table (mig 024) tracks four kinds today — CloudflareTunnel, CloudflareDNS, EC2Instance, SecurityGroup — and the existing reconciler doesn't catch Secrets Manager orphans. The proper fix (KindSecretsManagerSecret + recorder hook + reconciler enumerator) is filed as a follow-up controlplane issue. This sweeper is the immediate stopgap. Parallel-shape to sweep-cf-tunnels.sh: - Hourly schedule offset (:30, between sweep-cf-orphans :15 and sweep-cf-tunnels :45) so the three janitors don't burst CP admin at the same minute. - 24h grace window — never deletes a secret younger than the provisioning roundtrip, so an in-flight provision can't be racemurdered. - MAX_DELETE_PCT=50 default (mirrors sweep-cf-orphans for durable resources; tenant secrets should track 1:1 with live tenants). - Same schedule-vs-dispatch hardening as the other janitors: schedule → hard-fail on missing secrets, dispatch → soft-skip. - 8-way xargs parallelism, dry-run by default, --execute to delete. Requires a dedicated AWS_JANITOR_* IAM principal — the prod molecule-cp principal lacks secretsmanager:ListSecrets (it only has scoped Get/Create/Update/Delete). The workflow's verify-secrets step will hard-fail on the first scheduled run until those secrets are configured, surfacing the missing setup loudly rather than silently no-op'ing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 02:38:08 -07:00
Hongming Wang	9753d58539	fix(build): register event_log in TOP_LEVEL_MODULES The wheel-build drift gate caught it correctly: any new top-level module under workspace/ must be listed in TOP_LEVEL_MODULES so its `from event_log import …` statements get rewritten to `from molecule_runtime.event_log import …` at package time. Without this entry, the published wheel ships event_log.py un-rewritten and crashes at runtime with ModuleNotFoundError on first heartbeat. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 00:19:30 -07:00
Hongming Wang	2552779d97	Merge pull request #2517 from Molecule-AI/test/all-runtimes-a2a-e2e-harness test(e2e): unified A2A round-trip parity harness across all 4 runtimes	2026-05-02 11:40:14 +00:00
Hongming Wang	d88c160e56	test(e2e): wire SaaS auth headers (TENANT_ADMIN_TOKEN + TENANT_ORG_ID) The harness needs Authorization + X-Molecule-Org-Id (per-tenant, NOT CP_ADMIN_API_TOKEN) when targeting *.moleculesai.app subdomains. Existing single-Origin-header form silent-failed with 404 against staging tenants since the SaaS edge WAF rewrites unauthenticated /workspaces calls to Next.js (per reference_saas_waf_origin_header.md). Switch to a headers array so multiple -H flags compose cleanly with curl arg-quoting, and document the env var contract at the top of the script. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 04:36:23 -07:00
Hongming Wang	5aaac7d2d9	test(e2e): unified A2A round-trip parity harness across all 4 runtimes Adds two scripts: scripts/test-all-runtimes-a2a-e2e.sh Provisions one workspace per runtime (claude-code, hermes, codex, openclaw), sets provider keys, waits online, sends two A2A messages per workspace. First message validates round-trip; second message validates session continuity. Cleans up via trap on EXIT. scripts/test-hermes-plugin-e2e.sh Hermes-only variant focused on the plugin /a2a/inbound path. Proof-point: session continuity between turns (the plugin path's deliverable; old chat-completions path lost context per turn). Both honor SKIP_<runtime> env vars for incremental testing and tolerate the SaaS edge WAF Origin header requirement (per reference_saas_waf_origin_header.md). Run: PLATFORM=https://demo-tenant.staging.moleculesai.app \\ ./scripts/test-all-runtimes-a2a-e2e.sh Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 04:36:23 -07:00
Hongming Wang	8bf29b7d0e	fix(sweep-cf-tunnels): parallelize deletes + raise workflow timeout The hourly Sweep stale Cloudflare Tunnels job got cancelled mid-cleanup on 2026-05-02 (run 25248788312, killed at 5min after deleting 424/672 stale tunnels). A second manual dispatch finished the remaining 254 fine, so the immediate backlog cleared, but two underlying bugs would re-trip on the next big cleanup. Bug 1: serial delete loop. The execute branch was a `while read; do curl -X DELETE; done` pipeline at ~0.7s/tunnel — fine for the steady-state cleanup of a handful, but a 600+ backlog needs ~7-8min. This commit fans out to $SWEEP_CONCURRENCY (default 8) workers via `xargs -P 8 -L 1 -I {} bash -c '...' _ {} < "$DELETE_PLAN"`. With 8x parallelism the same 600+ list drains in ~60s. Notes: - We use stdin (`<`) not GNU's `xargs -a FILE` so the script stays portable to BSD xargs (matters for local-runner testing on macOS). - We pass ONLY the tunnel id on argv. xargs tokenizes on whitespace by default; tab-separating id+name on argv risks mangling. The name is kept in a side-channel id->name map ($NAME_MAP) and looked up by the worker only on failure, for FAIL_LOG readability. - Workers print exactly `OK` or `FAIL` on stdout; tally with `grep -c '^OK$' / '^FAIL$'`. - On non-zero FAILED, log the first 20 lines of $FAIL_LOG as "Failure detail (first 20):" — same diagnostic surface as before but consolidated so we don't spam logs on a flaky CF API. Bug 2: workflow's 5-min cap was set as a hangs-detector but turned out to be a real-job-too-slow detector. Raised to 30 min — generous headroom for the ~60s steady-state run while still surfacing genuine hangs (and in line with the sweep-cf-orphans companion job). Bug 3 (drive-by): the existing trap was `trap 'rm -rf "$PAGES_DIR"' EXIT`, which would have been silently overwritten by any later trap registration. Replaced with a single `cleanup()` function that wipes PAGES_DIR + all four new tempfiles (DELETE_PLAN, NAME_MAP, FAIL_LOG, RESULT_LOG), called once via `trap cleanup EXIT`. Verification: - bash -n scripts/ops/sweep-cf-tunnels.sh: clean - shellcheck -S warning scripts/ops/sweep-cf-tunnels.sh: clean - python3 yaml.safe_load on the workflow: clean - Synthetic 30-line delete plan with every 7th id sentinel'd to return {"success":false}: TEST PASS, DELETED=26 FAILED=4, FAIL_LOG side-channel name lookup verified. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 02:35:46 -07:00
Hongming Wang	a117a60eed	fix(sweep-cf-tunnels): buffer pages to disk to avoid argv ARG_MAX The page-merge loop passed the entire accumulating tunnel JSON to python3 -c via argv on every iteration. On a busy account (verified 2026-05-02: 672 tunnels, 14 pages on Hongmingwangrabbit account) this exceeds the GH Ubuntu runner's combined argv+envp limit (~128 KB) and dies with `python3: Argument list too long` at exit 126 — the workflow has been silently failing this way since the very first run that hit a real account, masked earlier by a missing-CF_ACCOUNT_ID secret check. Buffer each page response to a file under a temp dir, merge from disk at the end. Also bumps the page cap from 20 to 40 (1000 → 2000 tunnel ceiling) so the existing soft-cap warning has headroom; the disk-merge shape is O(n) in tunnel count rather than the previous O(n^2) so the larger ceiling is cheap. Verified locally against the live account (672 tunnels): script now runs cleanly to the existing MAX_DELETE_PCT safety gate, which trips at 99% > 90% as designed and surfaces the actual orphan backlog for operator-driven cleanup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 00:42:25 -07:00
Hongming Wang	b8fdbd9fab	fix(runtime): register configs_dir in TOP_LEVEL_MODULES + drop alias Wheel-build smoke gate detected `configs_dir` missing from scripts/build_runtime_package.py:TOP_LEVEL_MODULES. Without it the build would ship `import configs_dir` un-rewritten and every external-runtime install would die on `ModuleNotFoundError` at first import. Two callers used `import configs_dir as _configs_dir` to belt-and- suspenders against an imagined name collision, but the rewriter rejects `import X as Y` because the rewrite would produce `import molecule_runtime.X as X as Y` (invalid syntax). No actual collision exists (only docstring/comment references). Switched to plain `import configs_dir`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 13:13:57 -07:00
Hongming Wang	6d23611620	ops: demo-day freeze + rollback runbook Demo-day preparation bundle for the funding demo (~2026-05-06). Adds: - scripts/demo-freeze.sh — captures current ghcr.io workspace-template-* :latest digests for all 8 runtimes, then disables both cascade vectors that could re-tag :latest mid-demo: publish-runtime.yml in molecule-core (PATH 1 — staging push to workspace/** auto-bumps the wheel and fans out to 8 templates) and publish-image.yml in each of the 8 template repos (PATH 2 — direct template repo merge re-tags :latest). Defaults to dry-run; requires --execute to apply. Writes both digest + workflow receipts to scripts/demo-freeze-snapshots/. - scripts/demo-thaw.sh — re-enables every workflow demo-freeze.sh disabled, keyed off the receipt timestamp. Defaults to executing (the inverse safety polarity from freeze, where the destructive default is dry-run). --dry-run prints without applying. - scripts/demo-day-runbook.md — operator runbook indexing the six rollback levers (platform image rollback, template image rollback, tenant redeploy, workspace delete, Railway rollback, Vercel rollback) plus pre-warm timing and post-demo cleanup. Also covers read-only diagnostics for "is this working?" moments and the CP_ADMIN_API_TOKEN rotation step that must follow demo (the token gets copy-pasted into shells during incident response). - scripts/demo-freeze-snapshots/.gitignore — generated freeze receipts are operational state, not source. Tracked .gitkeep so the directory exists when the script writes to it. Both scripts dry-run-tested locally. Did not exercise --execute since that would actually disable production workflows mid-development. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 12:04:30 -07:00
Hongming Wang	aacaba024c	feat(wheel-smoke): exercise executor.execute() to catch lazy imports (#2275 ) The existing wheel-publish smoke (`wheel_smoke.py`) only IMPORTS `molecule_runtime.main` at module scope. Lazy imports buried inside `async def execute(...)` bodies (e.g. `from a2a.types import FilePart`) NEVER evaluate at static-import time — they crash at first message delivery in production. The 2026-04-2x v0→v1 a2a-sdk migration shipped 5 such regressions in templates that all looked fine at module-load smoke. This change adds `smoke_mode.py` plus a `MOLECULE_SMOKE_MODE=1` short-circuit in `main.py`: after `adapter.create_executor(...)`, the boot path invokes `executor.execute(stub_ctx, stub_queue)` once with a 5s timeout (`MOLECULE_SMOKE_TIMEOUT_SECS`). Healthy import tree → execution proceeds far enough to hit a network boundary and times out (exit 0). Broken lazy import → `ImportError` / `ModuleNotFoundError` from inside the executor body (exit 1). Other downstream errors (auth, validation) pass — those are caught by adapter-level tests, not this gate. Stub `(RequestContext, EventQueue)` is built from the real a2a-sdk so SendMessageRequest/RequestContext constructor changes also surface as import-tree failures (the regression class also includes "SDK refactored mid-publish"). The stub-build itself is wrapped — if it raises, that's a smoke fail too. Phase 2 (separate PR, molecule-ci) wires this into publish-template-image.yml so the publish gate runs the boot smoke against every template image before pushing the tag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 21:21:18 -07:00
Hongming Wang	6e92fe0a08	chore: rewriter unit tests + drop misleading noqa on `import inbox` Two small follow-ups to the PR #2433 → #2436 → #2439 incident chain. 1) `import inbox # noqa: F401` in workspace/a2a_mcp_server.py was misleading — `inbox` IS used (at the bridge wiring inside main()). F401 means "imported but unused", which would mask a real future F401 if the usage is removed. Drop the noqa, keep the explanatory block comment about the rewriter's `import X` → `import mr.X as X` expansion (and the `import X as Y` → `import mr.X as X as Y` trap the comment exists to prevent re-introducing). 2) scripts/test_build_runtime_package.py — 17 unit tests covering `rewrite_imports()` and `build_import_rewriter()` in scripts/build_runtime_package.py. Until now the function had zero coverage despite the entire wheel build depending on it. Tests pin: bare-import aliasing, dotted-import preservation, indented imports, from-imports (simple + dotted + multi-symbol + block), the `import X as Y` rejection added in PR #2436 (with comment- stripping + indented + comma-not-alias edge cases), allowlist anchoring (`a2a` ≠ `a2a_tools`), and end-to-end reproduction of the PR #2433 failing pattern + the #2436 fix pattern. 3) Wire scripts/test_.py into CI by adding a second discover pass to test-ops-scripts.yml. Top-level scripts/ tests live alongside their target file (parallels the scripts/ops/ test layout); the existing scripts/ops/ pass keeps running because scripts/ops/ has no __init__.py so a single discover from scripts/ root doesn't recurse. Two passes is simpler than retrofitting namespace packages. Path filter widened from `scripts/ops/` to `scripts/*` so PRs touching the build script trigger the new tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 20:45:32 -07:00
Hongming Wang	0acdf3bb56	fix(wheel): import inbox without alias to dodge rewriter collision PR #2433 (notifications/claude/channel) shipped 'import inbox as _inbox_module' inside a2a_mcp_server.py:main(). The build script's import rewriter expands plain 'import inbox' to 'import molecule_runtime.inbox as inbox', so the original source became 'import molecule_runtime.inbox as inbox as _inbox_module', which is invalid Python. Caught at the publish-runtime + PR-built-wheel-smoke gate (the SyntaxError trace is in run 25200422679). The wheel didn't ship to PyPI because publish-runtime's smoke-import step refused to install it, but staging is currently sitting on a broken-build commit until this fix-forward lands. Changes: - a2a_mcp_server.py: lift `import inbox` to top of file (rewriter produces clean `import molecule_runtime.inbox as inbox`), call inbox.set_notification_callback directly in main() - build_runtime_package.py: rewrite_imports() now raises ValueError when it sees 'import X as Y' for any X in the workspace allowlist, instead of silently producing a syntax-error wheel. Operator gets a clear actionable error at build time pointing at the offending line + suggested rewrites ('from X import …' or plain 'import X'). The build-time gate (this PR's rewriter check) catches the regression class earlier than the smoke-time gate (PR #2433's failure). Adding 'PR-built wheel + import smoke' to staging branch protection's required checks is filed separately so this class doesn't merge again. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 20:21:54 -07:00

1 2

95 Commits