molecule-core

Author	SHA1	Message	Date
Hongming Wang	5920fc856d	Merge pull request #2182 from Molecule-AI/ci/agentcard-smoke-followup-2179 fix(workspace): rename supported_protocols → supported_interfaces (CRITICAL — every boot crashes)	2026-04-27 14:58:28 +00:00
Hongming Wang	851fd21fb1	fix(workspace): rename supported_protocols → supported_interfaces (a2a-sdk 1.0) CRITICAL: every workspace boot since the a2a-sdk 1.0 migration (#1974) has been crashing at AgentCard construction with: ValueError: Protocol message AgentCard has no "supported_protocols" field The protobuf field is `supported_interfaces` (plural, interfaces — see a2a-sdk types/a2a_pb2.pyi:189). The 0.3→1.0 migration left the kwarg as `supported_protocols`, which doesn't exist in the 1.0 schema, so the constructor raises before any subsequent line of main runs. Why this hid for so long: - publish-runtime.yml's smoke step only IMPORTED molecule_runtime.main; importing the module is fine, only CONSTRUCTING the AgentCard fails - The user-visible symptom is "Workspace failed: " with empty last_sample_error, indistinguishable from generic boot timeouts - The state_transition_history=True bug (fixed in #2179) was a sibling of this — same migration, same class, just caught first Fix is symmetric with #2179: 1. workspace/main.py: rename the kwarg + comment explaining why 2. .github/workflows/publish-runtime.yml: extend the smoke block to instantiate AgentCard with the exact production call shape, so the next field-rename of this class fails at publish time instead of breaking every workspace startup Verification: - Constructed AgentCard against fresh a2a-sdk 1.0.2 in a clean venv with the corrected kwarg → succeeds - Constructed it with the original `supported_protocols` kwarg → fails immediately with the exact error production sees - Smoke test pinned to mirror main.py's exact call shape; main.py + smoke must stay in lockstep going forward Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 07:54:23 -07:00
Hongming Wang	1a703f5687	fix(publish-runtime): wait for PyPI propagation + expand path filter Two structural fixes for the cascade race conditions that bit us five times today: 1. PyPI propagation wait (cascade job): poll PyPI for the just-published version with a 60s budget BEFORE firing repository_dispatch. PyPI accepts the upload but takes a few seconds to make it available via the package index. Cascade was firing too fast — downstream template builds ran `pip install` against a stale index, resolved to the previous version, and docker layer cache locked that in for subsequent rebuilds. Pairs with the build-arg cache invalidation in molecule-ci PR (separate change). Wait without invalidation = next build still pip-resolves correctly. Invalidation without wait = first cascade build may still race PyPI propagation. Together: no race, no stale cache. 2. Path filter expansion: scripts/build_runtime_package.py is the build script and changes to it (e.g. import-rewrite fixes, manifest emit, lib/ subpackage move) directly affect what ships in the wheel. Was missing from the path filter, so PRs touching only scripts/ (like #2174's lib/ fix) didn't auto-publish — the operator had to remember a manual dispatch. Add it to the closed list of files that trigger auto-publish. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 07:42:37 -07:00
Hongming Wang	3df5867b56	fix: restore main_sync entry point in workspace/main.py The wheel's pyproject.toml has declared `molecule-runtime = "molecule_runtime.main:main_sync"` since the publish pipeline was created on 2026-04-26, but the function itself was never present in workspace/main.py — it lived in the pre-monorepo molecule-ai-workspace-runtime repo and was lost during the consolidation that made workspace/ the source of truth. The 0.1.15 wheel still had main_sync from a leftover snapshot, so the regression went unnoticed until 0.1.16 (the first wheel built from the new source-of-truth) shipped. Symptom: every workspace container restart loops with ImportError: cannot import name 'main_sync' from 'molecule_runtime.main' — the molecule-runtime CLI script's first line tries to import the missing symbol. Workspaces stay in `provisioning` until the 10-min sweep marks them failed. Caught by .github/workflows/runtime-pin-compat.yml, which already imports the symbol by name as its smoke test. (That check kept failing red on every recent merge_group run; this PR fixes the underlying symbol-not-found instead of the smoke step.) Also strengthens publish-runtime.yml's wheel smoke from `import molecule_runtime.main` (loads the module — passes even when entry-point target is missing) to `from molecule_runtime.main import main_sync` (the actual contract the CLI script needs). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 03:35:49 -07:00
Hongming Wang	c68dc1877f	fix(release): drift-gate TOP_LEVEL_MODULES + smoke-import main in publish Two compounding bugs surfaced when 0.1.16 hit production today: 1. scripts/build_runtime_package.py had a hand-curated TOP_LEVEL_MODULES set listing every workspace/.py that should get its bare imports rewritten to `molecule_runtime.X`. The set silently went stale: - Missing: transcript_auth (added since #87 phase 1c), runtime_wedge, watcher → unrewritten imports shipped, every workspace startup died with ModuleNotFoundError. - Stale: claude_sdk_executor, cli_executor (both removed in #87), hermes_executor (never existed) → harmless but misleading. 2. publish-runtime.yml's wheel-smoke step asserted on stable invariants (BaseAdapter, AdapterConfig, a2a_client error sentinel) but never imported main. So even though main.py held the broken bare `from transcript_auth import ...`, the smoke check passed. Fixes: - Build script now derives the on-disk module set from workspace/.py and asserts it matches TOP_LEVEL_MODULES exactly. Drift in either direction fails the build with a specific diff message instead of shipping a broken wheel. Closed-list typo guard preserved (we still edit the set explicitly when a module is added/removed) — the gate just makes drift impossible to ignore. - TOP_LEVEL_MODULES updated to current reality: drop the 3 stale, add the 3 missing. - publish-runtime.yml wheel-smoke now `import molecule_runtime.main` before the invariant asserts. main is the entry point and transitively imports every module — any bare-import bug surfaces as ModuleNotFoundError before PyPI accepts the upload. Tested locally: `python3 scripts/build_runtime_package.py --version 0.1.99 --out /tmp/build-test` succeeds, and /tmp/build-test/molecule_runtime/main.py contains the rewritten `from molecule_runtime.transcript_auth import ...`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 03:19:17 -07:00
Hongming Wang	0a455b7d71	feat(publish-runtime): auto-publish to PyPI on staging pushes that touch workspace/ Adds a third trigger so any merge to staging that changes workspace/ auto-publishes a new molecule-ai-workspace-runtime patch release. Closes the human-in-loop gap that caused tonight's RuntimeCapabilities ImportError outage. Tonight: #117 added RuntimeCapabilities to molecule_runtime.adapters.base. The merge landed at 02:37 UTC. Templates rebuilt their images at 07:37 UTC (4 hours later) and started importing the new symbol. PyPI was still serving 0.1.15 (pre-#117) because nobody remembered to push a runtime-vX.Y.Z tag or workflow_dispatch the publish. Result: every template image shipped tonight runs `from molecule_runtime.adapters.base import RuntimeCapabilities` against an installed runtime that doesn't export it -> ImportError -> workspace never registers -> stuck in provisioning until 10-min sweep. Mechanism: - New trigger: push to staging filtered to paths: ['workspace/']. Path filter applies only to branch pushes; the existing tag trigger still fires unconditionally. - Version derivation for the auto case: query PyPI's JSON API for current latest, bump the patch component. PyPI is the source of truth so concurrent runs don't double-publish (HTTP 400 on collision). - concurrency: group serializes parallel staging merges so they don't race on the bump computation. cancel-in-progress: false because each workspace/** change deserves its own release. - publish job now exposes its derived version as a job-level output so the cascade reads it cleanly. Fixes a latent bug: cascade tried to read steps.version.outputs.version, which is from a different job's scope and silently resolved to empty -- then re-derived from GITHUB_REF_NAME, which would have been "staging" under the new trigger and produced an invalid version. Tag-driven and manual-dispatch paths are unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 02:11:45 -07:00
Hongming Wang	f3a204347c	fix(publish-runtime): use PyPI Trusted Publisher (OIDC) instead of PYPI_TOKEN (#2113 ) Drops the static PYPI_TOKEN secret in favor of OIDC trusted publishing. PyPI now mints a short-lived upload credential after verifying the workflow's OIDC claim against the trusted-publisher config registered for molecule-ai-workspace-runtime (Molecule-AI/molecule-core, publish-runtime.yml, environment pypi-publish). Why: - A leaked PYPI_TOKEN would let any holder publish arbitrary versions of molecule-ai-workspace-runtime to PyPI from anywhere — bypassing the monorepo's review and CI gates entirely. The 8 template repos pull this package; a malicious publish poisons all of them. - Trusted Publisher (OIDC) makes that exfil path moot: no long-lived credential exists to leak. Only this exact workflow, on this repo, in the pypi-publish environment, can upload. After this lands and the first OIDC publish succeeds, the PYPI_TOKEN repo secret should be deleted (it becomes dead weight + a leak surface with no purpose). Belt-and-suspenders companion to PR #56 in molecule-ai-workspace-runtime (sibling repo lockdown). Without OIDC, the sibling lockdown alone doesn't prevent local `python -m build && twine upload` from a laptop with a personal PyPI maintainer credential. Co-authored-by: Hongming Wang <hongmingwangalt@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 13:14:47 -07:00
Hongming Wang	199630908d	fix(publish-runtime): smoke test asserts stable invariants, not feature flags (#2112 ) The original smoke step had `assert a2a_client._A2A_QUEUED_PREFIX` which is a feature-flag-style check — it fires false-positive every time staging is mid-release of that specific feature. Caught when the dry-run publish (run 24965411618) failed because _A2A_QUEUED_PREFIX hadn't landed on staging yet (it lives in PR #2061's series, separate from the PR #2103 chain that shipped this workflow). Replaced with checks for stable invariants of the package contract: - a2a_client._A2A_ERROR_PREFIX exists (always has, since the [A2A_ERROR] sentinel is the foundational error-tagging primitive) - adapters.get_adapter is callable - BaseAdapter has the .name() static method (interface anchor) - AdapterConfig has __init__ (dataclass present) These four cover the cases the smoke test actually needs to catch: import-path rewrites broken by build_runtime_package.py, missing modules, dataclass shape regressions. They don't fire when a specific feature is mid-merge. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Hongming Wang <hongmingwangalt@gmail.com>	2026-04-26 13:14:15 -07:00
Hongming Wang	0de67cd379	feat(platform/admin): /admin/workspace-images/refresh + Docker SDK + GHCR auth The production-side end of the runtime CD chain. Operators (or the post- publish CI workflow) hit this after a runtime release to pull the latest workspace-template-* images from GHCR and recreate any running ws-* containers so they adopt the new image. Without this, freshly-published runtime sat in the registry but containers kept the old image until naturally cycled. Implementation notes: - Uses Docker SDK ImagePull rather than shelling out to docker CLI — the alpine platform container has no docker CLI installed. - ghcrAuthHeader() reads GHCR_USER + GHCR_TOKEN env, builds the base64- encoded JSON payload Docker engine expects in PullOptions.RegistryAuth. Both empty → public/cached images only; both set → private GHCR pulls. - Container matching uses ContainerInspect (NOT ContainerList) because ContainerList returns the resolved digest in .Image, not the human tag. Inspect surfaces .Config.Image which is what we need. - Provisioner.DefaultImagePlatform() exported so admin handler picks the same Apple-Silicon-needs-amd64 platform as the provisioner — single source of truth for the multi-arch override. Local-dev companion: scripts/refresh-workspace-images.sh runs on the host and inherits the host's docker keychain auth — alternate path for when GHCR_USER/TOKEN aren't set in the platform env. 🤖 Generated with [Claude Code](https://claude.com/claude-code)	2026-04-26 10:17:21 -07:00

9 Commits