Commit Graph

9 Commits

Author SHA1 Message Date
Hongming Wang
5920fc856d
Merge pull request #2182 from Molecule-AI/ci/agentcard-smoke-followup-2179
fix(workspace): rename supported_protocols → supported_interfaces (CRITICAL — every boot crashes)
2026-04-27 14:58:28 +00:00
Hongming Wang
851fd21fb1 fix(workspace): rename supported_protocols → supported_interfaces (a2a-sdk 1.0)
CRITICAL: every workspace boot since the a2a-sdk 1.0 migration (#1974)
has been crashing at AgentCard construction with:
  ValueError: Protocol message AgentCard has no "supported_protocols" field

The protobuf field is `supported_interfaces` (plural, interfaces — see
a2a-sdk types/a2a_pb2.pyi:189). The 0.3→1.0 migration left the kwarg
as `supported_protocols`, which doesn't exist in the 1.0 schema, so
the constructor raises before any subsequent line of main runs.

Why this hid for so long:
  - publish-runtime.yml's smoke step only IMPORTED molecule_runtime.main;
    importing the module is fine, only CONSTRUCTING the AgentCard fails
  - The user-visible symptom is "Workspace failed: " with empty
    last_sample_error, indistinguishable from generic boot timeouts
  - The state_transition_history=True bug (fixed in #2179) was a
    sibling of this — same migration, same class, just caught first

Fix is symmetric with #2179:
  1. workspace/main.py: rename the kwarg + comment explaining why
  2. .github/workflows/publish-runtime.yml: extend the smoke block to
     instantiate AgentCard with the exact production call shape, so
     the next field-rename of this class fails at publish time
     instead of breaking every workspace startup

Verification:
  - Constructed AgentCard against fresh a2a-sdk 1.0.2 in a clean
    venv with the corrected kwarg → succeeds
  - Constructed it with the original `supported_protocols` kwarg →
    fails immediately with the exact error production sees
  - Smoke test pinned to mirror main.py's exact call shape; main.py
    + smoke must stay in lockstep going forward

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 07:54:23 -07:00
Hongming Wang
1a703f5687 fix(publish-runtime): wait for PyPI propagation + expand path filter
Two structural fixes for the cascade race conditions that bit us
five times today:

1. **PyPI propagation wait** (cascade job): poll PyPI for the
   just-published version with a 60s budget BEFORE firing
   repository_dispatch. PyPI accepts the upload but takes a few
   seconds to make it available via the package index. Cascade was
   firing too fast — downstream template builds ran `pip install`
   against a stale index, resolved to the previous version, and
   docker layer cache locked that in for subsequent rebuilds.
   Pairs with the build-arg cache invalidation in molecule-ci PR
   (separate change). Wait without invalidation = next build still
   pip-resolves correctly. Invalidation without wait = first cascade
   build may still race PyPI propagation. Together: no race, no
   stale cache.

2. **Path filter expansion**: scripts/build_runtime_package.py is
   the build script and changes to it (e.g. import-rewrite fixes,
   manifest emit, lib/ subpackage move) directly affect what ships
   in the wheel. Was missing from the path filter, so PRs touching
   only scripts/ (like #2174's lib/ fix) didn't auto-publish — the
   operator had to remember a manual dispatch. Add it to the closed
   list of files that trigger auto-publish.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 07:42:37 -07:00
Hongming Wang
3df5867b56 fix: restore main_sync entry point in workspace/main.py
The wheel's pyproject.toml has declared
`molecule-runtime = "molecule_runtime.main:main_sync"` since the
publish pipeline was created on 2026-04-26, but the function
itself was never present in workspace/main.py — it lived in the
pre-monorepo molecule-ai-workspace-runtime repo and was lost
during the consolidation that made workspace/ the source of truth.

The 0.1.15 wheel still had main_sync from a leftover snapshot,
so the regression went unnoticed until 0.1.16 (the first wheel
built from the new source-of-truth) shipped. Symptom: every
workspace container restart loops with

  ImportError: cannot import name 'main_sync' from 'molecule_runtime.main'

— the molecule-runtime CLI script's first line tries to import
the missing symbol. Workspaces stay in `provisioning` until the
10-min sweep marks them failed.

Caught by .github/workflows/runtime-pin-compat.yml, which already
imports the symbol by name as its smoke test. (That check kept
failing red on every recent merge_group run; this PR fixes the
underlying symbol-not-found instead of the smoke step.)

Also strengthens publish-runtime.yml's wheel smoke from
`import molecule_runtime.main` (loads the module — passes even
when entry-point target is missing) to `from molecule_runtime.main
import main_sync` (the actual contract the CLI script needs).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 03:35:49 -07:00
Hongming Wang
c68dc1877f fix(release): drift-gate TOP_LEVEL_MODULES + smoke-import main in publish
Two compounding bugs surfaced when 0.1.16 hit production today:

1. scripts/build_runtime_package.py had a hand-curated TOP_LEVEL_MODULES
   set listing every workspace/*.py that should get its bare imports
   rewritten to `molecule_runtime.X`. The set silently went stale:
   - Missing: transcript_auth (added since #87 phase 1c), runtime_wedge,
     watcher → unrewritten imports shipped, every workspace startup
     died with ModuleNotFoundError.
   - Stale: claude_sdk_executor, cli_executor (both removed in #87),
     hermes_executor (never existed) → harmless but misleading.

2. publish-runtime.yml's wheel-smoke step asserted on stable invariants
   (BaseAdapter, AdapterConfig, a2a_client error sentinel) but never
   imported main. So even though main.py held the broken bare
   `from transcript_auth import ...`, the smoke check passed.

Fixes:

- Build script now derives the on-disk module set from workspace/*.py
  and asserts it matches TOP_LEVEL_MODULES exactly. Drift in either
  direction fails the build with a specific diff message instead of
  shipping a broken wheel. Closed-list typo guard preserved (we still
  edit the set explicitly when a module is added/removed) — the gate
  just makes drift impossible to ignore.

- TOP_LEVEL_MODULES updated to current reality: drop the 3 stale,
  add the 3 missing.

- publish-runtime.yml wheel-smoke now `import molecule_runtime.main`
  before the invariant asserts. main is the entry point and
  transitively imports every module — any bare-import bug surfaces
  as ModuleNotFoundError before PyPI accepts the upload.

Tested locally: `python3 scripts/build_runtime_package.py
--version 0.1.99 --out /tmp/build-test` succeeds, and
/tmp/build-test/molecule_runtime/main.py contains the rewritten
`from molecule_runtime.transcript_auth import ...`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 03:19:17 -07:00
Hongming Wang
0a455b7d71 feat(publish-runtime): auto-publish to PyPI on staging pushes that touch workspace/
Adds a third trigger so any merge to staging that changes workspace/**
auto-publishes a new molecule-ai-workspace-runtime patch release. Closes
the human-in-loop gap that caused tonight's RuntimeCapabilities
ImportError outage.

Tonight: #117 added RuntimeCapabilities to molecule_runtime.adapters.base.
The merge landed at 02:37 UTC. Templates rebuilt their images at 07:37
UTC (4 hours later) and started importing the new symbol. PyPI was
still serving 0.1.15 (pre-#117) because nobody remembered to push a
runtime-vX.Y.Z tag or workflow_dispatch the publish. Result: every
template image shipped tonight runs `from molecule_runtime.adapters.base
import RuntimeCapabilities` against an installed runtime that doesn't
export it -> ImportError -> workspace never registers -> stuck in
provisioning until 10-min sweep.

Mechanism:
- New trigger: push to staging filtered to paths: ['workspace/**'].
  Path filter applies only to branch pushes; the existing tag trigger
  still fires unconditionally.
- Version derivation for the auto case: query PyPI's JSON API for
  current latest, bump the patch component. PyPI is the source of
  truth so concurrent runs don't double-publish (HTTP 400 on collision).
- concurrency: group serializes parallel staging merges so they don't
  race on the bump computation. cancel-in-progress: false because each
  workspace/** change deserves its own release.
- publish job now exposes its derived version as a job-level output so
  the cascade reads it cleanly. Fixes a latent bug: cascade tried to
  read steps.version.outputs.version, which is from a different job's
  scope and silently resolved to empty -- then re-derived from
  GITHUB_REF_NAME, which would have been "staging" under the new
  trigger and produced an invalid version.

Tag-driven and manual-dispatch paths are unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 02:11:45 -07:00
Hongming Wang
f3a204347c
fix(publish-runtime): use PyPI Trusted Publisher (OIDC) instead of PYPI_TOKEN (#2113)
Drops the static PYPI_TOKEN secret in favor of OIDC trusted publishing.
PyPI now mints a short-lived upload credential after verifying the
workflow's OIDC claim against the trusted-publisher config registered
for molecule-ai-workspace-runtime (Molecule-AI/molecule-core,
publish-runtime.yml, environment pypi-publish).

Why:
- A leaked PYPI_TOKEN would let any holder publish arbitrary versions of
  molecule-ai-workspace-runtime to PyPI from anywhere — bypassing the
  monorepo's review and CI gates entirely. The 8 template repos pull
  this package; a malicious publish poisons all of them.
- Trusted Publisher (OIDC) makes that exfil path moot: no long-lived
  credential exists to leak. Only this exact workflow, on this repo,
  in the pypi-publish environment, can upload.

After this lands and the first OIDC publish succeeds, the PYPI_TOKEN
repo secret should be deleted (it becomes dead weight + a leak surface
with no purpose).

Belt-and-suspenders companion to PR #56 in molecule-ai-workspace-runtime
(sibling repo lockdown). Without OIDC, the sibling lockdown alone
doesn't prevent local `python -m build && twine upload` from a laptop
with a personal PyPI maintainer credential.

Co-authored-by: Hongming Wang <hongmingwangalt@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 13:14:47 -07:00
Hongming Wang
199630908d
fix(publish-runtime): smoke test asserts stable invariants, not feature flags (#2112)
The original smoke step had `assert a2a_client._A2A_QUEUED_PREFIX`
which is a feature-flag-style check — it fires false-positive every time
staging is mid-release of that specific feature. Caught when the dry-run
publish (run 24965411618) failed because _A2A_QUEUED_PREFIX hadn't
landed on staging yet (it lives in PR #2061's series, separate from the
PR #2103 chain that shipped this workflow).

Replaced with checks for stable invariants of the package contract:

  - a2a_client._A2A_ERROR_PREFIX exists (always has, since the
    [A2A_ERROR] sentinel is the foundational error-tagging primitive)
  - adapters.get_adapter is callable
  - BaseAdapter has the .name() static method (interface anchor)
  - AdapterConfig has __init__ (dataclass present)

These four cover the cases the smoke test actually needs to catch:
import-path rewrites broken by build_runtime_package.py, missing
modules, dataclass shape regressions. They don't fire when a specific
feature is mid-merge.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Hongming Wang <hongmingwangalt@gmail.com>
2026-04-26 13:14:15 -07:00
Hongming Wang
0de67cd379 feat(platform/admin): /admin/workspace-images/refresh + Docker SDK + GHCR auth
The production-side end of the runtime CD chain. Operators (or the post-
publish CI workflow) hit this after a runtime release to pull the latest
workspace-template-* images from GHCR and recreate any running ws-* containers
so they adopt the new image. Without this, freshly-published runtime sat in
the registry but containers kept the old image until naturally cycled.

Implementation notes:
- Uses Docker SDK ImagePull rather than shelling out to docker CLI — the
  alpine platform container has no docker CLI installed.
- ghcrAuthHeader() reads GHCR_USER + GHCR_TOKEN env, builds the base64-
  encoded JSON payload Docker engine expects in PullOptions.RegistryAuth.
  Both empty → public/cached images only; both set → private GHCR pulls.
- Container matching uses ContainerInspect (NOT ContainerList) because
  ContainerList returns the resolved digest in .Image, not the human tag.
  Inspect surfaces .Config.Image which is what we need.
- Provisioner.DefaultImagePlatform() exported so admin handler picks the
  same Apple-Silicon-needs-amd64 platform as the provisioner — single
  source of truth for the multi-arch override.

Local-dev companion: scripts/refresh-workspace-images.sh runs on the
host and inherits the host's docker keychain auth — alternate path for
when GHCR_USER/TOKEN aren't set in the platform env.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
2026-04-26 10:17:21 -07:00