molecule-core

Author	SHA1	Message	Date
Hongming Wang	e1628c4d56	fix(a2a): route terminal Message via TaskUpdater.complete/failed in task mode PR #2558 enqueued a Task at the start of new requests so the v1 SDK would accept TaskUpdater.start_work() — fix #1 of the v0→v1 migration gap (PR #2170). But after Task is enqueued, the executor enters "task mode" and the SDK rejects raw Message enqueues at the terminal step: {"code":-32603,"message":"Received Message object in task mode. Use TaskStatusUpdateEvent or TaskArtifactUpdateEvent instead."} Synth-E2E 2026-05-03T11:00:34Z surfaced this on the very first run after the prior fix cascaded. Validation site is the same a2a/server/agent_execution/active_task.py — the framework's job is to enforce the v1 invariant; we're catching up to it. The fix routes both terminal events through TaskUpdater helpers: - success: updater.complete(message=msg) wraps in TaskStatusUpdateEvent(state=COMPLETED, final=True) - error: updater.failed(message=...) wraps in TaskStatusUpdateEvent(state=FAILED, final=True) Both helpers exist in a2a-sdk ≥ 1.0; verified via TaskUpdater.complete signature. Tests: - conftest TaskUpdater stub now records complete/failed calls AND routes the message back through event_queue.enqueue_event so the ~20 legacy tests asserting on enqueue_event keep working - 2 new regression tests pin the contract: * test_terminal_success_routes_via_updater_complete * test_terminal_error_routes_via_updater_failed - Both NEW tests verified to FAIL on staging-baseline (without this fix) and PASS with it — they'd catch the regression before staging if the wheel-smoke gate covered task-mode terminal events too (separate yak-shave for #131 follow-up) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 04:06:45 -07:00
Hongming Wang	78721f7a42	Merge pull request #2561 from Molecule-AI/fix/cascade-list-drift-gate feat(ci): structural drift gate for cascade list vs manifest (RFC #388 PR-3)	2026-05-03 10:55:08 +00:00
Hongming Wang	09010212a0	feat(ci): structural drift gate for cascade list vs manifest (RFC #388 PR-3) Closes the recurrence path of PR #2556. The data fix realigned 8→4 templates in publish-runtime.yml's TEMPLATES variable, but the underlying drift hazard was unguarded — the next manifest change could silently leave cascade out of sync again. This gate fails any PR that changes manifest.json or publish-runtime.yml in a way that makes the cascade list diverge from manifest workspace_templates (suffix-stripped). Either direction is caught: missing-from-cascade templates that won't auto-rebuild on a new wheel publish (the codex-stuck-on-stale-runtime bug class — PR #2512 added codex to manifest, cascade wasn't updated, codex stayed pinned to its last-built runtime version for weeks). extra-in-cascade cascade dispatches to deprecated templates (the wasted-API-calls + dead-CI-noise class — PR #2536 pruned 5 templates from manifest; cascade kept dispatching to all 8 until PR #2556). Triggers narrowly: only on PRs that touch manifest.json, publish-runtime.yml, or the script itself. Fast (single grep+sed+comm pipeline, no Go build). Surfaced during the RFC #388 prior-art audit; folded in as the structural follow-up to the data fix #2556 promised. Self-tested both failure modes locally before commit: - Drop codex from cascade → script fails with "MISSING: codex" - Add langgraph to cascade → script fails with "EXTRA: langgraph" Refs: https://github.com/Molecule-AI/molecule-controlplane/issues/388 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 03:52:39 -07:00
Hongming Wang	bb63e60114	Merge pull request #2560 from Molecule-AI/fix/preflight-smoke-mode-bypass fix(preflight): skip required_env check in MOLECULE_SMOKE_MODE	2026-05-03 10:46:20 +00:00
Hongming Wang	06240ab67b	fix(preflight): skip required_env check in MOLECULE_SMOKE_MODE Boot smoke (#2275) exercises executor.execute() against stub deps and never hits the real provider, so missing auth env is not a real blocker. Without this bypass, every adapter that introduces a new auth env var must be mirrored into molecule-ci's fake-env list — a maintenance treadmill that just bit hermes-template: - 2026-05-03 09:47 UTC: hermes publish-image smoke fails on HERMES_API_KEY preflight (workflow injects CLAUDE_CODE_OAUTH_TOKEN, ANTHROPIC_API_KEY, GEMINI_API_KEY, OPENAI_API_KEY but not HERMES_API_KEY or OPENROUTER_API_KEY). Failed for two cycles before being noticed. The bypass demotes Required-env failures to warnings when MOLECULE_SMOKE_MODE is truthy, so the unset env stays visible in the boot log without blocking. Production paths are unchanged (env unset → fail). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 03:44:05 -07:00
Hongming Wang	750b32c33f	Merge pull request #2558 from Molecule-AI/fix/a2a-v1-task-enqueue fix(a2a): enqueue Task before TaskStatusUpdateEvent for v1 SDK contract	2026-05-03 10:18:36 +00:00
Hongming Wang	5c3b79a8ba	fix(a2a): enqueue Task before TaskStatusUpdateEvent for v1 SDK contract a2a-sdk ≥ 1.0 raises InvalidAgentResponseError when an executor publishes a TaskStatusUpdateEvent (e.g. via TaskUpdater.start_work) before any Task event for fresh requests. The framework only auto-creates the Task on continuation messages (existing task_id resolves via task_manager.get_task); new requests leave _task_created unset and the SDK validation at a2a/server/agent_execution/active_task.py rejects the first status update. PR #2170 migrated the executor surface to v1 but missed this contract. The synthetic E2E gate caught it on every staging run since (~1 week silent fail) with: {"jsonrpc":"2.0","id":"e2e-msg-1","error":{"code":-32603, "message":"Agent should enqueue Task before TaskStatusUpdateEvent event","data":null}} The fix enqueues a Task(state=SUBMITTED) before the TaskUpdater is constructed, gated on `context.current_task is None` so continuation messages don't double-enqueue (which the SDK logs about but doesn't reject). Tests: - test_first_event_is_task_for_new_request — pins the new-request path: first enqueue must be a Task with the expected id/context_id - test_no_task_enqueue_on_continuation — pins the continuation path: when context.current_task is set, the executor must NOT re-enqueue Task - conftest: stub Task / TaskStatus / TaskState in the mocked a2a.types module so the import inside the executor resolves under unit tests google-adk adapter does not have this bug — its execute() only emits Message events, not TaskStatusUpdateEvent. Its cancel() does emit one, but cancel is rarely-invoked and out of scope for this fix. Live verification path: this PR's merge → publish-runtime cascade → next synth-E2E firing should go green at step "8/11 Sending A2A message to parent — expecting agent response".	2026-05-03 03:15:54 -07:00
Hongming Wang	e014d22ee9	Merge pull request #2557 from Molecule-AI/feat/sweep-aws-secrets-orphans feat(ops): sweep orphan AWS Secrets Manager secrets	2026-05-03 09:48:59 +00:00
Hongming Wang	18c2bdbe68	Merge pull request #2529 from Molecule-AI/dependabot/pip/workspace/starlette-gte-1.0.0 chore(deps)(deps): update starlette requirement from >=0.38.0 to >=1.0.0 in /workspace	2026-05-03 09:42:15 +00:00
Hongming Wang	6f8f7932d2	feat(ops): add sweep-aws-secrets janitor — orphan tenant bootstrap secrets CP's deprovision flow calls Secrets.DeleteSecret() (provisioner/ec2.go:806) but only when the deprovision runs to completion. Crashed provisions and incomplete teardowns leak the per-tenant `molecule/tenant/<org_id>/bootstrap` secret. At ~$0.40/secret/month, ~45 leaked secrets surfaced as ~$19/month on the AWS cost dashboard. The tenant_resources audit table (mig 024) tracks four kinds today — CloudflareTunnel, CloudflareDNS, EC2Instance, SecurityGroup — and the existing reconciler doesn't catch Secrets Manager orphans. The proper fix (KindSecretsManagerSecret + recorder hook + reconciler enumerator) is filed as a follow-up controlplane issue. This sweeper is the immediate stopgap. Parallel-shape to sweep-cf-tunnels.sh: - Hourly schedule offset (:30, between sweep-cf-orphans :15 and sweep-cf-tunnels :45) so the three janitors don't burst CP admin at the same minute. - 24h grace window — never deletes a secret younger than the provisioning roundtrip, so an in-flight provision can't be racemurdered. - MAX_DELETE_PCT=50 default (mirrors sweep-cf-orphans for durable resources; tenant secrets should track 1:1 with live tenants). - Same schedule-vs-dispatch hardening as the other janitors: schedule → hard-fail on missing secrets, dispatch → soft-skip. - 8-way xargs parallelism, dry-run by default, --execute to delete. Requires a dedicated AWS_JANITOR_* IAM principal — the prod molecule-cp principal lacks secretsmanager:ListSecrets (it only has scoped Get/Create/Update/Delete). The workflow's verify-secrets step will hard-fail on the first scheduled run until those secrets are configured, surfacing the missing setup loudly rather than silently no-op'ing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 02:38:08 -07:00
Hongming Wang	15124527da	Merge pull request #2276 from Molecule-AI/feat/layer1-runtime-digest-pinning feat(provisioner): digest-pin runtime images via runtime_image_pins table (Layer 1 of #2272)	2026-05-03 09:32:54 +00:00
Hongming Wang	e9a1ce3591	Merge pull request #2556 from Molecule-AI/fix/cascade-list-align-to-manifest fix(publish-runtime): align cascade list to 4 supported runtimes	2026-05-03 09:32:36 +00:00
Hongming Wang	1bff419833	feat(provisioner): digest-pin workspace images via runtime_image_pins (#2272 layer 1) Layer 1 of the runtime-rollout plan. Decouples publish from promotion by giving operators a `runtime_image_pins` table the provisioner consults at container-create time. No row = legacy `:latest` behavior; row present = provisioner pulls `<base>@sha256:<digest>`. One bad publish no longer breaks every workspace simultaneously. Mechanics: - Migration 047: `runtime_image_pins` (template_name PK + sha256 digest + audit columns) and `workspaces.runtime_image_digest` (nullable, with partial index) for "show me workspaces still on the old digest" queries. - `resolveRuntimeImage` (handlers/runtime_image_pin.go): looks up the pin, returns `<base>@sha256:<digest>` on hit, "" on miss/error so the provisioner falls through to the legacy tag map. Availability over pinning — any DB error logs and returns "" rather than blocking the provision. `WORKSPACE_IMAGE_LOCAL_OVERRIDE=1` short-circuits the lookup so devs rebuilding template images locally see their fresh build. - `WorkspaceConfig.Image` carries the resolved value into the provisioner. `selectImage` honors it ahead of the runtime→tag map and falls back to DefaultImage on unknown runtime. - The existing `imageTagIsMoving` predicate (#215) already returns false on `@sha256:` form, so digest pins skip the force-pull path naturally. Tests: - Handler-side (sqlmock): no-pin/db-error/with-pin/empty/unknown/local- override paths cover every branch of `resolveRuntimeImage`. - Provisioner-side: `selectImage` table covers explicit-image preference, runtime-map fallback, unknown-runtime → default, empty-config → default. Plus a struct-literal compile-time pin on `Image` so a future refactor can't silently drop the field. Layer 2 (per-ring routing via `workspaces.runtime_image_digest`) and the admin promote/rollback endpoint ride on top of this and ship separately.	2026-05-03 02:30:00 -07:00
Hongming Wang	24276b9458	fix(publish-runtime): align cascade list to 4 supported runtimes The cascade `TEMPLATES` list in publish-runtime.yml had drifted from manifest.json: Currently dispatches to: claude-code, langgraph, crewai, autogen, deepagents, hermes, gemini-cli, openclaw manifest.json supports: claude-code, hermes, openclaw, codex (after PR #2536 pruned to 4 actively-supported) Two consequences of the drift: 1. `codex` (added in PR #2512, supported in manifest) was never in the cascade — fresh runtime publishes did NOT trigger a codex template rebuild. Codex stayed pinned to whatever runtime version it last saw at its own image-build time. 2. langgraph/crewai/autogen/deepagents/gemini-cli — deprecated, no shipping images, no working A2A — were still receiving cascade dispatches. Wasted API calls and (worse) green CI on dead repos masks "this template is dead, stop maintaining it." Now matches manifest.json workspace_templates exactly. Surfaced during RFC #388 (fast workspace provision) prior-art audit. Long-term fix is to derive TEMPLATES from manifest.json so this can't drift again — captured as a Phase-1 invariant in RFC #388. This commit is the data fix only; structural fix lands with the bake pipeline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 02:28:15 -07:00
Hongming Wang	aef9555b1d	Merge pull request #2555 from Molecule-AI/feat/canvas-warm-paper-tailwind-v4 feat(canvas): warm-paper theme + Tailwind v4	2026-05-03 09:27:23 +00:00
Hongming Wang	db48d1d261	fix(canvas): restore text-white on saturated buttons + close zinc gaps Independent code review of #2555 caught two contrast regressions left by the bulk perl pass: 1. text-white → text-ink mass-substitution silently broke destructive and primary buttons. text-ink resolves to #15181c (warm-paper near-black) in light mode — dark text on bg-red-600 / bg-amber-600 / bg-emerald-600 / bg-blue-600 / bg-accent / bg-accent-strong / bg-good / bg-bad fails WCAG contrast and looks broken. Per-line pass flips text-ink → text-white only when a saturated bg utility is present; tinted-state pills (bg-red-950/50 etc.) keep their intentionally-retained text-* literals. 2. Original mapping table was missing bg-zinc-600 (most-used hover-state literal for cancel buttons — caused them to JUMP from warm cream resting state to dark zinc on hover in light mode) and text-zinc-700/800/900 (separator dots and decorative dim text invisible on warm-paper light bg). Extended mapping fills these gaps with bg-surface-card / text-ink-soft. Also: drop stale tailwind.config.ts reference from components.json (file deleted by the v3→v4 migration); switch baseColor zinc → neutral and enable cssVariables since v4 uses CSS-driven tokens. Future shadcn-cli invocations would have failed or written malformed components without this. 27 sites in 27 files affected by #1, ~20 sites in 20 files by #2. 1214/1214 unit tests still pass; build still clean. Findings courtesy of multi-model review per code-review-and-quality skill — different blind spots catch different bugs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 02:04:20 -07:00
Hongming Wang	052575d773	fix(canvas): regenerate lockfile with cross-platform optional deps CI's `npm ci` failed because the previous lock was generated on macOS arm64, which omits the Linux-specific optional deps that @tailwindcss/postcss → lightningcss-linux-x64-gnu transitively need (@emnapi/runtime, @emnapi/core). Re-ran `npm install --include=optional` so the lock includes every platform variant of lightningcss + the @emnapi packages they pull in. Runner (Linux x64) now has what it needs; local macOS install still fine (npm picks the matching binary at install time). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 01:52:42 -07:00
Hongming Wang	c0eca8d0e1	feat(canvas): warm-paper theme + Tailwind v4 migration Brings the canvas onto the warm-paper design system already shipped to landing, marketplace, and SaaS surfaces, and migrates the build from Tailwind v3 → v4 to match molecule-app. Plumbing: - swap tailwindcss v3 → v4, drop autoprefixer, add @tailwindcss/postcss - delete tailwind.config.ts (v4 reads tokens from @theme blocks in CSS) - globals.css: @import "tailwindcss" + @plugin "@tailwindcss/typography" - two @theme blocks: warm-paper light defaults + always-dark surface tokens (bg-bg / ink-mute / line-strong) for terminal/console panels - [data-theme="dark"] cascade overrides the warm-paper tokens for dark - React Flow edge stroke + scrollbar + selection colour pull from semantic tokens so they flip with the theme Theme infra (ported from molecule-app, identical contracts): - lib/theme-cookie.ts: mol_theme cookie + boot script (no "use client" so server components can read the constants) - lib/theme-provider.tsx: ThemeProvider + useTheme + cookie writer with Domain=.moleculesai.app so the preference follows the user across canvas/app/market/landing subdomains AND tenant subdomains - lib/theme.ts: ColorToken union + cssVar() helper - components/ThemeToggle.tsx: 3-way System/Light/Dark picker - layout.tsx: SSR cookie read + nonce'd inline boot script (CSP needs the explicit nonce — strict-dynamic doesn't forgive an un-nonce'd inline sibling) + ThemeProvider wrapper + bg-surface/text-ink body Component migration (62 files): - Mechanical bg-zinc-* / text-zinc-* / border-zinc-* / text-white → semantic surface/ink/line tokens via perl negative-lookahead pass (preserves opacity modifiers like /80, /60) - bg-blue-500/600 → bg-accent / bg-accent-strong - text-red-* / amber-* / emerald-* → text-bad / warm / good - Tinted-state banner backgrounds (bg-red-950, bg-amber-950, bg-blue-950 etc.) intentionally left literal — they remain readable on warm-paper in light mode without inventing new state-soft tokens - TerminalTab.tsx skipped — xterm renders to canvas, not DOM - 3 unit-test assertions updated to match new token strings (credits pillTone, AuthGate overlay class, A2AEdge accent) Verification: - pnpm test: 1214/1214 pass - pnpm tsc --noEmit: clean - next build: ✓ Compiled successfully (8 routes) - dev server inspection: html data-theme stamped, body uses bg-surface text-ink, boot script carries nonce, compiled CSS contains both @theme blocks + [data-theme="dark"] override Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 01:43:55 -07:00
Hongming Wang	e4893f5a9a	Merge pull request #2552 from Molecule-AI/feat/wire-event-log-into-adapter-base feat(workspace): wire EventLog into adapter base (#119 PR-3b)	2026-05-03 08:39:34 +00:00
Hongming Wang	872f8e8971	Merge pull request #2554 from Molecule-AI/chore/remove-dead-ast-defensive-block chore(workspace): remove dead defensive block in load_skills AST gate	2026-05-03 08:33:18 +00:00
Hongming Wang	d58185b8a8	chore(workspace): remove dead defensive block in load_skills AST gate Self-review of PR #2553 caught an unreachable defensive block at test_load_skills_call_sites.py:99-103: the inner check guarded `call.func.__class__.__name__ == "Name"` from a FunctionDef, but `_find_load_skills_calls` already filters its return type to `ast.Call` — `FunctionDef` cannot reach that loop body. The block was a no-op `pass` with a misleading comment. Removing keeps the gate behaviorally identical; tests still pass. Same five-axis review pass that turned this up also approved the substantive logic of #2553, so no behavior change here. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 01:30:05 -07:00
Hongming Wang	3c0f7de4b9	Merge pull request #2553 from Molecule-AI/feat/skill-compat-audit docs(skills): document SKILL.md `runtime` field + AST coverage gate (#119 PR-4)	2026-05-03 08:24:55 +00:00
Hongming Wang	f8b40d8d73	docs(skills): document SKILL.md `runtime` field + AST coverage gate (#119 PR-4) Closes the documentation + audit gap for declarative skill-compat. The plumbing has been live since PR #117 (RuntimeCapabilities) and skill_loader's `_normalize_runtime_field` has been emitting filter decisions for weeks, but: - No public doc explained the `runtime` frontmatter field, so skill authors didn't know how to opt in / opt out. - No structural gate ensured every load_skills() call site threads current_runtime — a future caller forgetting the kwarg silently force-loads runtime-incompatible skills (no AttributeError, just a delayed crash on first tool invocation). Two changes: 1. docs/agent-runtime/skills.md - Adds `runtime`, `tags`, `examples` to the Frontmatter Fields table. - Adds a Runtime Compatibility section with example, accepted shapes (universal default, list, string sugar), and the "logged + omitted, not crashed" failure mode. Notes that match values come from each adapter's name() (the same string in config.yaml's runtime: field). 2. workspace/tests/test_load_skills_call_sites.py - Static AST gate: walks every workspace/*.py (excluding tests), finds load_skills(...) Call nodes, fails if any lacks current_runtime= as a keyword. - Defense-in-depth `test_known_call_sites_present` — pins that the scan actually sees the two known callers (adapter_base, skill_loader.watcher) so a refactor that moves them is loud. - Sanity-checked the matcher against a synthetic violating module. Same-shape pattern as PR #2358 (tenant_resources audit-coverage AST gate, #150) — pin the contract structurally, not just behaviorally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 01:22:34 -07:00
Hongming Wang	71e7a6ffee	feat(workspace): wire EventLog into adapter base (#119 PR-3b) Adds adapter.event_log property+setter on BaseAdapter so adapters can emit structured events (tool dispatch, skill load, executor errors) without coupling to the chosen backend. Default is a shared no-op DisabledEventLog; main.py overrides at boot from the observability.event_log config block (PR-2 schema). The shape is intentionally additive: - Property is invisible to the BaseAdapter signature snapshot drift gate (the helper walks vars(cls) for callables only — properties are not callable). Verified with a regression test in the new test_adapter_base_event_log.py. - Existing adapters continue to work unchanged. Template repos that never call self.event_log get the no-op for free. - Setter accepts any EventLogBackend, so swapping memory↔disabled at runtime (or to a future Redis backend) requires no adapter code change. Sequels: - PR-3c: emit events from claude-code/hermes adapters at the natural points (tool dispatch, skill load). - PR-4: skill-compat audit + SKILL.md frontmatter docs. - Platform-side /workspaces/:id/activity endpoint reads the buffer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 01:18:19 -07:00
Hongming Wang	e2b58f0fbc	Merge pull request #2551 from Molecule-AI/feat/wire-observability-config feat(workspace): wire observability heartbeat + log_level into consumers (#119 PR-3a)	2026-05-03 08:05:00 +00:00
Hongming Wang	efa68a26b1	feat(workspace): wire observability config into heartbeat + uvicorn (#119 PR-3a) Replaces the hard-coded HEARTBEAT_INTERVAL=30 in heartbeat.py and log_level="info" in main.py with values from ObservabilityConfig (#119 PR-1, schema landed in PR #2538). Concrete plumbing: - heartbeat.HeartbeatLoop accepts an `interval_seconds=` keyword arg. Defaults to the legacy module constant so 2-arg callers (existing tests, any downstream code that hasn't been updated) keep their existing 30s behavior. - main.py constructs HeartbeatLoop with config.observability.heartbeat_interval_seconds — the value the config parser already clamped to [5, 300]. - main.py's uvicorn.Config takes log_level from config.observability.log_level (lowercased — uvicorn's convention differs from Python logging's) with LOG_LEVEL env still winning as an ops-side debugging override. Adapter EventLog wiring deferred to PR-3b (#208 follow-up) — touches adapter_base interface + needs careful design, kept separate to keep this PR small + reviewable. Tests: - test_heartbeat.py: 3 new tests pin default interval, explicit override, and the [5, 300] band that the constructor accepts without re-clamping (clamping is the parser's job). - All 88 tests in test_heartbeat.py + test_config.py pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 01:01:57 -07:00
Hongming Wang	87e355c296	Merge pull request #2548 from Molecule-AI/feat/event-log-module feat(workspace): event_log module + EventLogConfig (#119 PR-2)	2026-05-03 07:54:05 +00:00
Hongming Wang	67f3e49e42	Merge pull request #2549 from Molecule-AI/fix/orphan-sweeper-skip-external-runtime fix(orphan-sweeper): exclude runtime='external' from stale-token revoke	2026-05-03 07:52:43 +00:00
Hongming Wang	be271aef8b	fix(orphan-sweeper): exclude runtime='external' from stale-token revoke The Docker-mode orphan sweeper was incorrectly targeting external runtime workspaces, revoking their auth tokens ~6 minutes after creation (one sweep cycle past the 5-min grace). External workspaces have NO local container by design — their agent runs off-host. The "no live container" predicate the sweep uses to detect wiped-volume orphans matches every external workspace unconditionally, which was killing the only auth credential the off-host agent has. Reproducer: create runtime=external workspace, paste the auth token into molecule-mcp / curl, wait 5 minutes. Next request returns `HTTP 401 — token may be revoked`. Platform log shows `Orphan sweeper: revoking stale tokens for workspace <id> (no live container; volume likely wiped)`. Fix: add `AND w.runtime != 'external'` to the sweep's SELECT. The existing test regexes (third-pass query expectations + the shared expectStaleTokenSweepNoOp helper) are tightened to require the new predicate, so a regression that drops it fails CI immediately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 00:49:37 -07:00
Hongming Wang	9753d58539	fix(build): register event_log in TOP_LEVEL_MODULES The wheel-build drift gate caught it correctly: any new top-level module under workspace/ must be listed in TOP_LEVEL_MODULES so its `from event_log import …` statements get rewritten to `from molecule_runtime.event_log import …` at package time. Without this entry, the published wheel ships event_log.py un-rewritten and crashes at runtime with ModuleNotFoundError on first heartbeat. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 00:19:30 -07:00
Hongming Wang	0fc2531250	feat(workspace): event_log module + EventLogConfig (#119 PR-2) Adds workspace/event_log.py with an in-memory EventLog backend and a disabled no-op variant, plus EventLogConfig nested in ObservabilityConfig (backend / ttl_seconds / max_entries). The event log is the append-and-query buffer that the canvas Activity tab and platform `/activity` endpoint will read in PR-3 of the #119 stack. Two backends ship in this PR: - InMemoryEventLog: bounded ring buffer with TTL eviction, monotonic ids that survive eviction so cursors don't break, thread-safe for concurrent appends from heartbeat + main loop + A2A executor. - DisabledEventLog: no-op for `backend: disabled` — opts the workspace out without crashing callers that propagate event ids. Schema-only PR — no consumers wired yet. Wiring lands in PR-3. Test coverage: - 34 new test_event_log.py tests (100% line coverage on event_log.py) - 9 new test_config.py tests for EventLogConfig parsing - Concurrency stress with 8 threads × 200 appends — verifies unique monotonic ids under contention - TTL + max_entries eviction with injected clock (no time.sleep) - Disabled backend contract pinned Closes #207. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 00:17:12 -07:00
Hongming Wang	350495f032	Merge pull request #2547 from Molecule-AI/perf/cache-platform-inbound-secret perf(wsauth): in-process cache for platform_inbound_secret reads	2026-05-03 07:11:38 +00:00
Hongming Wang	384edb4af0	Merge branch 'staging' into perf/cache-platform-inbound-secret	2026-05-03 00:08:43 -07:00
Hongming Wang	b040171fa1	perf(wsauth): in-process cache for platform_inbound_secret reads Heartbeats fire every 60s per workspace and were the dominant caller of ReadPlatformInboundSecret — one DB SELECT each, purely to redeliver the same value. For an N-workspace fleet that's N SELECTs/minute of pure overhead, growing linearly with the fleet (#189). This adds a sync.Map cache keyed by workspaceID with a 5-minute TTL: - Read-through: cache miss → DB SELECT → populate → return. - Write-through: every IssuePlatformInboundSecret call refreshes the cache with the new value before returning, so the lazy-heal mint path (readOrLazyHealInboundSecret) doesn't see a stale read of the value it just wrote. - TTL eviction: 5 minutes — generous enough that the heartbeat hot path hits cache for ~5 reads in a row before re-validating, short enough that an out-of-band rotation (operator running `UPDATE workspaces SET platform_inbound_secret=...` directly) propagates within minutes without requiring a redeploy. - Absence not cached: ErrNoInboundSecret skips the cache write so the lazy-heal recovery contract for the column-NULL case (readOrLazyHealInboundSecret in workspace_provision_shared.go) keeps working. Memory footprint is bounded by the active workspace fleet (~200 bytes per entry); deleted workspaces leave dead entries until process restart, acceptable given workspace-deletion is operator-rare. Why in-process instead of Redis: workspace-server runs as a single Railway service today (per memory project_controlplane_ownership); adding Redis for this single column read would be over-engineering. The cache is a self-contained, Redis-free upgrade that keeps the same semantic surface (read returns the latest secret) while collapsing the heartbeat read storm. If the deployment ever fans out across replicas, an operator-side rotation propagates per-replica TTL-bounded without needing a shared write log. Tests: 5 new cases covering cache hit within TTL, refresh after TTL (simulating an operator rotation via SQL), write-through on Issue, absence-not-cached, and Reset clearing all entries. The setupMock helper in wsauth and setupTestDB helper in handlers both call ResetInboundSecretCacheForTesting() at start + cleanup so write-through state from one test doesn't shadow SELECT expectations in the next. SetInboundSecretCacheNowForTesting() exposes a deterministic clock override so the TTL test doesn't sleep. Task: #189.	2026-05-03 00:04:38 -07:00
Hongming Wang	c4f64a11a8	Merge pull request #2546 from Molecule-AI/fix/provisioner-repull-moving-tags fix(provisioner): force re-pull of moving image tags on workspace start	2026-05-03 06:59:36 +00:00
Hongming Wang	552602e462	fix(provisioner): force re-pull of moving image tags on workspace start Previously Start() only pulled when the image was missing locally (imgErr != nil). Once a tenant's Docker daemon had `:latest` cached, it stuck on that snapshot forever even after publish-runtime pushed a newer image with the same tag — the same image-cache class that sibling task #232 closed on the controlplane redeploy path. Now Start() additionally re-pulls when the tag is "moving" (`:latest`, no tag, `:staging`, `:main`, `:dev`, `:edge`, `:nightly`, `:rolling`). Pinned tags (semver, sha-prefixed, date-stamped, build-id) and digest-pinned references (`@sha256:...`) skip the pull because their contents are by definition immutable. The classifier (imageTagIsMoving) is deliberately conservative on the "moving" side — only the well-known moving tags trip it. Misclassifying a pinned tag as moving wastes bandwidth on every provision; misclassifying moving as pinned silently bricks the fleet on stale snapshots, which is exactly the bug class this fix closes. Edge cases handled: - Registry hostname with port (`localhost:5000/foo`) — the `:5000` is not mistaken for a tag. - Digest pinning (`image@sha256:...`) — never re-pulled even if a moving-looking tag is also present. - Legacy local-build tags (`workspace-template:hermes`) — treated as pinned (no registry to move from). Test coverage: 22 cases across all classifier shapes. No changes to the pull-failure path (still best-effort, ContainerCreate still surfaces the actionable "image not found" error if the pull failed and the cache is also empty). Task: #215. Companion to #232.	2026-05-02 23:56:32 -07:00
Hongming Wang	29261cee3d	Merge pull request #2537 from Molecule-AI/test/derive-provider-drift-gate Test: AST drift gate for derive-provider.sh ↔ Go port	2026-05-03 06:54:22 +00:00
Hongming Wang	dfeefb0acc	fix(workspace-server): vendor upstream derive-provider.sh + close 12-prefix drift The drift gate's monorepoRoot walk-up looked for workspace-configs-templates/ which is gitignored locally and doesn't exist in this repo at all (the canonical script lives in molecule-ai-workspace-template-hermes). Test failed on CI from day one with "could not find monorepo root". Two layered fixes in one PR: 1. Vendor upstream derive-provider.sh as testdata/ + drop monorepoRoot. The vendored copy has a header pointing operators at the upstream source and a one-line cp command for refresh. Test now reads two files (vendored shell + workspace_provision.go) via package-relative paths — Go test sets cwd to the package dir, so this is hermetic without any walk-up gymnastics. 2. Update the case-statement regex to match upstream's renamed variable (${_HERMES_MODEL} since v0.12.0, the resolved value of HERMES_INFERENCE_MODEL with a HERMES_DEFAULT_MODEL legacy fallback). Regex now accepts either spelling so a future rename fails loudly on the parser-sanity check rather than silently returning empty. Vendoring upstream surfaced real drift the gate was designed to catch: upstream v0.12.0 added 12 provider prefixes that deriveProviderFromModelSlug didn't handle (xai/grok, bedrock/aws, tencent/tencent-tokenhub, gmi, qwen-oauth, lmstudio/lm-studio, minimax-oauth, alibaba-coding-plan, google-gemini-cli, openai-codex, copilot-acp, copilot). Without these, Save+Restart on a workspace using one of those prefixes would persist LLM_PROVIDER="" and the next boot would fall back to derive-provider.sh's runtime *=auto branch — losing the user's explicit choice on every restart. Added all 12 case clauses + 16 new table-driven test cases (covering both canonical and aliased forms). Drift gate now passes; future upstream additions will fail loudly with a "DRIFT: ..." message pointing the engineer at the missing case. Task: #242	2026-05-02 23:51:23 -07:00
Hongming Wang	284012a768	test(workspace-server): AST drift gate for derive-provider.sh ↔ Go port PR #2535 added a Go port of derive-provider.sh (deriveProviderFromModelSlug) so workspace-server can persist LLM_PROVIDER into workspace_secrets at provision time. This created two sources of truth — if a future PR adds a provider prefix to one without the other, the platform's persisted LLM_PROVIDER silently disagrees with what the container's derive-provider.sh produces at boot, with no test going red. This adds a hermetic drift gate that: 1. Parses workspace-configs-templates/hermes/scripts/derive-provider.sh with regex (handling both single-line `pat/) PROVIDER="x" ;;` clauses and multi-line conditional clauses) to build a map[prefix]provider. 2. Walks workspace_provision.go's AST with go/ast, finds deriveProviderFromModelSlug, and extracts every case-clause prefix → return-string-literal pair. 3. Cross-checks both directions and accepts only the two documented divergences (nousresearch/ and openai/* both → "openrouter" at provision time because derive-provider.sh's runtime-env checks aren't loaded yet) via a hardcoded acceptedDivergences map. 4. Fails with an actionable message that names both files and suggests the exact fix (add the case OR add to divergence list with a comment). Pattern: behavior-based AST gate from PR #2367 / memory feedback — pin the invariant by what the function maps, not by what it's named. Stdlib-only (go/ast, go/parser, go/token, regexp); no network, no DB, no docker — reads two monorepo files in-process. A second sanity-check test pins anchor prefixes the regex must find, so a future shell-syntax change can't silently produce an empty map and trivially pass the main gate. Closes task #242.	2026-05-02 23:51:23 -07:00
Hongming Wang	a28f905c5a	Merge pull request #2545 from Molecule-AI/fix/configtab-single-source-of-truth-MODEL fix(canvas): ConfigTab is single source of truth for tier/provider/model	2026-05-03 06:40:38 +00:00
Hongming Wang	bdd1d09dfb	fix(canvas): tighten originalModel + pin store-flush failure-gating (review feedback) PR #2545 self-review findings. (1) originalModel was set from wsMetadataModel alone. On a hermes/pre-#240 workspace where MODEL_PROVIDER was never written but YAML has runtime_config.model: "something", originalModel="" while the form rendered "something" — handleSave's diff fired /model PUT on every unrelated save (tier change → workspace auto-restart). Snapshot from the actual rendered model in BOTH loadConfig branches so the diff stays scoped to user-initiated changes. (2) The store-flush test asserted the call happened but didn't pin success-gating. A future refactor wrapping the PATCH in try/catch and unconditionally calling updateNodeData would have shipped green and left the badge lying about server-rejected writes. New test pins the PATCH-rejects-no-flush invariant. (3) Hermes-edge regression test for (1). All 1214 canvas tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 23:37:52 -07:00
Hongming Wang	7f0c58d563	fix(canvas): ConfigTab is single source of truth for tier/provider/model Three drift bugs in ConfigTab + ProviderModelSelector. Same root pattern: the form's display, the diff baseline, and the canvas store all read or write from different copies of the same data, so what the user sees and what the runtime actually uses can diverge silently. (1) currentModelId read runtime_config.model first; loadConfig overrode only top-level config.model. With template YAML `runtime_config.model: sonnet` and live MODEL_PROVIDER=`MiniMax-M2`, the form rendered "Claude Code subscription / Claude Sonnet (OAuth)" while the container env (and chat) used MiniMax-M2. Fix: loadConfig propagates wsMetadataModel into BOTH places. (2) handleSave's nextModel-vs-oldModel diff compared the form value to the YAML default. After (1) mirrors wsMetadataModel into the form's runtime_config.model for display, that diff was always non-zero on no-op saves and would fire /model PUT — which auto-restarts. New originalModel state tracks the loaded MODEL_PROVIDER and is the diff baseline. (3) handleSave PATCHed the workspace row but never pushed the same fields into useCanvasStore.updateNodeData. User picked T3, hit Save & Restart, DB updated to tier=3, header pill kept showing T2 until full hydrate. Fix: mirror dbPatch into the store. Bonus: ProviderModelSelector.handleProviderChange used to auto-default the model to next.models[0] (alphabetically first) when switching providers. User picked the MiniMax provider intending MiniMax-M2.7; the form silently set MiniMax-M2 (first in the bucket) and the workspace deployed with the wrong model. Now empty-default for multi-model providers, force explicit pick — Save/Deploy already gate on model.trim() === "". Three new tests in ConfigTab.provider.test.tsx pin (1)/(2)/(3); two existing ProviderModelSelector tests updated to reflect the no-silent- default behaviour, with a new single-model-auto-pick test for the 0-vs-many boundary. 1212/1212 canvas tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 23:31:02 -07:00
Hongming Wang	5259ce3ea1	Merge pull request #2544 from Molecule-AI/fix/templates-yaml-log-and-coexistence-test fix(workspace-server): log silent yaml.Unmarshal + coexistence test (#256, #257)	2026-05-03 06:04:51 +00:00
Hongming Wang	586d567a48	fix(workspace-server): log silent yaml.Unmarshal + coexistence test (#256 , #257 ) Two follow-ups from PR #2543's multi-model code review (audit #253). 1. Log silent yaml.Unmarshal errors (#256). When a malformed config.yaml made `yaml.Unmarshal(data, &raw)` fail, the affected template silently disappeared from /templates with no trace — operator could not distinguish "excluded due to parse error" from "never existed." That widened a real foot-gun once PR #2543 added structured top-level `providers:` (a string-shaped top-level `providers:` decoded into `[]providerRegistryEntry` would fail and drop the whole entry). Now logs `templates list: skip <id>: yaml.Unmarshal: <err>` and continues with the rest. 2. Coexistence test (#257 part 1). PR #2543 covered the structured registry and slug list in isolation. claude-code-default in production ships BOTH: top-level `providers:` (structured registry, 2 entries) AND `runtime_config.providers:` (slug list, 3 entries). New `TestTemplatesList_BothProviderShapesCoexist` mirrors that layout, asserts both shapes surface independently with no cross-talk (e.g. a slug-only entry like `anthropic-api` does NOT synthesize a stub in the structured registry), and pins the JSON wire-shape for both fields side-by-side. 3. `base_url: null` decoding assertion (#257 part 3). Adds an explicit `got[0].BaseURL == ""` check in the existing `TestTemplatesList_SurfacesProviderRegistry` test, locking in the `string` (not `string`) type. A future change to `string` would surface as JSON `null` and break canvas's "no base_url = use provider defaults" branch — caught loudly by this assertion. Tests: 11 TestTemplatesList_* now green, including the new MalformedYAMLLogsAndSkips and BothProviderShapesCoexist. The remaining piece of #257 — renaming `Providers []string` JSON tag to `provider_slugs` — requires coordinated canvas updates across 4 files and is intentionally deferred to a separate PR (no canvas churn while user is mid-test). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 23:01:59 -07:00
Hongming Wang	0acd9b3454	Merge pull request #2543 from Molecule-AI/fix/templates-structured-provider-registry-235 fix(workspace-server): surface structured provider registry on /templates (#235)	2026-05-03 05:45:49 +00:00
Hongming Wang	992a0c6860	fix(workspace-server): surface structured provider registry on /templates (#235 ) Closes the contract drift caught by audit #253. Task #235 ("Server: enrich /templates payload with structured providers") was marked completed, but `templates.go` only ever emitted the `runtime_config.providers []string` slug list — the structured ProviderEntry shape (auth_env, model_prefixes, model_aliases, base_url) the description promised was never plumbed. Templates ship the structured registry under a TOP-LEVEL `providers:` block (claude-code carries 6+ entries today; hermes still uses the slug list). Both shapes coexist and are independent — surface them as two separate fields: - `providers` → existing []string slug list (unchanged) - `provider_registry` → new []providerRegistryEntry (structured) The canvas's ProviderModelSelector comment block already anticipates this ("Templates that ship explicit vendor metadata (future) should override the heuristic."). With this field in place, the canvas can optionally drop its prefix-inference fallback for templates that ship an explicit registry — separate PR. Today's change is purely additive on the server side; no canvas change required. Tests: - TestTemplatesList_SurfacesProviderRegistry: order preservation + field plumbing on a claude-code-shaped fixture (oauth + minimax) + JSON wire-shape gate to catch struct-tag renames. - TestTemplatesList_OmitsProviderRegistryWhenAbsent: omitempty so legacy templates (hermes, langgraph) don't emit `null` and break Array.isArray on the canvas side. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 22:42:42 -07:00
Hongming Wang	48bc518dfc	Merge pull request #2541 from Molecule-AI/fix/workspace-server-universal-MODEL-env fix(workspace-server): set universal MODEL env on every templated provision	2026-05-03 05:23:36 +00:00
Hongming Wang	2eea2b1315	Merge pull request #2540 from Molecule-AI/fix/wire-provider-model-selector fix(canvas): wire ProviderModelSelector into MissingKeysModal + ConfigTab	2026-05-03 05:23:35 +00:00
Hongming Wang	bd5f3428d5	Merge pull request #2539 from Molecule-AI/fix/preflight-explicit-empty-required-env fix(runtime): explicit empty per-model required_env means "no auth"	2026-05-03 05:23:33 +00:00
Hongming Wang	8a86b66159	fix(workspace-server): set universal MODEL env on every templated provision Bug B fix, server-side complement to molecule-runtime PR #2538. The runtime PR taught `workspace/config.py` to honour `MODEL_PROVIDER` over `runtime_config.model` from the template's verbatim YAML. This PR is the upstream half: workspace-server's `applyRuntimeModelEnv` now sets `MODEL=<picked>` for every runtime, not just hermes (which got `HERMES_DEFAULT_MODEL` already). Pre-fix: applyRuntimeModelEnv's per-runtime switch only emitted HERMES_DEFAULT_MODEL for hermes; every other runtime got nothing, so the adapter read its template's default model from /configs/config.yaml. Surfaced 2026-05-02 — picking MiniMax-M2 in canvas → workspace booted with model=sonnet (claude-code template default) and demanded CLAUDE_CODE_OAUTH_TOKEN. Post-fix: MODEL is set unconditionally before the per-runtime switch. HERMES_DEFAULT_MODEL stays for backwards compat. Adapters opt in by reading os.environ["MODEL"] in their executor (claude-code adapter already does this since the same Bug B fix; see workspace-configs-templates/claude-code-default/adapter.py). Tests ===== - `TestApplyRuntimeModelEnv_SetsUniversalMODELForAllRuntimes`: table-driven across claude-code/hermes/langgraph/crewai + empty-model fallback + MODEL_PROVIDER-secret-fallback path. Adding a new runtime = adding a row, not writing a new test. - All 6 sub-cases pass + existing `TestWorkspaceCreate_FirstDeploy_UnknownModel_OnlyMintModelProvider` pin still green. Why now ======= This was authored alongside the runtime PR but stashed (not committed) during a session-handoff cleanup. The molecule-runtime side shipped at SHA `16ac895a` and is live on PyPI as molecule-ai-workspace-runtime 0.1.84, but until the workspace-server side ships, the canvas-picked MODEL env never reaches non-hermes adapters. Caught by the systematic stash audit triggered by the user's discovery that ProviderModelSelector had been similarly stashed. Closes the workspace-server side of #246. Builds on merged #2538. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 22:10:51 -07:00

1 2 3 4 5 ...

3839 Commits