molecule-core

Author	SHA1	Message	Date
Hongming Wang	aefb44aff2	fix(workspace-runtime): route delegate_task through platform A2A proxy tool_delegate_task was POSTing directly to peer["url"], which is the Docker-internal hostname (e.g. http://ws-X-Y:8000) for in- container peers. External callers — the standalone molecule-mcp wrapper running on an operator's laptop — get [Errno 8] nodename nor servname every single delegation, breaking the universal-MCP path's last "ride the same code as in-container" claim. The platform's /workspaces/:peer-id/a2a proxy endpoint already handles internal forwarding for in-container peers AND is the only path external runtimes can use. Unify on it: in-container callers pay one extra HTTP hop on the same Docker bridge (microseconds); external callers get a working delegation path for the first time. discover_peer is still called for access-control + online-status detection — only the routing target changes. Verified live on 2026-04-30 against workspace 8dad3e29 (external mac runtime) → 97ac32e9 (Claude Code Agent in-container): direct POST returned ConnectError, proxy POST returned "acknowledged from claude code agent" as requested. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 17:13:50 -07:00
Hongming Wang	cc58e87393	Merge pull request #2415 from Molecule-AI/feat/molecule-mcp-inbox-polling feat(workspace-runtime): inbox polling for standalone molecule-mcp	2026-04-30 23:41:47 +00:00
Hongming Wang	d061642cfc	test(inbox): bind side-effecting pop() before assert CodeQL flagged the bare `assert state.pop(...) is None` — under `python -O` asserts are stripped, which would skip the call entirely and the test would silently pass without exercising the code. Bind the result first so the call always runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:39:45 -07:00
Hongming Wang	b47d4ceb00	feat(workspace-runtime): add inbox polling for standalone molecule-mcp path The universal MCP server (a2a_mcp_server.py) was outbound-only — agents in standalone runtimes (Claude Code, hermes, codex, etc.) could delegate, list peers, and write memories, but never observed the canvas-user or peer-agent messages addressed to them. This blocked "constantly responding" loops without forcing operators back onto a runtime-specific channel plugin. This PR closes the inbound gap with a poller-fed in-memory queue and three new MCP tools: - wait_for_message(timeout_secs?) — block until next message arrives - inbox_peek(limit?) — list pending messages (non-destructive) - inbox_pop(activity_id) — drop a handled message A daemon thread polls /workspaces/:id/activity?type=a2a_receive every 5s, fills the queue from the cursor (since_id), and persists the cursor to ${CONFIGS_DIR}/.mcp_inbox_cursor so a restart doesn't replay backlog. On 410 (cursor pruned) we fall back to since_secs=600 for a bounded recovery window. Activity-row → InboxMessage extraction mirrors the molecule-mcp-claude-channel plugin's extractText (envelope shapes #1-3 + summary fallback). mcp_cli.main starts the poller alongside the existing register + heartbeat threads. In-container runtimes (which have push delivery via canvas WebSocket) skip activation, so inbox tools return an informational "(inbox not enabled)" message instead of double-delivery. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:32:48 -07:00
Hongming Wang	d00c8be8c9	Merge pull request #2413 from Molecule-AI/fix/external-runtime-universal-mcp feat(workspace-runtime): expose universal MCP server to runtime=external operators	2026-04-30 23:21:25 +00:00
Hongming Wang	b54ceb799f	fix: address 5-axis review findings on PR #2413 Critical: - ExternalConnectModal.tsx: filledUniversalMcp substitution searched for WORKSPACE_AUTH_TOKEN but the snippet's placeholder is now MOLECULE_WORKSPACE_TOKEN (changed in the previous polish commit `876c0bfc`). Operators copy-pasting the MCP tab would have gotten a literal "<paste from create response>" instead of the token. Fix the substitution to match the new placeholder name. Important: - mcp_cli._platform_register: 401/403 from initial register now hard- exits with code 3 + an actionable stderr message pointing the operator at the canvas Tokens tab. Pre-fix: warning log + continue, which made a bad-token startup silently fail (heartbeat 401's forever, every tool call also 401's, no clear surfacing in the operator's MCP client). 500/503 still log + continue (transient platform blips shouldn't abort the MCP loop). - a2a_mcp_server.cli_main docstring: removed stale claim that this is the wheel's console-script entry-point target. The actual target is mcp_cli.main since 2026-04-30. Wheel-smoke pins both names so the functionality was correct, but the doc was lying. Test coverage: 3 new mcp_cli tests: - register 401 exits code=3 + stderr mentions canvas Tokens tab - register 403 (C18 hijack rejection) takes same path - register 500/503 does NOT exit — only auth errors hard-fail Findings deferred to follow-up (acceptable per review rubric): - Code dedup across mcp_cli / heartbeat.py / molecule_agent SDK - Pooled httpx.Client for connection reuse - Heartbeat exponential backoff - Token-resolution ordering parity (env-first vs file-first) between mcp_cli.main and platform_auth.get_token Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:06:59 -07:00
Hongming Wang	876c0bfcd4	docs(canvas): update Universal MCP snippet — molecule-mcp now standalone The canvas tab snippet for the Universal MCP path was written before this PR added the built-in register + heartbeat thread. Earlier wording described it as "outbound-only — pair with the Claude Code or Python SDK tab for heartbeat + inbound messages" — that's stale. molecule-mcp now handles register + heartbeat itself; the only thing it doesn't yet do is inbound A2A delivery. Updated: - externalUniversalMcpTemplate header comment + body — describes standalone behavior, points operators at SDK/channel only when they need INBOUND (not heartbeat). - Drops the now-redundant curl-register step from the snippet — the binary registers itself on startup. - Canvas modal label likewise updated. No runtime / behavior change; pure docs polish so a copy-pasting operator's mental model matches what the binary actually does. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 15:52:15 -07:00
Hongming Wang	427300f3a4	feat: make molecule-mcp standalone (built-in register + heartbeat) + recover awaiting_agent on heartbeat Two paired fixes that together let an external operator run a single process (molecule-mcp) and see their workspace come up online in the canvas — the bug surfaced live when status stuck at "awaiting_agent / OFFLINE" despite an active MCP server. Platform side (workspace-server/internal/handlers/registry.go): Heartbeat handler already auto-recovers offline → online and provisioning → online, but NOT awaiting_agent → online. Healthsweep flips stale-heartbeat external workspaces TO awaiting_agent, and with no recovery path the workspace stays "OFFLINE — Restart" in the canvas forever. Add the symmetric branch: if currentStatus == "awaiting_agent" and a heartbeat arrives, flip to online + broadcast WORKSPACE_ONLINE. Mirrors the existing offline/provisioning patterns exactly. Test: TestHeartbeatHandler_AwaitingAgentToOnline asserts the SQL UPDATE fires with the awaiting_agent guard clause. Wheel side (workspace/mcp_cli.py): molecule-mcp was outbound-only — operators had to run a separate SDK process to register + heartbeat. Now mcp_cli.main(): 1. Calls /registry/register at startup (idempotent upsert flips status awaiting_agent → online via the existing register path). 2. Spawns a daemon thread that POSTs /registry/heartbeat every 20s. 20s is comfortably under the healthsweep stale window so a single missed beat doesn't cause status churn. 3. Runs the MCP stdio loop in the foreground. Both calls set Origin: ${PLATFORM_URL} so the SaaS edge WAF accepts them. Threaded heartbeat (not asyncio) chosen because it doesn't need to share an event loop with the MCP stdio server — daemon=True cleanly dies when the operator's runtime exits. MOLECULE_MCP_DISABLE_HEARTBEAT=1 escape hatch lets in-container callers (which have heartbeat.py running already) reuse the entry point without double-heartbeating. Default is enabled. End-to-end verification (live, against hongmingwang.moleculesai.app, workspace 8dad3e29-...): pre-fix: status=awaiting_agent → canvas shows OFFLINE forever post-fix: ran `molecule-mcp` for 5s standalone → canvas state: status=online runtime=external agent=molecule-mcp-8dad3e29 Test coverage: 7 new mcp_cli tests (register-at-startup, heartbeat- thread-spawned, disable-env-skips-both, env-and-file token resolution, register payload shape, heartbeat endpoint + headers); 1 new platform test (awaiting_agent → online recovery). Full workspace + handlers suites green: 1355 Python, full Go handlers passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 15:42:44 -07:00
Hongming Wang	716589742c	feat(canvas): add Universal MCP tab to external-agent connect modal The "Connect your external agent" dialog already covered Claude Code, Python SDK, curl, and raw fields. This adds a Universal MCP tab that documents the new \`molecule-mcp\` console script — the runtime- agnostic baseline shipped by PR #2413's workspace-runtime changes. Surface area: - New \`externalUniversalMcpTemplate\` constant in workspace-server. Three-step snippet: pip install runtime → one-shot register via curl → wire molecule-mcp into agent's MCP config (Claude Code example, notes that hermes/codex/etc. take the same env-var contract). - Workspace create response now includes \`universal_mcp_snippet\` alongside the existing curl/python/channel snippets. - Canvas modal renders the tab when \`universal_mcp_snippet\` is present; backward-compatible with older platform builds (tab hides when empty). Origin/WAF coverage (the user explicitly asked for this): - The runtime wheel handles Origin automatically (this PR's earlier commit on platform_auth.auth_headers). - The curl tab now sets \`Origin: {{PLATFORM_URL}}\` preemptively with an explanatory comment; \`/registry/register\` is currently WAF-allowed without it but adding now keeps the snippet working if WAF rules expand. The comment also explains why \`/workspaces/*\` paths return empty 404 without Origin — the exact failure mode I hit while smoke-testing this PR live. - The MCP snippet's footer notes that the wheel auto-handles Origin so operators don't think about it. End-to-end verification (against live tenant hongmingwang.moleculesai.app, freshly registered workspace): - get_workspace_info → full JSON - list_peers → "Claude Code Agent (ID: 97ac32e9..., status: online)" - recall_memory → "No memories found." all returned by the molecule-mcp binary speaking MCP stdio to this Claude Code session. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 15:34:27 -07:00
Hongming Wang	74c5e0d7a8	fix(workspace-runtime): add Origin header so SaaS edge WAF accepts MCP tool calls Discovered while smoke-testing the molecule-mcp external-runtime path against a live tenant (hongmingwang.moleculesai.app). Every tool call that hit /workspaces/* or /registry/*/peers returned 404 — but /registry/register and /registry/heartbeat returned 200. Diagnosis: the tenant's edge WAF requires a same-origin header. Without it, unhandled paths get silently rewritten to the canvas Next.js app, which has no /workspaces or /registry/:id/peers route and returns an empty 404. The molecule-mcp-claude-channel plugin already sets this header (server.ts:271-276); the workspace runtime never did because in-container PLATFORM_URLs (Docker network) aren't behind the WAF. Fix: extend platform_auth.auth_headers() to include Origin: ${PLATFORM_URL} whenever PLATFORM_URL is set. Inside-container behavior is unchanged (the WAF is path-irrelevant for the internal hostnames). External-runtime calls now thread the WAF correctly. Verification (live, against a freshly-registered external workspace): pre-fix: get_workspace_info → "not found", list_peers → 404 post-fix: get_workspace_info → full workspace JSON, list_peers → "Claude Code Agent (ID: 97ac32e9..., status: online)" This is the kind of bug unit tests can never catch — caught only by running the wheel against the real tenant. Memory: feedback_always_run_e2e.md. Test coverage: 4 new tests in test_platform_auth.py — Origin alone when no token + Origin + Authorization both, no-PLATFORM_URL falls through to original empty-dict behavior, env-token path with Origin. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 15:30:15 -07:00
Hongming Wang	169e284d57	feat(workspace-runtime): expose universal MCP server to runtime=external operators Ship the baseline universal MCP path that any external runtime (Claude Code, hermes, codex, anything that speaks MCP stdio) can use, before optimizing per-runtime channels. Today the workspace MCP server only spins up inside the container; external operators have no way to call the 8 platform tools (delegate_task, list_peers, send_message_to_user, commit_memory, etc.) from outside. Three additive changes: 1. `platform_auth.get_token()` env-var fallback — adds `MOLECULE_WORKSPACE_TOKEN` as a fallback when no `${CONFIGS_DIR}/.auth_token` file exists. File-first preserves in-container behavior unchanged. External operators (no /configs volume) now have a way to supply the token without faking the filesystem layout. 2. `molecule-mcp` console script — adds a new entry point in the published `molecule-ai-workspace-runtime` PyPI wheel. Operators run `pip install molecule-ai-workspace-runtime`, set 3 env vars (WORKSPACE_ID, PLATFORM_URL, MOLECULE_WORKSPACE_TOKEN), and register the binary in their agent's MCP config. `mcp_cli.main` is a thin validator wrapper — it checks env BEFORE importing the heavy `a2a_mcp_server` module so a misconfigured first-run gets a friendly 3-line error instead of a 20-line module-level RuntimeError traceback. 3. Wheel smoke gate — extends `scripts/wheel_smoke.py` to assert `cli_main` and `mcp_cli.main` are importable. Same regression class as the 0.1.16 main_sync incident: a silent rename or unrewritten import here would break every external operator on the next wheel publish (memory: feedback_runtime_publish_pipeline_gates.md). Test coverage: - `tests/test_platform_auth.py` — 8 new tests for the env-var fallback: file-priority, env-fallback, whitespace handling, cache, header construction, empty-env-as-unset. - `tests/test_mcp_cli.py` — 8 new tests for the validator: each required var separately, file-or-env satisfies token requirement, whitespace-only env treated as missing, help mentions canvas Tokens tab. - Full `workspace/tests/` suite green: 1346 passed, 1 skipped. - Local end-to-end: built wheel, installed in venv, ran `molecule-mcp` with no env → friendly error; with env → MCP server starts. Why now / why this shape: user redirect was "support the baseline first so all runtimes can use, then optimize". A claude-only MCP channel leaves hermes/codex/third-party operators broken on runtime=external. This PR ships the runtime-agnostic baseline; per- runtime polish (claude-channel push delivery, hermes-native bindings) is a follow-up PR. PR #2412 fixed the partner bug where canvas Restart silently revoked the operator's token — the two together unblock the external-runtime story end-to-end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 15:20:19 -07:00
Hongming Wang	d2046c374d	Merge pull request #2412 from Molecule-AI/fix/restart-external-no-revoke fix(workspace-server): skip provision pipeline on Restart for runtime=external	2026-04-30 22:11:25 +00:00
Hongming Wang	36e263a07d	fix(workspace-server): skip provision pipeline on Restart for runtime=external POST /workspaces/:id/restart on a runtime=external workspace ran the full re-provision pipeline (Stop → provisionWorkspace), which calls issueAndInjectToken → RevokeAllForWorkspace. For external workspaces (operator-driven, no container/EC2) that silently destroyed the operator's local bearer token on every "Restart" click in the canvas — the local poller would then 401-spam against /activity until the operator manually regenerated from the Tokens tab. The auto-restart path (runRestartCycle, line 436) already short-circuits runtime=external. This patch mirrors that for the manual handler so the two paths agree, and surfaces a 200 OK with a clear message so the canvas can tell the operator the fix is on their side rather than silently no-op'ing. Test coverage: TestRestartHandler_ExternalRuntimeNoOps asserts the short-circuit fires before* any DB write or provision call. sqlmock's "unexpected query" failure mode would catch a regression that re-introduced the token revoke or the status=provisioning UPDATE. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 15:08:48 -07:00
Hongming Wang	c68ec23d3c	Merge pull request #2410 from Molecule-AI/auto/harness-replays-ci-gate ci: gate PRs on tests/harness/run-all-replays.sh	2026-04-30 20:35:30 +00:00
Hongming Wang	0f0df576f5	Merge pull request #2392 from Molecule-AI/auto/e2e-staging-external-runtime test(e2e): live staging regression for external-runtime awaiting_agent transitions	2026-04-30 20:32:23 +00:00
Hongming Wang	c8b17ea1ad	fix(harness): install httpx for replay Python evals peer-discovery-404 imports workspace/a2a_client.py which depends on httpx; the runner's stock Python doesn't have it, so the replay's PARSE assertion (b) fails with ModuleNotFoundError on every run. The WIRE assertion (a) — pure curl — passes, so the failure was masking just enough to make the replay LOOK partially-broken when the tenant side is fine. Adding tests/harness/requirements.txt with only httpx instead of sourcing workspace/requirements.txt: that file pulls a2a-sdk, langchain-core, opentelemetry, sqlalchemy, temporalio, etc. — ~30s of install for one replay's PARSE step. The harness's deps surface should grow when a new replay introduces a new import, not by default. Workflow gains one step (`pip install -r tests/harness/requirements.txt`) between the /etc/hosts setup and run-all-replays. No other changes.	2026-04-30 13:32:00 -07:00
Hongming Wang	9dae0503ee	fix(harness): generate SECRETS_ENCRYPTION_KEY per-run instead of hardcoding Replaces the hardcoded base64 sentinel (`630dd0da`) with a per-run generation in up.sh, exported into compose's interpolation environment. Why: - Hardcoding a 32-byte base64 string in the repo, even one labelled "test-only", sets a bad muscle-memory pattern. The next agent or contributor copies the shape into another harness — or worse, into a staging .env — and the test-only sentinel turns into something someone treats as a real key. - Secret scanners flag key-shaped values regardless of the surrounding comment claiming intent. Avoiding the literal entirely sidesteps the false-positive. - A fresh key per harness lifetime more closely mimics prod's per-tenant isolation, exercising the same code paths without any pretense of stable encrypted-data fixtures (which the harness wipes on every ./down.sh anyway). Implementation: - up.sh: `openssl rand -base64 32` if SECRETS_ENCRYPTION_KEY isn't already set in the caller's env. Honoring a pre-set value lets a debug session pin a key for reproducibility (e.g. when investigating encrypted-row corruption). - compose.yml: `${SECRETS_ENCRYPTION_KEY:?…}` makes a misuse loud — running `docker compose up` directly bypassing up.sh fails fast with a clear error pointing at the right entry point, rather than a 100s unhealthy-tenant timeout. Both paths verified via `docker compose config`: - with key exported: value interpolates cleanly - without it: "required variable SECRETS_ENCRYPTION_KEY is missing a value: must be set — run via tests/harness/up.sh, which generates one per run"	2026-04-30 13:30:14 -07:00
Hongming Wang	630dd0dae7	fix(harness): seed SECRETS_ENCRYPTION_KEY so MOLECULE_ENV=production tenant boots Found via the first run of the harness-replays-required-check workflow (#2410): the tenant container failed its healthcheck after 100s with "refusing to boot without encryption in production". This is the deferred CRITICAL flagged on PR #2401 — `crypto.InitStrict()` requires SECRETS_ENCRYPTION_KEY when MOLECULE_ENV=production, and the harness sets prod-mode but never seeded a key. Fix: add a clearly-test 32-byte base64 value (encoding the literal string "harness-test-only-not-for-prod!!") inline. Keeping MOLECULE_ENV=production preserves the harness's value as a production- shape replay surface — it now exercises the full encryption boot path including the strict check, rather than skirting it via dev-mode. Why inline rather than .env: - The harness compose file is meant to be self-contained and reproducible from a clean clone. An external .env would split the config across two files for one synthetic value. - The value is intentionally a sentinel; there's no operator decision here to gate behind a per-deployment file. After this lands the harness boots clean and `run-all-replays.sh` can exercise the buildinfo + peer-discovery replays as designed. The required-check workflow itself (#2410) needs no change.	2026-04-30 13:25:52 -07:00
Hongming Wang	be44e54b77	Merge pull request #2411 from Molecule-AI/auto/prod-version-check-script ops: check-prod-versions.sh — one-line tenant version status	2026-04-30 20:16:43 +00:00
Hongming Wang	24cb2a286f	ci(harness-replays): KEEP_UP=1 so dump-logs step has containers to read First run on PR #2410 failed with 'container harness-tenant-1 is unhealthy' but the dump-compose-logs step printed empty tenant logs because run-all-replays.sh's trap-on-EXIT had already torn down the harness. Setting KEEP_UP=1 leaves containers in place; the always-run Force teardown step at the end owns cleanup explicitly. Now we'll actually see why the tenant didn't become healthy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 13:15:46 -07:00
Hongming Wang	41d5f9558f	ops: scripts/ops/check-prod-versions.sh — one-line "is each tenant on latest?" Iterates a list of tenant slugs (default canary set on production, operator-supplied on staging), curls each tenant's /buildinfo plus canvas's /api/buildinfo, compares to origin/main's HEAD SHA, prints a table with one of {current, stale, unreachable} per surface. Returns non-zero if any surface is stale, so it can be wired into a periodic alert later. Why this exists: every "is the fix live?" question used to be answered with a one-off curl + git rev-parse + manual diff. This script does that uniformly across every public surface (workspace tenants + canvas) and is parseable. The redeploy verifier (#2398) covers the deploy moment; this covers any-time-after. Reads EXPECTED_SHA from `gh api repos/Molecule-AI/molecule-core/ commits/main` so it always reflects the actual upstream tip, not local working-copy state. Falls back to local origin/main with a WARN if `gh` isn't logged in — debugging is still useful even if the comparison may lag. Depends on: - #2409 (TenantGuard /buildinfo allowlist) — without it every tenant looks "unreachable" because the route 404s before the handler. Already merged on staging; will hit production after the next staging→main fast-forward + redeploy. - #2407 (canvas /api/buildinfo) — already on main + Vercel. Usage: ./scripts/ops/check-prod-versions.sh # production canary set TENANT_SLUGS="a b c" ./scripts/ops/check-prod-versions.sh # custom set ENV=staging TENANT_SLUGS="..." ./scripts/ops/check-prod-versions.sh Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 13:13:47 -07:00
Hongming Wang	b5c7b349d8	Merge pull request #2408 from Molecule-AI/auto/fix-healthsweep-gate-saas fix(boot): always start health-sweep goroutine — SaaS tenants need it for external-runtime liveness	2026-04-30 20:12:28 +00:00
Hongming Wang	3105e87cf7	ci: gate PRs on tests/harness/run-all-replays.sh Closes the gap between "the harness exists" and "the harness blocks bugs." Phase 2 of the harness roadmap (per tests/harness/README.md): make harness-based E2E a required CI check on every PR touching the tenant binary or the harness itself. Trigger: push + pull_request to staging+main, paths-filtered to workspace-server/, canvas/, tests/harness/**, and this workflow. merge_group support included so this becomes branch-protectable. Single-job-with-conditional-steps pattern (matches e2e-api.yml). One check run regardless of paths-filter outcome; satisfies branch protection cleanly per the PR #2264 SKIPPED-in-set finding. Why this exists: 2026-04-30 we shipped a TenantGuard allowlist gap (/buildinfo added to router.go in #2398, never added to the allowlist) that the existing buildinfo-stale-image.sh replay would have caught. The harness was wired correctly; nobody ran it. Replays as a discipline beat replays as a memory item. The CI pipeline: detect-changes (paths filter) └ harness-replays (always) ├ no-op pass when paths-filter says no relevant change └ otherwise: checkout + sibling plugin checkout + /etc/hosts entry + run-all-replays.sh + compose-logs-on-failure + force-teardown Compose logs from tenant/cp-stub/cf-proxy/postgres are dumped on failure so a CI red is debuggable without re-reproducing locally. The trap in run-all-replays.sh handles teardown; the always-run down.sh step is a belt-and-suspenders against trap-bypass kills. Follow-ups (not in this PR): - Add this check to staging branch protection once it's been green for a few PRs (the new-workflow-instability hedge that other gates followed). - Eventually wire the buildx GHA cache to speed up tenant image builds — currently every PR rebuilds the full Dockerfile.tenant (Go + Next.js + template clones) from scratch. Acceptable for now; optimize when the timeout-minutes:30 ceiling becomes painful. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 13:04:53 -07:00
Hongming Wang	9c7b14923d	Merge pull request #2409 from Molecule-AI/auto/tenant-guard-allowlist-buildinfo fix(tenant-guard): allowlist /buildinfo so redeploy verifier can reach it	2026-04-30 19:58:00 +00:00
Hongming Wang	8516a8f9c6	fix(tenant-guard): allowlist /buildinfo so redeploy verifier can reach it The /buildinfo route added in #2398 to verify each tenant runs the published SHA was 404'd by TenantGuard on every production tenant — the allowlist had /health, /metrics, /registry/register, /registry/heartbeat, but not /buildinfo. The redeploy workflows curl /buildinfo from a CI runner with no X-Molecule-Org-Id header, TenantGuard 404'd them, gin's NoRoute proxied to canvas, canvas returned its HTML 404 page, jq read empty git_sha, and the verifier silently soft-warned every tenant as "unreachable" — which the workflow doesn't fail on. Confirmed externally: curl https://hongmingwang.moleculesai.app/buildinfo → HTTP 404 + Content-Type: text/html (Next.js "404: This page could not be found.") even though /health on the same host returns {"status":"ok"} from gin. The buildinfo package's own doc already declares /buildinfo public by design ("Public is intentional: it's a build identifier, not operational state. The same string is already published as org.opencontainers.image.revision on the container image, so no new info is exposed.") — the allowlist just missed it. Pin the alignment in tenant_guard_test.go: TestTenantGuard_AllowlistBypassesCheck now asserts /buildinfo returns 200 without an org header alongside /health and /metrics, so a future allowlist edit can't silently regress the verifier again. Closes the silent-success failure mode: stale tenants will now show up as STALE (hard-fail) rather than UNREACHABLE (soft-warn). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 12:54:51 -07:00
Hongming Wang	235aca9908	fix(boot): always start health-sweep goroutine — SaaS tenants need it for external-runtime liveness Pre-fix, cmd/server/main.go gated the entire health-sweep goroutine on `prov != nil`. On SaaS tenants (`MOLECULE_ORG_ID` set) the local Docker provisioner is never initialized — only `cpProv`. So the goroutine never started, and `sweepStaleRemoteWorkspaces` (which transitions runtime='external' workspaces from 'online' to 'awaiting_agent' when their last_heartbeat_at goes stale) never ran. Net effect on production: every external-runtime workspace on SaaS that lost its agent stayed 'online' indefinitely instead of falling back to 'awaiting_agent' (re-registrable). The drift gate (#2388) caught the migration side and #2382 fixed the SQL writes, but this orchestration-side gate slipped through both because there was no SaaS-mode E2E coverage on the heartbeat-loss → awaiting_agent transition. Caught by #2392 (live staging external-runtime regression E2E) failing at step 6 — 180s with no heartbeat, expected status=awaiting_agent, got online. Fix: drop the `if prov != nil` gate. `StartHealthSweep` already handles nil checker correctly (healthsweep.go:50-71): the Docker sweep is gated inside the loop, the remote sweep always runs. Test coverage already exists at TestStartHealthSweep_NilCheckerRunsRemoteSweep. After this lands and tenants redeploy, #2392 step 6 passes and the regression coverage closes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 12:05:40 -07:00
Hongming Wang	c03074424e	Merge pull request #2407 from Molecule-AI/auto/canvas-buildinfo-endpoint feat(canvas): add /api/buildinfo for version parity with tenant	2026-04-30 19:04:22 +00:00
Hongming Wang	fc3b5fd385	feat(canvas): add /api/buildinfo for version-display parity with tenant Workspace-server has GET /buildinfo (PR #2398) — `curl https://<slug>. moleculesai.app/buildinfo` returns the live git SHA. Canvas had no parallel: debugging "is this the deployed code?" required reading Vercel's UI or response headers (deployment ID, not git SHA). Add canvas /api/buildinfo returning {git_sha, git_ref, vercel_env} sourced from VERCEL_GIT_COMMIT_SHA / _REF / VERCEL_ENV — Vercel injects these at build time from the deploying commit. Outside Vercel (local `next dev`, harness) all three are unset and the endpoint returns `git_sha: "dev"`, the same sentinel workspace-server uses pre-ldflags- injection. Now both surfaces speak the same vocabulary: curl https://<slug>.moleculesai.app/buildinfo curl https://canvas.moleculesai.app/api/buildinfo 3 tests cover dev-fallback, Vercel-injected SHA pass-through, and JSON content type. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 12:00:35 -07:00
Hongming Wang	6e0a0dba55	Merge pull request #2406 from Molecule-AI/auto/harness-run-all-replays feat(tests): add run-all-replays.sh harness runner	2026-04-30 19:00:07 +00:00
Hongming Wang	0af4012f79	feat(tests): add run-all-replays.sh harness runner Boots the harness, runs every script under replays/, tracks pass/fail, and tears down on exit. Closes the README's TODO for the harness runner that the per-replay-registration comment referenced. Usage: ./run-all-replays.sh # boot, run, teardown KEEP_UP=1 ./run-all-replays.sh # leave harness running on exit REBUILD=1 ./run-all-replays.sh # rebuild images before booting Trap-on-EXIT teardown ensures partial-failure runs don't leak Docker resources. Returns non-zero if any replay failed; CI can adopt this as a single command without per-replay registration. Phase 2 picks this up to wire harness-based E2E as a required check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 11:57:27 -07:00
Hongming Wang	46ca5aa6d3	Merge pull request #2405 from Molecule-AI/auto/wheel-smoke-shared-script refactor(ci): extract wheel smoke into shared script (close PR-time vs publish-time gap)	2026-04-30 18:55:35 +00:00
Hongming Wang	ef206b5be6	refactor(ci): extract wheel smoke into shared script publish-runtime.yml had a broad smoke (AgentCard call-shape, well-known mount alignment, new_text_message) inline as a heredoc. runtime-prbuild- compat.yml had a narrow inline smoke (just `from main import main_sync`). Result: a PR could introduce SDK shape regressions that pass at PR time and only fail at publish time, post-merge. Extract the broad smoke into scripts/wheel_smoke.py and invoke it from both workflows. PR-time gate now matches publish-time gate — same script, same assertions. Eliminates the drift hazard of two heredocs that have to be kept in lockstep manually. Verified locally: * Built wheel from workspace/ source, installed in venv, ran smoke → pass * Simulated AgentCard kwarg-rename regression → smoke catches it as `ValueError: Protocol message AgentCard has no "supported_interfaces" field` (the exact failure mode of #2179 / supported_protocols incident) Path filter for runtime-prbuild-compat extended to include scripts/wheel_smoke.py so smoke-only edits get PR-validated. publish- runtime path filter intentionally NOT extended — smoke-only edits should not auto-trigger a PyPI version bump. Subset of #131 (the broader "invoke main() against stub config" goal remains pending — main() needs a config dir + stub platform server). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 11:52:07 -07:00
Hongming Wang	3689fd9d82	Merge pull request #2403 from Molecule-AI/auto/buildinfo-verify-distinguish-unreachable fix(ci): hard-fail when >50% of fleet unreachable post-redeploy	2026-04-30 18:43:08 +00:00
Hongming Wang	9b909c4459	fix(ci): gate 50%-floor on TOTAL_VERIFIED >= 4 Self-review of #2403 caught a regression: with a 1-tenant fleet (the exact case the original #2402 fix targeted), the new floor would re-introduce the flake. Trace: TOTAL=1, UNREACHABLE=1, $((1/2))=0 if 1 -gt 0 → TRUE → exit 1 The 50%-rule only meaningfully distinguishes "real outage" from "teardown race" when the fleet is large enough that "half down" is statistically meaningful. With 1-3 tenants, canary-verify is the actual gate (it runs against the canary first and aborts the rollout if the canary fails to come up). Gate the floor on TOTAL_VERIFIED >= 4. Truth table: TOTAL UNREACHABLE RESULT 1 1 soft-warn (original e2e flake case) 4 2 soft-warn (exactly half) 4 3 hard-fail (75% — real outage) 10 6 hard-fail (60% — real outage) Mirrored across staging.yml + main.yml. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 11:40:31 -07:00
Hongming Wang	6159429634	Merge pull request #2401 from Molecule-AI/auto/local-production-shape-harness feat(tests): add production-shape local harness (Phase 1)	2026-04-30 18:36:44 +00:00
Hongming Wang	c5aaca2bbe	Merge pull request #2399 from Molecule-AI/auto/peer-discovery-diagnostic-2397 feat(workspace): surface peer-discovery failure reason instead of "may be isolated"	2026-04-30 18:36:42 +00:00
Hongming Wang	ec39fecda2	fix(ci): hard-fail when >50% of fleet unreachable post-redeploy Belt-and-suspenders sanity floor on top of the unreachable-soft-warn introduced earlier in this PR. Addresses the residual gap noted in review: if a new image crashes on startup, every tenant ends up unreachable, and the soft-warn alone would let that ship as a green deploy. Canary-verify catches it on the canary tenant first, but this guard is a fallback for canary-skip dispatches and same-batch races. Threshold is 50% of healthz_ok-snapshotted tenants — comfortably above the typical e2e-* teardown rate (5-10/hour, ~1 ephemeral tenant per batch) but below any plausible real-outage scenario. Mirrored across staging.yml + main.yml for shape parity. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 11:35:56 -07:00
Hongming Wang	046eccbb7c	fix(harness): five-axis self-review fixes before merge Three findings from re-reviewing PR #2401 with fresh eyes: 1. Critical — port binding to 0.0.0.0 compose.yml's cf-proxy bound 8080:8080 (default 0.0.0.0). The harness uses a hardcoded ADMIN_TOKEN so anyone on the local network or VPN could hit /workspaces with admin privileges. Switch to 127.0.0.1:8080 so admin access is loopback-only — safe for E2E and prevents the known-token leak. 2. Required — dead code in cp-stub peersFailureMode + __stub/mode + __stub/peers were declared with atomic.Value setters but no handler ever READ from them. CP doesn't host /registry/peers (the tenant does), so the toggles couldn't drive responses. Removed the dead vars + handlers; kept redeployFleetCalls counter and __stub/state since those have a real consumer in the buildinfo replay. 3. Required — replay's auth-context dependency peer-discovery-404.sh's Python eval ran a2a_client.get_peers_with_ diagnostic() against the live tenant. Without a workspace token file, auth_headers() yields empty headers — so the helper might exercise a 401 branch instead of the 404 branch the replay claims to test. Split the assertion into (a) WIRE — direct curl proves the platform returns 404 from /registry/<unregistered>/peers — and (b) PARSE — feed the helper a mocked 404 via httpx patches, no network/auth. Each branch tests exactly what it claims. Also added a graceful skip when the workspace runtime in the current checkout pre-dates #2399 (no get_peers_with_diagnostic yet) — replay falls back to wire-only verification with a clear message instead of an opaque AttributeError. After #2399 lands on staging, both branches will run. cp-stub still builds clean. compose.yml validates. Replay's bash syntax + Python eval both verified locally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 11:32:40 -07:00
Hongming Wang	22314a6c91	Merge pull request #2402 from Molecule-AI/auto/buildinfo-verify-distinguish-unreachable fix(ci): distinguish unreachable from stale in /buildinfo verify step	2026-04-30 18:30:14 +00:00
Hongming Wang	d45241cae7	fix(ci): distinguish unreachable from stale in /buildinfo verify step The /buildinfo verify step (PR #2398) was treating "no /buildinfo response" the same as "tenant returned wrong SHA" — both bumped MISMATCH_COUNT and hard-failed the workflow. First post-merge run on staging caught a real edge case: ephemeral E2E tenants (slug e2e-20260430-...) get torn down by the E2E teardown trap between CP's healthz_ok snapshot and the verify step running, so the verify step would dial into DNS that no longer resolves and hard-fail on a benign condition. The bug class we actually care about is STALE (tenant up + serving old code, the #2395 root). UNREACHABLE post-redeploy is almost always a benign teardown race; real "tenant up but unreachable" is caught by CP's own healthz monitor + the alert pipeline, so double-counting it here was making this workflow flaky on every staging push that overlapped E2E. Wire: - Split MISMATCH_COUNT into STALE_COUNT + UNREACHABLE_COUNT. - STALE → hard-fail the workflow (the bug class we're guarding). - UNREACHABLE → :⚠️:, don't fail. Reachable-mismatch still hard-fails. - Job summary surfaces both lists separately so on-call can tell at a glance which class fired. Mirror in redeploy-tenants-on-main.yml for shape parity (prod has fewer ephemeral tenants but identical asymmetry would be a gratuitous fork). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 11:25:46 -07:00
Hongming Wang	f13d2b2b7b	feat(tests): add production-shape local harness (Phase 1) The harness brings up the SaaS tenant topology on localhost using the SAME workspace-server/Dockerfile.tenant image that ships to production. Tests run against http://harness-tenant.localhost:8080 and exercise the same code path a real tenant takes: client → cf-proxy (nginx; CF tunnel + LB header rewrites) → tenant (Dockerfile.tenant — combined platform + canvas) → cp-stub (minimal Go CP stand-in for /cp/* paths) → postgres + redis Why this exists: bugs that survive `go run ./cmd/server` and ship to prod almost always live in env-gated middleware (TenantGuard, /cp/* proxy, canvas proxy), header rewrites, or the strict-auth / live-token mode. The harness activates ALL of them locally so #2395 + #2397-class bugs can be reproduced before deploy. Phase 1 surface: - cp-stub/main.go: minimal CP stand-in. /cp/auth/me, redeploy-fleet, /__stub/{peers,mode,state} for replay scripts. Catch-all returns 501 with a clear message when a new CP route appears. - cf-proxy/nginx.conf: rewrites Host to <slug>.localhost, injects X-Forwarded-*, disables buffering to mirror CF tunnel streaming semantics. - compose.yml: one service per topology layer; tenant builds from the actual production Dockerfile.tenant. - up.sh / down.sh / seed.sh: lifecycle scripts. - replays/peer-discovery-404.sh: reproduces #2397 + asserts the diagnostic helper from PR #2399 surfaces "404" + "registered". - replays/buildinfo-stale-image.sh: reproduces #2395 + asserts /buildinfo wire shape + GIT_SHA injection from PR #2398. - README.md: topology, quickstart, what the harness does NOT cover. Phases 2-3 (separate PRs): - Phase 2: convert tests/e2e/test_api.sh to target the harness URL instead of localhost; make harness-based replays a required CI gate. - Phase 3: config-coherence lint that diffs harness env list against production CP's env list, fails CI on drift. Verification: - cp-stub builds (go build ./...). - cp-stub responds to all stubbed endpoints (smoke-tested locally). - compose.yml passes `docker compose config --quiet`. - All shell scripts pass `bash -n` syntax check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 11:22:46 -07:00
Hongming Wang	3b34dfefbc	feat(workspace): surface peer-discovery failure reason instead of "may be isolated" Closes #2397. Today, every empty-peer condition (true empty, 401/403, 404, 5xx, network) collapses to a single message: "No peers available (this workspace may be isolated)". The user has no way to tell whether they need to provision more workspaces (true isolation), restart the workspace (auth), re-register (404), page on-call (5xx), or check network (timeout) — five different operator actions, one ambiguous string. Wire: - new helper get_peers_with_diagnostic() in a2a_client.py returns (peers, error_summary). error_summary is None on 200; a short actionable string on every other branch. - get_peers() now shims through it so non-tool callers (system-prompt formatters) keep the bare-list contract. - tool_list_peers() switches to the diagnostic helper and surfaces the actual reason. The "may be isolated" string is removed; true empty now reads "no peers in the platform registry." Tests: - TestGetPeersWithDiagnostic: 200, 200-empty, 401, 403, 404, 5xx, network exception, 200-but-non-list-body, and the bare-list-shim regression guard. - TestToolListPeers: each diagnostic branch surfaces its reason + explicit assertion that "may be isolated" is gone. Coverage 91.53% (floor 86%). 122 a2a tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 11:09:26 -07:00
Hongming Wang	c06e2fec5e	Merge pull request #2396 from Molecule-AI/auto/typed-workspace-status refactor(workspace-status): typed constants + AST-based drift gate	2026-04-30 18:03:30 +00:00
Hongming Wang	dea306d267	Merge pull request #2398 from Molecule-AI/auto/buildinfo-deploy-verification feat(deploy): verify each tenant /buildinfo matches published SHA after redeploy	2026-04-30 18:00:36 +00:00
Hongming Wang	998e13c4bd	feat(deploy): verify each tenant /buildinfo matches published SHA after redeploy Closes the gap that let issue #2395 ship: redeploy-fleet workflows reported ssm_status=Success based on SSM RPC return code alone, while EC2 tenants silently kept serving the previous :latest digest because docker compose up without an explicit pull is a no-op when the local tag already exists. Wire: - new buildinfo package exposes GitSHA, set at link time via -ldflags from the GIT_SHA build-arg (default "dev" so test runs without ldflags fail closed against an unset deploy) - router exposes GET /buildinfo returning {git_sha} — public, no auth, cheap enough to curl from CI for every tenant - both Dockerfiles thread GIT_SHA into the Go build - publish-workspace-server-image.yml passes GIT_SHA=github.sha for both images - redeploy-tenants-on-main.yml + redeploy-tenants-on-staging.yml curl each tenant's /buildinfo after the redeploy SSM RPC and fail the workflow on digest mismatch; staging treats both :latest and :staging-latest as moving tags; verification is skipped only when an operator pinned a specific tag via workflow_dispatch Tests: - TestGitSHA_DefaultDevSentinel pins the dev default - TestBuildInfoEndpoint_ReturnsGitSHA pins the wire shape that the workflow's jq lookup depends on Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:55:08 -07:00
Hongming Wang	188db33794	refactor(workspace-status): catch missed literal in workspace_bootstrap.go + add literal-drift gate Two related fixes after self-review of #2396: 1. workspace_bootstrap.go:62 — `SET status = 'failed'` was missed in the initial sweep. Now parameterized as $3 with models.StatusFailed. Test fixed with the additional WithArgs sentinel. 2. Drift gate now scans production .go AST for hard-coded `UPDATE workspaces … SET status = '<literal>'` and fails with file:line. This catches the kind of miss the first commit just fixed — the original migration-vs-codebase axis only verified AllWorkspaceStatuses ⊆ enum, not "no raw literals in writes." Verified the gate fires: dropped a synthetic 'failed' literal into internal/handlers/_drift_sanity.go and confirmed the gate flagged "internal/handlers/_drift_sanity.go:6 → SET status = 'failed'". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:51:01 -07:00
Hongming Wang	fdf1b5d76a	refactor(workspace-status): typed constants + AST-based drift gate Eliminate raw 'awaiting_agent'/'hibernating'/'failed'/etc string literals from production status writes. Adds models.WorkspaceStatus typed alias and models.AllWorkspaceStatuses canonical slice; every UPDATE workspaces SET status = ... now passes a parameterized $N typed value rather than a hard-coded SQL literal. Defense-in-depth follow-up to migration 046 (#2388): the Postgres enum type was missing 'awaiting_agent' + 'hibernating' for ~5 days because sqlmock regex matching cannot enforce live enum constraints. The drift gate is now a proper Go AST + SQL parser (no regex), asserting the codebase ⊆ migration enum and every const appears in the canonical slice. With status as a parameterized typed value, future enum mismatches fail at the SQL layer in tests, not silently in prod. Test coverage: full suite passes with -race; drift gate green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:41:41 -07:00
Hongming Wang	17a0f49140	test(e2e): read delivery_mode from register response, not GET Step 5b assertion failed against staging: register response: {"delivery_mode":"poll","platform_inbound_secret":"...","status":"registered"} HTTP_CODE=200 ❌ Expected delivery_mode=poll, got — register UPDATE not honoring payload.delivery_mode The register call succeeded (200, status:registered, delivery_mode:poll). The assertion was reading the field from the workspace GET response — but GET /workspaces/:id (workspace.go:587 Get handler) doesn't fetch delivery_mode at all. The SELECT column list on line 597 pre-dates the delivery_mode column from #2339 PR 1, so empty is the only thing GET can return for it. Fix: read delivery_mode from the register response body. That's the canonical source — register is what writes the column, and its handler already echoes the resolved value back. The check is now meaningful ("the handler honored the explicit poll we sent") instead of testing GET's serialization gap. Surfacing delivery_mode in GET is a separate fix; not gating this test on it keeps the test focused on the awaiting_agent transitions it was written for. Filed mentally as a follow-up — registry_test.go already covers the resolveDeliveryMode logic directly, which is what users actually hit through the handler. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:35:21 -07:00
Hongming Wang	201f39a6d0	test(e2e): set delivery_mode=poll explicitly to decouple from image drift Second-round failure on the same test (run 25179171433): register response: {"error":"hostname \"example.invalid\" cannot be resolved (DNS error)"} HTTP_CODE=400 Root cause: registry.Register's resolveDeliveryMode was supposed to default runtime=external workspaces to poll mode (PR #2382), in which case validateAgentURL is skipped and example.invalid passes through. But the freshly-provisioned staging tenant for this test was running an older workspace-server image that lacked that branch — the implicit default was still push, validateAgentURL ran, and the DNS lookup 400'd. Same image-drift class as the production bug seen on the hongmingwang tenant 17:30Z (deployed image lagging main HEAD). Fix: send delivery_mode="poll" explicitly. Eliminates the test's dependence on resolveDeliveryMode's default branch being deployed. Step 5b reframed: was "verify external→poll default working", now "verify explicit-poll round-trips". The default-resolution behavior is exercised by handler-level tests in registry_test.go, which run against the SHA being merged (not whatever :latest happens to be on the fleet). That's the right place for it — E2E should test what users see, unit tests should pin what handlers compute. Pulling those apart removes a class of "intermittent on staging, green locally" failures. The deeper bug — fleet redeploy + provision both can serve stale images even when the tag has been republished — gets a separate issue. This commit just unblocks the merge. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:27:50 -07:00
Hongming Wang	eacc229e91	test(e2e): fix /registry/register payload — id (not workspace_id) + agent_card The new external-runtime regression test had two payload bugs that made step 5 fail with HTTP 400 on its first run: 1. Field name: sent {"workspace_id":...} but RegisterPayload (workspace- server/internal/models/workspace.go:58) declares `id` with binding:"required" — workspace_id is the heartbeat payload's field, not register's. 2. Missing required field: agent_card has binding:"required" and was absent. ShouldBindJSON 400'd before any handler logic ran, which is why the body said nothing useful. Why this got past local verification: the test was written from memory of the heartbeat shape, never run end-to-end before pushing, and curl with --fail-with-body prints the body to stdout but exit-22's under set -e — the body was suppressed before the log line could fire. Fix: - Send `id` + a minimal valid agent_card ({name, skills:[{id,name}]}) matching the canonical shape from tests/e2e/test_api.sh:96. - Pull the body into REGISTER_BODY shared between steps 5 and 7 so drift between the two register calls is impossible. - Drop --fail-with-body for these two calls and append HTTP_CODE via curl -w so the body is always visible when the call non-200s. The explicit grep for HTTP_CODE=200 + \|\|true on curl preserves the fail-fast contract. - Inline payload contract comment pointing at RegisterPayload so the next person editing this doesn't repeat the heartbeat-confusion mistake. The url=https://example.invalid:443 is fine: runtime=external resolves to poll mode (registry.go:resolveDeliveryMode case 3), and validateAgentURL only fires for push. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:15:54 -07:00

1 2 3 4 5 ...

3563 Commits