peer-discovery-404 imports workspace/a2a_client.py which depends on
httpx; the runner's stock Python doesn't have it, so the replay's
PARSE assertion (b) fails with ModuleNotFoundError on every run. The
WIRE assertion (a) — pure curl — passes, so the failure was masking
just enough to make the replay LOOK partially-broken when the tenant
side is fine.
Adding tests/harness/requirements.txt with only httpx instead of
sourcing workspace/requirements.txt: that file pulls a2a-sdk,
langchain-core, opentelemetry, sqlalchemy, temporalio, etc. — ~30s
of install for one replay's PARSE step. The harness's deps surface
should grow when a new replay introduces a new import, not by
default.
Workflow gains one step (`pip install -r tests/harness/requirements.txt`)
between the /etc/hosts setup and run-all-replays. No other changes.
Replaces the hardcoded base64 sentinel (630dd0da) with a per-run
generation in up.sh, exported into compose's interpolation environment.
Why:
- Hardcoding a 32-byte base64 string in the repo, even one labelled
"test-only", sets a bad muscle-memory pattern. The next agent or
contributor copies the shape into another harness — or worse, into a
staging .env — and the test-only sentinel turns into something
someone treats as a real key.
- Secret scanners flag key-shaped values regardless of the surrounding
comment claiming intent. Avoiding the literal entirely sidesteps the
false-positive.
- A fresh key per harness lifetime more closely mimics prod's
per-tenant isolation, exercising the same code paths without any
pretense of stable encrypted-data fixtures (which the harness wipes
on every ./down.sh anyway).
Implementation:
- up.sh: `openssl rand -base64 32` if SECRETS_ENCRYPTION_KEY isn't
already set in the caller's env. Honoring a pre-set value lets a
debug session pin a key for reproducibility (e.g. when investigating
encrypted-row corruption).
- compose.yml: `${SECRETS_ENCRYPTION_KEY:?…}` makes a misuse loud —
running `docker compose up` directly bypassing up.sh fails fast with
a clear error pointing at the right entry point, rather than a 100s
unhealthy-tenant timeout.
Both paths verified via `docker compose config`:
- with key exported: value interpolates cleanly
- without it: "required variable SECRETS_ENCRYPTION_KEY is missing a
value: must be set — run via tests/harness/up.sh, which generates
one per run"
Found via the first run of the harness-replays-required-check workflow
(#2410): the tenant container failed its healthcheck after 100s with
"refusing to boot without encryption in production". This is the
deferred CRITICAL flagged on PR #2401 — `crypto.InitStrict()` requires
SECRETS_ENCRYPTION_KEY when MOLECULE_ENV=production, and the harness
sets prod-mode but never seeded a key.
Fix: add a clearly-test 32-byte base64 value (encoding the literal
string "harness-test-only-not-for-prod!!") inline. Keeping
MOLECULE_ENV=production preserves the harness's value as a production-
shape replay surface — it now exercises the full encryption boot path
including the strict check, rather than skirting it via dev-mode.
Why inline rather than .env:
- The harness compose file is meant to be self-contained and
reproducible from a clean clone. An external .env would split the
config across two files for one synthetic value.
- The value is intentionally a sentinel; there's no operator decision
here to gate behind a per-deployment file.
After this lands the harness boots clean and `run-all-replays.sh` can
exercise the buildinfo + peer-discovery replays as designed. The
required-check workflow itself (#2410) needs no change.
First run on PR #2410 failed with 'container harness-tenant-1 is unhealthy'
but the dump-compose-logs step printed empty tenant logs because
run-all-replays.sh's trap-on-EXIT had already torn down the harness.
Setting KEEP_UP=1 leaves containers in place; the always-run Force
teardown step at the end owns cleanup explicitly. Now we'll actually
see why the tenant didn't become healthy.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Iterates a list of tenant slugs (default canary set on production,
operator-supplied on staging), curls each tenant's /buildinfo plus
canvas's /api/buildinfo, compares to origin/main's HEAD SHA, prints a
table with one of {current, stale, unreachable} per surface. Returns
non-zero if any surface is stale, so it can be wired into a periodic
alert later.
Why this exists: every "is the fix live?" question used to be
answered with a one-off curl + git rev-parse + manual diff. This
script does that uniformly across every public surface (workspace
tenants + canvas) and is parseable. The redeploy verifier (#2398)
covers the deploy moment; this covers any-time-after.
Reads EXPECTED_SHA from `gh api repos/Molecule-AI/molecule-core/
commits/main` so it always reflects the actual upstream tip, not
local working-copy state. Falls back to local origin/main with a
WARN if `gh` isn't logged in — debugging is still useful even if
the comparison may lag.
Depends on:
- #2409 (TenantGuard /buildinfo allowlist) — without it every
tenant looks "unreachable" because the route 404s before the
handler. Already merged on staging; will hit production after
the next staging→main fast-forward + redeploy.
- #2407 (canvas /api/buildinfo) — already on main + Vercel.
Usage:
./scripts/ops/check-prod-versions.sh # production canary set
TENANT_SLUGS="a b c" ./scripts/ops/check-prod-versions.sh # custom set
ENV=staging TENANT_SLUGS="..." ./scripts/ops/check-prod-versions.sh
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the gap between "the harness exists" and "the harness blocks bugs."
Phase 2 of the harness roadmap (per tests/harness/README.md): make
harness-based E2E a required CI check on every PR touching the tenant
binary or the harness itself.
Trigger: push + pull_request to staging+main, paths-filtered to
workspace-server/**, canvas/**, tests/harness/**, and this workflow.
merge_group support included so this becomes branch-protectable.
Single-job-with-conditional-steps pattern (matches e2e-api.yml). One
check run regardless of paths-filter outcome; satisfies branch
protection cleanly per the PR #2264 SKIPPED-in-set finding.
Why this exists: 2026-04-30 we shipped a TenantGuard allowlist gap
(/buildinfo added to router.go in #2398, never added to the allowlist)
that the existing buildinfo-stale-image.sh replay would have caught.
The harness was wired correctly; nobody ran it. Replays as a discipline
beat replays as a memory item.
The CI pipeline:
detect-changes (paths filter)
└ harness-replays (always)
├ no-op pass when paths-filter says no relevant change
└ otherwise: checkout + sibling plugin checkout +
/etc/hosts entry + run-all-replays.sh +
compose-logs-on-failure + force-teardown
Compose logs from tenant/cp-stub/cf-proxy/postgres are dumped on
failure so a CI red is debuggable without re-reproducing locally.
The trap in run-all-replays.sh handles teardown; the always-run
down.sh step is a belt-and-suspenders against trap-bypass kills.
Follow-ups (not in this PR):
- Add this check to staging branch protection once it's been green
for a few PRs (the new-workflow-instability hedge that other gates
followed).
- Eventually wire the buildx GHA cache to speed up tenant image
builds — currently every PR rebuilds the full Dockerfile.tenant
(Go + Next.js + template clones) from scratch. Acceptable for now;
optimize when the timeout-minutes:30 ceiling becomes painful.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The /buildinfo route added in #2398 to verify each tenant runs the
published SHA was 404'd by TenantGuard on every production tenant —
the allowlist had /health, /metrics, /registry/register,
/registry/heartbeat, but not /buildinfo. The redeploy workflows
curl /buildinfo from a CI runner with no X-Molecule-Org-Id header,
TenantGuard 404'd them, gin's NoRoute proxied to canvas, canvas
returned its HTML 404 page, jq read empty git_sha, and the verifier
silently soft-warned every tenant as "unreachable" — which the
workflow doesn't fail on.
Confirmed externally:
curl https://hongmingwang.moleculesai.app/buildinfo
→ HTTP 404 + Content-Type: text/html (Next.js "404: This page
could not be found.") even though /health on the same host
returns {"status":"ok"} from gin.
The buildinfo package's own doc already declares /buildinfo public
by design ("Public is intentional: it's a build identifier, not
operational state. The same string is already published as
org.opencontainers.image.revision on the container image, so no new
info is exposed.") — the allowlist just missed it.
Pin the alignment in tenant_guard_test.go:
TestTenantGuard_AllowlistBypassesCheck now asserts /buildinfo
returns 200 without an org header alongside /health and /metrics,
so a future allowlist edit can't silently regress the verifier
again.
Closes the silent-success failure mode: stale tenants will now
show up as STALE (hard-fail) rather than UNREACHABLE (soft-warn).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-fix, cmd/server/main.go gated the entire health-sweep goroutine on
`prov != nil`. On SaaS tenants (`MOLECULE_ORG_ID` set) the local Docker
provisioner is never initialized — only `cpProv`. So the goroutine
never started, and `sweepStaleRemoteWorkspaces` (which transitions
runtime='external' workspaces from 'online' to 'awaiting_agent' when
their last_heartbeat_at goes stale) never ran.
Net effect on production: every external-runtime workspace on SaaS
that lost its agent stayed 'online' indefinitely instead of falling
back to 'awaiting_agent' (re-registrable). The drift gate (#2388)
caught the migration side and #2382 fixed the SQL writes, but this
orchestration-side gate slipped through both because there was no
SaaS-mode E2E coverage on the heartbeat-loss → awaiting_agent
transition.
Caught by #2392 (live staging external-runtime regression E2E)
failing at step 6 — 180s with no heartbeat, expected
status=awaiting_agent, got online.
Fix: drop the `if prov != nil` gate. `StartHealthSweep` already
handles nil checker correctly (healthsweep.go:50-71): the Docker
sweep is gated inside the loop, the remote sweep always runs. Test
coverage already exists at TestStartHealthSweep_NilCheckerRunsRemoteSweep.
After this lands and tenants redeploy, #2392 step 6 passes and the
regression coverage closes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Workspace-server has GET /buildinfo (PR #2398) — `curl https://<slug>.
moleculesai.app/buildinfo` returns the live git SHA. Canvas had no
parallel: debugging "is this the deployed code?" required reading
Vercel's UI or response headers (deployment ID, not git SHA).
Add canvas /api/buildinfo returning {git_sha, git_ref, vercel_env}
sourced from VERCEL_GIT_COMMIT_SHA / _REF / VERCEL_ENV — Vercel injects
these at build time from the deploying commit. Outside Vercel (local
`next dev`, harness) all three are unset and the endpoint returns
`git_sha: "dev"`, the same sentinel workspace-server uses pre-ldflags-
injection.
Now both surfaces speak the same vocabulary:
curl https://<slug>.moleculesai.app/buildinfo
curl https://canvas.moleculesai.app/api/buildinfo
3 tests cover dev-fallback, Vercel-injected SHA pass-through, and JSON
content type.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Boots the harness, runs every script under replays/, tracks pass/fail,
and tears down on exit. Closes the README's TODO for the harness runner
that the per-replay-registration comment referenced.
Usage:
./run-all-replays.sh # boot, run, teardown
KEEP_UP=1 ./run-all-replays.sh # leave harness running on exit
REBUILD=1 ./run-all-replays.sh # rebuild images before booting
Trap-on-EXIT teardown ensures partial-failure runs don't leak Docker
resources. Returns non-zero if any replay failed; CI can adopt this as
a single command without per-replay registration. Phase 2 picks this up
to wire harness-based E2E as a required check.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
publish-runtime.yml had a broad smoke (AgentCard call-shape, well-known
mount alignment, new_text_message) inline as a heredoc. runtime-prbuild-
compat.yml had a narrow inline smoke (just `from main import main_sync`).
Result: a PR could introduce SDK shape regressions that pass at PR time
and only fail at publish time, post-merge.
Extract the broad smoke into scripts/wheel_smoke.py and invoke it from
both workflows. PR-time gate now matches publish-time gate — same script,
same assertions. Eliminates the drift hazard of two heredocs that have
to be kept in lockstep manually.
Verified locally:
* Built wheel from workspace/ source, installed in venv, ran smoke → pass
* Simulated AgentCard kwarg-rename regression → smoke catches it as
`ValueError: Protocol message AgentCard has no "supported_interfaces"
field` (the exact failure mode of #2179 / supported_protocols incident)
Path filter for runtime-prbuild-compat extended to include
scripts/wheel_smoke.py so smoke-only edits get PR-validated. publish-
runtime path filter intentionally NOT extended — smoke-only edits should
not auto-trigger a PyPI version bump.
Subset of #131 (the broader "invoke main() against stub config" goal
remains pending — main() needs a config dir + stub platform server).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Self-review of #2403 caught a regression: with a 1-tenant fleet (the
exact case the original #2402 fix targeted), the new floor would
re-introduce the flake. Trace:
TOTAL=1, UNREACHABLE=1, $((1/2))=0
if 1 -gt 0 → TRUE → exit 1
The 50%-rule only meaningfully distinguishes "real outage" from
"teardown race" when the fleet is large enough that "half down" is
statistically meaningful. With 1-3 tenants, canary-verify is the
actual gate (it runs against the canary first and aborts the rollout
if the canary fails to come up).
Gate the floor on TOTAL_VERIFIED >= 4. Truth table:
TOTAL UNREACHABLE RESULT
1 1 soft-warn (original e2e flake case)
4 2 soft-warn (exactly half)
4 3 hard-fail (75% — real outage)
10 6 hard-fail (60% — real outage)
Mirrored across staging.yml + main.yml.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Belt-and-suspenders sanity floor on top of the unreachable-soft-warn
introduced earlier in this PR. Addresses the residual gap noted in
review: if a new image crashes on startup, every tenant ends up
unreachable, and the soft-warn alone would let that ship as a green
deploy. Canary-verify catches it on the canary tenant first, but this
guard is a fallback for canary-skip dispatches and same-batch races.
Threshold is 50% of healthz_ok-snapshotted tenants — comfortably above
the typical e2e-* teardown rate (5-10/hour, ~1 ephemeral tenant per
batch) but below any plausible real-outage scenario.
Mirrored across staging.yml + main.yml for shape parity.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three findings from re-reviewing PR #2401 with fresh eyes:
1. Critical — port binding to 0.0.0.0
compose.yml's cf-proxy bound 8080:8080 (default 0.0.0.0). The harness
uses a hardcoded ADMIN_TOKEN so anyone on the local network or VPN
could hit /workspaces with admin privileges. Switch to 127.0.0.1:8080
so admin access is loopback-only — safe for E2E and prevents the
known-token leak.
2. Required — dead code in cp-stub
peersFailureMode + __stub/mode + __stub/peers were declared with
atomic.Value setters but no handler ever READ from them. CP doesn't
host /registry/peers (the tenant does), so the toggles couldn't
drive responses. Removed the dead vars + handlers; kept
redeployFleetCalls counter and __stub/state since those have a real
consumer in the buildinfo replay.
3. Required — replay's auth-context dependency
peer-discovery-404.sh's Python eval ran a2a_client.get_peers_with_
diagnostic() against the live tenant. Without a workspace token
file, auth_headers() yields empty headers — so the helper might
exercise a 401 branch instead of the 404 branch the replay claims
to test.
Split the assertion into (a) WIRE — direct curl proves the platform
returns 404 from /registry/<unregistered>/peers — and (b) PARSE —
feed the helper a mocked 404 via httpx patches, no network/auth.
Each branch tests exactly what it claims.
Also added a graceful skip when the workspace runtime in the
current checkout pre-dates #2399 (no get_peers_with_diagnostic
yet) — replay falls back to wire-only verification with a clear
message instead of an opaque AttributeError. After #2399 lands on
staging, both branches will run.
cp-stub still builds clean. compose.yml validates. Replay's bash
syntax + Python eval both verified locally.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The /buildinfo verify step (PR #2398) was treating "no /buildinfo response"
the same as "tenant returned wrong SHA" — both bumped MISMATCH_COUNT and
hard-failed the workflow. First post-merge run on staging caught a real
edge case: ephemeral E2E tenants (slug e2e-20260430-...) get torn down by
the E2E teardown trap between CP's healthz_ok snapshot and the verify step
running, so the verify step would dial into DNS that no longer resolves
and hard-fail on a benign condition.
The bug class we actually care about is STALE (tenant up + serving old
code, the #2395 root). UNREACHABLE post-redeploy is almost always a benign
teardown race; real "tenant up but unreachable" is caught by CP's own
healthz monitor + the alert pipeline, so double-counting it here was
making this workflow flaky on every staging push that overlapped E2E.
Wire:
- Split MISMATCH_COUNT into STALE_COUNT + UNREACHABLE_COUNT.
- STALE → hard-fail the workflow (the bug class we're guarding).
- UNREACHABLE → :⚠️:, don't fail. Reachable-mismatch still hard-fails.
- Job summary surfaces both lists separately so on-call can tell at a
glance which class fired.
Mirror in redeploy-tenants-on-main.yml for shape parity (prod has fewer
ephemeral tenants but identical asymmetry would be a gratuitous fork).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The harness brings up the SaaS tenant topology on localhost using the
SAME workspace-server/Dockerfile.tenant image that ships to production.
Tests run against http://harness-tenant.localhost:8080 and exercise the
same code path a real tenant takes:
client
→ cf-proxy (nginx; CF tunnel + LB header rewrites)
→ tenant (Dockerfile.tenant — combined platform + canvas)
→ cp-stub (minimal Go CP stand-in for /cp/* paths)
→ postgres + redis
Why this exists: bugs that survive `go run ./cmd/server` and ship to
prod almost always live in env-gated middleware (TenantGuard, /cp/*
proxy, canvas proxy), header rewrites, or the strict-auth / live-token
mode. The harness activates ALL of them locally so #2395 + #2397-class
bugs can be reproduced before deploy.
Phase 1 surface:
- cp-stub/main.go: minimal CP stand-in. /cp/auth/me, redeploy-fleet,
/__stub/{peers,mode,state} for replay scripts. Catch-all returns
501 with a clear message when a new CP route appears.
- cf-proxy/nginx.conf: rewrites Host to <slug>.localhost, injects
X-Forwarded-*, disables buffering to mirror CF tunnel streaming
semantics.
- compose.yml: one service per topology layer; tenant builds from
the actual production Dockerfile.tenant.
- up.sh / down.sh / seed.sh: lifecycle scripts.
- replays/peer-discovery-404.sh: reproduces #2397 + asserts the
diagnostic helper from PR #2399 surfaces "404" + "registered".
- replays/buildinfo-stale-image.sh: reproduces #2395 + asserts
/buildinfo wire shape + GIT_SHA injection from PR #2398.
- README.md: topology, quickstart, what the harness does NOT cover.
Phases 2-3 (separate PRs):
- Phase 2: convert tests/e2e/test_api.sh to target the harness URL
instead of localhost; make harness-based replays a required CI gate.
- Phase 3: config-coherence lint that diffs harness env list against
production CP's env list, fails CI on drift.
Verification:
- cp-stub builds (go build ./...).
- cp-stub responds to all stubbed endpoints (smoke-tested locally).
- compose.yml passes `docker compose config --quiet`.
- All shell scripts pass `bash -n` syntax check.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes#2397. Today, every empty-peer condition (true empty, 401/403, 404,
5xx, network) collapses to a single message: "No peers available (this
workspace may be isolated)". The user has no way to tell whether they need
to provision more workspaces (true isolation), restart the workspace
(auth), re-register (404), page on-call (5xx), or check network (timeout) —
five different operator actions, one ambiguous string.
Wire:
- new helper get_peers_with_diagnostic() in a2a_client.py returns
(peers, error_summary). error_summary is None on 200; a short
actionable string on every other branch.
- get_peers() now shims through it so non-tool callers (system-prompt
formatters) keep the bare-list contract.
- tool_list_peers() switches to the diagnostic helper and surfaces the
actual reason. The "may be isolated" string is removed; true empty
now reads "no peers in the platform registry."
Tests:
- TestGetPeersWithDiagnostic: 200, 200-empty, 401, 403, 404, 5xx,
network exception, 200-but-non-list-body, and the bare-list-shim
regression guard.
- TestToolListPeers: each diagnostic branch surfaces its reason +
explicit assertion that "may be isolated" is gone.
Coverage 91.53% (floor 86%). 122 a2a tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the gap that let issue #2395 ship: redeploy-fleet workflows reported
ssm_status=Success based on SSM RPC return code alone, while EC2 tenants
silently kept serving the previous :latest digest because docker compose up
without an explicit pull is a no-op when the local tag already exists.
Wire:
- new buildinfo package exposes GitSHA, set at link time via -ldflags from
the GIT_SHA build-arg (default "dev" so test runs without ldflags fail
closed against an unset deploy)
- router exposes GET /buildinfo returning {git_sha} — public, no auth,
cheap enough to curl from CI for every tenant
- both Dockerfiles thread GIT_SHA into the Go build
- publish-workspace-server-image.yml passes GIT_SHA=github.sha for both
images
- redeploy-tenants-on-main.yml + redeploy-tenants-on-staging.yml curl each
tenant's /buildinfo after the redeploy SSM RPC and fail the workflow on
digest mismatch; staging treats both :latest and :staging-latest as
moving tags; verification is skipped only when an operator pinned a
specific tag via workflow_dispatch
Tests:
- TestGitSHA_DefaultDevSentinel pins the dev default
- TestBuildInfoEndpoint_ReturnsGitSHA pins the wire shape that the
workflow's jq lookup depends on
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related fixes after self-review of #2396:
1. workspace_bootstrap.go:62 — `SET status = 'failed'` was missed in the
initial sweep. Now parameterized as $3 with models.StatusFailed.
Test fixed with the additional WithArgs sentinel.
2. Drift gate now scans production .go AST for hard-coded
`UPDATE workspaces … SET status = '<literal>'` and fails with
file:line. This catches the kind of miss the first commit just
fixed — the original migration-vs-codebase axis only verified
AllWorkspaceStatuses ⊆ enum, not "no raw literals in writes."
Verified the gate fires: dropped a synthetic 'failed' literal into
internal/handlers/_drift_sanity.go and confirmed the gate flagged
"internal/handlers/_drift_sanity.go:6 → SET status = 'failed'".
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Eliminate raw 'awaiting_agent'/'hibernating'/'failed'/etc string literals
from production status writes. Adds models.WorkspaceStatus typed alias and
models.AllWorkspaceStatuses canonical slice; every UPDATE workspaces SET
status = ... now passes a parameterized $N typed value rather than a
hard-coded SQL literal.
Defense-in-depth follow-up to migration 046 (#2388): the Postgres enum
type was missing 'awaiting_agent' + 'hibernating' for ~5 days because
sqlmock regex matching cannot enforce live enum constraints. The drift
gate is now a proper Go AST + SQL parser (no regex), asserting the
codebase ⊆ migration enum and every const appears in the canonical
slice. With status as a parameterized typed value, future enum mismatches
fail at the SQL layer in tests, not silently in prod.
Test coverage: full suite passes with -race; drift gate green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Step 5b assertion failed against staging:
register response: {"delivery_mode":"poll","platform_inbound_secret":"...","status":"registered"}
HTTP_CODE=200
❌ Expected delivery_mode=poll, got — register UPDATE not honoring payload.delivery_mode
The register call succeeded (200, status:registered, delivery_mode:poll).
The assertion was reading the field from the workspace GET response — but
GET /workspaces/:id (workspace.go:587 Get handler) doesn't fetch
delivery_mode at all. The SELECT column list on line 597 pre-dates the
delivery_mode column from #2339 PR 1, so empty is the only thing GET can
return for it.
Fix: read delivery_mode from the register response body. That's the
canonical source — register is what writes the column, and its handler
already echoes the resolved value back. The check is now meaningful
("the handler honored the explicit poll we sent") instead of testing
GET's serialization gap.
Surfacing delivery_mode in GET is a separate fix; not gating this test
on it keeps the test focused on the awaiting_agent transitions it was
written for. Filed mentally as a follow-up — registry_test.go already
covers the resolveDeliveryMode logic directly, which is what users
actually hit through the handler.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Second-round failure on the same test (run 25179171433):
register response: {"error":"hostname \"example.invalid\" cannot be resolved (DNS error)"}
HTTP_CODE=400
Root cause: registry.Register's resolveDeliveryMode was supposed to
default runtime=external workspaces to poll mode (PR #2382), in which
case validateAgentURL is skipped and example.invalid passes through.
But the freshly-provisioned staging tenant for this test was running
an older workspace-server image that lacked that branch — the implicit
default was still push, validateAgentURL ran, and the DNS lookup
400'd. Same image-drift class as the production bug seen on the
hongmingwang tenant 17:30Z (deployed image lagging main HEAD).
Fix: send delivery_mode="poll" explicitly. Eliminates the test's
dependence on resolveDeliveryMode's default branch being deployed.
Step 5b reframed: was "verify external→poll default working", now
"verify explicit-poll round-trips". The default-resolution behavior
is exercised by handler-level tests in registry_test.go, which run
against the SHA being merged (not whatever :latest happens to be on
the fleet). That's the right place for it — E2E should test what
users see, unit tests should pin what handlers compute. Pulling those
apart removes a class of "intermittent on staging, green locally"
failures.
The deeper bug — fleet redeploy + provision both can serve stale
images even when the tag has been republished — gets a separate
issue. This commit just unblocks the merge.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The new external-runtime regression test had two payload bugs that made
step 5 fail with HTTP 400 on its first run:
1. Field name: sent {"workspace_id":...} but RegisterPayload (workspace-
server/internal/models/workspace.go:58) declares `id` with
binding:"required" — workspace_id is the heartbeat payload's field,
not register's.
2. Missing required field: agent_card has binding:"required" and was
absent. ShouldBindJSON 400'd before any handler logic ran, which is
why the body said nothing useful.
Why this got past local verification: the test was written from memory
of the heartbeat shape, never run end-to-end before pushing, and curl
with --fail-with-body prints the body to stdout but exit-22's under
set -e — the body was suppressed before the log line could fire.
Fix:
- Send `id` + a minimal valid agent_card ({name, skills:[{id,name}]})
matching the canonical shape from tests/e2e/test_api.sh:96.
- Pull the body into REGISTER_BODY shared between steps 5 and 7 so
drift between the two register calls is impossible.
- Drop --fail-with-body for these two calls and append HTTP_CODE via
curl -w so the body is always visible when the call non-200s. The
explicit grep for HTTP_CODE=200 + ||true on curl preserves the
fail-fast contract.
- Inline payload contract comment pointing at RegisterPayload so the
next person editing this doesn't repeat the heartbeat-confusion
mistake.
The url=https://example.invalid:443 is fine: runtime=external resolves
to poll mode (registry.go:resolveDeliveryMode case 3), and validateAgentURL
only fires for push.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The harness had `STATUS == "ready"` as the terminal condition, but
/cp/admin/orgs returns `instance_status='running'` for the live tenant.
Test ran for 14 minutes seeing instance_status=running and timing out
because nothing matched 'ready'.
Mirrors test_staging_full_saas.sh:210-211 — the case "$STATUS" in
running) break path is the source of truth. Also adds the same
diagnostic burst on 'failed' so the next run surfaces last_error
instead of just "timed out."
Caught on the first dispatch run (id=25177415268) of this harness.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When gh run list returns [] (no E2E run on the main SHA — the common
case for canvas-only / cmd-only / sweep-only changes whose paths
don't trigger E2E), jq's `.[0]` is null and the interpolation
`"\(null)/\(null // "none")"` produces "null/none". The case
statement has no `null/none)` branch, so it falls into `*)` →
exit 1 → auto-promote-on-e2e fails → `:latest` doesn't get retagged
to the new SHA → tenants on `redeploy-tenants-on-main` end up
pulling the OLD `:latest` digest.
Surfaced 2026-04-30 17:00Z as the first observable consequence of
PR #2389 (App-token dispatch fix). Every prior auto-promote-on-e2e
run was triggered by E2E completion (the "Upstream is E2E itself"
short-circuit at line 151 fired before reaching the gate). #2389
made publish-image's completion event correctly fire workflow_run
listeners — auto-promote-on-e2e is one of those listeners — and
hit the latent jq bug on the first publish-upstream run.
Fix: change `.[0]` to `(.[0] // {})` in the jq filter so the empty-
array case becomes `none/none` (the documented "E2E paths-filtered
out for this SHA — proceed" branch) instead of the unhandled
`null/none`. Also default `.status` for the same defensive reason.
Verified the three input shapes locally:
[] → "none/none" ✓
[{status:completed,conclusion:success}] → "completed/success" ✓
[{status:in_progress,conclusion:null}] → "in_progress/none" ✓
Outer `|| echo "none/none"` fallback retained as defense-in-depth
for non-zero gh exits (network / auth failures).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pins the four workspaces.status=awaiting_agent transitions on a real
staging tenant, end-to-end. Catches the class of silent enum failures
that migration 046 fix-forwarded — specifically:
1. workspace.go:333 — POST /workspaces with runtime=external + no URL
parks the row in 'awaiting_agent'. Pre-046 the UPDATE silently
failed and the row stuck on 'provisioning'.
2. registry.go:resolveDeliveryMode — registering an external workspace
defaults delivery_mode='poll' (PR #2382). The harness asserts the
poll default after register.
3. registry/healthsweep.go:sweepStaleRemoteWorkspaces — after
REMOTE_LIVENESS_STALE_AFTER (90s default) with no heartbeat, the
workspace transitions back to 'awaiting_agent'. Pre-046 the sweep
UPDATE silently failed and the workspace stuck on 'online' forever.
4. Re-register from awaiting_agent → 'online' confirms the state is
operator-recoverable, which is the whole reason for using
awaiting_agent (vs. 'offline') as the external-runtime stale state.
The harness mirrors test_staging_full_saas.sh: tenant create →
DNS/TLS wait → tenant token retrieve → exercise → idempotent teardown
via EXIT/INT/TERM trap. Exit codes match the documented contract
{0,1,2,3,4}; raw bash exit codes are normalized so the safety-net
sweeper doesn't open false-positive incident issues.
The companion workflow gates on the source files that touch this
lifecycle: workspace.go, registry.go, workspace_restart.go,
healthsweep.go, liveness.go, every migration, the static drift gate,
and the script + workflow themselves. Daily 07:30 UTC cron catches
infra drift on quiet days. cancel-in-progress=false because aborting
a half-rolled tenant leaves orphan resources for the safety-net to
clean.
Verification:
- bash -n: ok
- shellcheck: only the documented A && B || C pattern, identical to
test_staging_full_saas.sh.
- YAML parser: ok.
- Workflow path filter matches every site that writes to the
workspace_status enum (cross-checked against the drift gate's
UPDATE workspaces / INSERT INTO workspaces enumeration).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Symmetric with the existing CPProvisionerAPI interface. Closes the
asymmetry where the SaaS provisioner field was an interface (mockable
in tests) but the Docker provisioner field was a concrete pointer
(not).
## Changes
- New ``provisioner.LocalProvisionerAPI`` interface — the 7 methods
WorkspaceHandler / TeamHandler call on h.provisioner today: Start,
Stop, IsRunning, ExecRead, RemoveVolume, VolumeHasFile,
WriteAuthTokenToVolume. Compile-time assertion confirms *Provisioner
satisfies it. Mirror of cp_provisioner.go's CPProvisionerAPI block.
- ``WorkspaceHandler.provisioner`` and ``TeamHandler.provisioner``
re-typed from ``*provisioner.Provisioner`` to
``provisioner.LocalProvisionerAPI``. Constructor parameter type is
unchanged — the assignment widens to the interface, so the 200+
callers of ``NewWorkspaceHandler`` / ``NewTeamHandler`` are
unaffected.
- Constructors gain a ``if p != nil`` guard before assigning to the
interface field. Without this, ``NewWorkspaceHandler(..., nil, ...)``
(the test fixture pattern across 200+ tests) yields a typed-nil
interface value where ``h.provisioner != nil`` evaluates *true*,
and the SaaS-vs-Docker fork incorrectly routes nil-fixture tests
into the Docker code path. Documented inline with reference to
the Go FAQ.
- Hardened the 5 Provisioner methods that lacked nil-receiver guards
(Start, ExecRead, WriteAuthTokenToVolume, RemoveVolume,
VolumeHasFile) — return ErrNoBackend on nil receiver instead of
panicking on p.cli dereference. Symmetric with Stop/IsRunning
(already hardened in #1813). Defensive cleanup so a future caller
that bypasses the constructor's nil-elision still degrades
cleanly.
- Extended TestZeroValuedBackends_NoPanic with 5 new sub-tests
covering the newly-hardened nil-receiver paths. Defense-in-depth:
a future refactor that drops one of the nil-checks fails red here
before reaching production.
## Why now
- Provisioner orchestration has been touched in #2366 / #2368 — the
interface symmetry is the natural follow-up captured in #2369.
- Future work (CP fleet redeploy endpoint, multi-backend
provisioners) wants this in place. Memory note
``project_provisioner_abstraction.md`` calls out pluggable
backends as a north-star.
- Memory note ``feedback_long_term_robust_automated.md`` —
compile-time gates + ErrNoBackend symmetry > runtime panics.
## Verification
- ``go build ./...`` clean.
- ``go test ./...`` clean — 1300+ tests pass, including the
previously-flaky Create-with-nil-provisioner paths that now
exercise the constructor's nil-elision correctly.
- ``go test ./internal/provisioner/ -run TestZeroValuedBackends_NoPanic
-v`` — all 11 nil-receiver subtests green (was 6, +5 for the
newly-hardened methods).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause (verified 2026-04-30): GITHUB_TOKEN-initiated
workflow_dispatch creates the dispatched run, but the resulting run's
completion event does NOT fire downstream `workflow_run` triggers.
This is the documented "no recursion" rule:
https://docs.github.com/en/actions/using-workflows/triggering-a-workflow#triggering-a-workflow-from-a-workflow
Evidence (publish-workspace-server-image runs on main):
run_id | head_sha | triggering_actor | canary | redeploy
------------+-----------+-----------------------+--------+----------
25151545007 | 6ef562ee | HongmingWang-Rabbit | YES | YES
25171773918 | 21313dc | github-actions[bot] | NO | NO
25173801008 | 59dec57 | github-actions[bot] | NO | NO
The 06:52Z run that "worked" was an operator-fired dispatch from the
terminal — actor was the operator's PAT. The two runs that "dropped"
were dispatched by auto-promote-staging.yml's `gh workflow run` step
authenticated via `secrets.GITHUB_TOKEN`, so the actor became
`github-actions[bot]` and the workflow_run cascade was suppressed.
Same workflow file, same dispatch call, same successful publish run
— only the auth token differed.
Fix: mint a molecule-ai GitHub App installation token before the
dispatch step and use it as `GH_TOKEN`. App-initiated dispatches
DO propagate the workflow_run cascade (the App user is a real
identity, not the GITHUB_TOKEN bot pseudonym).
The molecule-ai App (app_id=3398844, installation 124443072) is
already installed on the org with `actions:write` — no new App
needed. Only secrets are missing.
## Required setup before merge
The following repo secrets must be added at
https://github.com/Molecule-AI/molecule-core/settings/secrets/actions
or auto-promote will hard-fail at the new "Mint App token" step:
- `MOLECULE_AI_APP_ID` = `3398844`
- `MOLECULE_AI_APP_PRIVATE_KEY` = contents of a .pem file generated at
https://github.com/organizations/Molecule-AI/settings/installations/124443072
(Click "Generate a private key" if one doesn't exist yet.)
## Long-term cleanup
The polling tail step still exists because the auto-merge call
itself uses GITHUB_TOKEN, so the FF push to main doesn't fire
publish-workspace-server-image's `push` trigger naturally. Switching
the auto-merge call to use the SAME App token would eliminate the
polling tail entirely. Tracked in #2357.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Migration 043 (2026-04-25) introduced the workspace_status enum but
omitted two values application code had been writing for days, so every
UPDATE that tried to write either value failed silently in production:
'awaiting_agent' (since 2026-04-24, commit 1e8b5e01):
- handlers/workspace.go:333 — external workspace pre-register
- handlers/registry.go (via PR #2382) — liveness offline transition
- registry/healthsweep.go (via PR #2382) — heartbeat-staleness sweep
'hibernating' (since hibernation feature shipped):
- handlers/workspace_restart.go:271 — DB-level claim before stop
All four/five sites swallowed the enum-cast error. User-visible impact:
external workspaces never transition to a stale state when their agent
disconnects (canvas shows them stuck on 'online'/'degraded' indefinitely),
new external workspaces never advance past 'provisioning', and idle
workspaces never auto-hibernate (resources held forever).
PR #2382 didn't *cause* this — it inherited the gap and added two more
silent-fail paths on top. The pre-existing two had been broken for five
days and went unnoticed because:
1. sqlmock matches SQL by regex, not against the live enum constraint.
Every test passed despite the prod-only failure.
2. The handlers either drop the Exec error entirely (workspace.go:333)
or log+continue without an alert (the other three).
Fix in three pieces:
1. migrations/046_*.up.sql — ALTER TYPE workspace_status ADD VALUE
'awaiting_agent', 'hibernating'. IF NOT EXISTS makes it idempotent
across re-runs (RunMigrations re-applies until schema_migrations
records the file). ALTER TYPE ADD VALUE doesn't take a heavy lock
and commits immediately, safe under live traffic.
2. migrations/046_*.down.sql — full rename → recreate → cast → drop
recipe. Postgres has no DROP VALUE so this is the only honest
rollback. Pre-flights existing rows to compatible values
(awaiting_agent → offline, hibernating → hibernated) before the
type swap.
3. internal/db/workspace_status_enum_drift_test.go — static gate that
parses every UPDATE/INSERT against `workspaces` in workspace-server/
internal/, extracts every status literal, and asserts each is in
the enum union (CREATE TYPE + every ALTER TYPE ADD VALUE). The gate
runs in unit tests, no DB required, and would have caught both
omissions on the day they shipped. Pattern matches
feedback_behavior_based_ast_gates and feedback_mock_at_drifting_layer.
Verification:
- go test ./internal/db/ -count=1 -race ✓
- go vet ./... ✓
- Drift gate flips red if I delete either ADD VALUE from the migration
(validated via local mutation).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pin the 5 public functions adapters and the runtime hot-path import
through ``from platform_auth import``:
- ``auth_headers`` — every outbound httpx call merges this in
- ``self_source_headers`` — A2A peer + self-message header builder
- ``get_token`` — main.py reads on boot to decide register-vs-resume
- ``save_token`` — main.py persists the platform-issued token
- ``refresh_cache`` — 401-retry path drops in-process cache (#1877)
A grep across workspace/ shows 14+ runtime modules import these:
main.py, heartbeat.py, a2a_client.py, a2a_tools.py, consolidation.py,
events.py, executor_helpers.py (3 sites), molecule_ai_status.py,
builtin_tools/memory.py (3 sites), builtin_tools/temporal_workflow.py
(2 sites). Renaming any of the five (e.g. ``auth_headers`` →
``bearer_headers``) makes every one of those imports raise ImportError
at workspace boot — the failure surface is deep in heartbeat init,
nowhere near the rename site.
Same drift class as the BaseAdapter signature snapshot (#2378, #2380),
skill_loader gate (#2381), runtime_wedge gate (#2383). Reuses the
``_signature_snapshot.py`` helpers shipped in #2381.
Defense-in-depth: ``test_snapshot_has_required_functions`` asserts
the five names are still present, so removing one even with a
synchronized snapshot edit forces an explicit edit here with a
justification.
``clear_cache`` is intentionally NOT in the snapshot — it's a
test-only helper. Production code MUST NOT depend on it.
Verified red on deliberate rename: ``auth_headers`` →
``bearer_headers`` produces a clean diff of the missing function in
the failure message. Restored before commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>