Backward-compat replay gate for the A2A JSON-RPC protocol surface.
Every PR that touches normalizeA2APayload OR bumps the a-2-a-sdk
version pin runs every shape in testdata/a2a_corpus/ through the
current code and asserts:
valid/ — every shape MUST parse without error and produce a
canonical v0.3 payload (params.message.parts list).
invalid/ — every shape MUST be rejected with the documented
status code and error substring.
What this prevents
The 2026-04-29 v0.2 → v0.3 silent-drop bug (PR #2349) shipped
because the SDK bump PR didn't replay v0.2-shaped inputs against
the new code; the shape-mismatch surfaced only in production when
the receiver's Pydantic validator silently rejected inbound
messages.
This gate would have caught it pre-merge. Hand-verified: reverting
the v0.2 string→parts shim in normalizeA2APayload fails 3 of the
v0.2 corpus entries with the exact rejection class the production
bug exhibited.
Corpus contents (11 entries)
valid/ (10):
v0_2_string_content — basic v0.2 (the broken case)
v0_2_string_content_no_message_id — v0.2 + auto-fill messageId
v0_2_list_content — v0.2 with content as Part list
v0_3_parts_text_only — canonical v0.3
v0_3_parts_multi_text — multi-Part list
v0_3_parts_with_file — multimodal (text + file)
v0_3_parts_with_context — contextId for multi-turn
v0_3_streaming_method — message/stream variant
v0_3_unicode_text — emoji + multi-script
v0_3_long_text — 10KB text Part
no_jsonrpc_envelope — bare params/method without
outer envelope (legacy senders)
invalid/ (3):
no_content_or_parts — message has neither field
content_is_integer — wrong type for v0.2 content
content_is_bool — wrong type, separate from int
so the failure msg identifies
which type-class regressed
Plus 4 inline malformed-JSON cases (truncated, not-JSON, empty,
whitespace) that can't be expressed as JSON corpus entries.
Coverage tests
The gate has 4 test functions:
1. TestA2ACorpus_ValidShapesParse — replay valid/ corpus,
assert no error + canonical v0.3 output (parts list non-empty,
messageId non-empty, content field deleted).
2. TestA2ACorpus_InvalidShapesRejected — replay invalid/ corpus,
assert rejection matches recorded status + error substring.
3. TestA2ACorpus_MalformedJSONRejected — inline cases for
non-parseable bodies.
4. TestA2ACorpus_HasMinimumCoverage — at least one v0.2 +
one v0.3 entry exists (loses neither side of the bridge).
5. TestA2ACorpus_EveryEntryHasMetadata — _comment/_added/_source
on every entry per the README policy; _expect_error and
_expect_status on invalid entries.
Documentation
testdata/a2a_corpus/README.md describes the corpus contract:
- When to add entries (new SDK shape, new production-observed
shape).
- When NOT to add (test scaffolding, hypothetical futures).
- Removal policy (breaking change, deprecation window required).
Verification
- All 24 corpus subtests pass on current main.
- Hand-test: revert the v0.2 compat shim → 3 v0.2 entries fail
the gate with the exact rejection class the production bug
exhibited. Confirmed.
- Whole-module go test ./... green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses code-review C1 (test goroutine race) and I2 (CF 524) on PR #2362.
C1: TestRunRestartCycle_SaaSPath_DispatchesViaCPProv invoked runRestartCycle
end-to-end, which spawns `go h.sendRestartContext(...)`. That goroutine
outlived the test, then read db.DB while the next test's setupTestDB wrote
to it — DATA RACE under -race, cascading 30+ failures across the handlers
suite. Refactored: extracted `stopForRestart(ctx, id)` from runRestartCycle
as a pure dispatcher, and rewrote the SaaS-path test to call it directly
(no async goroutine spawned). Added a no-provisioner no-op guard test.
I2: Cloudflare 524 ("origin timed out") now triggers maybeMarkContainerDead
alongside 502/503/504. Same upstream signal — origin agent unresponsive.
Verified `go test -race -count=1 ./internal/handlers/...` green locally.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Independent review of #2362 caught a Critical gap: the previous commit
fixed the Stop dispatch in runRestartCycle but left the provisionWorkspace
dispatch unconditionally Docker-only. So on SaaS the auto-restart cycle
would Stop the EC2 successfully (good), then NPE inside provisionWorkspace's
`h.provisioner.VolumeHasFile` call. coalesceRestart's recover()-without-
re-raise (a deliberate platform-stability safeguard) silently swallowed
the panic, leaving the workspace permanently stuck in status='provisioning'
because the UPDATE on workspace_restart.go:450 had already run.
Net pre-amendment effect on SaaS: dead agent → structured 503 (good) →
workspace flipped to 'offline' (good) → cpProv.Stop succeeded (good) →
provisionWorkspace NPE swallowed (bad) → workspace permanently
'provisioning' until manual canvas restart. The headline claim of #2362
("SaaS auto-restart now works") was false on the path it shipped.
Fix: dispatch the reprovision call the same way every other call site
in the package does (workspace.go:431-433, workspace_restart.go:197+596) —
branch on `h.cpProv != nil` and call provisionWorkspaceCP for SaaS,
provisionWorkspace for Docker.
Tests:
- New TestRunRestartCycle_SaaSPath_DispatchesViaCPProv asserts cpProv.Stop
is called when the SaaS path runs (would have caught the NPE if
provisionWorkspace had been called instead).
- fakeCPProv updated: methods record calls and return nil/empty by
default rather than panicking. The previous "panic on unexpected call"
pattern was unsafe — the panic fires on the async restart goroutine
spawned by maybeMarkContainerDead AFTER the test assertions ran, so
the test passed by accident even though the production path was
broken (which is exactly how the Critical bug landed).
- Existing tests still pass (full handlers + provisioner suites green).
Branch-count audit refresh:
runRestartCycle dispatch decisions:
1. h.provisioner != nil → provisioner.Stop + provisionWorkspace ✓ (existing tests)
2. h.cpProv != nil → cpProv.Stop + provisionWorkspaceCP ✓ (NEW test)
3. both nil → coalesceRestart never called (RestartByID gate) ✓
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Class-of-bugs fix surfaced by hongmingwang.moleculesai.app's canvas chat
to a dead workspace returning a generic Cloudflare 502 page on
2026-04-30. Three independent gaps in the reactive-health path that
together leak dead-agent failures to canvas with no auto-recovery.
## Bug 1 — maybeMarkContainerDead is a no-op for SaaS tenants
`maybeMarkContainerDead` only consulted `h.provisioner` (local Docker
provisioner). SaaS tenants set `h.cpProv` (CP-backed EC2 provisioner)
and leave `h.provisioner` nil — so the function early-returned false
on every call and dead EC2 agents never triggered the offline-flip /
broadcast / restart cascade.
Fix: extend `CPProvisionerAPI` interface with `IsRunning(ctx, id)
(bool, error)` (already implemented on `*CPProvisioner`; just needs
to surface on the interface). `maybeMarkContainerDead` now branches:
local-Docker path uses `h.provisioner.IsRunning`; SaaS path uses
`h.cpProv.IsRunning` which calls the CP's `/cp/workspaces/:id/status`
endpoint to read the EC2 state.
## Bug 2 — RestartByID short-circuits on `h.provisioner == nil`
Same shape as Bug 1: the auto-restart cascade triggered by
`maybeMarkContainerDead` calls `RestartByID` which short-circuited
when the local Docker provisioner was missing. So even if Bug 1 were
fixed, the workspace-offline state would never recover.
Fix: change the gate to `h.provisioner == nil && h.cpProv == nil`
and update `runRestartCycle` to branch on which provisioner is
wired for the Stop call. (The HTTP `Restart` handler already does
this branching correctly — we're just bringing the auto-restart path
to parity.)
## Bug 3 — upstream 502/503/504 propagated as-is, masked by Cloudflare
When the agent's tunnel returns 5xx (the "tunnel up but no origin"
shape — agent process dead but cloudflared connection still healthy),
`dispatchA2A` returns successfully at the HTTP layer with a 5xx body.
`handleA2ADispatchError`'s reactive-health path doesn't run because
that path is only triggered on transport-level errors. The pre-fix
code propagated the 502 status to canvas; Cloudflare in front of the
platform then masked the 502 with its own opaque "error code: 502"
page, hiding any structured response and any Retry-After hint.
Fix: in `proxyA2ARequest`, when the upstream returns 502/503/504, run
`maybeMarkContainerDead` BEFORE propagating. If IsRunning confirms
the agent is dead → return a structured 503 with restarting=true +
Retry-After (CF doesn't mask 503s the same way). If running, propagate
the original status (don't recycle a healthy agent on a transient
hiccup — it might have legitimately returned 502).
## Drive-by — a2aClient transport timeouts
a2aClient was `&http.Client{}` with no Transport timeouts. When a
workspace's EC2 black-holes TCP connects (instance terminated mid-flight,
SG flipped, NACL bug), the OS default is 75s on Linux / 21s on macOS —
long enough for Cloudflare's ~100s edge timeout to fire first and
surface a generic 502. Added DialContext (10s connect), TLSHandshake
(10s), and ResponseHeaderTimeout (60s). Client.Timeout DELIBERATELY
unset — that would pre-empt slow-cold-start flows (Claude Code OAuth
first-token, multi-minute agent synthesis). Long-tail body streaming
is still governed by per-request context deadline.
## Tests
- `TestMaybeMarkContainerDead_CPOnly_NotRunning` — IsRunning(false) →
marks workspace offline, returns true.
- `TestMaybeMarkContainerDead_CPOnly_Running` — IsRunning(true) →
no offline-flip, returns false (don't recycle a healthy agent).
- `TestProxyA2A_Upstream502_TriggersContainerDeadCheck` — agent server
returns 502 + cpProv reports dead → caller gets 503 with restarting=
true and Retry-After: 15.
- `TestProxyA2A_Upstream502_AliveAgent_PropagatesAsIs` — same upstream
502 but cpProv reports running → propagates 502 (existing behavior;
safety check that prevents over-eager recycling).
- Existing `TestMaybeMarkContainerDead_NilProvisioner` /
`TestMaybeMarkContainerDead_ExternalRuntime` still pass.
- Full handlers + provisioner test suites pass.
## Impact
Pre-fix: dead EC2 agent on a SaaS tenant → CF-masked 502 to canvas, no
auto-recovery, manual restart from canvas required.
Post-fix: dead EC2 agent on a SaaS tenant → structured 503 with
restarting=true + Retry-After to canvas, workspace flipped to offline,
auto-restart cycle triggered. Canvas can show a user-actionable
"agent is restarting, please wait" message instead of a generic 502.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Independent review of #2358 surfaced three gaps that the original
self-review missed. All three would manifest only on the FIRST real
staging→main promotion through the new tail step, so they'd silently
re-introduce the deploy-chain bug #2357 was supposed to fix.
1. **Missing `actions: write` permission.** `gh workflow run` POSTs to
`/repos/.../actions/workflows/.../dispatches`, which requires the
actions:write scope on GITHUB_TOKEN. The job had only contents:write
+ pull-requests:write, so the dispatch call would 403 on every run
and the publish chain would still not fire. Adding the scope.
2. **No workflow-level concurrency block.** When CI + E2E Staging
Canvas + E2E API Smoke + CodeQL all complete within seconds of each
other on a green staging push (the typical case), four separate
workflow_run events fire and four parallel auto-promote runs all
reach the dispatch tail. They poll the same PR, all observe the
same mergedAt, and all call `gh workflow run` — producing 2-4×
redundant publish builds racing for the same `:staging-latest`
retag and 2-4× canary-verify chains. Added
`concurrency.group: auto-promote-staging, cancel-in-progress: false`.
cancel-in-progress=false because killing a polling tail that's
about to dispatch would re-introduce the original bug.
3. **PR closed-without-merge ties up a runner for 30 min.** If the
merge queue rejects the PR (gates flip red post-approval), or an
operator closes it manually, mergedAt stays null forever and the
loop polls 60 × 30s burning a runner slot. Now also reads `state`
in the same `gh pr view` call and breaks early when STATE=CLOSED.
Verification on this PR is structural (workflow won't fire on a
staging→main promotion until this lands AND a subsequent staging
push triggers auto-promote). The actions:write fix in particular is
unverifiable until the next real run — the prior #2358 fix has
the same property, so we're stacking two unverifiable workflow
edits. That's intentional rather than risky: stage 1 (#2358) was
load-bearing for the deploy-chain restoration; stage 2 (this PR)
hardens it before it actually matters.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The auto-promote staging → main flow uses `gh pr merge --auto` with
GITHUB_TOKEN, which means GitHub suppresses downstream `push` events on
the resulting main commit. This is documented behavior — events created
by GITHUB_TOKEN do not trigger new workflow runs, with workflow_dispatch
and repository_dispatch as the only exceptions.
Effect: when the merge queue lands the auto-promote PR, the main push
DOES NOT fire publish-workspace-server-image. canary-verify + the
:staging-<sha> → :latest retag never run, so redeploy-tenants-on-main
also never fires. Tenants stay on stale code until someone manually
dispatches the chain (which is what just happened for issue #2339).
Fix here: after enqueuing auto-merge, poll for the PR to land, then
explicitly `gh workflow run publish-workspace-server-image.yml --ref
main`. workflow_dispatch is the documented exception, so the dispatch
event itself DOES create a new run. canary-verify and
redeploy-tenants-on-main chain via workflow_run as before.
Long-term (tracked in #2357): switch the auto-merge call above to a
GitHub App token (actions/create-github-app-token) so the merge event
itself can trigger the downstream chain naturally; the polling tail
becomes deletable.
Why a 30-min poll cap: merge queue typically lands a green promote PR
within 5-10 min. 30 min covers a slow CI run without hanging the
workflow indefinitely. If the merge times out, the step warns and
exits 0 — operator can manually dispatch as a fallback.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI run on PR #2355 surfaced `pq: invalid input syntax for type uuid:
ws-poll-e2e-1777529293-3363` — workspaces.id is UUID-typed and the
hand-rolled "ws-<tag>" shape fails the cast. Phase 1 returned
generic 'registration failed' which cascaded into Phase 3 'lookup
failed' (resolveAgentURL on a non-existent row) and Phase 4 'missing
workspace auth token' (no token extracted because Phase 1 didn't run
the bootstrap path).
Generate v4 UUIDs via uuidgen (with a python3 fallback), one each
for the poll workspace, the caller workspace, and the Phase 2
invalid-mode probe.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
End-to-end coverage for the canvas-chat unblocker. Exercises every
moving part of the #2339 stack against a real platform instance:
Phase 1 — register a workspace as delivery_mode=poll WITHOUT a URL;
verify the response carries delivery_mode=poll.
Phase 2 — invalid delivery_mode rejected with 400 (typo defense).
Phase 3 — POST A2A to the poll-mode workspace; verify proxyA2ARequest
short-circuits and returns 200 {status:queued, delivery_mode:poll,
method:message/send} without ever resolving an agent URL.
Phase 4 — verify the queued message appears in /activity?type=a2a_receive
with the right method + payload (the polling agent reads from here).
Phase 5 — since_id cursor returns ASC-ordered rows STRICTLY AFTER the
cursor; the cursor row itself must NOT be replayed. Sends two
follow-up messages and asserts ordering: rows[0] is the older new
event, rows[-1] is the newer.
Phase 6 — unknown / pruned cursor returns 410 Gone with an explanation.
Phase 7 — cross-workspace cursor isolation: a UUID belonging to one
workspace cannot be used to peek at another workspace's feed (returns
410, same as pruned, no info leak).
Idempotent: per-run unique workspace ids (date+pid). Trap-based cleanup
deletes the test rows on exit; no e2e_cleanup_all_workspaces call (see
feedback_never_run_cluster_cleanup_tests_on_live_platform.md).
Wired into .github/workflows/e2e-api.yml so it runs on every PR that
touches workspace-server/, tests/e2e/, or the workflow file itself —
same gate as the existing test_a2a_e2e + test_notify_attachments suites.
Stacked on #2354 (PR 3: since_id cursor).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Telegram getUpdates / Slack RTM shape: poll-mode workspaces pass the id
of the last activity_logs row they consumed, server returns rows
strictly after in chronological (ASC) order. Existing callers that don't
pass since_id keep DESC + most-recent-N — backwards-compatible.
Cursor lookup is scoped by workspace_id so a caller cannot enumerate or
peek at another workspace's events by passing a UUID belonging to a
different workspace. Cross-workspace and pruned cursors both return
410 Gone — no information leak (caller cannot distinguish "row never
existed" from "row exists but you can't see it").
since_id + since_secs both apply (AND). When since_id is set the order
flips to ASC because polling consumers need recorded-order; the recent-
feed shape (no since_id) keeps DESC.
Tests:
- TestActivityHandler_SinceID_ReturnsNewerASC — cursor lookup → main
query with cursorTime + ASC ordering.
- TestActivityHandler_SinceID_CursorNotFound_410 — pruned/unknown cursor.
- TestActivityHandler_SinceID_CrossWorkspaceCursor_410 — UUID belongs to
another workspace, scoped lookup hides it (same 410 path, no leak).
- TestActivityHandler_SinceID_CombinedWithSinceSecs — placeholder index
arithmetic with both filters.
Stacked on #2353 (PR 2: poll-mode short-circuit).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Skip SSRF/dispatch and queue to activity_logs for delivery_mode=poll
workspaces. The polling agent (e.g. molecule-mcp-claude-channel on an
operator's laptop) consumes via GET /activity?since_id= in PR 3 — no
public URL needed.
Order: budget -> normalize -> lookupDeliveryMode short-circuit ->
resolveAgentURL. Normalizing before the short-circuit keeps the
JSON-RPC method name on the activity_logs row so the polling agent
can dispatch correctly.
Fail-closed-to-push: any DB error reading delivery_mode defaults to
push (loud + recoverable) rather than poll (silent drop).
Tests:
- TestProxyA2A_PollMode_ShortCircuits_NoSSRF_NoDispatch — core invariant:
no resolveAgentURL, no Do(), records to activity_logs, returns 200
{status:"queued",delivery_mode:"poll",method:"message/send"}.
- TestProxyA2A_PushMode_NoShortCircuit — push path unaffected; the agent
server actually receives the request.
- TestProxyA2A_PollMode_FailsClosedToPush — DB error on mode lookup
must NOT silently queue; falls through to the push path.
Stacked on #2348 (PR 1: schema + register flow).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hard gate #4: codified module boundaries as Go tests, so a new
contributor (or AI agent) can't silently land an import that crosses
a layer.
Boundaries enforced (one architecture_test.go per package):
- wsauth has no internal/* deps — auth leaf, must be unit-testable in
isolation
- models has no internal/* deps — pure-types leaf, reverse dep would
create cycles since most packages depend on models
- db has no internal/* deps — DB layer below business logic, must be
testable with sqlmock without spinning up handlers/provisioner
- provisioner does not import handlers or router — unidirectional
layering: handlers wires provisioner into HTTP routes; the reverse
is a cycle
Each test parses .go files in its package via go/parser (no x/tools
dep needed) and asserts forbidden import paths don't appear. Failure
messages name the rule, the offending file, and explain WHY the
boundary exists so the diff reviewer learns the rule.
Note: the original issue's first two proposed boundaries
(provisioner-no-DB, handlers-no-docker) don't match the codebase
today — provisioner already imports db (PR #2276 runtime-image
lookup) and handlers hold *docker.Client directly (terminal,
plugins, bundle, templates). I picked the four boundaries that
actually hold; the first two are aspirational and would need a
refactor before they could be codified.
Hand-tested by injecting a deliberate wsauth -> orgtoken violation:
the gate fires red with the rule message before merge.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hard gate Tier 2 item 2 of 4. Cron-driven full-lifecycle E2E that
catches regressions visible only at runtime — schema drift,
deployment-pipeline gaps, vendor outages, env-var rotations,
DNS / CF / Railway side-effects.
Empirical motivation from today:
- #2345 (A2A v0.2 silent drop) — passed unit tests, broke at JSON-RPC
parse layer between sender + receiver. Visible only when a sender
exercises the full path. Now-fixed by PR #2349, but a continuous
E2E would have surfaced it within 20 min of the regression.
- RFC #2312 chat upload — landed staging-branch but never reached
staging tenants because publish-workspace-server-image was main-
only. Caught by manual dogfooding hours after deploy. Same pattern.
Both classes are invisible to PR-time CI. The continuous gate fires
every 20 min against a real staging tenant and surfaces regressions
within minutes.
Cadence: cron `0,20,40 * * * *` (3x/hour). Offsets the existing
sweep-cf-orphans (:15) and sweep-cf-tunnels (:45) so the three ops
don't burst CF/AWS APIs at the same minute. Concurrency group
prevents overlapping runs if one hangs.
Cost: ~$0.50-1/day GHA + pennies of staging tenant lifecycle.
Reuses existing tests/e2e/test_staging_full_saas.sh — no new harness
to maintain. Bounded at 10 min wall-clock (vs 15 min default) so
stuck runs fail fast rather than holding up the next firing.
Defaults to E2E_RUNTIME=langgraph (fastest cold start; the regression
classes this gate catches don't need hermes-specific paths). Operators
can dispatch with runtime=hermes when they want SDK-native coverage.
Schedule-vs-dispatch hardening: hard-fail on missing
CP_STAGING_ADMIN_API_TOKEN for cron firing (silent-skip would mask
real outages); soft-skip for operator dispatch.
Refs:
- #2342 hard-gates Tier 2 item 2
- #2345 (A2A v0.2 fix that this gate would have caught earlier)
- #2335 / #2337 (deployment-pipeline gaps that this gate also catches)
Closes#2345.
## Symptom
Design Director silently dropped A2A briefs whose sender used the
v0.2 message format (`params.message.content` string) instead of v0.3
(`params.message.parts` part-list). The downstream a2a-sdk's v0.3
Pydantic validator rejected with "params.message.parts — Field
required" but the rejection only landed in tenant-side logs; the
sender saw HTTP 200/202 and assumed delivery.
UX Researcher therefore never received the kickoff. Multi-agent
pipeline silently idle.
## Fix
Convert at the proxy edge in normalizeA2APayload. Two cases handled,
one explicitly rejected:
v0.2 string content → wrap as [{kind: text, text: <content>}]
(the canonical v0.2 case from the dogfooding
report)
v0.2 list content → preserve list as parts (some older clients
put a list under `content`; treat as "client
meant parts, used wrong field name")
v0.3 parts present → no-op (hot path for normal traffic)
Neither present → return HTTP 400 with structured JSON-RPC
error pointing at the missing field
Why at the proxy edge: every workspace gets the compat for free
without each one bumping a2a-sdk separately. The SDK's own compat
adapter is strict about `parts` and rejects v0.2 senders.
Why reject loud on missing-both: pre-fix the SDK's Pydantic
rejection was post-handler-dispatch and invisible to the original
sender. Now misshapen payloads return a structured 400 to the actual
caller — kills the entire silent-drop class for this payload-shape
category.
## Tests
7 new cases on normalizeA2APayload (#2345) + 1 fixture update on the
existing _MissingMethodReturnsEmpty test:
TestNormalizeA2APayload_ConvertsV02StringContentToParts
TestNormalizeA2APayload_ConvertsV02ListContentToParts
TestNormalizeA2APayload_PreservesV03Parts (hot path)
TestNormalizeA2APayload_RejectsMessageWithNeitherContentNorParts
TestNormalizeA2APayload_RejectsContentWithUnsupportedType
TestNormalizeA2APayload_NoMessageNoCheck (e.g. tasks/list bypasses)
All 11 normalizeA2APayload tests pass + full handler suite (no
regressions).
## Refs
Hard-gates discussion: this is exactly the class of failure
(silent-drop on schema mismatch) that #2342 (continuous synthetic
E2E) would catch automatically. Tier 2 RFC item from #2345 (caller
gets structured JSON-RPC error on parse failure) is delivered above
via the loud-reject path.
Adds workspaces.delivery_mode (push, default | poll) and lets the register
handler accept poll-mode workspaces with no URL. This is the foundation
for the unified poll/push delivery design in #2339 — Telegram-getUpdates
shape for external runtimes that have no public URL.
What this PR does:
- Migration 045: NOT NULL TEXT column, default 'push', CHECK constraint
on the two valid values.
- models.Workspace + RegisterPayload + CreateWorkspacePayload gain a
DeliveryMode field. RegisterPayload.URL drops the `binding:"required"`
tag — the handler now enforces it conditionally on the resolved mode.
- Register handler: validates explicit delivery_mode if set; resolves
effective mode (payload value, else stored row value, else push) AFTER
the C18 token check; validates URL only when effective mode is push;
persists delivery_mode in the upsert; returns it in the response;
skips URL caching when payload.URL is empty.
- CreateWorkspace handler: persists delivery_mode (defaults to push) in
the same INSERT, validates it before any side effects.
What this PR does NOT do (intentional, follow-up PRs):
- PR 2: short-circuit ProxyA2A for poll-mode workspaces (skip SSRF +
dispatch, log a2a_receive activity, return 200).
- PR 3: since_id cursor on GET /activity for lossless polling.
- Plugin v0.2 in molecule-mcp-claude-channel: cursor persistence + a
register helper that creates poll-mode workspaces.
Backwards compatibility: every existing workspace stays push-mode (schema
default) with identical behavior. New tests:
TestRegister_PollMode_AcceptsEmptyURL,
TestRegister_PushMode_RejectsEmptyURL,
TestRegister_InvalidDeliveryMode,
TestRegister_PollMode_PreservesExistingValue. All existing register +
create tests updated to expect the new delivery_mode column in the
INSERT args.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes#2332 item 1 (workspace awareness — agents don't surface
platform-native tools up front).
The dogfooding session surfaced that agents weren't using A2A
delegation, persistent memory, or send_message_to_user. The tools
were registered AND documented in the system prompt — but only in
sections #8 (Inter-Agent Communication) and #9 (Hierarchical Memory),
which agents read AFTER they've already started reasoning about a
plan from earlier sections.
This adds a tight inventory at section #1.5 (immediately after
Platform Instructions, before role-specific prompt files) — every
tool name + its short description in a bulleted block. Detailed
when_to_use docs in sections #8/#9 stay; this preamble is the
elevator pitch ("you have these"), the later sections are the
manual ("here's when and how").
Generated from `platform_tools.registry` ToolSpecs — every tool's
`name` + `short` flow through automatically, no manual sync. A new
`get_capabilities_preamble(mcp: bool)` helper in executor_helpers
mirrors the existing get_a2a_instructions / get_hma_instructions
pattern.
CLI-runtime agents (mcp=False) get an empty preamble — they see
_A2A_INSTRUCTIONS_CLI's hand-written subcommand vocabulary further
down, and the registry's MCP tool names would conflict.
Tests:
- test_capabilities_preamble_appears_in_mcp_prompt: header present
- test_capabilities_preamble_lists_every_registry_tool: every
a2a + memory tool from registry shows up (drift catches at test
time — adding a new tool to registry surfaces here automatically)
- test_capabilities_preamble_precedes_prompt_files: ordering
invariant (toolkit before role docs)
- test_capabilities_preamble_skipped_for_cli_runtime: empty when
mcp=False
All 40 prompt + platform_tools tests pass.
Parity with #2337's redeploy-tenants-on-staging.yml. Both prod and
staging redeploys now have explicit serialization:
group: redeploy-tenants-on-main (per-workflow, global)
group: redeploy-tenants-on-staging (per-workflow, global)
cancel-in-progress: false on both — aborting a half-rolled-out fleet
would leave tenants stuck on whatever image they happened to be on
when cancelled. Better to finish the in-flight rollout before starting
the next one.
Pre-fix this workflow relied on GitHub's implicit workflow_run queueing,
which is "probably fine" but not defensible — explicit > implicit for
load-bearing pipeline behavior. Picked up as a #2337 review nit
(architecture finding 1: concurrency asymmetry between the two
redeploy workflows).
No behavior change in the common case. The change matters only when
two main pushes land within seconds AND the first redeploy is still
mid-rollout — currently rare; will become more common once #2335
(staging-trigger publish) feeds main more frequently via auto-promote.
Two follow-ups from #2335 review (tracked in #2336):
1. Add `concurrency:` block to publish-workspace-server-image.yml so
two rapid staging pushes don't race the same :staging-latest retag.
Group is per-branch (`${{ github.ref }}`) so staging and main can
build in parallel — they produce different :staging-<sha> tags and
last-write-wins on :staging-latest is acceptable across branches.
`cancel-in-progress: false` keeps in-flight builds — partially-pushed
images would break canary-fleet pin consistency.
2. Add redeploy-tenants-on-staging.yml. After #2335, every staging push
produces a fresh :staging-latest, but existing tenants only pick it
up on next reprovision. This workflow mirrors redeploy-tenants-on-
main but for staging:
- workflow_run-gated to branches: [staging]
- target_tag default 'staging-latest' (vs 'latest' for prod)
- CP_URL default https://staging-api.moleculesai.app
- CP_STAGING_ADMIN_API_TOKEN repo secret (operator must set)
- canary_slug empty by default — staging is itself the canary; no
sub-canary needed inside it. Soak still applies if operator
specifies a tenant for blast-radius control.
Schedule-vs-dispatch hardening matches sweep-cf-orphans/sweep-cf-
tunnels: hard-fail on auto-trigger when secret missing so misconfig
doesn't silently leave staging tenants on stale code; soft-skip on
operator dispatch.
Operator action required after merge:
Add CP_STAGING_ADMIN_API_TOKEN repo secret. Pull value from staging-
CP's CP_ADMIN_API_TOKEN env in Railway controlplane / staging
environment. Until set, the auto-trigger will fail the workflow run
(visible as red CI), surfacing the misconfiguration. Workflow runs
only on staging publish-workspace-server-image success, so no extra
load while it sits unconfigured.
Verification:
- YAML lint clean on both workflows.
- Reviewed redeploy-tenants-on-main as template; differences are scoped
to staging-specific values (URL, tag, secret name) + harden-on-missing-
secret pattern.
Refs #2335, #2336.
Root cause: this workflow only triggered on `branches: [main]`, but
staging-CP pins TENANT_IMAGE=:staging-latest (verified via Railway).
:staging-latest was only retagged on main push, so:
staging-branch code → never built → never reaches staging tenants
staging-CP serves → "yesterday's main" indefinitely
When staging→main was wedged (path-filter parity bug, canvas teardown
race — both fixed earlier today), :staging-latest stopped updating
entirely. RFC #2312 (chat upload HTTP-forward) landed on staging but
freshly-provisioned staging tenants kept failing chat upload because
they pulled pre-RFC-#2312 image. Verified by tearing down a fresh
tenant and observing the legacy "workspace container not running"
error from the docker-exec code path that RFC #2312 deleted.
Pre-2026-04-24 there was a related-but-different incident: TENANT_IMAGE
was a static :staging-<sha> pin that drifted 10 days behind. This new
incident is "the dynamic pin still drifts when its update workflow
doesn't fire."
Fix: add `staging` to the branches trigger. Tag policy is unchanged
(:staging-<sha> + :staging-latest on every push). canary-verify.yml
still runs on main push (workflow_run-gated to `branches: [main]`),
preserving the canary-verified :latest promotion for prod tenants.
Steady state after this:
- staging push → :staging-latest = staging-branch code → staging-CP
- main push → :staging-<sha> for canary, :staging-latest retag
(post-promote main code), and after canary green
→ :latest for prod tenants
What this does NOT change:
- canary-verify.yml flow (still main-only)
- redeploy-tenants-on-main.yml (still rolls prod fleet on main push)
- publish-canvas-image.yml (self-hosted standalone canvas; orthogonal)
- The :latest tag (canary-verified main, unchanged)
What this does fix:
- RFC #2312-class fixes that land on staging now actually reach
staging tenants without waiting for staging→main promote.
- The dogfooding observation "staging tenants seem to be running
yesterday's code" disappears as a class.
Drive-by: also fixed the typo in the path-filter list (was
`publish-platform-image.yml`, the actual file is
`publish-workspace-server-image.yml`).
The header comment claimed:
"file upload (HTTP-forward) + download (Docker-exec)"
and:
"Download still uses the v1 docker-cp path; migrating it lives in
the next PR in this stack"
Both wrong now. RFC #2312 PR-D landed the Download HTTP-forward path:
chat_files.go:336 builds an http.NewRequestWithContext to
${wsURL}/internal/file/read?path=<abs>, with the response streamed
back to the caller. The workspace-side Starlette handler is at
workspace/internal_file_read.py, mounted at workspace/main.py:440.
Update the header to reflect actual code: both upload + download are
HTTP-forward, share the same per-workspace platform_inbound_secret
auth, and work uniformly on local Docker and SaaS EC2.
Pure docs change — no behavior, no build/test impact.
Closes the observability gap surfaced in #2329 item 5: callers received
queue_id in the 202 enqueue response but had no public lookup. The only
existing observability path was check_task_status (delegation-flavored
A2A only — joins via request_body->>'delegation_id'). Cross-workspace
peer-direct A2A had no observability after enqueue.
This PR ships RFC #2331's Tier 1: minimum viable observability + caller-
specified TTL. No schema migration — expires_at column already exists
(migration 042); only DequeueNext was honoring it, with no caller path
to populate it.
Two changes:
1. extractExpiresInSeconds(body) — new helper mirroring
extractIdempotencyKey/extractDelegationIDFromBody. Pulls
params.expires_in_seconds from the JSON-RPC body. Zero (the unset
default) preserves today's infinite-TTL semantics. EnqueueA2A grew
an expiresAt *time.Time parameter; the proxy callsite computes
*time.Time from the extracted seconds and threads it through to
the INSERT.
2. GET /workspaces/:id/a2a/queue/:queue_id — new public handler.
Auth: caller's workspace token must match queue.caller_id OR
queue.workspace_id, OR be an org-level token. 404 (not 403) on
auth failure to avoid leaking queue_id existence. Response
includes status/attempts/last_error/timestamps/expires_at; embeds
response_body via LEFT JOIN against activity_logs when status=
completed for delegation-flavored items.
What this does NOT change:
- Drain semantics (heartbeat-driven dispatch).
- Native-session bypass (claude-agent-sdk, hermes still skip queue).
- Schema (column already exists).
- MCP tools (delegate_task_async / check_task_status keep their
contract; this is a parallel queue-id surface).
Tests:
- 7 cases on extractExpiresInSeconds covering absent/positive/
zero/negative/invalid-JSON/wrong-type/empty-params.
- go vet + go build clean.
- Full handlers test suite passes (no regressions from the
EnqueueA2A signature change — only one production caller).
Tier 2 (cross-workspace stitch + webhook callback) and Tier 3
(controllerized lifecycle) deferred per RFC #2331.
Issue: scripts/dev-start.sh assumed `go` was on PATH; on a fresh dev
box without Go installed, line 111 (`go run ./cmd/server`) failed
with `go: not found` and the script bailed before printing the
readiness banner. The script's own prerequisite list (line 13-21)
said "Go 1.25+" but there was no signpost between "open the doc" and
"command not found."
Fix: detect `go` via `command -v`. If present, keep the existing
`go run` path (fast iteration, attaches to local log). If not,
fall back to `docker compose up -d --build platform` which uses the
published platform container — slower first run but the script
still works without forcing the dev to install Go just to read logs.
Either path leaves /health on :8080 so the rest of the script's
wait loop is unchanged.
If both paths fail, the error message names the install URL
(https://go.dev/dl/) and the fallback diagnostic (`/tmp/molecule-platform.log`)
so the dev has a single, actionable next step.
Verified: `sh -n` syntax check passes.
Closes#2329 item 2.
CP's tenant-delete cascade removes the DNS record (with sweep-cf-orphans
as a backstop) but does NOT delete the underlying Cloudflare Tunnel.
Each E2E provision creates one Tunnel named `tenant-<slug>`; without
cleanup these accumulate indefinitely on the account, consuming the
tunnel quota and cluttering the dashboard.
Observed 2026-04-30: dozens of `tenant-e2e-canvas-*` tunnels in Down
state with zero replicas, weeks past their tenant's deletion. Same
class of bug as the DNS-records leak that drove sweep-cf-orphans
(controlplane#239).
Parallel-shape to sweep-cf-orphans:
- Same dry-run-by-default + --execute pattern
- Same MAX_DELETE_PCT safety gate (default 90% — higher than DNS
sweep's 50% because tenant-shaped tunnels are orphans by design)
- Same schedule/dispatch hardening (hard-fail on missing secrets
when scheduled, soft-skip when dispatched)
- Cron offset to :45 to avoid CF API bursts colliding with the DNS
sweep at :15
Decision rules (in order):
1. Name doesn't match `tenant-<slug>` → keep (unknown — never sweep
tunnels that might belong to platform infra).
2. Tunnel has active connections (status=healthy or non-empty
connections array) → keep (defense-in-depth: don't kill a live
tunnel even if CP forgot the org).
3. Slug ∈ {prod_slugs ∪ staging_slugs} → keep.
4. Otherwise → delete (orphan).
Verified by:
- shell syntax check (bash -n)
- YAML lint
- Decide-logic offline smoke (7 cases, all pass)
- End-to-end dry-run smoke with stubbed CP + CF APIs
Required secrets (added to existing org-secrets):
CF_API_TOKEN must include account:cloudflare_tunnel:edit
scope (separate from zone:dns:edit used by
sweep-cf-orphans — same token if scope is
broad, or a new token if narrowly scoped).
CF_ACCOUNT_ID account that owns the tunnels (visible in
dash.cloudflare.com URL path).
CP_PROD_ADMIN_TOKEN reused from sweep-cf-orphans.
CP_STAGING_ADMIN_TOKEN reused from sweep-cf-orphans.
Note: CP-side root cause (tenant-delete should cascade to tunnel
delete) is in molecule-controlplane and worth fixing separately. This
janitor is the operational backstop in the meantime — same pattern
applied to DNS records when the same root cause was unaddressed.
Setup wrote .playwright-staging-state.json at the END (step 7), only
after org create + provision-wait + TLS + workspace create + workspace-
online all succeeded. If setup crashed at steps 1-6, the org existed in
CP but the state file did not, so Playwright's globalTeardown bailed
out ("nothing to tear down") and the workflow safety-net pattern-swept
every e2e-canvas-<today>-* org to compensate. That sweep deleted
concurrent runs' live tenants — including their CF DNS records —
causing victims' next fetch to die with `getaddrinfo ENOTFOUND`.
Race observed 2026-04-30 on PR #2264 staging→main: three real-test
runs killed each other mid-test, blocking 68 commits of staging→main
promotion.
Fix: write the state file as setup's first action, right after slug
generation, before any CP call. Now:
- Crash before slug gen → no state file, no orphan to clean
- Crash during steps 1-6 → state file has slug; teardown deletes
it (DELETE 404s if org never created)
- Setup completes → state file has full state; teardown
deletes the slug
The workflow safety-net no longer pattern-sweeps; it reads the state
file and deletes only the recorded slug. Concurrent canvas-E2E runs no
longer poison each other.
Verified by:
- tsc --noEmit on staging-setup.ts + staging-teardown.ts
- YAML lint on e2e-staging-canvas.yml
- Code review: state file write moved to line 113 (post-makeSlug,
pre-CP) with the original line-249 write retained as a "promote
to full state" overwrite at the end
Acceptance criterion 3 of #2001 ("CI check that fails if TENANT_IMAGE
contains a SHA-shaped suffix") was deferred from PR #2168 because
querying Railway from a GitHub Actions runner needs RAILWAY_TOKEN
plumbed as a repo secret. The detection script + regression test in
#2168 cover detection; this is the automation-cadence layer.
Daily 13:00 UTC schedule (06:00 PT) + workflow_dispatch. Daily is the
right cadence for variables-tier config — Railway env var changes are
deliberate operator actions, low-frequency. Hourly would risk Railway
API rate-limit surprises.
Issue-on-failure pattern mirrors e2e-staging-sanity.yml — drift opens
a `railway-drift` priority-high issue (or comments on the open one),
and a subsequent clean run auto-closes it with a "drift resolved"
comment. No human-in-the-loop needed for the close.
Schedule-vs-dispatch secret hardening per
feedback_schedule_vs_dispatch_secrets_hardening:
- Schedule trigger HARD-FAILS on missing RAILWAY_AUDIT_TOKEN
(silent-success was the failure mode that bit us before)
- workflow_dispatch SOFT-SKIPS so an operator can dry-run the
workflow shape during initial token provisioning
Operator action required before this gate is live:
- Provision a Railway API token, read-only `variables` scope on the
molecule-platform project (id 7ccc8c68-61f4-42ab-9be5-586eeee11768)
- Store as repo secret RAILWAY_AUDIT_TOKEN
- Rotate per the standard 90-day schedule
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Branch protection treats matching-name check runs as a SET — any SKIPPED
member fails the required-check eval, even with SUCCESS siblings. The
two-jobs-sharing-name pattern (no-op + real-job) emits one SKIPPED + one
SUCCESS check run per workflow run; with multiple runs at the same SHA
(detect-changes triggers + auto-promote re-runs) the SET fills with
SKIPPED entries that block branch protection.
Verified live on PR #2264 (staging→main auto-promote): mergeStateStatus
stayed BLOCKED for 18+ hours despite APPROVED + MERGEABLE + all gates
green at the workflow level. `gh pr merge` returned "base branch policy
prohibits the merge"; `enqueuePullRequest` returned "No merge queue
found for branch 'main'". The check-runs API showed `E2E API Smoke
Test` and `Canvas tabs E2E` each had 2 SKIPPED + 2 SUCCESS at head SHA
66142c1e.
Fix: collapse no-op + real-job into ONE job with no job-level `if:`,
gating real work via per-step `if: needs.detect-changes.outputs.X ==
'true'`. The job always runs and emits exactly one SUCCESS check run
under the required-check name regardless of paths-filter outcome —
branch-protection-clean.
Same pattern as ci.yml's earlier conversion of Canvas/Platform/Python/
Shellcheck (PR #2322). Closes the parity-fix that should have been
applied to all four path-filtered required checks at once.