Commit Graph

3488 Commits

Author SHA1 Message Date
Hongming Wang
8488a188c2 test(adapter_base): signature snapshot — drift gate for adapter public surface (#2364 item 2)
Every workspace template (langgraph, claude-code, hermes, etc.)
subclasses BaseAdapter. Renaming, removing, or re-typing a method
on the base class silently breaks templates: the override stops
being recognized as an override; the old method-name's caller
silently invokes the default no-op; the new method-name is
unimplemented in templates that haven't migrated.

Recent #87 universal-runtime + #1957 recordResource refactor both
renamed/added methods. Without a frozen snapshot, the next rename
ships quietly and surfaces only when a template's CI catches the
AttributeError days later — long after the merge window for an
easy revert.

This snapshot pins BaseAdapter's public method surface against a
checked-in JSON file. Same-shape pattern as PR #2363's A2A
protocol-compat replay gate, applied to a Python public-API
surface instead of JSON message shapes. Both close drift classes
by snapshotting the structural surface that consumers depend on.

Two tests:

  1. test_base_adapter_signature_matches_snapshot — full
     introspection diff against tests/snapshots/adapter_base_signature.json.
     Drift = test failure with both expected + actual JSON in
     the message so the reviewer sees what changed.
  2. test_snapshot_has_required_methods — defense-in-depth: even
     if both the source AND snapshot are updated together
     (intentional API removal), this catches removal of the
     short list of methods that EVERY template depends on (name,
     display_name, description, capabilities, memory_filename).
     Removing one of these requires explicit edit to the
     `required` set with a justification.

Verified the gate fires red on a deliberate rename
(memory_filename → memory_filename_RENAMED) — failure message
shows the full snapshot diff including parameter shapes and
return annotations.

Updating the snapshot is the explicit acknowledgment that a
template-affecting API change is intentional. Reviewer of the
introducing PR sees the snapshot diff and decides whether
template repos need coordinated updates.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 05:18:39 -07:00
Hongming Wang
97058d5392
Merge pull request #2377 from Molecule-AI/auto/lazy-heal-helper-direct-tests
test(provision): direct unit tests for readOrLazyHealInboundSecret
2026-04-30 11:44:22 +00:00
Hongming Wang
233a912cbe test(provision): direct unit tests for readOrLazyHealInboundSecret
The helper landed in #2376 and is exercised via chat_files + registry
integration tests. Those tests conflate the helper's behavior with the
caller's response shape — a future refactor that broke the (secret,
healed, err) contract subtly (e.g. returning healed=true on a
read-success path, or swallowing a mint error) might still pass them.

Adds 4 direct sub-tests pinning each branch of the contract:

  - secret already present → (s, false, nil)
  - secret missing, mint succeeds → (minted, true, nil)
  - secret missing, mint fails → ("", false, err)
  - read fails (non-NoInboundSecret) → ("", false, err)

Each sub-case asserts the return tuple shape AND mock.ExpectationsWereMet
(for the success path) so a future helper change that skips a DB op
trips the gate immediately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 04:41:13 -07:00
Hongming Wang
677b4858ab
Merge pull request #2376 from Molecule-AI/auto/lazy-heal-secret-helper
refactor: extract readOrLazyHealInboundSecret to dedup chat_files + registry
2026-04-30 11:14:49 +00:00
Hongming Wang
30a569c742 refactor: extract readOrLazyHealInboundSecret to dedup chat_files + registry
The lazy-heal-on-miss pattern landed in two places this session:
PR #2372 (chat_files.go::resolveWorkspaceForwardCreds — Upload + Download)
and PR #2375 (registry.go::Register). Both implementations did the same
thing:

  read → if ErrNoInboundSecret then mint inline → return outcome

Different response-shape requirements but the same core mechanic. Three
sites' worth of drift potential: any future heal-time condition we add
(audit log, alert, secret rotation, observability) had to be applied to
each site, with partial application silently re-opening the gap.

Fix: extract readOrLazyHealInboundSecret in workspace_provision_shared.go
returning (secret, healed, err). Each caller maps the outcome to its
response shape:

  - chat_files: healed=true → 503 with retry hint; err != nil → 503 with
    RFC-#2312 reprovision hint
  - registry:   healed=true|false + err==nil → include in response;
    err != nil → omit field (workspace can retry on next register)

Net effect:

  - Single source of truth for the read+heal mechanic
  - Response-shape decisions stay in callers (they DO differ per feature)
  - Future heal-time conditions go in one place
  - Behavior preserved: existing TestRegister_NoInboundSecret_LazyHeals,
    TestRegister_NoInboundSecret_LazyHealMintFailureOmitsField,
    TestChatUpload_NoInboundSecret_LazyHeal*,
    TestChatDownload_NoInboundSecret_LazyHeal* all pass unchanged

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 04:11:43 -07:00
Hongming Wang
eef1969e30
Merge pull request #2375 from Molecule-AI/auto/registry-lazy-heal-inbound-secret
fix(registry): lazy-heal platform_inbound_secret on register for legacy workspaces
2026-04-30 10:47:49 +00:00
Hongming Wang
f3f5c4537b fix(registry): lazy-heal platform_inbound_secret on register for legacy workspaces
Pre-fix: a legacy SaaS workspace with NULL platform_inbound_secret
needed two round-trips before chat upload worked:

  1. Workspace registers → response missing platform_inbound_secret
  2. User attempts chat upload → chat_files lazy-heals platform-side
     (RFC #2312 backfill) → 503 + retry-after
  3. Workspace heartbeats → register response now includes the
     freshly-minted secret → workspace writes /configs/.platform_inbound_secret
  4. User retries chat upload → workspace bearer matches → 200

The platform-side lazy-heal in chat_files.go (#2366) closes the
existing-workspace gap, but the user-visible round-trip dance is
still ugly.

Fix: lazy-heal at register time too. When ReadPlatformInboundSecret
returns ErrNoInboundSecret, mint inline and include the freshly-
minted secret in the register response. Collapses the dance to a
single round-trip:

  1. Workspace registers → response includes lazy-healed secret
  2. User attempts chat upload → workspace bearer matches → 200

Failure model: best-effort. Mint failure logs and falls through to
omitting the field (workspace will retry on next register call).
The 200 response status is preserved — register success doesn't
hinge on the inbound-secret heal.

Tests:

  - TestRegister_NoInboundSecret_LazyHeals: pins the success branch.
    Mocks the UPDATE explicitly + asserts ExpectationsWereMet, so a
    regression that skipped the mint would fail loudly. Replaces
    the prior TestRegister_NoInboundSecret_OmitsField which
    "passed" on this branch only because sqlmock-unmatched-UPDATE
    coincidentally drove the omit-field error path.
  - TestRegister_NoInboundSecret_LazyHealMintFailureOmitsField:
    pins the failure branch — explicit UPDATE error → 200 + field
    absent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 03:44:50 -07:00
Hongming Wang
343e164f5f
Merge pull request #2374 from Molecule-AI/auto/wsauth-token-lookup-helper
refactor(wsauth): extract lookupTokenByHash to dedup auth predicate across 3 callers
2026-04-30 10:14:40 +00:00
Hongming Wang
64822dac49 refactor(wsauth): extract lookupTokenByHash to dedup auth predicate across 3 callers
ValidateToken, WorkspaceFromToken, and ValidateAnyToken each duplicated
the same JOIN+WHERE auth predicate:

    FROM workspace_auth_tokens t
    JOIN workspaces w ON w.id = t.workspace_id
    WHERE t.token_hash = $1
      AND t.revoked_at IS NULL
      AND w.status != 'removed'

Same drift class as the SaaS provision-mint bug fixed in #2366. A
future safety addition (e.g. exclude paused workspaces from auth) had
to be applied to all three queries; a partial application would
silently re-open one auth path while closing the others.

Fix: hoist the predicate into lookupTokenByHash, which projects
(id, workspace_id) — the union of fields any caller needs. Each
public function picks what it uses:

  - ValidateToken      — needs both (compares workspaceID, updates last_used_at by id)
  - WorkspaceFromToken — needs workspace_id
  - ValidateAnyToken   — needs id

The trivial perf cost of selecting one extra column per call is worth
the single-source-of-truth guarantee for the auth predicate.

Test mock updates: two upstream test files (a2a_proxy_test, middleware
wsauth_middleware_test{,_canvasorbearer_test}) had hand-typed regex
matchers and row shapes pinned to the per-function SELECT projection.
Updated to the unified shape; behavior is unchanged.

All wsauth + middleware + handlers + full-module tests green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 03:11:38 -07:00
Hongming Wang
17760e10d2
Merge pull request #2373 from Molecule-AI/auto/admin-test-token-mock-coverage
test(admin_test_token): pin ADMIN_TOKEN IDOR-fix (#112) gate behavior
2026-04-30 10:02:13 +00:00
Hongming Wang
e403d74a3d test(admin_test_token): pin ADMIN_TOKEN IDOR-fix (#112) gate behavior
The admin test-token endpoint has a critical security check at
admin_test_token.go:64-72 — the IDOR fix from #112 that requires an
explicit ADMIN_TOKEN bearer when the env var is set. Pre-fix, the
route accepted ANY bearer that matched a live org token, allowing
cross-org test-token minting (and therefore cross-org workspace
authentication). The current code uses subtle.ConstantTimeCompare
against ADMIN_TOKEN.

Test coverage was zero. The existing tests exercised the
ADMIN_TOKEN-unset path (local dev / CI) but never set ADMIN_TOKEN.
A regression that:

  - removed the os.Getenv("ADMIN_TOKEN") check
  - inverted the comparison
  - replaced ConstantTimeCompare with bytes.Equal (timing leak)
  - re-introduced the AdminAuth fallback that allows org tokens

would not fail any test, and the breakage would re-open the IDOR
that #112 closed.

Adds four tests covering the gate matrix:

  - ADMIN_TOKEN set + no Authorization header → 401
  - ADMIN_TOKEN set + wrong Authorization → 401
  - ADMIN_TOKEN set + correct Authorization → 200
  - ADMIN_TOKEN unset + no Authorization → 200 (gate bypassed safely)

The 4-row matrix pins the gate's full truth table: any regression in
either dimension (gate enabled/disabled, header correct/wrong) trips
exactly one test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 02:59:08 -07:00
Hongming Wang
264e726672
Merge pull request #2372 from Molecule-AI/auto/chat-files-resolve-creds-helper
refactor(chat_files): extract resolveWorkspaceForwardCreds shared by Upload+Download
2026-04-30 09:54:56 +00:00
Hongming Wang
501a42d753 refactor(chat_files): extract resolveWorkspaceForwardCreds shared by Upload+Download
The 50-line "resolve URL + read inbound secret + lazy-heal on miss"
block was duplicated nearly verbatim between Upload and Download
handlers. Drift-prone — same class of risk as the original SaaS
provision drift fixed in #2366. A future change like:

  - secret rotation (re-mint when the row's older than X)
  - per-feature audit logging
  - additional fail-closed conditions

would have to be applied to both handlers, and a partial application
that healed Upload but skipped Download would surface only at runtime.

Fix: hoist the shared logic into resolveWorkspaceForwardCreds. The
function takes an op label ("upload"/"download") used in log messages
+ the 503 RFC-#2312 detail copy so operators can still distinguish
which feature ran. Both handlers reduce to:

    wsURL, secret, ok := resolveWorkspaceForwardCreds(c, ctx, workspaceID, "upload")
    if !ok { return }

Net -20 lines (helper amortizes the 50-line block across both call
sites). Existing test coverage (TestChatUpload_NoInboundSecret_*,
TestChatDownload_NoInboundSecret_* from PR #2370) covers all four
branches of the shared helper.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 02:51:53 -07:00
Hongming Wang
29368dd749
Merge pull request #2371 from Molecule-AI/auto/team-expand-mint-fix
test(provision): pin PARENT_ID env injection contract in prepareProvisionContext
2026-04-30 09:45:19 +00:00
Hongming Wang
4ba12668f0 test(provision): pin PARENT_ID env injection contract in prepareProvisionContext
#2367 moved PARENT_ID env injection from inline TeamHandler.Expand
into the shared prepareProvisionContext (sourced from
payload.ParentID). The test was missing — a regression that:
  - dropped the injection
  - inverted the nil-check
  - leaked an empty PARENT_ID="" into env

would not fail any existing test, but workspace/coordinator.py reads
PARENT_ID on startup to track parent-child relationship, so the
breakage would surface only at runtime.

Adds TestPrepareProvisionContext_ParentIDInjection with three
sub-cases:
  - nil ParentID → no PARENT_ID env
  - empty-string ParentID → no PARENT_ID env (don't pollute)
  - set ParentID → PARENT_ID env equals value

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 02:41:41 -07:00
Hongming Wang
ab6bcc030c
Merge pull request #2370 from Molecule-AI/auto/lazy-heal-test-coverage
test(chat_files): pin lazy-heal mint contract for both Upload and Download
2026-04-30 09:41:40 +00:00
Hongming Wang
6c065a02e6 test(chat_files): pin lazy-heal mint contract for both Upload and Download
The 2026-04-30 lazy-heal fix in chat_files.go (PR #2366) ATTEMPTS to
mint platform_inbound_secret on miss so legacy workspaces self-heal
without requiring destructive reprovision. The pre-existing
TestChatUpload_NoInboundSecret + TestChatDownload_NoInboundSecret
tests asserted the 503 response shape but did NOT pin that the mint
UPDATE actually fires — they happened to exercise the mint-failure
branch (sqlmock unmatched UPDATE = error = "Failed to mint" code path
returns 503 with "RFC #2312" detail, which still passed the original
assertions).

This means a regression that:
  - skipped the lazy-heal mint entirely
  - inverted the success/failure response branches
  - moved the mint to a different code path

would not fail those tests.

Fix:

  - TestChatUpload_NoInboundSecret_LazyHeal: mock the UPDATE
    successfully; assert sqlmock.ExpectationsWereMet (mint MUST run)
    + body contains "retry" + "30" (success branch).
  - TestChatUpload_NoInboundSecret_LazyHealFailure: mock the UPDATE
    to fail; assert body contains "Reprovision" (failure branch).
  - Same pair for the Download handler — independent code path means
    independent test.

Pins both branches of both handlers (4 tests) so future drift trips
the gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 02:38:28 -07:00
Hongming Wang
d9801d1e62
Merge pull request #2368 from Molecule-AI/auto/team-expand-mint-fix
fix(team): delegate Expand child-provision to shared mint pipeline (#2367)
2026-04-30 09:31:34 +00:00
Hongming Wang
bb52a1a365 fix(team): delegate Expand child-provisioning to shared mint pipeline (#2367)
Closes #2367.

TeamHandler.Expand provisioned child workspaces by directly calling
h.provisioner.Start, skipping mintWorkspaceSecrets and every other
preflight (secrets load, env mutators, identity injection, missing-env,
empty-config-volume auto-recover). Children shipped with NULL
platform_inbound_secret + never-issued auth_token — same drift class as
the SaaS bug just fixed in PR #2366, found while exercising a stronger
gate against this package.

Fix:

- TeamHandler now holds *WorkspaceHandler. Expand delegates each child
  provision to wh.provisionWorkspace, picking up the shared
  prepare/mint/preflight pipeline automatically. Future provision-time
  steps go in ONE place and team-expand inherits them.
- prepareProvisionContext gains PARENT_ID env injection sourced from
  payload.ParentID (which Expand now populates). This preserves the
  signal workspace/coordinator.py reads on startup, without threading
  env through provisioner.WorkspaceConfig manually.
- NewTeamHandler signature gains *WorkspaceHandler; router passes it.

Gate upgrade:

- TestProvisionFunctions_AllCallMintWorkspaceSecrets is now
  behavior-based: it walks every FuncDecl in the package and flags any
  function that calls h.provisioner.Start or h.cpProv.Start without
  also calling mintWorkspaceSecrets. Drift-resistant by construction —
  a future provision function with any name still trips the gate.
- Replaces the name-list version from PR #2366. The name list missed
  Expand precisely because Expand wasn't named provision*; the
  behavior-based detector caught it spontaneously when prototyped.

Tests: full workspace-server module green; gate previously verified to
fire red on Expand pre-fix and on deliberate mintWorkspaceSecrets
removal.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 02:28:29 -07:00
Hongming Wang
b9b0a46f2e
Merge pull request #2366 from Molecule-AI/auto/workspace-provision-shared
fix(provision): share Docker+SaaS prepare path so both mint workspace secrets (RFC #2312)
2026-04-30 09:21:38 +00:00
Hongming Wang
3f8286ea47 fix(provision): share Docker+SaaS prepare path so both mint workspace secrets (RFC #2312)
Root cause of 2026-04-30 silent-503 chat-upload bug: provisionWorkspaceCP
(SaaS) skipped issueAndInjectInboundSecret while provisionWorkspaceOpts
(Docker) called it. Every prod SaaS workspace provisioned with NULL
platform_inbound_secret → upload returned 503 with the v2-enrollment
message on every attempt.

Structural fix:
- Extract prepareProvisionContext (secrets load, env mutators, preflight,
  cfg build), mintWorkspaceSecrets (auth_token + platform_inbound_secret),
  markProvisionFailed (broadcast + DB update) into workspace_provision_shared.go
- Refactor both provision modes to call the shared helpers
- Add provisionAbort struct so the missing-env failure class can carry its
  structured "missing" payload through the shared abort path
- Unify last_sample_error: previously the decrypt-fail path skipped it while
  others set it; users now see every failure class in the UI

Drift prevention:
- AST gate TestProvisionFunctions_AllCallMintWorkspaceSecrets asserts every
  function in the provisionFunctions set calls mintWorkspaceSecrets at least
  once (same shape as the audit-coverage gate from #335). New provision paths
  must either call mint or be added to provisionExemptFunctions with a
  one-line justification
- Behavioral test TestMintWorkspaceSecrets_PersistsInboundSecretInSaaSMode
  pins the contract: SaaS mode MUST persist platform_inbound_secret to the DB
  column even though it skips file injection

Existing-workspace recovery (chat_files.go lazy-heal):
- Upload + Download handlers detect NULL platform_inbound_secret and call
  IssuePlatformInboundSecret inline, returning 503 with retry_after_seconds=30
- Self-heals workspaces that were provisioned before this fix without
  requiring destructive reprovision

Tests: full handlers + workspace-server module green; AST gate verified to
fire red on deliberate violation (commented-out mint call surfaces the
exact function name + actionable remediation message).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 02:18:08 -07:00
Hongming Wang
49c3433a70
Merge pull request #2365 from Molecule-AI/auto/cf-dead-origin-status-gap
fix(a2a): cover CF 521/522/523 in dead-origin status set
2026-04-30 08:51:28 +00:00
Hongming Wang
e06ebaefdf
Merge pull request #2346 from Molecule-AI/auto/issue-2341-migration-collision
ci: hard gate against migration version collisions (#2341)
2026-04-30 08:50:19 +00:00
Hongming Wang
b5df2126b9 fix(test): convert migration-collision tests from pytest to unittest (#2341)
CI failure: the Ops scripts (unittest) job runs `python -m unittest
discover` which doesn't have pytest installed. test_check_migration_
collisions.py imported pytest unconditionally, failing module import:

  ImportError: Failed to import test module: test_check_migration_collisions
  Traceback (most recent call last):
    File ".../test_check_migration_collisions.py", line 12, in <module>
      import pytest
  ModuleNotFoundError: No module named 'pytest'

The tests use no pytest-specific features (just bare assert + plain
class). Sibling test_sweep_cf_decide.py in the same dir already uses
unittest.TestCase. Convert this one to match: drop the pytest import,
make TestMigrationFileRe inherit from unittest.TestCase.

unittest.TestLoader.discover() requires TestCase subclasses for
auto-discovery, so the fix is two lines (drop import, add base).
Bare assert statements work fine inside TestCase methods.

Verified: `python3 -m unittest scripts.ops.test_check_migration_collisions -v`
runs all 9 tests, all pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 01:47:27 -07:00
Hongming Wang
8e508a7a2f fix(a2a): cover CF 521/522/523 in dead-origin status set
Independent review on PR #2362 caught: the dead-agent classifier at
a2a_proxy.go included 502/503/504/524 but missed the rest of the CF
origin-failure family (521/522/523), which are MORE indicative of a
dead EC2 than 524:

- 521 "Web server is down" — CF can't open TCP to origin (most direct
  dead-EC2 signal; fires when the workspace EC2 has been terminated
  and CF still has the CNAME pointing at it).
- 522 "Connection timed out" — TCP didn't complete in ~15s (typical
  of SG/NACL flap or agent process hung on accept).
- 523 "Origin is unreachable" — CF can't route to origin (DNS gone,
  network path broken).

Pre-fix any of these would propagate as-is to the canvas and the user
would see a 5xx without the reactive auto-restart firing — exactly
the SaaS-blind class of failure PR #2362 was meant to close.

Refactor: extracted isUpstreamDeadStatus(int) helper so the matrix is
in one place, with TestIsUpstreamDeadStatus locking in 18 status
codes (7 dead, 11 not-dead including 520 and 525 which look CF-shaped
but indicate different failures).

Also tightened TestStopForRestart_NoProvisioner_NoOp per the same
review: now uses sqlmock.ExpectationsWereMet to assert the dispatcher
doesn't touch the DB on the both-nil path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 01:39:04 -07:00
Hongming Wang
2742c3c837
Merge pull request #2363 from Molecule-AI/auto/a2a-corpus-replay-gate
test(a2a): protocol-shape replay corpus gate (#2345 follow-up)
2026-04-30 08:29:20 +00:00
Hongming Wang
747c12e582 test(a2a): protocol-shape replay corpus gate (#2345 follow-up)
Backward-compat replay gate for the A2A JSON-RPC protocol surface.
Every PR that touches normalizeA2APayload OR bumps the a-2-a-sdk
version pin runs every shape in testdata/a2a_corpus/ through the
current code and asserts:

  valid/   — every shape MUST parse without error and produce a
             canonical v0.3 payload (params.message.parts list).

  invalid/ — every shape MUST be rejected with the documented
             status code and error substring.

What this prevents

The 2026-04-29 v0.2 → v0.3 silent-drop bug (PR #2349) shipped
because the SDK bump PR didn't replay v0.2-shaped inputs against
the new code; the shape-mismatch surfaced only in production when
the receiver's Pydantic validator silently rejected inbound
messages.

This gate would have caught it pre-merge. Hand-verified: reverting
the v0.2 string→parts shim in normalizeA2APayload fails 3 of the
v0.2 corpus entries with the exact rejection class the production
bug exhibited.

Corpus contents (11 entries)

valid/ (10):
  v0_2_string_content              — basic v0.2 (the broken case)
  v0_2_string_content_no_message_id — v0.2 + auto-fill messageId
  v0_2_list_content                — v0.2 with content as Part list
  v0_3_parts_text_only             — canonical v0.3
  v0_3_parts_multi_text            — multi-Part list
  v0_3_parts_with_file             — multimodal (text + file)
  v0_3_parts_with_context          — contextId for multi-turn
  v0_3_streaming_method            — message/stream variant
  v0_3_unicode_text                — emoji + multi-script
  v0_3_long_text                   — 10KB text Part
  no_jsonrpc_envelope              — bare params/method without
                                     outer envelope (legacy senders)

invalid/ (3):
  no_content_or_parts              — message has neither field
  content_is_integer               — wrong type for v0.2 content
  content_is_bool                  — wrong type, separate from int
                                     so the failure msg identifies
                                     which type-class regressed

Plus 4 inline malformed-JSON cases (truncated, not-JSON, empty,
whitespace) that can't be expressed as JSON corpus entries.

Coverage tests

The gate has 4 test functions:

1. TestA2ACorpus_ValidShapesParse        — replay valid/ corpus,
   assert no error + canonical v0.3 output (parts list non-empty,
   messageId non-empty, content field deleted).
2. TestA2ACorpus_InvalidShapesRejected   — replay invalid/ corpus,
   assert rejection matches recorded status + error substring.
3. TestA2ACorpus_MalformedJSONRejected   — inline cases for
   non-parseable bodies.
4. TestA2ACorpus_HasMinimumCoverage      — at least one v0.2 +
   one v0.3 entry exists (loses neither side of the bridge).
5. TestA2ACorpus_EveryEntryHasMetadata   — _comment/_added/_source
   on every entry per the README policy; _expect_error and
   _expect_status on invalid entries.

Documentation

testdata/a2a_corpus/README.md describes the corpus contract:
  - When to add entries (new SDK shape, new production-observed
    shape).
  - When NOT to add (test scaffolding, hypothetical futures).
  - Removal policy (breaking change, deprecation window required).

Verification

- All 24 corpus subtests pass on current main.
- Hand-test: revert the v0.2 compat shim → 3 v0.2 entries fail
  the gate with the exact rejection class the production bug
  exhibited. Confirmed.
- Whole-module go test ./... green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 01:26:02 -07:00
Hongming Wang
344e3e8914
Merge pull request #2362 from Molecule-AI/auto/a2a-upstream-5xx-mark-dead
fix(a2a): detect dead EC2 agents on upstream 5xx + reactive auto-restart for SaaS
2026-04-30 08:14:14 +00:00
Hongming Wang
a27cf8f39f fix(restart): extract stopForRestart helper + add 524 to dead-agent list
Addresses code-review C1 (test goroutine race) and I2 (CF 524) on PR #2362.

C1: TestRunRestartCycle_SaaSPath_DispatchesViaCPProv invoked runRestartCycle
end-to-end, which spawns `go h.sendRestartContext(...)`. That goroutine
outlived the test, then read db.DB while the next test's setupTestDB wrote
to it — DATA RACE under -race, cascading 30+ failures across the handlers
suite. Refactored: extracted `stopForRestart(ctx, id)` from runRestartCycle
as a pure dispatcher, and rewrote the SaaS-path test to call it directly
(no async goroutine spawned). Added a no-provisioner no-op guard test.

I2: Cloudflare 524 ("origin timed out") now triggers maybeMarkContainerDead
alongside 502/503/504. Same upstream signal — origin agent unresponsive.

Verified `go test -race -count=1 ./internal/handlers/...` green locally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 00:58:22 -07:00
Hongming Wang
28b4e38002 fix(restart): branch provisionWorkspace dispatch on cpProv (PR #2362 amendment)
Independent review of #2362 caught a Critical gap: the previous commit
fixed the Stop dispatch in runRestartCycle but left the provisionWorkspace
dispatch unconditionally Docker-only. So on SaaS the auto-restart cycle
would Stop the EC2 successfully (good), then NPE inside provisionWorkspace's
`h.provisioner.VolumeHasFile` call. coalesceRestart's recover()-without-
re-raise (a deliberate platform-stability safeguard) silently swallowed
the panic, leaving the workspace permanently stuck in status='provisioning'
because the UPDATE on workspace_restart.go:450 had already run.

Net pre-amendment effect on SaaS: dead agent → structured 503 (good) →
workspace flipped to 'offline' (good) → cpProv.Stop succeeded (good) →
provisionWorkspace NPE swallowed (bad) → workspace permanently
'provisioning' until manual canvas restart. The headline claim of #2362
("SaaS auto-restart now works") was false on the path it shipped.

Fix: dispatch the reprovision call the same way every other call site
in the package does (workspace.go:431-433, workspace_restart.go:197+596) —
branch on `h.cpProv != nil` and call provisionWorkspaceCP for SaaS,
provisionWorkspace for Docker.

Tests:

- New TestRunRestartCycle_SaaSPath_DispatchesViaCPProv asserts cpProv.Stop
  is called when the SaaS path runs (would have caught the NPE if
  provisionWorkspace had been called instead).

- fakeCPProv updated: methods record calls and return nil/empty by
  default rather than panicking. The previous "panic on unexpected call"
  pattern was unsafe — the panic fires on the async restart goroutine
  spawned by maybeMarkContainerDead AFTER the test assertions ran, so
  the test passed by accident even though the production path was
  broken (which is exactly how the Critical bug landed).

- Existing tests still pass (full handlers + provisioner suites green).

Branch-count audit refresh:

  runRestartCycle dispatch decisions:
    1. h.provisioner != nil → provisioner.Stop + provisionWorkspace ✓ (existing tests)
    2. h.cpProv != nil       → cpProv.Stop + provisionWorkspaceCP   ✓ (NEW test)
    3. both nil              → coalesceRestart never called (RestartByID gate) ✓

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 00:35:51 -07:00
Hongming Wang
9f35788aee fix(a2a): detect dead EC2 agents on upstream 5xx + reactive auto-restart for SaaS
Class-of-bugs fix surfaced by hongmingwang.moleculesai.app's canvas chat
to a dead workspace returning a generic Cloudflare 502 page on
2026-04-30. Three independent gaps in the reactive-health path that
together leak dead-agent failures to canvas with no auto-recovery.

## Bug 1 — maybeMarkContainerDead is a no-op for SaaS tenants

`maybeMarkContainerDead` only consulted `h.provisioner` (local Docker
provisioner). SaaS tenants set `h.cpProv` (CP-backed EC2 provisioner)
and leave `h.provisioner` nil — so the function early-returned false
on every call and dead EC2 agents never triggered the offline-flip /
broadcast / restart cascade.

Fix: extend `CPProvisionerAPI` interface with `IsRunning(ctx, id)
(bool, error)` (already implemented on `*CPProvisioner`; just needs
to surface on the interface). `maybeMarkContainerDead` now branches:
local-Docker path uses `h.provisioner.IsRunning`; SaaS path uses
`h.cpProv.IsRunning` which calls the CP's `/cp/workspaces/:id/status`
endpoint to read the EC2 state.

## Bug 2 — RestartByID short-circuits on `h.provisioner == nil`

Same shape as Bug 1: the auto-restart cascade triggered by
`maybeMarkContainerDead` calls `RestartByID` which short-circuited
when the local Docker provisioner was missing. So even if Bug 1 were
fixed, the workspace-offline state would never recover.

Fix: change the gate to `h.provisioner == nil && h.cpProv == nil`
and update `runRestartCycle` to branch on which provisioner is
wired for the Stop call. (The HTTP `Restart` handler already does
this branching correctly — we're just bringing the auto-restart path
to parity.)

## Bug 3 — upstream 502/503/504 propagated as-is, masked by Cloudflare

When the agent's tunnel returns 5xx (the "tunnel up but no origin"
shape — agent process dead but cloudflared connection still healthy),
`dispatchA2A` returns successfully at the HTTP layer with a 5xx body.
`handleA2ADispatchError`'s reactive-health path doesn't run because
that path is only triggered on transport-level errors. The pre-fix
code propagated the 502 status to canvas; Cloudflare in front of the
platform then masked the 502 with its own opaque "error code: 502"
page, hiding any structured response and any Retry-After hint.

Fix: in `proxyA2ARequest`, when the upstream returns 502/503/504, run
`maybeMarkContainerDead` BEFORE propagating. If IsRunning confirms
the agent is dead → return a structured 503 with restarting=true +
Retry-After (CF doesn't mask 503s the same way). If running, propagate
the original status (don't recycle a healthy agent on a transient
hiccup — it might have legitimately returned 502).

## Drive-by — a2aClient transport timeouts

a2aClient was `&http.Client{}` with no Transport timeouts. When a
workspace's EC2 black-holes TCP connects (instance terminated mid-flight,
SG flipped, NACL bug), the OS default is 75s on Linux / 21s on macOS —
long enough for Cloudflare's ~100s edge timeout to fire first and
surface a generic 502. Added DialContext (10s connect), TLSHandshake
(10s), and ResponseHeaderTimeout (60s). Client.Timeout DELIBERATELY
unset — that would pre-empt slow-cold-start flows (Claude Code OAuth
first-token, multi-minute agent synthesis). Long-tail body streaming
is still governed by per-request context deadline.

## Tests

- `TestMaybeMarkContainerDead_CPOnly_NotRunning` — IsRunning(false) →
  marks workspace offline, returns true.
- `TestMaybeMarkContainerDead_CPOnly_Running` — IsRunning(true) →
  no offline-flip, returns false (don't recycle a healthy agent).
- `TestProxyA2A_Upstream502_TriggersContainerDeadCheck` — agent server
  returns 502 + cpProv reports dead → caller gets 503 with restarting=
  true and Retry-After: 15.
- `TestProxyA2A_Upstream502_AliveAgent_PropagatesAsIs` — same upstream
  502 but cpProv reports running → propagates 502 (existing behavior;
  safety check that prevents over-eager recycling).
- Existing `TestMaybeMarkContainerDead_NilProvisioner` /
  `TestMaybeMarkContainerDead_ExternalRuntime` still pass.
- Full handlers + provisioner test suites pass.

## Impact

Pre-fix: dead EC2 agent on a SaaS tenant → CF-masked 502 to canvas, no
auto-recovery, manual restart from canvas required.

Post-fix: dead EC2 agent on a SaaS tenant → structured 503 with
restarting=true + Retry-After to canvas, workspace flipped to offline,
auto-restart cycle triggered. Canvas can show a user-actionable
"agent is restarting, please wait" message instead of a generic 502.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 00:28:22 -07:00
Hongming Wang
92a29bb37c
Merge pull request #2360 from Molecule-AI/auto/2358-followup-permissions-concurrency
fix(ci): close gaps in auto-promote dispatch tail (#2358 follow-up)
2026-04-30 07:07:21 +00:00
Hongming Wang
26d5c5ba1f fix(ci): close gaps in auto-promote dispatch tail (#2358 follow-up)
Independent review of #2358 surfaced three gaps that the original
self-review missed. All three would manifest only on the FIRST real
staging→main promotion through the new tail step, so they'd silently
re-introduce the deploy-chain bug #2357 was supposed to fix.

1. **Missing `actions: write` permission.** `gh workflow run` POSTs to
   `/repos/.../actions/workflows/.../dispatches`, which requires the
   actions:write scope on GITHUB_TOKEN. The job had only contents:write
   + pull-requests:write, so the dispatch call would 403 on every run
   and the publish chain would still not fire. Adding the scope.

2. **No workflow-level concurrency block.** When CI + E2E Staging
   Canvas + E2E API Smoke + CodeQL all complete within seconds of each
   other on a green staging push (the typical case), four separate
   workflow_run events fire and four parallel auto-promote runs all
   reach the dispatch tail. They poll the same PR, all observe the
   same mergedAt, and all call `gh workflow run` — producing 2-4×
   redundant publish builds racing for the same `:staging-latest`
   retag and 2-4× canary-verify chains. Added
   `concurrency.group: auto-promote-staging, cancel-in-progress: false`.
   cancel-in-progress=false because killing a polling tail that's
   about to dispatch would re-introduce the original bug.

3. **PR closed-without-merge ties up a runner for 30 min.** If the
   merge queue rejects the PR (gates flip red post-approval), or an
   operator closes it manually, mergedAt stays null forever and the
   loop polls 60 × 30s burning a runner slot. Now also reads `state`
   in the same `gh pr view` call and breaks early when STATE=CLOSED.

Verification on this PR is structural (workflow won't fire on a
staging→main promotion until this lands AND a subsequent staging
push triggers auto-promote). The actions:write fix in particular is
unverifiable until the next real run — the prior #2358 fix has
the same property, so we're stacking two unverifiable workflow
edits. That's intentional rather than risky: stage 1 (#2358) was
load-bearing for the deploy-chain restoration; stage 2 (this PR)
hardens it before it actually matters.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 00:03:31 -07:00
Hongming Wang
d850ec7c8c
Merge pull request #2358 from Molecule-AI/auto/issue-2357-promote-dispatch-chain
fix(ci): dispatch publish chain after auto-promote merge (#2357)
2026-04-30 06:36:02 +00:00
Hongming Wang
9a7f61661b fix(ci): dispatch publish chain after auto-promote merge (#2357)
The auto-promote staging → main flow uses `gh pr merge --auto` with
GITHUB_TOKEN, which means GitHub suppresses downstream `push` events on
the resulting main commit. This is documented behavior — events created
by GITHUB_TOKEN do not trigger new workflow runs, with workflow_dispatch
and repository_dispatch as the only exceptions.

Effect: when the merge queue lands the auto-promote PR, the main push
DOES NOT fire publish-workspace-server-image. canary-verify + the
:staging-<sha> → :latest retag never run, so redeploy-tenants-on-main
also never fires. Tenants stay on stale code until someone manually
dispatches the chain (which is what just happened for issue #2339).

Fix here: after enqueuing auto-merge, poll for the PR to land, then
explicitly `gh workflow run publish-workspace-server-image.yml --ref
main`. workflow_dispatch is the documented exception, so the dispatch
event itself DOES create a new run. canary-verify and
redeploy-tenants-on-main chain via workflow_run as before.

Long-term (tracked in #2357): switch the auto-merge call above to a
GitHub App token (actions/create-github-app-token) so the merge event
itself can trigger the downstream chain naturally; the polling tail
becomes deletable.

Why a 30-min poll cap: merge queue typically lands a green promote PR
within 5-10 min. 30 min covers a slow CI run without hanging the
workflow indefinitely. If the merge times out, the step warns and
exits 0 — operator can manually dispatch as a fallback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 23:31:13 -07:00
Hongming Wang
c1f993ca36
Merge pull request #2355 from Molecule-AI/auto/issue-2339-pr4-e2e-poll-roundtrip
test(e2e): poll-mode + since_id cursor round-trip (#2339 PR 4)
2026-04-30 06:25:29 +00:00
Hongming Wang
08252b3cd7 fix(e2e): use real UUIDs for poll-mode test workspace ids
CI run on PR #2355 surfaced `pq: invalid input syntax for type uuid:
ws-poll-e2e-1777529293-3363` — workspaces.id is UUID-typed and the
hand-rolled "ws-<tag>" shape fails the cast. Phase 1 returned
generic 'registration failed' which cascaded into Phase 3 'lookup
failed' (resolveAgentURL on a non-existent row) and Phase 4 'missing
workspace auth token' (no token extracted because Phase 1 didn't run
the bootstrap path).

Generate v4 UUIDs via uuidgen (with a python3 fallback), one each
for the poll workspace, the caller workspace, and the Phase 2
invalid-mode probe.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 23:10:36 -07:00
Hongming Wang
a495b86a06 test(e2e): poll-mode + since_id cursor round-trip (#2339 PR 4)
End-to-end coverage for the canvas-chat unblocker. Exercises every
moving part of the #2339 stack against a real platform instance:

Phase 1 — register a workspace as delivery_mode=poll WITHOUT a URL;
verify the response carries delivery_mode=poll.
Phase 2 — invalid delivery_mode rejected with 400 (typo defense).
Phase 3 — POST A2A to the poll-mode workspace; verify proxyA2ARequest
short-circuits and returns 200 {status:queued, delivery_mode:poll,
method:message/send} without ever resolving an agent URL.
Phase 4 — verify the queued message appears in /activity?type=a2a_receive
with the right method + payload (the polling agent reads from here).
Phase 5 — since_id cursor returns ASC-ordered rows STRICTLY AFTER the
cursor; the cursor row itself must NOT be replayed. Sends two
follow-up messages and asserts ordering: rows[0] is the older new
event, rows[-1] is the newer.
Phase 6 — unknown / pruned cursor returns 410 Gone with an explanation.
Phase 7 — cross-workspace cursor isolation: a UUID belonging to one
workspace cannot be used to peek at another workspace's feed (returns
410, same as pruned, no info leak).

Idempotent: per-run unique workspace ids (date+pid). Trap-based cleanup
deletes the test rows on exit; no e2e_cleanup_all_workspaces call (see
feedback_never_run_cluster_cleanup_tests_on_live_platform.md).

Wired into .github/workflows/e2e-api.yml so it runs on every PR that
touches workspace-server/, tests/e2e/, or the workflow file itself —
same gate as the existing test_a2a_e2e + test_notify_attachments suites.

Stacked on #2354 (PR 3: since_id cursor).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 23:07:10 -07:00
Hongming Wang
b5bde0399a
Merge pull request #2354 from Molecule-AI/auto/issue-2339-pr3-activity-cursor
feat(activity): since_id cursor on GET /activity (#2339 PR 3)
2026-04-30 05:55:05 +00:00
Hongming Wang
a81b0e1e3d feat(activity): since_id cursor on GET /activity (#2339 PR 3)
Telegram getUpdates / Slack RTM shape: poll-mode workspaces pass the id
of the last activity_logs row they consumed, server returns rows
strictly after in chronological (ASC) order. Existing callers that don't
pass since_id keep DESC + most-recent-N — backwards-compatible.

Cursor lookup is scoped by workspace_id so a caller cannot enumerate or
peek at another workspace's events by passing a UUID belonging to a
different workspace. Cross-workspace and pruned cursors both return
410 Gone — no information leak (caller cannot distinguish "row never
existed" from "row exists but you can't see it").

since_id + since_secs both apply (AND). When since_id is set the order
flips to ASC because polling consumers need recorded-order; the recent-
feed shape (no since_id) keeps DESC.

Tests:
- TestActivityHandler_SinceID_ReturnsNewerASC — cursor lookup → main
  query with cursorTime + ASC ordering.
- TestActivityHandler_SinceID_CursorNotFound_410 — pruned/unknown cursor.
- TestActivityHandler_SinceID_CrossWorkspaceCursor_410 — UUID belongs to
  another workspace, scoped lookup hides it (same 410 path, no leak).
- TestActivityHandler_SinceID_CombinedWithSinceSecs — placeholder index
  arithmetic with both filters.

Stacked on #2353 (PR 2: poll-mode short-circuit).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 22:51:52 -07:00
Hongming Wang
706a388806
Merge pull request #2353 from Molecule-AI/auto/issue-2339-pr2-poll-shortcircuit-v2
feat(a2a): poll-mode short-circuit in ProxyA2A (#2339 PR 2)
2026-04-30 05:29:03 +00:00
Hongming Wang
91a1d5377d feat(a2a): poll-mode short-circuit in ProxyA2A (#2339 PR 2)
Skip SSRF/dispatch and queue to activity_logs for delivery_mode=poll
workspaces. The polling agent (e.g. molecule-mcp-claude-channel on an
operator's laptop) consumes via GET /activity?since_id= in PR 3 — no
public URL needed.

Order: budget -> normalize -> lookupDeliveryMode short-circuit ->
resolveAgentURL. Normalizing before the short-circuit keeps the
JSON-RPC method name on the activity_logs row so the polling agent
can dispatch correctly.

Fail-closed-to-push: any DB error reading delivery_mode defaults to
push (loud + recoverable) rather than poll (silent drop).

Tests:
- TestProxyA2A_PollMode_ShortCircuits_NoSSRF_NoDispatch — core invariant:
  no resolveAgentURL, no Do(), records to activity_logs, returns 200
  {status:"queued",delivery_mode:"poll",method:"message/send"}.
- TestProxyA2A_PushMode_NoShortCircuit — push path unaffected; the agent
  server actually receives the request.
- TestProxyA2A_PollMode_FailsClosedToPush — DB error on mode lookup
  must NOT silently queue; falls through to the push path.

Stacked on #2348 (PR 1: schema + register flow).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 22:22:28 -07:00
Hongming Wang
3da2392f95
Merge pull request #2348 from Molecule-AI/auto/issue-2339-pr1-delivery-mode
feat(workspaces): delivery_mode column + poll-mode register flow (#2339 PR 1)
2026-04-30 05:18:03 +00:00
Hongming Wang
ec6e47cbe3
Merge pull request #2351 from Molecule-AI/auto/issue-2344-architecture-lint
test(arch): codify 4 module boundaries as architecture tests (#2344)
2026-04-30 05:16:09 +00:00
Hongming Wang
68f18424f5 test(arch): codify 4 module boundaries as architecture tests (#2344)
Hard gate #4: codified module boundaries as Go tests, so a new
contributor (or AI agent) can't silently land an import that crosses
a layer.

Boundaries enforced (one architecture_test.go per package):

- wsauth has no internal/* deps — auth leaf, must be unit-testable in
  isolation
- models has no internal/* deps — pure-types leaf, reverse dep would
  create cycles since most packages depend on models
- db has no internal/* deps — DB layer below business logic, must be
  testable with sqlmock without spinning up handlers/provisioner
- provisioner does not import handlers or router — unidirectional
  layering: handlers wires provisioner into HTTP routes; the reverse
  is a cycle

Each test parses .go files in its package via go/parser (no x/tools
dep needed) and asserts forbidden import paths don't appear. Failure
messages name the rule, the offending file, and explain WHY the
boundary exists so the diff reviewer learns the rule.

Note: the original issue's first two proposed boundaries
(provisioner-no-DB, handlers-no-docker) don't match the codebase
today — provisioner already imports db (PR #2276 runtime-image
lookup) and handlers hold *docker.Client directly (terminal,
plugins, bundle, templates). I picked the four boundaries that
actually hold; the first two are aspirational and would need a
refactor before they could be codified.

Hand-tested by injecting a deliberate wsauth -> orgtoken violation:
the gate fires red with the rule message before merge.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 22:12:58 -07:00
Hongming Wang
664bbd8899
Merge pull request #2350 from Molecule-AI/auto/issue-2342-continuous-synth-e2e
ci: continuous synthetic E2E against staging (#2342)
2026-04-30 05:08:35 +00:00
Hongming Wang
0b83faa33c
Merge pull request #2349 from Molecule-AI/auto/issue-2345-a2a-v02-compat-clean
fix(a2a): v0.2 → v0.3 compat shim at proxy edge (#2345)
2026-04-30 05:05:04 +00:00
Hongming Wang
db5d11ffca ci: continuous synthetic E2E against staging (#2342)
Hard gate Tier 2 item 2 of 4. Cron-driven full-lifecycle E2E that
catches regressions visible only at runtime — schema drift,
deployment-pipeline gaps, vendor outages, env-var rotations,
DNS / CF / Railway side-effects.

Empirical motivation from today:
  - #2345 (A2A v0.2 silent drop) — passed unit tests, broke at JSON-RPC
    parse layer between sender + receiver. Visible only when a sender
    exercises the full path. Now-fixed by PR #2349, but a continuous
    E2E would have surfaced it within 20 min of the regression.
  - RFC #2312 chat upload — landed staging-branch but never reached
    staging tenants because publish-workspace-server-image was main-
    only. Caught by manual dogfooding hours after deploy. Same pattern.

Both classes are invisible to PR-time CI. The continuous gate fires
every 20 min against a real staging tenant and surfaces regressions
within minutes.

Cadence: cron `0,20,40 * * * *` (3x/hour). Offsets the existing
sweep-cf-orphans (:15) and sweep-cf-tunnels (:45) so the three ops
don't burst CF/AWS APIs at the same minute. Concurrency group
prevents overlapping runs if one hangs.

Cost: ~$0.50-1/day GHA + pennies of staging tenant lifecycle.

Reuses existing tests/e2e/test_staging_full_saas.sh — no new harness
to maintain. Bounded at 10 min wall-clock (vs 15 min default) so
stuck runs fail fast rather than holding up the next firing.
Defaults to E2E_RUNTIME=langgraph (fastest cold start; the regression
classes this gate catches don't need hermes-specific paths). Operators
can dispatch with runtime=hermes when they want SDK-native coverage.

Schedule-vs-dispatch hardening: hard-fail on missing
CP_STAGING_ADMIN_API_TOKEN for cron firing (silent-skip would mask
real outages); soft-skip for operator dispatch.

Refs:
  - #2342 hard-gates Tier 2 item 2
  - #2345 (A2A v0.2 fix that this gate would have caught earlier)
  - #2335 / #2337 (deployment-pipeline gaps that this gate also catches)
2026-04-29 22:04:57 -07:00
Hongming Wang
140fc5fb10 fix(a2a): v0.2 → v0.3 compat shim at proxy edge (#2345)
Closes #2345.

## Symptom

Design Director silently dropped A2A briefs whose sender used the
v0.2 message format (`params.message.content` string) instead of v0.3
(`params.message.parts` part-list). The downstream a2a-sdk's v0.3
Pydantic validator rejected with "params.message.parts — Field
required" but the rejection only landed in tenant-side logs; the
sender saw HTTP 200/202 and assumed delivery.

UX Researcher therefore never received the kickoff. Multi-agent
pipeline silently idle.

## Fix

Convert at the proxy edge in normalizeA2APayload. Two cases handled,
one explicitly rejected:

  v0.2 string content   → wrap as [{kind: text, text: <content>}]
                          (the canonical v0.2 case from the dogfooding
                          report)
  v0.2 list content     → preserve list as parts (some older clients
                          put a list under `content`; treat as "client
                          meant parts, used wrong field name")
  v0.3 parts present    → no-op (hot path for normal traffic)
  Neither present       → return HTTP 400 with structured JSON-RPC
                          error pointing at the missing field

Why at the proxy edge: every workspace gets the compat for free
without each one bumping a2a-sdk separately. The SDK's own compat
adapter is strict about `parts` and rejects v0.2 senders.

Why reject loud on missing-both: pre-fix the SDK's Pydantic
rejection was post-handler-dispatch and invisible to the original
sender. Now misshapen payloads return a structured 400 to the actual
caller — kills the entire silent-drop class for this payload-shape
category.

## Tests

7 new cases on normalizeA2APayload (#2345) + 1 fixture update on the
existing _MissingMethodReturnsEmpty test:

  TestNormalizeA2APayload_ConvertsV02StringContentToParts
  TestNormalizeA2APayload_ConvertsV02ListContentToParts
  TestNormalizeA2APayload_PreservesV03Parts (hot path)
  TestNormalizeA2APayload_RejectsMessageWithNeitherContentNorParts
  TestNormalizeA2APayload_RejectsContentWithUnsupportedType
  TestNormalizeA2APayload_NoMessageNoCheck (e.g. tasks/list bypasses)

All 11 normalizeA2APayload tests pass + full handler suite (no
regressions).

## Refs

Hard-gates discussion: this is exactly the class of failure
(silent-drop on schema mismatch) that #2342 (continuous synthetic
E2E) would catch automatically. Tier 2 RFC item from #2345 (caller
gets structured JSON-RPC error on parse failure) is delivered above
via the loud-reject path.
2026-04-29 22:01:41 -07:00
Hongming Wang
d5b00d6ac1 feat(workspaces): delivery_mode column + poll-mode register flow (#2339 PR 1)
Adds workspaces.delivery_mode (push, default | poll) and lets the register
handler accept poll-mode workspaces with no URL. This is the foundation
for the unified poll/push delivery design in #2339 — Telegram-getUpdates
shape for external runtimes that have no public URL.

What this PR does:

  - Migration 045: NOT NULL TEXT column, default 'push', CHECK constraint
    on the two valid values.
  - models.Workspace + RegisterPayload + CreateWorkspacePayload gain a
    DeliveryMode field. RegisterPayload.URL drops the `binding:"required"`
    tag — the handler now enforces it conditionally on the resolved mode.
  - Register handler: validates explicit delivery_mode if set; resolves
    effective mode (payload value, else stored row value, else push) AFTER
    the C18 token check; validates URL only when effective mode is push;
    persists delivery_mode in the upsert; returns it in the response;
    skips URL caching when payload.URL is empty.
  - CreateWorkspace handler: persists delivery_mode (defaults to push) in
    the same INSERT, validates it before any side effects.

What this PR does NOT do (intentional, follow-up PRs):

  - PR 2: short-circuit ProxyA2A for poll-mode workspaces (skip SSRF +
    dispatch, log a2a_receive activity, return 200).
  - PR 3: since_id cursor on GET /activity for lossless polling.
  - Plugin v0.2 in molecule-mcp-claude-channel: cursor persistence + a
    register helper that creates poll-mode workspaces.

Backwards compatibility: every existing workspace stays push-mode (schema
default) with identical behavior. New tests:
TestRegister_PollMode_AcceptsEmptyURL,
TestRegister_PushMode_RejectsEmptyURL,
TestRegister_InvalidDeliveryMode,
TestRegister_PollMode_PreservesExistingValue. All existing register +
create tests updated to expect the new delivery_mode column in the
INSERT args.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 21:47:14 -07:00