Commit Graph

3996 Commits

Author SHA1 Message Date
Hongming Wang
cbb8ee0807
Merge pull request #2080 from Molecule-AI/fix/retarget-action-handle-duplicate-pr-1884
ci(retarget): handle 422 'duplicate PR' by closing redundant main-PR (closes #1884)
2026-04-26 07:56:13 +00:00
Hongming Wang
b5f9cbbc55 ci(retarget): handle 422 'duplicate PR' by closing redundant main-PR (closes #1884)
When a bot opens a PR against main and there's already another PR on
the same head branch targeting staging, GitHub's PATCH /pulls returns
422 with:

  "A pull request already exists for base branch 'staging' and
   head branch '<branch>'"

Pre-fix: the retarget Action exited 1 with no further action. The
target-main PR sat there as a duplicate, the workflow run showed
red, and someone had to manually close the duplicate. Today's case
(#1881 duplicate of #1820) had to be closed manually.

Fix: catch that specific 422 message and close the main-PR as
redundant instead of failing. Any OTHER 422 (or other error) still
fails loud — the grep matches the specific duplicate-base text, not
a blanket "any 422 means duplicate".

Behaviour matrix:

  PATCH succeeds                           → retargeted, explainer
                                              comment posted
  PATCH 422 "already exists for staging"   → close main-PR with
                                              explainer (NEW)
  PATCH any other failure                  → workflow fails (preserves
                                              loud-fail for real bugs)

Tests: GitHub Actions don't have an inline unit-test framework here.
The workflow YAML parses (validated locally) and the bash logic is
straightforward. Real verification will be the next duplicate-PR
scenario in production.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 00:53:55 -07:00
Hongming Wang
8543bae83f
Merge branch 'staging' into fix/canvas-multilevel-layout-ux 2026-04-26 00:36:54 -07:00
rabbitblood
6494e9192b refactor(ops): apply simplify findings on #2027 PR
Code-quality + efficiency review of PR #2079:

- Hoist all_slugs = prod_slugs | staging_slugs out of decide() into the
  caller (was rebuilt on every record — 1k records × ~50-slug union per
  call). decide() signature now (r, all_slugs, ec2_names).
- Compile regexes at module scope (_WS_RE, _E2E_RE, _TENANT_RE) +
  hoist platform-core literal set (_PLATFORM_CORE_NAMES). Same change
  mirrored in the bash heredoc.
- Drop decorative # Rule N: comments (numbering was out of order, 3
  before 2 — actively confusing).
- Move the "edits must mirror" reminder OUTSIDE the CANONICAL DECIDE
  block in the .sh file, eliminating the .replace() comment-skip hack
  in TestParityWithBashScript.
- Drop per-line .strip() in _slice_canonical (would mask a real
  indentation bug; both blocks already at column 0).
- subTest() in TestPlatformCore loops so a single failure no longer
  short-circuits the rest of the items.
- merge_group + concurrency on test-ops-scripts.yml (parity with
  ci.yml gate behaviour).
- Fix don't apostrophe in inline comment that closed the python
  heredoc's single-quote and broke bash -n.

All 25 tests still pass. bash -n clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 00:28:15 -07:00
rabbitblood
ba78a5c00d test(ops): unit tests for sweep-cf-orphans decide() (#2027)
Closes #2027.

The CF orphan sweep deletes DNS records — a misclassification could nuke
a live workspace's tunnel. The decision function had MAX_DELETE_PCT
percentage gating but no automated test of category → action mapping.

Approach: extract the decide() function to scripts/ops/sweep_cf_decide.py
as a verbatim copy bracketed by `# CANONICAL DECIDE BEGIN/END` markers.
The shell script keeps its inline heredoc (so the operational path is
untouched) but bracketed by the same markers. A parity test
(TestParityWithBashScript) reads both files and asserts the bracketed
blocks match line-for-line — drift fails CI loudly.

Coverage (25 tests, 1 file, stdlib unittest only):
- Rule 1 platform-core: apex, _vercel, _domainkey, www/api/app/doc/send/status/staging-api
- Rule 3 ws-*: live (matches EC2 prefix) on prod + staging; orphan on prod + staging
- Rule 4 e2e-*: live + orphan on staging; orphan on prod
- Rule 2 generic tenant: live prod + staging; unknown subdomain kept-for-safety
- Rule 5 fallthrough: external domain + unrelated apex
- Rule priority: api.moleculesai.app stays platform-core (not tenant); _vercel stays verification
- Safety gate: under/at/over default 50% threshold; zero-total no-divide; custom threshold
- Empty live-sets: documents that decide() alone classifies as orphan, gate is the defense

CI: new .github/workflows/test-ops-scripts.yml runs `python -m unittest
discover` against scripts/ops/ on every PR/push that touches the
directory. Lightweight — no requirements file, stdlib only.

Local: `cd scripts/ops && python -m unittest test_sweep_cf_decide -v` →
25 tests, all OK.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 00:22:30 -07:00
Hongming Wang
5e36c6638c feat(platform,canvas): classify "datastore unavailable" as 503 + dedicated UI
User reported the canvas threw a generic "API GET /workspaces: 500
{auth check failed}" error when local Postgres + Redis were both
down. Two problems:

1. The error code (500) and message ("auth check failed") said
   nothing useful. The actual condition was "platform can't reach
   its datastore to validate your token" — a Service Unavailable
   class, not Internal Server Error.

2. The canvas had no way to distinguish infra-down from a real
   auth bug, so it rendered the raw API string in the same
   generic-error overlay it uses for everything.

Fix in two layers:

Server (wsauth_middleware.go):
  - New abortAuthLookupError helper centralises all three sites
    that previously returned `500 {"error":"auth check failed"}`
    when HasAnyLiveTokenGlobal or orgtoken.Validate hit a DB error.
  - Now returns 503 + structured body
    `{"error": "...", "code": "platform_unavailable"}`. 503 is
    the correct semantic ("retry shortly, infra is unavailable")
    and the code field is the contract the canvas reads.
  - Body deliberately excludes the underlying DB error string —
    production hostnames / connection-string fragments must not
    leak into a user-visible error toast.

Canvas (api.ts):
  - New PlatformUnavailableError class. api.ts inspects 503
    responses for the platform_unavailable code and throws the
    typed error instead of the generic "API GET /…: 503 …"
    message. Generic 503s (upstream-busy, etc.) keep the legacy
    path so existing busy-retry UX isn't disrupted.

Canvas (page.tsx):
  - New PlatformDownDiagnostic component renders when the
    initial hydration catches PlatformUnavailableError.
    Surfaces the actual condition with operator-actionable
    copy ("brew services start postgresql@14 / redis") +
    pointer to the platform log + a Reload button.

Tests:
  - Go: TestAdminAuth_DatastoreError_Returns503PlatformUnavailable
    pins the response shape (status, code field, no DB-error leak)
  - Canvas: 5 tests for PlatformUnavailableError classification —
    typed throw on 503+code match, generic-Error fallback for
    503-without-code (upstream busy), 500 stays generic, non-JSON
    body falls back to generic.

1015 canvas tests + full Go middleware suite pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 00:01:56 -07:00
Hongming Wang
194121c674
Merge pull request #2063 from Molecule-AI/feat/redeploy-tenants-on-main-merge
ci(redeploy): auto-redeploy tenant EC2s after every main merge
2026-04-26 07:00:59 +00:00
Hongming Wang
944ddcb4e5
Merge pull request #2062 from Molecule-AI/fix/sweep-script-env-override
fix(scripts): make sweep-cf-orphans MAX_DELETE_PCT env override actually work
2026-04-26 06:55:14 +00:00
Hongming Wang
20cce3c27c
Merge pull request #2078 from Molecule-AI/fix/api-401-probe-before-redirect
fix(api): probe /cp/auth/me before redirecting on 401
2026-04-26 06:51:38 +00:00
Hongming Wang
5a3dbb95e1 fix(api): probe /cp/auth/me before redirecting on 401
The actual cause-fix for the staging-tabs E2E saga (#2073/#2074/#2075).

Old behaviour: ANY 401 from any fetch on a SaaS tenant subdomain
called redirectToLogin → window.location.href = AuthKit. This is
wrong. Plenty of 401s don't mean "session is dead":

  - workspace-scoped endpoints (/workspaces/:id/peers, /plugins)
    require a workspace-scoped token, not the tenant admin bearer
  - resource-permission mismatches (user has tenant access but not
    this specific workspace)
  - misconfigured proxies returning 401 spuriously

A single transient one of those yanked authenticated users back to
AuthKit. Same bug yanked the staging-tabs E2E off the tenant origin
mid-test for 6+ hours tonight, leading to the cascade of test-side
mocks (#2073/#2074/#2075) that worked around the symptom without
fixing the cause.

This PR fixes it at the source. The new logic:

  - 401 on /cp/auth/* path → that IS the canonical session-dead
    signal → redirect (unchanged)
  - 401 on any other path with slug present → probe /cp/auth/me:
      probe 401 → session genuinely dead → redirect
      probe 200 → session fine, endpoint refused this token →
                  throw a real Error, caller renders error state
      probe network err → assume session-fine (conservative) →
                  throw real Error
  - slug empty (localhost / LAN / reserved subdomain) → throw
    without redirect (unchanged)

The probe adds one extra fetch on a 401, only when slug is set
and the path isn't already auth-scoped. That's rare and
worthwhile — a transient probe round-trip is cheap; an unwanted
auth redirect is a UX disaster.

Tests:
  - api-401.test.ts rewritten with the full matrix:
      * /cp/auth/me 401 → redirect (no probe, that IS the signal)
      * non-auth 401 + probe 401 → redirect
      * non-auth 401 + probe 200 → throw, no redirect  ← the fix
      * non-auth 401 + probe network err → throw, no redirect
      * empty slug paths (localhost/LAN/reserved) → throw, no probe
  - 43 tests in canvas/src/lib/__tests__/api*.test.ts all pass
  - tsc clean

The staging-tabs E2E spec's universal-401 route handler stays as
defense-in-depth (silences resource-load console noise + guards
against panels without try/catch), but the comment now describes
its role honestly: api.ts is the primary fix, the route is the
safety net.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 23:49:28 -07:00
Hongming Wang
b47a1b87b0 chore: refresh stale orphan-sweeper Stop-failure comment
Convergence-pass review noted the comment at orphan_sweeper.go:171
still describes the pre-cb126014 contract ("Stop returns nil even
when container is gone, but a future change could surface real
errors"). The future is now — Stop does surface real errors today.
Tightened the comment to match the live contract:
isContainerNotFound is treated as success, anything else returns
the wrapped Docker error, sweeper retries on the next cycle.

Pure comment change, no behavior diff.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 23:34:57 -07:00
Hongming Wang
cb12601414 fix(platform): make Provisioner.Stop return real errors so cleanup gates fire
Review caught a critical issue with 12c49183: the headline "skip
RemoveVolume when Stop fails" guarantee was dead code. `Provisioner.Stop`
unconditionally `return nil`'d after logging the underlying
ContainerRemove error, so the new `if err := h.provisioner.Stop(...);
err != nil { skip volume }` guard in workspace_crud.go AND the same
guard in the orphan sweeper could never fire. RemoveVolume always
ran, predictably failing with "volume in use" when Stop hadn't
actually killed the container — which is the exact production bug
the commit claimed to fix.

Now Stop:
  - returns nil on successful remove (no change)
  - returns nil when the container is already gone (uses the existing
    isContainerNotFound helper — that's the cleanup post-condition,
    not a failure)
  - returns the wrapped Docker error otherwise (daemon timeout, ctx
    cancellation, socket EOF — anything that means the container
    might still be alive)

Audited every Provisioner.Stop caller in the tree (team.go,
workspace_restart.go ×4, workspace.go) — all of them already
discard the return value, so the widened error surface is purely
opt-in for the new cleanup paths and breaks no existing behaviour.

Other review-driven fixes in this commit:

- workspace_crud.go: detached `broadcaster.RecordAndBroadcast` from
  the request ctx too. RecordAndBroadcast does INSERT INTO
  structure_events + Redis Publish; if the canvas hangs up, a
  request-ctx-bound INSERT can be cancelled mid-write and the
  WORKSPACE_REMOVED event never lands, leaving other WS clients
  ignorant of the cascade.

- orphan_sweeper.go: added isLikelyWorkspaceID guard before turning
  Docker container prefixes into SQL LIKE patterns. The Docker
  name filter is a SUBSTRING match (not prefix), so non-workspace
  containers like `my-ws-tool` slip through; the in-loop HasPrefix
  in provisioner trims most, but the in-sweeper alphabet check
  (hex + dashes only) is the second line of defence and also
  blocks SQL LIKE wildcards (`_`, `%`) from reaching the query.
  Two new tests pin this — TestSweepOnce_FiltersNonWorkspacePrefixes
  and TestIsLikelyWorkspaceID with 10 alphabet cases.

- provisioner.go: comment added to ListWorkspaceContainerIDPrefixes
  flagging the substring/HasPrefix relationship as load-bearing.

Verified: full Go test suite passes; all 8 sweeper tests pass
(2 new for the LIKE-pattern guard); existing dispatch / delete /
provisioner tests unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 23:32:48 -07:00
Hongming Wang
12c4918318 fix(platform): stop leaking workspace containers on delete
Symptom: deleting workspaces from the canvas marked DB rows
status='removed' but left Docker containers running indefinitely.
After a session of org imports + cancellations, we counted 10
running ws-* containers all backed by 'removed' DB rows, eating
~1100% CPU on the Docker VM.

Two compounding bugs in handlers/workspace_crud.go's delete cascade:

1. The cleanup loop used `c.Request.Context()` for the Docker
   stop/remove calls. When the canvas's `api.del` resolved on the
   platform's 200, gin cancelled the request ctx — and any in-flight
   Docker call cancelled with `context canceled`, leaving the
   container alive. Old logs:
       "Delete descendant <id> volume removal warning:
        ... context canceled"

2. `provisioner.Stop`'s error return was discarded and `RemoveVolume`
   ran unconditionally afterward. When Stop didn't actually kill the
   container (transient daemon error, ctx cancellation as in #1), the
   volume removal would predictably fail with "volume in use" and
   the container kept running with the volume mounted. Old logs:
       "Delete descendant <id> volume removal warning:
        Error response from daemon: remove ... volume is in use"

Fix layered in two parts:

- workspace_crud.go: detach cleanup with `context.WithoutCancel(ctx)`
  + a 30s bounded timeout. Stop's error is now checked and on
  failure we skip RemoveVolume entirely (the orphan sweeper below
  catches what we deferred).

- New registry/orphan_sweeper.go: periodic reconcile pass (every 60s,
  initial run on boot). Lists running ws-* containers via Docker name
  filter, intersects with DB rows where status='removed', stops +
  removes volumes for the leaks. Defence in depth — even a brand-new
  Stop failure mode heals on the next sweep instead of leaking
  forever.

Provisioner gains a tiny ListWorkspaceContainerIDPrefixes helper
that wraps ContainerList with the `name=ws-` filter; the sweeper
takes an OrphanReaper interface (matches the ContainerChecker
pattern in healthsweep.go) so unit tests don't need a real Docker
daemon.

main.go wires the sweeper alongside the existing liveness +
health-sweep + provisioning-timeout monitors, all under
supervised.RunWithRecover so a panic restarts the goroutine.

6 new sweeper tests cover the reconcile path, the
no-running-containers short-circuit, the daemon-error skip, the
Stop-failure-leaves-volume invariant (the same trap that motivated
this fix), the volume-remove-error-is-non-fatal continuation,
and the nil-reaper no-op.

Verified: full Go test suite passes; manually purged the 10 leaked
containers + their orphan volumes from the dev host with `docker
rm -f` + `docker volume rm` (one-off cleanup; the sweeper would
have caught them on the next cycle once deployed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 12:36:22 -07:00
Hongming Wang
23bea6e793
Merge pull request #2075 from Molecule-AI/fix/canvas-e2e-filter-resource-404
fix(canvas/e2e): filter generic 'Failed to load resource' + add URL diagnostics
2026-04-25 19:09:19 +00:00
Hongming Wang
bef6fca395 fix(canvas/e2e): filter generic "Failed to load resource" + add URL diagnostics
After #2074, the staging-tabs spec stopped failing on the auth-redirect
locator timeout (good — the broadened 401-mock works) but started
failing on a different aggregate check:

  Error: unexpected console errors:
  Failed to load resource: the server responded with a status of 404
  Failed to load resource: the server responded with a status of 404
  Failed to load resource: the server responded with a status of 404

Browser console messages for resource-load failures omit the URL,
so the message is uninformative on its own — we can't filter
selectively (e.g. "is this a missing-CSS noise or a real broken
endpoint?"). The previous filter list (sentry/vercel/WebSocket/
favicon/molecule-icon) catches specific known-noisy strings but
this generic "Failed to load resource" doesn't contain any of them.

Two changes:

1. Add page.on('requestfailed') + page.on('response>=400') logging
   to capture the URL of any failed request. Logs to test stdout
   (visible in the workflow log) — leaves a breadcrumb so a real
   bug isn't completely hidden when we filter the generic message.

2. Add "Failed to load resource" to the filter list. With (1) in
   place we still see the URLs for diagnosis; the generic console
   message is just noise.

Real JS exceptions (panel crash, undefined access, etc.) come with
a file path and stack trace and aren't matched by either filter,
so the gate still catches actual bugs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 12:07:07 -07:00
Hongming Wang
cdfe4e7b85
Merge pull request #2074 from Molecule-AI/fix/canvas-e2e-broaden-401-mock
fix(canvas/e2e): broaden 401-mock to all fetches
2026-04-25 18:43:07 +00:00
Hongming Wang
a84b167d4d fix(canvas/e2e): broaden 401-mock to all fetches, not just /workspaces/*
#2073 caught workspace-scoped 401s but missed non-workspace paths.
SkillsTab.tsx alone fetches /plugins and /plugins/sources, both
outside the /workspaces/<id>/* tree. Either of those 401s with the
tenant admin bearer in SaaS mode → canvas/src/lib/api.ts:62-74
redirects to AuthKit → page navigates away mid-test → next locator
times out.

Same failure signature observed at 16:03Z post-#2073 merge:

  e2e/staging-tabs.spec.ts:45:7 › tab: skills
  TimeoutError: locator.scrollIntoViewIfNeeded: Timeout 5000ms
  - navigated to "https://scenic-pumpkin-83.authkit.app/?..."

Broaden the route to "**" with `request.resourceType() !== "fetch"`
short-circuit (preserves HTML/JS/CSS pass-through) and a
/cp/auth/me skip (the dedicated mock above wins). Same 401 →
empty-body conversion logic; just a wider net.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 11:40:48 -07:00
Hongming Wang
2ee4b67cab chore: third-pass review polish — empty-stream gate test + Callable type
Pass 3 review came back Approve with two optional polish items.
Both taken to fully converge the loop:

1. Regression test for the empty-stream wedge-clear gate (added in
   3c4eef49). A degenerate stream that iterates without raising but
   emits NEITHER an AssistantMessage NOR a ResultMessage must NOT
   clear the wedge flag — pre-set wedge persists, the next heartbeat
   still reports runtime_state="wedged". Pins the gate against
   future regression.

2. Replaced the type annotation `"dict[str, callable[[dict], str]]"`
   (lowercase `callable`, string-quoted) with the proper
   `dict[str, Callable[[dict], str]]` using `Callable` from
   `collections.abc`. Benign before (`from __future__ import
   annotations` makes the annotation a string Python never
   evaluates), but pyright/mypy may flag the lowercase form.

65 Python tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 08:52:32 -07:00
Hongming Wang
3c4eef49aa chore: second-pass review polish — symmetry + clearer test fixtures
Round-2 review of the wedge/idle/progress bundle came back Approve
with 4 optional polish items. All taken:

1. Migration 043 down file gained `SET LOCAL lock_timeout = '5s'`
   matching the up file. A rollback under the same load that
   motivated the up-file guard would otherwise stall writers.

2. _clear_sdk_wedge_on_success now gates on actual stream content
   (result_text or assistant_chunks). A degenerate "iterator
   returned without raising but emitted nothing" case (possible
   from a partial stream or stub SDK) no longer falsely advertises
   recovery — only a real successful query (≥1 ResultMessage or
   AssistantMessage TextBlock) clears the wedge.

3. isUpstreamBusyError dropped the redundant
   `strings.Contains(msg, "context deadline exceeded")` fallback.
   *url.Error.Unwrap propagates the typed sentinel since Go 1.13;
   errors.Is(err, context.DeadlineExceeded) catches the real
   net/http shape. The substring was a foot-gun (would also match
   user-content with that phrase). Test fixture updated to use
   `fmt.Errorf("Post: %w", context.DeadlineExceeded)` which
   reflects what net/http actually returns.

4. TestIsUpstreamBusyError added a context.Canceled case (both
   typed and wrapped via %w) — pins the new applyIdleTimeout
   classification.

No critical/required findings on second pass; reviewer verdict was
Approve. Items above are polish for symmetry and test clarity.

1010 canvas + 64 Python + full Go suites pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 08:48:30 -07:00
Hongming Wang
892de784b3 fix: review-driven hardening of wedge detector + idle timeout + progress feed
Bundle review of pieces 1/2/3 surfaced two critical issues plus a
handful of required + optional fixes. All addressed.

Critical:

1. Migration 043 was missing 'paused' and 'hibernated' from the
   workspace_status enum. Both are real production statuses written
   by workspace_restart.go (lines 283 and 406), introduced by
   migration 029_workspace_hibernation. The original `USING
   status::workspace_status` cast would have errored mid-transaction
   on any production DB containing those values. Added both. Also
   added `SET LOCAL lock_timeout = '5s'` so the migration aborts
   instead of stalling the workspace fleet behind a slow SELECT.

2. The chat activity-feed window kept only 8 lines, and a single
   multi-tool turn (Read 5 files + Grep + Bash + Edit + delegate)
   easily flushed older context before the user could read it.
   Extracted appendActivityLine to chat/activityLog.ts with a
   20-line window AND consecutive-duplicate collapse (same tool
   on the same target twice in a row is noise, not new progress).
   5 unit tests pin the behavior.

Required:

3. The SDK wedge flag was sticky-only — a single transient
   Control-request-timeout from a flaky network blip locked the
   workspace into degraded for the whole process lifetime, even
   when the next query() would have succeeded. Added
   _clear_sdk_wedge_on_success(), called from _run_query's success
   path. The next heartbeat after a working query reports
   runtime_state empty and the platform recovers the workspace to
   online without a manual restart. New regression test.

4. _report_tool_use now sets target_id = WORKSPACE_ID for self-
   actions, matching the convention other self-logged activity
   rows use. DB consumers joining on target_id see a well-defined
   value instead of NULL.

Optional taken:

5. Tightened _WEDGE_ERROR_PATTERNS from "control request timeout"
   to "control request timeout: initialize" — suffix-anchored so a
   future SDK error on an in-flight tool-call control message
   doesn't get misclassified as the unrecoverable post-init wedge.

6. Dropped the redundant "context canceled" substring fallback in
   isUpstreamBusyError. errors.Is(err, context.Canceled) is the
   typed check; the substring would also match healthy client-side
   aborts, which we don't want classified as upstream-busy.

Verified: 1010 canvas tests + 64 Python tests + full Go suite pass;
migration applies cleanly on dev DB with all 8 enum values; reverse
migration restores TEXT.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 08:43:10 -07:00
Hongming Wang
bf1dc6b6a5 feat(platform): idle-based A2A timeout, drop 5-min canvas hardcode
The previous canvas-default 5-min absolute deadline pre-empted any
chat that legitimately ran longer (multi-turn tool use, large
synthesis tasks) and made every wedged-SDK call burn 5 full minutes
before the user saw anything. Replaced with a per-dispatch idle
timeout: cancel the request only when the broadcaster has been
silent for `idleTimeoutDuration` (60s). Any progress event for the
workspace — agent_log tool-use rows, task_update, a2a_send,
a2a_receive — resets the clock.

Mechanics:

- new applyIdleTimeout helper subscribes to events.Broadcaster's
  per-workspace SSE channel, drains its messages, resets a
  time.Timer on each one, cancels the wrapped ctx when the timer
  fires. Cleanup goroutine + subscription lives only as long as
  the returned cancel func is uncalled.
- dispatchA2A now takes workspaceID as a parameter, applies the
  idle timeout always (canvas + agent), and combines its cancel
  with the existing 30-min agent-to-agent ceiling cancel into one
  func the caller defers.
- Canvas dispatches no longer have an absolute ceiling at all —
  the idle timer is the only "give up" signal. A healthy chat
  reporting tool-use telemetry every few seconds runs forever;
  a wedged runtime fails in 60s instead of 5 min.
- isUpstreamBusyError now also recognises context.Canceled (the
  error class our idle cancel produces, distinct from
  DeadlineExceeded). Same 503-busy retry semantics.

Tests:

- TestApplyIdleTimeout_FiresOnSilence — 60ms idle, no events,
  ctx cancels with context.Canceled.
- TestApplyIdleTimeout_ResetsOnEvent — event mid-window extends
  the deadline; ctx alive past original deadline, then cancels
  on the second silence window.
- TestApplyIdleTimeout_NilBroadcasterDegradesGracefully — defensive
  no-op for paths that don't wire a broadcaster.
- 3 existing dispatchA2A tests updated for the new workspaceID
  param + the always-non-nil cancel return shape.

This pairs with Piece 1's per-tool-use telemetry (166c7f77): the
broadcaster events that reset the idle timer ARE the agent_log
rows the workspace started emitting per tool call. So the same
event stream feeds both the chat progress feed AND the proxy's
deadline.

Full Go test suite passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 08:34:55 -07:00
Hongming Wang
166c7f77af feat(chat): stream per-tool progress into MyChat live feed
Two halves of the same UX win — the user wants to see what Claude is
doing while a chat reply is in flight instead of staring at "0s" for
minutes.

Workspace side (claude_sdk_executor.py):
  - The executor's _run_query message loop already iterated the SDK
    stream for AssistantMessage.TextBlock content. Now also detects
    ToolUseBlock / ServerToolUseBlock entries (by class name, since
    the conftest stub doesn't define them) and fires-and-forgets a
    POST /workspaces/:id/activity row of type agent_log per tool use.
  - _summarize_tool_use maps the common tools (Read, Write, Edit,
    Bash, Glob, Grep, WebFetch, WebSearch, Task, TodoWrite) to a
    one-line summary with the file path / pattern / command, falling
    back to "🛠 <tool>(…)" for anything else. Truncated at 200 chars.
  - Posts directly to /workspaces/:id/activity rather than going
    through a2a_tools.report_activity, which would also push a
    /registry/heartbeat current_task and double-log as a TASK_UPDATED
    line in the same chat feed.
  - All failures swallowed silently — telemetry must not break
    the conversation.

Canvas side (ChatTab.tsx):
  - The existing ACTIVITY_LOGGED handler streams a2a_send /
    a2a_receive / task_update events into a sliding-window
    activityLog state. Two issues fixed:
      1. No `msg.workspace_id === workspaceId` filter — a sibling
         workspace's a2a_send was leaking into the wrong chat
         panel as "→ Delegating to X...". Added an early return.
      2. No agent_log render branch. Added one that renders the
         summary verbatim (the workspace already prefixed its
         own emoji icon, so no double-icon).
  - Existing 8-line sliding window keeps the UI scoped; older
    progress lines naturally roll off as new ones arrive.

Result: when DD is delegating to Visual Designer + reading
config files + running Bash to lint, the spinner area shows:
  📄 Read /configs/system-prompt.md
   Bash: pnpm test
  → Delegating to Visual Designer...
  ← Visual Designer responded (47s)

instead of bare "0s · Processing with Claude Code..." for minutes.

63 Python tests + 58 canvas chat tests pass; tsc clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 08:28:55 -07:00
Hongming Wang
14fab6e544
Merge pull request #2073 from Molecule-AI/fix/canvas-e2e-mock-workspace-apis
fix(canvas/e2e): swap workspace-scoped 401s for empty 200s in staging-tabs spec
2026-04-25 15:23:07 +00:00
Hongming Wang
979d4a0b7a fix(canvas/e2e): swap workspace-scoped 401s for empty 200s
The staging-tabs E2E has been failing for 6+ hours on the same
locator timeout — diagnosed earlier today as the canvas's
lib/api.ts:62-74 redirect-on-401 path firing mid-test:

  e2e/staging-tabs.spec.ts:45:7 › tab: skills
  TimeoutError: locator.scrollIntoViewIfNeeded: Timeout 5000ms
  - navigated to "https://scenic-pumpkin-83.authkit.app/?..."

Several side-panel tabs (Peers, Skills, Channels, Memory, Audit,
and anything workspace-scoped) hit endpoints under
`/workspaces/<id>/*` that require a workspace-scoped token, NOT
the tenant admin bearer the test uses. The endpoints respond 401
in SaaS mode. canvas/src/lib/api.ts:62-74 reacts to ANY 401 by
setting `window.location.href` to AuthKit — yanking the page off
the tenant origin mid-test.

The test comment at line 18 already acknowledged the 401 class
("Peers tab: 401 without workspace-scoped token") but assumed
those would surface as "errored content" rather than a hard
navigation. The redirect logic in api.ts was added later and
breaks the assumption.

Fix: add a Playwright route handler that catches any 401 from
`/workspaces/<id>/*` paths and replaces with `200 + empty body`.
Body shape is best-effort by URL — list endpoints (paths not
ending in a UUID-shaped segment) get `[]`, single-resource
endpoints get `{}`. Both are valid JSON and well-written panels
render an empty state for either rather than crashing.

The two route patterns (`/workspaces/...` and `/cp/auth/me`)
don't overlap — the existing `/cp/auth/me` mock continues to
gate AuthGate's session check independently.

Verification:
- Type-check passes (tsc clean for the spec; pre-existing errors
  in unrelated test files unchanged)
- Can't run staging E2E locally without CP admin token; CI will
  exercise the real path against the freshly-provisioned tenant
- E2E Staging SaaS (full lifecycle) is currently green at 08:07Z,
  confirming the underlying staging infra works — the failures
  have been narrowly in this Playwright-tabs spec

Targets staging per molecule-core convention.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 08:08:05 -07:00
Hongming Wang
4eb09e2146 feat(platform,workspace): SDK-wedge detection + workspace_status ENUM
Heartbeat lies. The asyncio task that POSTs /registry/heartbeat lives
in its own process slot, so a workspace whose claude_agent_sdk has
wedged on `Control request timeout: initialize` keeps reporting
"online" — every chat send hangs the full 5-min platform deadline
even though the runtime is dead in the water. This commit teaches
the workspace to admit it's wedged and the platform to honor that
admission by flipping status → degraded.

Five layers, all in one commit because they share a contract:

1. Migration 043 — convert workspaces.status from free-form TEXT to
   a real `workspace_status` Postgres ENUM with the 6 values
   production code actually writes (provisioning, online, offline,
   degraded, failed, removed). Locks the value set; future typo
   writes error at the DB instead of silently storing rogue strings.
   Down migration reverts to TEXT and drops the type.

2. workspace-server/internal/models — `HeartbeatPayload` gains a
   `runtime_state string` field. Empty = healthy. Currently the only
   non-empty value the handler honors is "wedged"; future symptoms
   can extend without another migration.

3. workspace-server/internal/handlers/registry.go — `evaluateStatus`
   gains a wedge branch BEFORE the existing error_rate >= 0.5 path:
   if `RuntimeState=="wedged"` and currently online, flip to
   degraded and broadcast WORKSPACE_DEGRADED with the wedge sample
   error. Recovery (`degraded → online`) now requires BOTH
   error_rate < 0.1 AND runtime_state cleared, so a workspace still
   reporting wedged stays degraded even when its error count
   happens to be 0 (the wedge captures a runtime state, not an
   error count).

4. workspace/claude_sdk_executor.py — module-level `_sdk_wedged_reason`
   flag set when execute()'s catch block sees an error matching
   `_WEDGE_ERROR_PATTERNS` (currently just "control request
   timeout"). Sticky for the process lifetime; the SDK's internal
   client-process state is corrupted on this error and only a
   workspace restart (= new Python process = fresh module state)
   clears it. Helpers `is_wedged()` / `wedge_reason()` /
   `_reset_sdk_wedge_for_test()` exposed.

5. workspace/heartbeat.py — heartbeat body now layers on
   `_runtime_state_payload()` for both the happy path and the
   401-retry path. Lazy-imports claude_sdk_executor so non-Claude
   runtimes (where the module may not even be importable) keep
   working unchanged.

Canvas required no changes — `STATUS_CONFIG.degraded` was already
defined in design-tokens.ts (amber dot, "Degraded" label) and
WorkspaceNode.tsx already renders `lastSampleError` underneath the
status pill when status === "degraded". The existing wiring just
never fired because nothing was writing degraded in this code path.

Tests:
- 3 Go handler tests for the new transitions (online → degraded on
  wedged, degraded stays put while still wedged, degraded → online
  after wedge clears)
- 5 Python wedge-detector tests (default clean, mark sets flag,
  sticky-first-wins, execute() flips on Control request timeout,
  execute() does NOT flip on unrelated errors)
- Migration smoke-tested against the local dev DB (3 existing rows,
  all enum-compatible; migration applied cleanly, post-state has
  the column as workspace_status type and the index preserved)

Verified: 79 Python tests pass; full Go test suite passes; migration
applies clean on a real DB; reverse migration restores the column to
TEXT.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 00:59:15 -07:00
Hongming Wang
c159d85eb5 fix(a2a): review-driven hardening — prefix-anchored type check, error_detail cap, shared hint module
Three required fixes from the bundle review of 391e1872:

1. workspace/a2a_client.py: substring `type_name in msg` could miss
   the diagnostic prefix when an exception's message embedded a
   different class name mid-string (e.g. `OSError("see ConnectionError
   below")` → printed as plain msg, type lost). Switched to a
   prefix-anchored check (`msg.startswith(f"{type_name}:")` etc.) so
   the type label is always added when not already at the start of
   the message.

2. workspace/a2a_tools.py: `activity_logs.error_detail` is unbounded
   TEXT on the platform (handlers/activity.go does not validate
   length). A buggy or hostile peer could stream arbitrarily large
   error messages into the caller's activity log. Cap at 4096 chars
   at the producer — comfortably above any real exception traceback,
   well below an obvious-DoS threshold.

3. New regression test for JSON-RPC `code=0` — pins the
   `code is not None` semantics so the code is preserved in the
   detail rather than collapsing into the no-code path. Code=0 is
   not valid per the spec, but a malformed peer can still emit it
   and we want it visible for diagnosis.

Plus one optional taken: extracted the A2A-error → hint mapping into
canvas/src/components/tabs/chat/a2aErrorHint.ts. The two prior copies
(AgentCommsPanel.inferCauseHint + ActivityTab.inferA2AErrorHint) had
already drifted — Activity tab gained `not found`/`offline` cases the
chat panel never picked up, AgentCommsPanel handled empty-input
explicitly while Activity didn't. The shared module is the merged
superset, with 10 unit tests pinning each named pattern + the
"most specific first" ordering (Claude SDK wedge wins over generic
timeout).

Skipped (per analysis):
- Unicode-naive 120-char slice — Python str[:N] slices on code
  points, not bytes. Safe.
- Nested [A2A_ERROR] confusion — non-issue per reviewer; outer
  prefix winning still produces a structured render.
- MessagePreview + JsonBlock dual render on errors — intentional
  drilldown; raw JSON is below the fold for operators who need it.
- console.warn dedup — refetches don't happen per-event so spam
  risk is low.
- str(data)[:200] materialization — A2A response bodies aren't
  typically MB-sized.

Verified: 1005 canvas tests pass (10 new hint tests); 10 Python
send_a2a_message tests pass (1 new for code=0); tsc clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 23:47:44 -07:00
Hongming Wang
391e187281 fix(a2a,canvas): make delivery failures comprehensive instead of "[A2A_ERROR] "
Symptom: Activity tab and Agent Comms surfaced bare "[A2A_ERROR] "
(prefix + nothing) for failed delegations. Operator had no signal
to act on — no exception type, no target, no hint about what went
wrong, no next step. Fix is in three layers.

1. workspace/a2a_client.py — every error path now produces an
   actionable detail string:

   - except branch: some httpx exceptions (RemoteProtocolError,
     ConnectionReset variants) stringify to "". Pre-fix the catch
     was `f"{_A2A_ERROR_PREFIX}{e}"` → bare prefix. Now falls back
     to `<TypeName> (no message — likely connection reset or silent
     timeout)` and always appends `[target=<url>]` for traceability
     in chained delegations.
   - JSON-RPC error branch: previously dropped error.code on the
     floor and printed "unknown" when message was missing. Now
     surfaces both, including the well-defined "JSON-RPC error
     with no message (code=N)" path.
   - "neither result nor error" branch: pre-fix returned
     str(payload) which the canvas rendered as a successful
     response block. Now tagged as A2A_ERROR with a payload
     snippet so downstream UI routes through the error path.

2. workspace/a2a_tools.py — tool_delegate_task now passes
   error_detail (the stripped error message) through to the
   activity-log POST. The platform's activity_logs.error_detail
   column is the canvas's red error chip source; populating it
   makes the failure visible in the row header without the user
   having to expand into raw response_body JSON. The summary line
   also gets a 120-char prefix of the cause so the collapsed row
   reads "React Engineer failed: ConnectionResetError: ... [target=...]"
   instead of "React Engineer failed".

3. canvas/src/components/tabs/ActivityTab.tsx — MessagePreview
   now detects [A2A_ERROR]-prefixed bodies and renders a
   structured error block (red chip, stripped detail, cause hint)
   instead of the previous gray text-block that showed the literal
   "[A2A_ERROR]" string. inferA2AErrorHint mirrors the patterns
   from AgentCommsPanel.inferCauseHint so the same symptom reads
   the same way in both surfaces (Claude SDK init wedge → restart
   workspace; timeout → busy/stuck; connection-reset → transient
   blip then check logs).

Tests: 9 send_a2a_message tests pass (including a new regression
test for the empty-stringifying-exception case that the user
reported); 995 canvas tests pass; tsc clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 23:40:05 -07:00
Hongming Wang
54f7c75c81 fix(canvas): make AgentCommsPanel load failures observable
Reported symptom: canvas edges show "1 call · just now" between two
agents, but the Agent Comms tab for the source workspace renders
"No agent-to-agent communications yet" — even though
GET /workspaces/<id>/activity?source=agent&limit=50 returns a2a_send
+ a2a_receive rows.

Confirmed via curl that the API does return the rows the panel
should map. The panel's load handler was the suspect, but it had:

  .catch(() => setLoading(false))

which swallowed every failure path — network errors, JSON parse,
ANY throw inside the .then body — without leaving a single trace in
the console. The panel just sat on its empty state and gave the user
zero signal to act on. (And by extension, gave us nothing to debug
remotely either.)

Two changes:

1. Wrap the per-row `toCommMessage` call in a try/catch so one
   malformed activity row (unexpected request_body shape, etc.)
   doesn't throw out of the for-loop and skip the
   setMessages(msgs) line. Previously the panel would silently
   drop the entire batch when ANY row failed to parse.

2. Replace the bare `.catch(() => setLoading(false))` with a
   logging variant. Now a future "panel stuck empty" report comes
   with `AgentCommsPanel: load activity failed <err>` or
   `AgentCommsPanel: failed to map activity row {...}` in the
   console — diagnosable instead of opaque.

Behavior on the happy path is unchanged (5 existing tests still
pass; tsc clean). This is purely defensive: it makes the failure
path visible so the next stuck-empty report can be root-caused
instead of guessed at.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 23:27:50 -07:00
Hongming Wang
28911ded40 fix(canvas): split shared autoFitTimerRef so settle + tracking fits don't cross-cancel
Bundle-level review caught an implicit coupling in useCanvasViewport
between two distinct fit effects:

  - settle fit: 1200ms one-shot when provisioning transitions to zero
    (deploy just finished — settle on the whole org once)
  - tracking fit: 500ms debounced per molecule:fit-deploying-org event
    (track the org's bounds as children land during the deploy)

Both effects shared a single autoFitTimerRef, so each one's
clearTimeout call could silently cancel the other's pending fit.
Today's behavior happened to land in the right order out of luck —
the tracking handler fires per-arrival during the deploy, then the
settle effect arms after the last child completes. But nothing in
the code enforces that ordering; a future refactor that, say,
fires the settle effect from the same event sequence as the
tracking timer (mid-deploy status flicker) would silently drop the
settle fit because the tracking timer's clearTimeout ran last.

Splitting into settleFitTimerRef + trackingFitTimerRef makes the
two effects fully independent. Cleanup clears both. Tests still pass
(995/995); the refactor is mechanical.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 23:19:02 -07:00
Hongming Wang
fc54601999
Merge pull request #2067 from Molecule-AI/fix/canary-openai-key-staging
ci(canary): inject E2E_OPENAI_API_KEY so A2A turn doesn't 500
2026-04-25 06:12:30 +00:00
Hongming Wang
52d203a098
Merge pull request #2068 from Molecule-AI/ci/sweep-stale-e2e-orgs
ci: hourly sweep of stale e2e-* orgs on staging
2026-04-25 06:12:29 +00:00
Hongming Wang
fe075ee1ba ci: hourly sweep of stale e2e-* orgs on staging
Adds a janitor workflow that runs every hour and deletes any
e2e-prefixed staging org older than MAX_AGE_MINUTES (default 120).
Catches orgs left behind when per-test-run teardown didn't fire:
CI cancellation, runner crash, transient AWS error mid-cascade,
bash trap missed (signal 9), etc.

Why it exists despite per-run teardown:
- Per-run teardown is best-effort by definition. Any process death
  after the test starts but before the trap fires leaves debris.
- GH Actions cancellation kills the runner with no grace period —
  the workflow's `if: always()` step usually catches this but can
  still fail on transient CP 5xx at the wrong moment.
- The CP cascade itself has best-effort branches today
  (cascadeTerminateWorkspaces logs+continues on individual EC2
  termination failures; DNS deletion same shape). Those need
  cleanup-correctness work in the CP, but a safety net belongs in
  CI either way — defense in depth.

Behaviour:
- Cron every hour. Manual workflow_dispatch with overrideable
  max_age_minutes + dry_run inputs for one-off cleanups.
- Concurrency group prevents two sweeps fighting.
- SAFETY_CAP=50 — refuses to delete more than 50 orgs in a single
  tick. If the CP admin endpoint goes weird and returns no
  created_at (or returns no orgs at all), every e2e-* would look
  stale; the cap catches the runaway-nuke case.
- DELETE is idempotent CP-side via org_purges.last_step, so a
  half-deleted org from a prior sweep gets picked up cleanly on the
  next tick.
- Per-org delete failures don't fail the workflow. Next hourly tick
  retries. The workflow only fails loud at the safety-cap gate.

Tonight's specific motivation: ~10 canvas-tabs E2E retries in 2 hours
with various failure modes; each provisioned a fresh tenant + EC2 +
DNS + DB row. Some fraction leaked. Without this loop, ops has to
periodically run the manual sweep-cf-orphans.sh script. With it,
staging self-heals.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 23:07:57 -07:00
Hongming Wang
43c28710ac
Merge pull request #2066 from Molecule-AI/fix/e2e-staging-status-field
fix(e2e): poll instance_status not status — staging E2E never matched the field, masked all real bugs
2026-04-25 05:58:36 +00:00
Hongming Wang
06c85bd185
Merge pull request #2045 from Molecule-AI/feat/flat-rate-pricing-1833
feat(canvas): flat-rate pricing — rename Starter→Team, Pro→Growth (Issue #1833)
2026-04-25 05:54:06 +00:00
Hongming Wang
e0f338e8ae fix(canvas): plug timer leak + optimistic-install semantics in SkillsTab
Three review-driven fixes plus regression coverage for the bugs
landed in 176b703d / deedb5ef:

1. clearTimeout the prior reload handle before scheduling a new one in
   both installFromSource and handleUninstall. Two installs within the
   PLUGIN_RELOAD_DELAY_MS window (15s) used to queue two
   loadInstalled() calls; the unmount cleanup only cleared the latest
   handle, and the second reconciliation could overwrite a still-
   correct optimistic state with a stale snapshot mid-restart.

2. Drop `setInstalledLoaded(true)` from the optimistic block. That
   flag's contract is "the initial GET has succeeded at least once" —
   it gates the auto-expand-registry effect. A user installing a
   custom-source plugin BEFORE the initial fetch returned would flip
   the gate prematurely, the auto-expand would never fire, and a
   followup loadInstalled racing with the optimistic write could
   overwrite our entry with [] mid-restart.

3. Don't force `supported_on_runtime: true` on the optimistic record.
   The "inert on this runtime" badge in the row renders on the value
   `=== false`. Forcing true would hide the badge for 15s if the user
   installed a plugin that doesn't actually support the workspace's
   runtime; the real value lands at refetch. Leaving the field
   undefined keeps the badge neutral until reconciliation arrives.

Plus a behavioral test (SkillsTab.install.test.tsx) that asserts:
  - the install POST URL contains the workspaceId (not "undefined")
  - the row's "Install" button is replaced by the green "Installed"
    tag synchronously after POST resolves, without advancing any
    timer — locks in the optimistic-update contract so a future
    refactor can't silently regress it.

995 canvas tests pass (2 new); tsc clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 22:47:46 -07:00
Hongming Wang
deedb5eff6 fix(canvas): optimistic plugin install so the UI flips to "Installed" instantly
After clicking Install, the button reverted from "Installing..." → "Install"
the moment the POST returned, then sat there for ~15s before the green
"Installed" tag appeared. The 15s gap is PLUGIN_RELOAD_DELAY_MS — we
delay the GET /workspaces/:id/plugins refetch to wait for the workspace
to restart (the listing handler returns [] while the container is
restarting because findRunningContainer comes up empty).

Uninstall already does optimistic local-state mutation (line 244 prior
to this commit) so the green tag → install button transition is
instant. Install was the inconsistent half — push the registry entry
into `installed` immediately after POST returns 200 and let the
delayed refetch reconcile.

The optimistic record uses the registry entry's metadata (name,
version, description, tags, runtimes, skills) and sets
supported_on_runtime=true. If reconciliation later disagrees (server
filter, install actually failed at the runtime layer), the refetch
overwrites the local record. Worst case is a brief 15s window where
we show "Installed" for a plugin that won't load — same window the
user previously experienced as "stuck on Install button" — but flipped
to the correct expected state.

Custom-source installs (github://, etc.) don't have a registry entry
to use, so they keep the old behavior of waiting for the refetch. Most
users install from the registry list in the UI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 22:41:51 -07:00
Hongming Wang
9a785e9c32 ci(canary): inject E2E_OPENAI_API_KEY so A2A turn doesn't 500
The canary workflow has been failing for ~30 consecutive runs (issue
#1500, opened 2026-04-21) on the same line:

  [hermes-agent error 500] No LLM provider configured. Run `hermes
  model` to select a provider, or run `hermes setup` for first-time
  configuration.

Root cause: the canary's env block was missing E2E_OPENAI_API_KEY.
Without it, tests/e2e/test_staging_full_saas.sh provisions the workspace
with empty secrets; template-hermes start.sh seeds ~/.hermes/.env with
no provider keys; derive-provider.sh resolves the model slug
`openai/gpt-4o` to PROVIDER=openrouter (hermes has no native openai
provider in its registry); A2A request at step 8/11 fails with the
"No LLM provider configured" error from hermes-agent.

The full-lifecycle workflow (e2e-staging-saas.yml line 84) carries the
same secret correctly. Mirror its pattern + add a fail-fast preflight
so future regressions surface in <5s instead of after 8 min of
provision-then-die.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 22:37:13 -07:00
Hongming Wang
176b703dbc fix(canvas): plugin install POSTed to /workspaces/undefined/plugins
SkillsTab read \`data.id\` from its props and used the value to build
two API URLs:
  POST   /workspaces/\${data.id}/plugins
  DELETE /workspaces/\${data.id}/plugins/\${pluginName}

But \`data\` is the React Flow node.data blob (WorkspaceNodeData) —
the workspace id lives on \`node.id\`, NOT on \`node.data\`. WorkspaceNodeData
extends \`Record<string, unknown>\`, which makes \`data.id\` type-check
silently as \`unknown\` instead of erroring. So every install/uninstall
hit \`/workspaces/undefined/plugins\`, the server's not-found path
returned 503 "workspace container not running" (misleading — the real
issue was the bogus URL), and the user got a confusing toast.

Every other tab in SidePanel takes \`workspaceId={selectedNodeId}\` as
an explicit prop. SkillsTab was the lone outlier, presumably because
"data has all the fields I need" is the obvious-looking shortcut that
TypeScript can't catch through the index-signature interface.

Fix: make \`workspaceId\` an explicit prop on SkillsTab, drop the
\`data.id\` reads, thread the prop from SidePanel like the other tabs.
Test fixture updated to pass it.

Verified: 993 canvas tests pass; tsc clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 22:36:35 -07:00
Hongming Wang
ee429cfee7 fix(canvas,dotenv): review-driven hardening of fit gate + parser parity
Independent code review surfaced two required documentation fixes and
one growth-correctness gap. All addressed here.

Auto-fit gate (useCanvasViewport):

The previous "subtree-grew-by-count" check missed the delete-then-add
case: subtree of 6 → delete one → 5 → a different child arrives → 6
again. A length-only comparison reads no growth and the fit is
skipped, leaving the new node off-screen. Switched to an id-set
membership snapshot so any brand-new id forces the fit even when the
count is unchanged.

The gate logic is now extracted as a pure exported function
`shouldFitGrowing(currentIds, prevIds, userPannedAt, lastAutoFitAt)`
so the regression-prone decision can be unit-tested in isolation
without standing up React Flow + DOM event refs. 8 cases cover:
first-fit, empty-prior, brand-new id, status-update with user pan,
no-pan-ever, pan-before-last-fit, delete-then-add same length, and
shrink-only with user pan.

Parser parity (dotenv.go + next.config.ts):

Existing-env semantics were undocumented in both parsers. Both now
explicitly note that an explicitly-set empty string (`KEY=` from the
parent shell) counts as "set" — the file value does NOT backfill —
matching the Go (os.LookupEnv) and Node (`process.env[k] !==
undefined`) primitives.

`export ` prefix uses a literal space; `export\tFOO=bar` is
intentionally rejected. Added the same comment in both parsers
to lock in this parity invariant since the commit message claims
"if one parser changes, the other has to."

Skipped (per analysis):
- Drag-pan respect for left-click drag-pan during deploy. The
  growth-check safety net means any pan gets overridden on the
  next arrival anyway, which is the desired behavior for the
  "watch the org deploy" use case. After deploy completes, no
  more fit-deploying-org events fire so drag-pan works freely.
- Map cleanup for lastFitSubtreeIdsRef. Per-tab session, UUID
  keys, tiny entries — not worth the cleanup hook.

993 canvas tests pass (8 new); Go dotenv tests pass; tsc clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 22:23:51 -07:00
Hongming Wang
e900a773ac fix(canvas): keep tracking org bounds during deploy after first fit
Symptom: org import zoomed to fit the parent + first child, then froze
at that framing while the remaining children kept materialising
off-screen. The user had to manually pan/zoom to see the new arrivals.

Two stacked bugs in useCanvasViewport's deploy-time auto-fit:

1. The user-pan-respect gate stamps userPannedAtRef on EVERY
   pointerdown that lands inside .react-flow__pane. That fires for
   ordinary clicks (deselect, click-near-a-card, modal-close-bubble
   from the import dialog) — not just for actual pan gestures. One
   accidental pre-import click was enough to lock out every fit for
   the rest of the deploy. Wheel is the canonical unambiguous
   pan/zoom signal; drop pointerdown.

2. Even with a real pan during deploy, when more children land the
   org's bounds grow and the user has lost context — the new
   arrivals are off-screen and the deploy is the primary thing they
   want to watch right now. The guard had no growth awareness, so
   one pan cancelled all follow-up fits unconditionally. Now we
   track the subtree size at the last fit (per root), and if the
   current subtree is larger we force the fit through regardless of
   the user-pan timestamp. When the subtree size hasn't changed
   (status updates on already-positioned nodes), the user-pan
   respect still applies — so post-deploy exploration isn't
   yanked back.

The Map keyed by root id supports back-to-back imports of different
orgs without one's growth count blocking the other's first fit.

985 canvas tests pass; tsc clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 21:37:54 -07:00
Hongming Wang
ec7ecd5461 fix(canvas): load monorepo .env in next.config so WS connects in dev
Symptom: spawn animation missing on org import. Workspaces appeared in
their final positions all at once instead of materialising one-by-one.

Root cause: the WS pill said "Reconnecting" forever because the canvas
was trying to connect to ws://localhost:3000/ws — its own port, where
Next.js dev doesn't serve a WebSocket — instead of the platform's
ws://localhost:8080/ws.

Why: deriveWsBaseUrl() falls back to window.location when
NEXT_PUBLIC_WS_URL is unset. Next.js auto-loads .env from the project
root only — and the canonical NEXT_PUBLIC_WS_URL /
NEXT_PUBLIC_PLATFORM_URL live in the monorepo root .env, alongside the
Go platform's MOLECULE_ENV / DATABASE_URL. Without an extra
canvas/.env.local copy (which would still be a per-developer manual
step), the canvas dev server starts blind to those vars.

Fix: next.config.ts now walks upward from __dirname looking for the
monorepo root (same workspace-server/go.mod sentinel the platform's
dotenv loader uses) and merges the root .env into process.env BEFORE
Next.js compiles. Existing env wins over file values, so docker
runs / CI / explicit exports still dominate.

The parser is a TypeScript mirror of workspace-server/cmd/server/
dotenv.go's parseDotEnvLine — same rules (export prefix, quotes,
inline comments, BOM) so a single .env line behaves identically across
both processes. If one parser changes, the other has to.

Production unaffected: `output: "standalone"` bakes resolved env into
the build, the workspace-server sentinel isn't shipped in deploy
artifacts, and the existing-env-wins rule means container env
dominates anywhere this file is consulted at runtime.

Verified: canvas dev startup log now shows
"[next.config] loaded 49 vars from /Users/.../molecule-core/.env";
served bundle has the correct ws://localhost:8080/ws URL; WS pill
flips to "Connected" after a hard refresh and per-workspace spawn
animations fire on the next org import as expected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 21:29:05 -07:00
Hongming Wang
4014513b94 fix(dotenv): empty value with inline comment was returning the comment
The repo's own .env contains lines like
  CONFIGS_DIR=                   # Path to workspace-configs-templates/...
where the value is empty + an inline comment. The pre-fix parser:
  1. v = "                   # Path to ..."
  2. TrimLeft → "# Path to ..."
  3. Inline-comment loop looked for " #" or "\t#" — neither matches
     because the leading whitespace is gone.
  4. Returned the comment text as the value.

Result: os.Setenv("CONFIGS_DIR", "# Path to ...") clobbered the auto-
discovery fallback. The TemplatesHandler then opened the comment as
a directory, ReadDir errored silently, and GET /templates returned
[]. Canvas's Templates panel showed "No templates found in
workspace-configs-templates/" even though 8 valid templates existed
on disk.

Fix: strip leading whitespace from the value FIRST, then run a
position-aware comment scan that treats `#` as a comment marker iff
it's at the start of the (trimmed) value or preceded by whitespace.
A bare `#` mid-value (e.g. `KEY=token#fragment`) still survives.

Quoted-value handling moved above the comment scan so
`KEY="value # not"` keeps the `#` as part of the value — pulled the
quote-detection into the same TrimLeft-then-check shape as the bare
path. The unterminated-quote case still falls through to bare-value
handling.

Three regression tests added covering the exact .env line that
broke (`CONFIGS_DIR=    # ...`), spaces-only with comment, and tab-
only with comment.

Verified end-to-end: GET /templates now returns all 8 templates.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 21:17:21 -07:00
Hongming Wang
9a223afba1 fix(dotenv,socket): review-driven hardening of .env loader + WS poll
Independent code review surfaced three required fixes and one cheap
optional one. All addressed here.

dotenv parser:
- `export FOO=bar` was parsed as key `"export FOO"` (with embedded
  space) and silently os.Setenv'd, so a developer pasting from a
  direnv `.envrc` would get junk vars. Now strips the prefix.
- Quoted values weren't unwrapped: `FOO="hello world"` produced value
  `"hello world"` with literal quotes. Now strips one matched pair of
  surrounding `"` or `'`. Inside a quoted value `#` is part of the
  value, not a comment marker (matches godotenv convention).
- UTF-8 BOM at file start (Windows editors) would have produced a
  first key like U+FEFF + "FOO". Now stripped via TrimPrefix.

dotenv loader:
- findDotEnv()'s upward walk would happily pick up `~/.env` or a
  sibling-repo `.env` if the binary was run from `~/Documents/other-
  project/`. Real foot-gun on shared dev boxes. Now gated on a
  monorepo sentinel: the candidate directory must contain
  `workspace-server/go.mod`. Falls through to "no .env found" (=
  pre-fix behavior) when the sentinel is absent.

socket fallback poll:
- startFallbackPoll() previously fired only on onclose, so the very
  first connect attempt — when onclose hasn't fired yet because we
  never had a successful onopen — left the canvas with no HTTP poll
  for the duration of the failing handshake (Chrome can hold a
  SYN-SENT WebSocket open ~75s before giving up). Now also called at
  the top of connect(); the timer-already-running guard makes it a
  no-op when one cycle later onclose calls it again.

Test coverage added: export prefix, single+double quoted values, hash
inside quotes preserved, unterminated quote falls back to bare value,
CRLF stripping locked in, BOM stripping, and a sentinel-rejection
regression test that creates a temp .env with no workspace-server
sibling and asserts findDotEnv refuses to load it.

Verified: 985 canvas tests + 30 dotenv subtests + 4 dotenv integration
tests all pass; tsc clean; rebuilt platform from monorepo root with
stripped env still loads .env (49 vars) and /workspaces returns 200.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 21:09:18 -07:00
Hongming Wang
21db85d691 fix(canvas): cascade delete locally so children disappear without WS
Deleting a parent on a wedged WS used to leave the child cards on
the canvas as orphaned roots until the user manually refreshed.

Why: Canvas.tsx and DetailsTab.tsx both called `removeNode(parentId)`
after `DELETE /workspaces/:id?confirm=true` returned 200. `removeNode`
deliberately re-parents children rather than cascading — it relies on
the per-descendant WORKSPACE_REMOVED WS events the platform emits as
part of the cascade to drop each child individually. When the WS is
unhealthy those events never arrive, so the local store keeps the
children alive (now re-parented to root since their actual parent is
gone).

Fix: new `removeSubtree(rootId)` action on the canvas store mirrors
the server-side cascade — drops the root + every descendant + every
incident edge in one atomic set(). Both delete call sites now use it.
The WS events still arrive when WS is healthy and become idempotent
no-ops because the nodes are already gone.

Why a new action instead of changing removeNode: removeNode's
re-parenting behavior is correct for non-cascading flows (drag-out,
manual node detach in the future). Adding a sibling action keeps
both call shapes available rather than forcing every caller to opt
out of cascade.

6 new unit tests cover root cascade, mid-level cascade, leaf
no-op-cascade, selection clearing across the subtree, selection
preservation outside the subtree, and edge cleanup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 20:51:09 -07:00
Hongming Wang
e58ecf2974 fix(e2e): scrollIntoView before toBeVisible — clipped tabs were "missing"
Seventh E2E bug, surfaced after the AuthGate mock from the previous
commit finally let the harness reach the tab-iteration loop:

  Error: tab-skills button missing — TABS list may have drifted
  Locator: locator('#tab-skills')

The TABS bar in SidePanel is `overflow-x-auto` (intentional — there
are 13 tabs and they don't all fit on smaller viewports; the
right-edge fade gradient signals the overflow). Tabs after position
~3 are clipped, and Playwright's `toBeVisible()` returns false for
clipped elements (it checks getBoundingClientRect against viewport).

Fix: `scrollIntoViewIfNeeded()` before the visibility assertion,
mirroring what SidePanel's own keyboard handler does on arrow-key
navigation. The tab is then in view and `toBeVisible()` passes.

This was the test's 7th and (probably) final harness bug. The
chain mapping all the way from "staging E2E timed out at 1200s"
this morning:

  1. instance_status field name (#2066)
  2. staging.moleculesai.app DNS zone (#2066)
  3. X-Molecule-Org-Id TenantGuard header (#2066)
  4. Hydration selector waited pre-click (#2066)
  5. networkidle never settles (this PR's parent commits)
  6. AuthGate /cp/auth/me redirect
  7. Tab buttons clipped by overflow-x-auto

If THIS run still fails, the failure surfaces in actual product
behavior (a tab's panel content), not test mechanics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 20:37:36 -07:00
Hongming Wang
f8c900909e fix(platform): auto-load .env from CWD on startup
Local dev runs (`/tmp/molecule-server` after `go build`) used to 401 on
/workspaces the moment the DB had any workspace token in it: the binary
inherited a bare shell env with no MOLECULE_ENV, so AdminAuth's dev
fail-open branch (gated on MOLECULE_ENV=development) didn't fire.

The repo's .env already has MOLECULE_ENV=development plus DATABASE_URL,
REDIS_URL, ADMIN_TOKEN=, etc. Until now you had to `set -a && source
.env` in the launching shell — a paper cut, but worse, it's a paper
cut in EVERY automated dev workflow (IDE run configs, integration
test harnesses, the smoke-test loop in this branch's manual testing).

Fix: cmd/server now walks upward from CWD looking for a .env (capped
at 6 levels) and merges KEY=VALUE pairs into os.Environ before any
other code reads env. Already-set vars win over file values, so
docker run -e / CI exports / `KEY=val ./binary` still dominate — only
unset keys get filled in.

Why no godotenv dep: the format we use is plain KEY=VALUE with `#`
comments, no interpolation, no quoting (verified against the live
.env: 49 kv lines, zero references to ${...} or `export`). A 30-line
parser is auditable and avoids supply-chain surface.

Why it's safe in production: Dockerfile doesn't COPY .env into the
image and .env is gitignored, so prod containers have no .env on
disk to load — the function's findDotEnv() loop finds nothing and
returns silently. If an operator deliberately drops one in, the
existing-env-wins rule means container-injected env still dominates.

Verified by booting `env -i HOME=$HOME PATH=$PATH /tmp/molecule-server`
from the repo root with a stripped env: log shows
".env: /Users/.../molecule-core/.env — loaded 49, 0 already set" and
/workspaces returns 200 instead of 401.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 20:33:28 -07:00
Hongming Wang
0b4dfbd121 fix(canvas): suppress stale provisioning banners + add WS-down HTTP fallback poll
Two related fixes for the case where the canvas thinks workspaces are
stuck provisioning when they're actually online:

1. ProvisioningTimeout banners now gate on wsStatus === "connected".
   While the WS is in connecting/disconnected state, the local
   "provisioning" status reflects the last event received before the
   drop — workspaces may have transitioned to online minutes ago. The
   8m timeout was firing against frozen state and showing a wall of
   yellow warnings on already-online workspaces.

2. Socket layer now starts a 10s rehydrate poll when the WS goes
   unhealthy (onclose) and stops it on onopen/disconnect. The
   reconnect attempts continue in parallel; whichever recovers first
   wins. rehydrate()'s existing dedup gate prevents the open-time
   rehydrate from racing with a fallback poll. Without this the
   store could stay frozen for minutes while WS exponential backoff
   chewed through retries.

Plus the previously-uncommitted TemplatePalette flushSync change so
the import modal unmounts synchronously before doImport runs (otherwise
React batches the close with the import's setState prefix and the
modal backdrop hides the spawn animation).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 20:22:15 -07:00
Hongming Wang
6c70b413e0 fix(e2e): mock /cp/auth/me — AuthGate redirect was preventing canvas render
Sixth E2E bug, surfaced after the page.goto-domcontentloaded fix
finally let the navigation complete. The harness now reaches the
canvas-root selector wait but still times out because the canvas
never renders:

  TimeoutError: page.waitForSelector: Timeout 45000ms exceeded.
  waiting for [aria-label="Molecule AI workspace canvas"]

Root cause: canvas/src/components/AuthGate.tsx wraps the page,
fetches /cp/auth/me on mount, and redirects to the login page when
the response is 401. The bearer header we set via
context.setExtraHTTPHeaders works for platform API calls but does
NOT satisfy /cp/auth/me — that endpoint is cookie-based (WorkOS
session). So:

  1. AuthGate mounts
  2. Calls fetchSession() → /cp/auth/me → 401 (no session cookie)
  3. AuthGate transitions to anonymous → redirectToLogin()
  4. Browser navigates away from tenant URL
  5. The React Flow canvas root with the aria-label never mounts
  6. waitForSelector times out at 45s

Fix: context.route() intercepts /cp/auth/me and returns a fake
Session JSON so AuthGate resolves to "authenticated" and renders
its children. The session contents are cosmetic — Session.org_id
and Session.user_id appear in a few canvas surfaces but never fail
on dummy values.

This is the cleanest fix path. Alternatives considered + rejected:
  - Add a ?e2e=1 backdoor to AuthGate: production code shouldn't
    have a "skip auth" flag, even gated.
  - Real WorkOS login flow in Playwright: too much overhead per run.
  - Skip the canvas UI test, test only API: defeats the point of
    the staging E2E (which is to catch UI regressions before
    promotion).

After this lands the harness should reach the workspace-node click
step and exercise tabs — only then can a real product bug (rather
than a test-harness bug) surface. The 6-bug chain mapped to:
  1. instance_status field name (#2066)
  2. staging.moleculesai.app DNS zone (#2066)
  3. X-Molecule-Org-Id TenantGuard header (#2066)
  4. Hydration selector waited pre-click (#2066)
  5. networkidle never settles (this commit's parent)
  6. AuthGate /cp/auth/me redirect (this commit)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 19:59:04 -07:00
Hongming Wang
1d71b4e9e5 fix(canvas): bundle of UX hardening — modals, position stability, error UX, paste
Single-themed bundle of fixes accumulated while polishing the canvas
chat / agent-comms / plugins / position flows. Each piece is small;
the connective tissue is "things observable from the canvas right
panel and the org-deploy flow that surprised real users".

UI / composer
  - Legend: add close X + persisted-localStorage state + reopener
    pill; default open for first-time users.
  - SidePanel: rename "Skills" tab label → "Plugins" (single-line;
    internal panelTab enum value, component name, and store keys
    unchanged).
  - SkillsTab: registry tri-state UI (loading / error / empty) with
    actionable Retry button + 10s explicit fetch timeout. Handle
    AbortSignal.timeout's DOMException by name (TimeoutError /
    AbortError) — Chromium's "signal timed out" message wouldn't
    match the prior naive /timeout/ regex. Reset mountedRef on every
    mount: pre-existing StrictMode dev-mode bug where cleanup-only
    `current = false` was never re-set, permanently wedging every
    `if (mountedRef.current) setX(...)` guard and producing a
    "Loading…" panel that never resolved on hard refresh.
  - ChatTab: paste-image-from-clipboard via onPaste handler; unique
    monotonic-counter filenames so same-second pastes don't collide
    on name+size dedup. mime→ext map avoids `image/svg+xml`-style
    raw extensions on synthesised filenames. Bypasses the
    DataTransfer constructor so Safari < 14.1 / older Edge work.
  - ChatTab: drop stuck error toast when the WS path already
    delivered the agent reply but the HTTP path errored late
    (sendingFromAPIRef gate now covers the .catch() handler).
  - ChatTab: filter heartbeat-style internal self-messages from the
    My Chat tab so historical rows with source_id=NULL don't
    surface as user-typed input.
  - Modal portals: OrgImportPreflightModal + MissingKeysModal
    (ProviderPickerModal + AllKeysModal) now createPortal to
    document.body and clamp max-h to 80vh. Escapes the ancestor
    containing block (TemplatePalette's fixed+filtered sidebar
    re-anchored descendants' position:fixed to itself, hiding
    modals behind workspace cards). MissingKeysModal bumped to
    z-[60] for stack ordering when both modals are open.
  - OrgImportPreflightModal saveOne: ref-based microtask-safe
    in-flight gate replaces the brittle "set startValue inside a
    setState updater and read on the next line" pattern (React 18
    doesn't guarantee functional updaters run synchronously; that
    path strands `saving:true` and never calls createSecret). Same
    useRef pattern guards SkillsTab.loadRegistry against concurrent
    fires and Fast-Refresh-stranded promises; force=true parameter
    on retry click bypasses the gate.

Agent comms
  - AgentCommsPanel: derive UI-facing `flow` field instead of using
    activity_type-derived direction. Self-logged a2a_receive rows
    (source_id == workspace_id, what the agent runtime writes to log
    its own outbound delegation replies) now correctly render as
    OUTBOUND with → arrow + right-justified bubble. Previously they
    rendered "← From Self" with Restart pointing at THIS workspace.
  - AgentCommsPanel: error rows replace the unactionable
    "X failed [A2A_ERROR]" body with banner + underlying-error
    code-block + cause-hint (matched on Claude Code SDK init wedge,
    deadline-exceeded, agent-thrown exception, empty-error) +
    Restart [peer] / Open [peer] action buttons.
  - AgentCommsPanel: render text bodies through ReactMarkdown +
    remark-gfm so multi-part replies (tables, code) render properly.

Multi-part text extractor
  - extractReplyText (live A2A response in ChatTab) and
    extractResponseText (chat history loader in message-parser):
    now COLLECT from every source — top-level parts, parts.root.text,
    and artifacts — joined with "\n". Previous "first source wins"
    silently dropped multi-part replies (Hermes summary+detail,
    Claude Code long-form table). Tests cover joined-from-parts,
    joined-from-artifacts, joined-from-both.

Position stability
  - canvas-topology.buildNodesAndEdges: auto-rescue heuristic now
    accepts currentParentSizes map; uses max(initial min, currently
    grown) for the bbox check. Fixes "child jumps to weird location
    after 30s" — the periodic socket health-check rehydrate
    (silenceSec > 30) was rebuilding nodes from scratch, and the
    rescue's reliance on grid-derived initial size false-flagged
    children the user dragged into the user-grown area.
  - canvas.hydrate: pass live measured dimensions from the existing
    store into buildNodesAndEdges.
  - socket.RehydrateDedup: pure exported helper class that gates
    rehydrate calls. Two states — in-flight (in-flight Promise reused
    by concurrent callers) + post-completion window (1.5s, returns
    Promise.resolve()). Initialised with -Infinity so first call
    always passes the gate. Wired into ReconnectingSocket.rehydrate.

A2A edges
  - New A2AEdge custom React Flow edge component portals its label
    out of the SVG layer via EdgeLabelRenderer so labels (a) render
    above workspace cards instead of being hidden behind them and
    (b) accept clicks. Click selects source + switches panel to
    Activity, but only on a NEW selection (preserves current tab on
    re-click of an already-selected source).
  - buildA2AEdges output tagged type:"a2a"; edgeTypes wired in
    Canvas.tsx.

Tests
  - 14 new vitest cases across 4 files (964 → 978 passing):
    OrgImportPreflightModal saveOne single-fire / double-click,
    any-of rendering; AgentCommsPanel toCommMessage flow derivation
    in all four shapes; canvas-topology rescue respects-grown /
    rescues-genuine-drift / fallback-without-live-size; socket
    RehydrateDedup gate behaviour; message-parser multi-part
    response extraction.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 19:54:43 -07:00
Hongming Wang
65b531acf6 fix(workspace): tag self-originated A2A POSTs with X-Workspace-ID
Workspace runtime fired four classes of A2A request to the platform
without the X-Workspace-ID header that identifies the source
workspace: heartbeat self-messages, initial_prompt, idle-loop fires,
and peer-to-peer A2A from runtime tools. The platform's a2a_receive
logger keys source_id off that header — without it, every such row
was written with source_id=NULL, which the canvas's My Chat tab
filters as ?source=canvas (i.e. "user typed this") and rendered the
internal triggers as if the human user had sent them. The
"Delegation results are ready..." heartbeat trigger was visible to
end users in the chat history; delegate_task A2A calls between agents
were misclassified the same way.

Centralise the header construction in a new platform_auth helper
self_source_headers(workspace_id) that returns auth_headers() PLUS
{X-Workspace-ID: <id>}. Apply it to:

  - heartbeat.py self-message (refactored from inline header dict)
  - main.py initial_prompt POST
  - main.py idle_prompt POST
  - a2a_client.py send_a2a_message (peer A2A from runtime)
  - builtin_tools/a2a_tools.py delegate_task (was missing ALL headers)

Tests:
  - test_heartbeat.py asserts the X-Workspace-ID header is set on
    the self-message POST.
  - test_a2a_tools_module.py asserts the same on delegate_task POSTs;
    FakeClient.post mocks updated to accept the headers kwarg.

Production effect lands the moment workspace containers are rebuilt
with this code; existing rows in activity_logs keep their NULL
source_id (legacy data). The canvas-side filter (#follow-up)
covers the historical-rows case until backfill.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 19:54:43 -07:00