Closes the gap where the Director would say "ZIP is ready at /tmp/foo.zip"
in plain text instead of attaching a download chip — the runtime literally
had no API for outbound file attachments. The canvas + platform's
chat-uploads infrastructure already supported the inbound (user → agent)
direction (commit 94d9331c); this PR wires the outbound side.
End-to-end shape:
agent: send_message_to_user("Done!", attachments=["/tmp/build.zip"])
↓ runtime
POST /workspaces/<self>/chat/uploads (multipart)
↓ platform
/workspace/.molecule/chat-uploads/<uuid>-build.zip
→ returns {uri: workspace:/...build.zip, name, mimeType, size}
↓ runtime
POST /workspaces/<self>/notify
{message: "Done!", attachments: [{uri, name, mimeType, size}]}
↓ platform
Broadcasts AGENT_MESSAGE with attachments + persists to activity_logs
with response_body = {result: "Done!", parts: [{kind:file, file:{...}}]}
↓ canvas
WS push: canvas-events.ts adds attachments to agentMessages queue
Reload: ChatTab.loadMessagesFromDB → extractFilesFromTask sees parts[]
Either path → ChatTab renders download chip via existing path
Files changed:
workspace-server/internal/handlers/activity.go
- NotifyAttachment struct {URI, Name, MimeType, Size}
- Notify body accepts attachments[], broadcasts in payload,
persists as response_body.parts[].kind="file"
canvas/src/store/canvas-events.ts
- AGENT_MESSAGE handler reads payload.attachments, type-validates
each entry, attaches to agentMessages queue
- Skips empty events (was: skipped only when content empty)
workspace/a2a_tools.py
- tool_send_message_to_user(message, attachments=[paths])
- New _upload_chat_files helper: opens each path, multipart POSTs
to /chat/uploads, returns the platform's metadata
- Fail-fast on missing file / upload error — never sends a notify
with a half-rendered attachment chip
workspace/a2a_mcp_server.py
- inputSchema declares attachments param so claude-code SDK
surfaces it to the model
- Defensive filter on the dispatch path (drops non-string entries
if the model sends a malformed payload)
Tests:
- 4 new Python: success path, missing file, upload 5xx, no-attach
backwards compat
- 1 new Go: Notify-with-attachments persists parts[] in
response_body so chat reload reconstructs the chip
Why /tmp paths work even though they're outside the canvas's allowed
roots: the runtime tool reads the bytes locally and re-uploads through
/chat/uploads, which lands the file under /workspace (an allowed root).
The agent can specify any readable path.
Does NOT include: agent → agent file transfer. Different design problem
(cross-workspace download auth: peer would need a credential to call
sender's /chat/download). Tracked as a follow-up under task #114.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
[Molecule-Platform-Evolvement-Manager]
Addresses github-code-quality finding on PR #2064:
> Comparison between inconvertible types
> Variable 'info' cannot be of type null, but it is compared to
> an expression of type null.
By line 75, `info` has been narrowed to non-null via the
`if (!info) return null;` guard at line 56 — so `open={info !== null}`
always evaluates to `true`. Switch to JSX shorthand `open` for
clarity and to silence the static check.
Behaviorally identical; the modal still opens whenever the parent
renders this component (which only happens with non-null info).
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Critical follow-up to PR #2126's review. Two real bugs:
1. **Runtime QUEUED never resolved.** Platform's drain stitch updates
the platform's delegate_result row when a queued delegation finally
completes, but never pushes back to the runtime. The LLM polling
check_delegation_status saw status="queued" forever — combined with
the new docstring guidance ("queued → wait, peer will reply"), the
model would wait indefinitely on a state that never resolves.
Strictly worse than pre-PR behavior where it would have at least
bypassed.
2. **Live updates dead code.** delegation.go writes activity rows by
direct INSERT INTO activity_logs, bypassing the LogActivity helper
that fires ACTIVITY_LOGGED. Adding "delegation" to the canvas's
ACTIVITY_LOGGED filter (PR #2126 first cut) was inert — initial
GET worked, live updates did not.
Fix:
(1) Runtime side, workspace/builtin_tools/delegation.py:
- New `_refresh_queued_from_platform(task_id)` async helper that
pulls /workspaces/<self>/delegations and finds the platform-side
delegate_result row for our task_id.
- check_delegation_status calls _refresh when local status is
QUEUED, so the LLM's poll itself drives state convergence.
- Best-effort: GET failure leaves local state untouched, next
poll retries.
- Docstring updated to reflect the actual behavior ("polls
transparently — keep polling and you'll see the flip").
- 4 new tests cover: QUEUED → completed via refresh; QUEUED →
failed via refresh; refresh keeps QUEUED when platform hasn't
resolved; refresh swallows network errors safely.
(2) Canvas side, AgentCommsPanel.tsx WS push handler:
- Listens for DELEGATION_SENT / DELEGATION_STATUS / DELEGATION_COMPLETE
/ DELEGATION_FAILED in addition to ACTIVITY_LOGGED.
- Each event's payload synthesized into an ActivityEntry shape
so toCommMessage's existing delegation branch maps it. Status
derived: STATUS uses payload.status, COMPLETE → "completed",
FAILED → "failed", SENT → "pending".
- The ACTIVITY_LOGGED branch keeps the "delegation" type accepted
as a no-op-today / future-proof path: if delegation handlers
are ever refactored to call LogActivity, this lights up
automatically without another canvas change.
Doesn't change: the docstring guidance ("queued → wait, don't bypass")
is now actually load-bearing because the refresh path will deliver
the eventual outcome. Without the refresh, the guidance was a trap.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two bugs that compounded into the "Director does the work itself" UX:
1. workspace/builtin_tools/delegation.py: _execute_delegation only
handled HTTP 200 in the response branch. When the peer's a2a-proxy
returned HTTP 202 + {queued: true} (single-SDK-session bottleneck
on the peer), the loop fell through. Two iterations later the
`if "error" in result` check tried to access an unbound `result`,
the goroutine ended quietly, and the delegation stayed at FAILED
with error="None". The LLM checking status saw "failed" + the
platform's "Delegation queued — target at capacity" log line in
chat context, concluded the peer was permanently unavailable, and
bypassed delegation to do the work itself.
Fix: explicit 202+queued branch. Adds DelegationStatus.QUEUED,
marks the local delegation as QUEUED, mirrors to the platform,
and returns cleanly without retrying. The retry loop is for
transient transport errors — queueing is a real ack, not a failure
to retry against (retrying would just re-queue the same task).
check_delegation_status docstring extended with explicit per-status
guidance: pending/in_progress → wait, queued → wait (peer busy on
prior task, reply WILL arrive), completed → use result, failed →
real error in error field; only fall back on failed, never queued.
2. canvas/src/components/tabs/chat/AgentCommsPanel.tsx: filter dropped
every delegation row because it whitelisted only a2a_send /
a2a_receive. activity_type='delegation' rows (written by the
platform's /delegate handler with method='delegate' or
'delegate_result') never reached toCommMessage. User saw "No
agent-to-agent communications yet" while 6+ delegations existed
in the DB.
Fix: include "delegation" in the both the initial filter and the
WS push filter, plus a delegation branch in toCommMessage that
maps the row as outbound (always — platform proxies on our behalf)
and uses summary as the primary text source.
Tests:
- 3 new Python tests cover the 202+queued path: status becomes
QUEUED not FAILED; no retry on queued (counted by URL match
against the A2A target since the mock is shared across all
AsyncClient calls); bare 202 without {queued:true} still
falls through to the existing retry-then-FAILED path.
- 3 new TS tests cover the delegation mapper: 'delegate' row
maps as outbound to target with summary text; queued
'delegate_result' preserves status='queued' (load-bearing for
the LLM's wait-vs-bypass decision); missing target_id returns
null instead of rendering a ghost.
Does NOT solve: the underlying single-SDK-session bottleneck that
causes peers to queue in the first place. Tracked as task #102
(parallel SDK sessions per workspace) — real architectural work.
This PR makes the runtime handle the queueing correctly so the LLM
doesn't bail out, and makes the delegations visible in Agent Comms
so operators can see what's happening.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
[Molecule-Platform-Evolvement-Manager]
Closes the first item from #2071 (Canvas test gaps follow-up):
adds behavioural coverage for the shared template-deploy hook that
both TemplatePalette (sidebar) and EmptyState (welcome grid) drive.
10 cases across 4 buckets:
**Happy path (4):**
- preflight ok → POST /workspaces → onDeployed fires with new id
- caller-supplied canvasCoords flows into the POST body
- default coords fall in [100,500) × [100,400) when canvasCoords omitted
- template.runtime is preferred over the resolveRuntime fallback
(locks the deduped-fallback table contract added in #2061)
**Preflight failures (2):**
- network throw sets error AND clears `deploying` (regression test
for the "stranded button" bug called out in the SUT's inline
comment — drop the try block and you'll fail this test)
- not-ok-with-missing-keys opens the modal without firing POST
**Modal lifecycle (2):**
- 'keys added' click retries POST without re-running preflight
(verifies the executeDeploy / deploy split — preflight call count
stays at 1, POST count goes to 1)
- 'cancel' click closes modal without firing POST
**POST failures (2):**
- Error rejection surfaces the message
- non-Error rejection surfaces the "Deploy failed" fallback
Mocks `@/lib/api`, `@/lib/deploy-preflight`, and `@/components/MissingKeysModal`
(stand-in component exposes the two callbacks as test-id buttons —
the real radix modal is irrelevant to this hook's behavior). Test
file follows the `vi.hoisted` + import-after-mocks pattern from
`canvas/src/app/__tests__/orgs-page.test.tsx`.
## Test plan
- [x] All 10 cases pass locally (`vitest run useTemplateDeploy.test.tsx`)
- [x] No changes to the SUT — pure additive coverage
- [ ] CI green
Follow-ups for the rest of #2071 (separate PRs):
- A2AEdge rendering + click-to-select-source
- OrgCancelButton cancel flow + optimistic state
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
[Molecule-Platform-Evolvement-Manager]
## What was breaking
Two distinct failure modes in `.github/workflows/secret-scan.yml`,
both visible after PR #2115 / #2117 hit the merge queue:
1. **`merge_group` events**: the script reads `github.event.before /
after` to determine BASE/HEAD. Those properties only exist on
`push` events. On `merge_group` events both came back empty, the
script fell through to "no BASE → scan entire tree" mode, and
false-positived on `canvas/src/lib/validation/__tests__/secret-formats.test.ts`
which contains a `ghp_xxxx…` literal as a masking-function fixture.
(Run 24966890424 — exit 1, "matched: ghp_[A-Za-z0-9]{36,}".)
2. **`push` events with shallow clone**: `fetch-depth: 2` doesn't
always cover BASE across true merge commits. When BASE is in the
payload but absent from the local object DB, `git diff` errors
out with `fatal: bad object <sha>` and the job exits 128.
(Run 24966796278 — push at 20:53Z merging #2115.)
## Fixes
- Add a dedicated fetch step for `merge_group.base_sha` (mirrors
the existing pull_request base fetch) so the diff base is in the
object DB before `git diff` runs.
- Move event-specific SHAs into a step `env:` block so the script
uses a clean `case` over `${{ github.event_name }}` instead of
a single `if pull_request / else push` that left merge_group on
the empty branch.
- Add an on-demand fetch for the push-event BASE when it isn't in
the shallow clone, plus a `git cat-file -e` guard before the
diff so we fall through cleanly to the "scan entire tree" path
if the fetch fails (correct, just slower) instead of exiting 128.
## Defense-in-depth
`secret-formats.test.ts` had two literal continuous-string fixtures
(`'ghp_xxxx…'`, `'github_pat_xxxx…'`). The ghp_ one matched the
secret-scan regex. Switched both to the `'prefix_' + 'x'.repeat(N)`
pattern already used elsewhere in the same file — runtime value is
the same, but the literal source text no longer matches the regex
even if the BASE detection ever falls back to tree-scan mode again.
## Test plan
- [x] No remaining regex matches in the secret-formats.test.ts source
- [x] YAML structure preserved
- [ ] CI passes on this PR's pull_request scan (was already passing)
- [ ] CI passes on this PR's merge_group scan (the new path)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-up to #2110 (which generalised pruneStaleKeys to Map<string, T>).
Identified by the simplify reviewer on that PR as the only other
in-tree caller of the same shape: `for (const id of map.keys()) { if
(!liveIds.has(id)) map.delete(id); }`.
Net: -3 lines, one less hand-rolled GC loop. No behaviour change —
the helper does exactly what the inline block did.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Simplify pass on top of #2069 fix:
- Export FALLBACK_POLL_MS from canvas/src/store/socket.ts and import
it as TOMBSTONE_TTL_MS in deleteTombstones.ts. Single source of
truth — tuning one without the other would silently re-open the
hydrate-races-delete window. Required-fix per simplify reviewer.
- Compress deleteTombstones.ts docstring from 30 lines to 10 — keep
the "what + why module-level"; drop the long-form problem
description (issue #2069 carries it).
- Compress canvas.ts call-site comments at removeSubtree (4 lines →
2) and hydrate (2 lines → 2 but tighter).
- Don't reassign the workspaces parameter inside hydrate — use a
const `live` and thread it through the two downstream calls
(computeAutoLayout, buildNodesAndEdges). Same effect, no lint
smell.
- Trim the canvas.test.ts integration-test preamble.
No behaviour change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes#2069. removeSubtree dropped a parent + descendants locally
after DELETE returned 200, but a GET /workspaces request that was
IN-FLIGHT before the DELETE completed could land AFTER and hydrate
the store with a stale snapshot — re-introducing the deleted nodes
on the canvas until the next 10s fallback poll corrected it.
New module canvas/src/store/deleteTombstones.ts holds a transient
process-lifetime Map<id, deletedAt>. removeSubtree calls
markDeleted(removedIds); hydrate calls wasRecentlyDeleted(id) to
filter the incoming workspaces. TTL is 10s — matches the WS-fallback
poll cadence so a single round-trip is covered, after which a
legitimately re-imported id flows through normally.
GC happens lazily at every read AND at write time so the map stays
bounded — no separate timer / interval / unmount plumbing.
Tests:
- canvas/src/store/__tests__/deleteTombstones.test.ts: 7 cases
covering immediate flag, never-marked, TTL boundary (9999ms vs
10001ms), GC-on-read, GC-on-write, re-mark resets timestamp,
iterable input.
- canvas/src/store/__tests__/canvas.test.ts: end-to-end "hydrate
cannot resurrect ids that removeSubtree just dropped (#2069)"
exercises the full chain at the store level.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Simplify pass on top of #2070 fix:
- Rename pruneStaleSubtreeIds → pruneStaleKeys, generalize to
Map<string, T> so the same shape can absorb other keyed-by-node-id
caches (ProvisioningTimeout.tsx tracking map is the obvious next
caller — left as a follow-up to keep this PR scoped).
- Trim the helper docstring to remove implementation-detail rot
(O(map_size), cadence claims). The ref-block comment carries the
rationale where it actually matters (at the call site).
- Add identity-preservation test: survivors must keep their original
Set reference. Guards against a future "rebuild instead of delete"
regression that would silently invalidate downstream === checks.
No behaviour change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes#2070. The Map<rootId, Set<nodeId>> in useCanvasViewport.ts
accumulated entries indefinitely — adds on every successful auto-fit,
never deletes when a root left state.nodes (cascade delete or manual
remove). Operationally invisible until thousands of imports, but the
fix is cheap.
Adds pruneStaleSubtreeIds(map, liveNodeIds) — a pure helper exported
alongside the existing shouldFitGrowing helper, called at the top of
runFit before any read or write to the map. Bounds the map to "roots
present right now" instead of "every root ever auto-fitted in this
session." O(map_size) per fit; runs only at user-driven cadence.
Tests in __tests__/useCanvasViewport.test.ts cover the four cases:
delete-some / no-op / clear-all / never-add.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Simplify pass on top of the canary fix:
- Drop the three CP commit SHAs from comments — issue #2090 covers
the audit trail, SHAs would rot.
- Pull the inline `900` into TLS_TIMEOUT_SEC=$((15 * 60)) so the
bash mirrors the TS side (15 min) at a glance.
- TENANT_HOST extraction now strips http(s) AND any port suffix, so
getent doesn't silently fail on a ws://host:443 style URL.
- sed-redact Authorization/Cookie out of the curl -v dump, defensive
against future callers adding an auth header to this probe.
Pure cleanup; no behaviour change to the happy path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Canary #2090 has been red for 6 consecutive runs over 4+ hours, all
timing out at the TLS-readiness step exactly at the 10-min cap. Time
window correlates with three CP commits that landed today/yesterday
and changed EC2 boot behaviour:
- molecule-controlplane@a3eb8be — fix(ec2): force fresh clone of /opt/adapter
- molecule-controlplane@ed70405 — feat(sweep): wire up healthcheck loop
- molecule-controlplane@4ab339e — fix(provisioner): aggregate cleanup errors
Two changes here, both surgical:
1. Bump the bash-side TLS deadline from 600s to 900s, and the canvas TS
mirror from 10m to 15m. Stays below the 20-min provision envelope
(so a genuinely-stuck tenant still fails loud at the earlier
provision step instead of masquerading as TLS).
2. On TLS-timeout, dump a diagnostic burst before exiting:
- getent hosts $TENANT_HOST (DNS resolution state)
- curl -kv $TENANT_URL/health (TLS handshake + HTTP layer)
The previous failure log was just "no 2xx in N min" with no signal
for which layer was actually broken. After this, the next timeout
tells us whether DNS, TLS handshake, or HTTP layer is the culprit
so the CP root cause can be isolated without speculation.
This is the unblock; a separate molecule-controlplane issue tracks the
underlying regression suspicion.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three files conflicted with staging changes that landed while this PR
sat open. Resolved each by combining both intents (not picking one side):
- a2a_proxy.go: keep the branch's idle-timeout signature
(workspaceID parameter + comment) AND apply staging's #1483 SSRF
defense-in-depth check at the top of dispatchA2A. Type-assert
h.broadcaster (now an EventEmitter interface per staging) back to
*Broadcaster for applyIdleTimeout's SubscribeSSE call; falls through
to no-op when the assertion fails (test-mock case).
- a2a_proxy_test.go: keep both new test suites — branch's
TestApplyIdleTimeout_* (3 cases for the idle-timeout helper) AND
staging's TestDispatchA2A_RejectsUnsafeURL (#1483 regression). Updated
the staging test's dispatchA2A call to pass the workspaceID arg
introduced by the branch's signature change.
- workspace_crud.go: combine both Delete-cleanup intents:
* Branch's cleanupCtx detachment (WithoutCancel + 30s) so canvas
hang-up doesn't cancel mid-Docker-call (the container-leak fix)
* Branch's stopAndRemove helper that skips RemoveVolume when Stop
fails (orphan sweeper handles)
* Staging's #1843 stopErrs aggregation so Stop failures bubble up
as 500 to the client (the EC2 orphan-instance prevention)
Both concerns satisfied: cleanup runs to completion past canvas
hangup AND failed Stop calls surface to caller.
Build clean, all platform tests pass.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Closes a 4+ cycle Canvas tabs E2E flake pattern that's been blocking
staging→main PRs since 2026-04-24+ (#2096, #2094, #2055, #2079, ...).
Root cause: TLS_TIMEOUT_MS=180s (3 min) is too tight for the layered
realities of staging tenant TLS readiness:
1. Cloudflare DNS propagation through the edge (1-2 min typical)
2. Tenant CF Tunnel registering the new hostname (1-2 min)
3. CF edge ACME cert provisioning + cache (1-3 min)
Each layer can add 1-3 min on its own under heavy staging load — the
realistic worst case is well past the 3-min cap.
Provision and workspace-online timeouts were already raised to 20 min
(staging-setup.ts:42-46 history). The TLS gate was the remaining
under-budgeted step. Bumping to 10 min keeps it inside the 20-min
PROVISION envelope so a genuinely-stuck tenant still fails loud at
the earlier provision step rather than masquerading as a TLS issue.
Both call sites raised together:
- canvas/e2e/staging-setup.ts: TLS_TIMEOUT_MS = 10 * 60 * 1000
- tests/e2e/test_staging_full_saas.sh: TLS_DEADLINE += 600
Each carries an inline rationale comment so the next reviewer sees
the layer-by-layer decomposition without re-reading the issue thread.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the canvas-side loop on #2054. Phases 1+2 plumbed
provision_timeout_ms from template manifest → workspace API →
canvas socket → node-data → ProvisioningTimeout resolver. The
template-hermes manifest declares provision_timeout_seconds: 720
(filed as a separate template-repo PR). With that flow live, the
canvas-side hardcoded RUNTIME_PROFILES.hermes entry is redundant.
Removed:
- RUNTIME_PROFILES.hermes (was 720000ms hardcoded in canvas/src/lib/runtimeProfiles.ts)
Doc updates:
- RUNTIME_PROFILES jsdoc explains the map is now empty by design —
new runtimes that need a non-default cold-boot threshold should
declare runtime_config.provision_timeout_seconds in their template
manifest, NOT add an entry here.
Tests updated (3):
- "returns hermes override when runtime = hermes" → "hermes returns
default — value moved server-side post-#2054 phase 3". Asserts
RUNTIME_PROFILES.hermes is undefined.
- The two server-override tests now compare against
DEFAULT_RUNTIME_PROFILE since hermes no longer has a profile entry.
19/19 pass locally. The end-state for hermes:
workspace-server reads template manifest at request time →
workspace API includes provision_timeout_ms: 720000 →
canvas hydrate populates node.data.provisionTimeoutMs →
ProvisioningTimeout resolver picks it up via overrides.
Same effective threshold (720s), now declarative and one-edit-point
per runtime.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
simplify-review note: the |/,-delimited node string is brittle if a
future string-typed field is added without sanitization. Document
which fields are user-typed (name — already sanitized) vs primitive
(id is UUID, runtime is a slug, provisionTimeoutMs is numeric) so
the next field-add doesn't accidentally introduce an injection
vector for the splitter.
Skipped (false-positive review finding): the agent flagged the
prop > runtime-profile order as inconsistent with the docstring,
but the docstring explicitly lists the prop at #2 (between node and
runtime-profile) — matches both the implementation AND the original
behavior pre-#2054 (the prop was 'timeoutMs ?? runtime-profile').
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 1 of moving runtime UX knobs server-side. Builds the canvas
foundation: a workspace can carry its own provision_timeout_ms
(sourced server-side from a template manifest in a follow-up PR),
and ProvisioningTimeout's resolver respects it per-node.
Today the resolver had Props-level timeoutMs that applied to ALL
nodes — fine for tests but wrong for production where one batch
could mix runtimes (hermes 12-min cold boot alongside docker 2-min).
The runtime profile fallback already handles per-runtime defaults;
this PR adds the per-WORKSPACE override layer above that.
Resolution priority (most specific wins):
1. node.provisionTimeoutMs — server-declared per-workspace
override (this PR's new field)
2. timeoutMs prop — single-threshold test override
3. runtime profile in @/lib/runtimeProfiles
4. DEFAULT_RUNTIME_PROFILE
Changes:
- WorkspaceData (socket): add optional provision_timeout_ms
- WorkspaceNodeData: add optional provisionTimeoutMs
- canvas-topology hydrate: thread the field through to node.data
- ProvisioningTimeout: extend the serialized-string node iteration
to carry provisionTimeoutMs (4-field positional split); pass as
the second arg to provisionTimeoutForRuntime
- 3 new tests in ProvisioningTimeout.test.tsx covering hydrate
threading, null fall-through, and resolver priority
Phase 2 (separate PR, blocked on workspace-server template-config
loader): workspace-server reads provision_timeout_seconds from
template config.yaml at provision time, includes
provision_timeout_ms in the workspace API/socket response. Phase 3
(template-repo PR): template-hermes config.yaml declares
provision_timeout_seconds: 720; canvas RUNTIME_PROFILES.hermes
becomes redundant and can be removed.
19/19 tests pass (3 new + 16 existing).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User reported the canvas threw a generic "API GET /workspaces: 500
{auth check failed}" error when local Postgres + Redis were both
down. Two problems:
1. The error code (500) and message ("auth check failed") said
nothing useful. The actual condition was "platform can't reach
its datastore to validate your token" — a Service Unavailable
class, not Internal Server Error.
2. The canvas had no way to distinguish infra-down from a real
auth bug, so it rendered the raw API string in the same
generic-error overlay it uses for everything.
Fix in two layers:
Server (wsauth_middleware.go):
- New abortAuthLookupError helper centralises all three sites
that previously returned `500 {"error":"auth check failed"}`
when HasAnyLiveTokenGlobal or orgtoken.Validate hit a DB error.
- Now returns 503 + structured body
`{"error": "...", "code": "platform_unavailable"}`. 503 is
the correct semantic ("retry shortly, infra is unavailable")
and the code field is the contract the canvas reads.
- Body deliberately excludes the underlying DB error string —
production hostnames / connection-string fragments must not
leak into a user-visible error toast.
Canvas (api.ts):
- New PlatformUnavailableError class. api.ts inspects 503
responses for the platform_unavailable code and throws the
typed error instead of the generic "API GET /…: 503 …"
message. Generic 503s (upstream-busy, etc.) keep the legacy
path so existing busy-retry UX isn't disrupted.
Canvas (page.tsx):
- New PlatformDownDiagnostic component renders when the
initial hydration catches PlatformUnavailableError.
Surfaces the actual condition with operator-actionable
copy ("brew services start postgresql@14 / redis") +
pointer to the platform log + a Reload button.
Tests:
- Go: TestAdminAuth_DatastoreError_Returns503PlatformUnavailable
pins the response shape (status, code field, no DB-error leak)
- Canvas: 5 tests for PlatformUnavailableError classification —
typed throw on 503+code match, generic-Error fallback for
503-without-code (upstream busy), 500 stays generic, non-JSON
body falls back to generic.
1015 canvas tests + full Go middleware suite pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The actual cause-fix for the staging-tabs E2E saga (#2073/#2074/#2075).
Old behaviour: ANY 401 from any fetch on a SaaS tenant subdomain
called redirectToLogin → window.location.href = AuthKit. This is
wrong. Plenty of 401s don't mean "session is dead":
- workspace-scoped endpoints (/workspaces/:id/peers, /plugins)
require a workspace-scoped token, not the tenant admin bearer
- resource-permission mismatches (user has tenant access but not
this specific workspace)
- misconfigured proxies returning 401 spuriously
A single transient one of those yanked authenticated users back to
AuthKit. Same bug yanked the staging-tabs E2E off the tenant origin
mid-test for 6+ hours tonight, leading to the cascade of test-side
mocks (#2073/#2074/#2075) that worked around the symptom without
fixing the cause.
This PR fixes it at the source. The new logic:
- 401 on /cp/auth/* path → that IS the canonical session-dead
signal → redirect (unchanged)
- 401 on any other path with slug present → probe /cp/auth/me:
probe 401 → session genuinely dead → redirect
probe 200 → session fine, endpoint refused this token →
throw a real Error, caller renders error state
probe network err → assume session-fine (conservative) →
throw real Error
- slug empty (localhost / LAN / reserved subdomain) → throw
without redirect (unchanged)
The probe adds one extra fetch on a 401, only when slug is set
and the path isn't already auth-scoped. That's rare and
worthwhile — a transient probe round-trip is cheap; an unwanted
auth redirect is a UX disaster.
Tests:
- api-401.test.ts rewritten with the full matrix:
* /cp/auth/me 401 → redirect (no probe, that IS the signal)
* non-auth 401 + probe 401 → redirect
* non-auth 401 + probe 200 → throw, no redirect ← the fix
* non-auth 401 + probe network err → throw, no redirect
* empty slug paths (localhost/LAN/reserved) → throw, no probe
- 43 tests in canvas/src/lib/__tests__/api*.test.ts all pass
- tsc clean
The staging-tabs E2E spec's universal-401 route handler stays as
defense-in-depth (silences resource-load console noise + guards
against panels without try/catch), but the comment now describes
its role honestly: api.ts is the primary fix, the route is the
safety net.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After #2074, the staging-tabs spec stopped failing on the auth-redirect
locator timeout (good — the broadened 401-mock works) but started
failing on a different aggregate check:
Error: unexpected console errors:
Failed to load resource: the server responded with a status of 404
Failed to load resource: the server responded with a status of 404
Failed to load resource: the server responded with a status of 404
Browser console messages for resource-load failures omit the URL,
so the message is uninformative on its own — we can't filter
selectively (e.g. "is this a missing-CSS noise or a real broken
endpoint?"). The previous filter list (sentry/vercel/WebSocket/
favicon/molecule-icon) catches specific known-noisy strings but
this generic "Failed to load resource" doesn't contain any of them.
Two changes:
1. Add page.on('requestfailed') + page.on('response>=400') logging
to capture the URL of any failed request. Logs to test stdout
(visible in the workflow log) — leaves a breadcrumb so a real
bug isn't completely hidden when we filter the generic message.
2. Add "Failed to load resource" to the filter list. With (1) in
place we still see the URLs for diagnosis; the generic console
message is just noise.
Real JS exceptions (panel crash, undefined access, etc.) come with
a file path and stack trace and aren't matched by either filter,
so the gate still catches actual bugs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#2073 caught workspace-scoped 401s but missed non-workspace paths.
SkillsTab.tsx alone fetches /plugins and /plugins/sources, both
outside the /workspaces/<id>/* tree. Either of those 401s with the
tenant admin bearer in SaaS mode → canvas/src/lib/api.ts:62-74
redirects to AuthKit → page navigates away mid-test → next locator
times out.
Same failure signature observed at 16:03Z post-#2073 merge:
e2e/staging-tabs.spec.ts:45:7 › tab: skills
TimeoutError: locator.scrollIntoViewIfNeeded: Timeout 5000ms
- navigated to "https://scenic-pumpkin-83.authkit.app/?..."
Broaden the route to "**" with `request.resourceType() !== "fetch"`
short-circuit (preserves HTML/JS/CSS pass-through) and a
/cp/auth/me skip (the dedicated mock above wins). Same 401 →
empty-body conversion logic; just a wider net.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bundle review of pieces 1/2/3 surfaced two critical issues plus a
handful of required + optional fixes. All addressed.
Critical:
1. Migration 043 was missing 'paused' and 'hibernated' from the
workspace_status enum. Both are real production statuses written
by workspace_restart.go (lines 283 and 406), introduced by
migration 029_workspace_hibernation. The original `USING
status::workspace_status` cast would have errored mid-transaction
on any production DB containing those values. Added both. Also
added `SET LOCAL lock_timeout = '5s'` so the migration aborts
instead of stalling the workspace fleet behind a slow SELECT.
2. The chat activity-feed window kept only 8 lines, and a single
multi-tool turn (Read 5 files + Grep + Bash + Edit + delegate)
easily flushed older context before the user could read it.
Extracted appendActivityLine to chat/activityLog.ts with a
20-line window AND consecutive-duplicate collapse (same tool
on the same target twice in a row is noise, not new progress).
5 unit tests pin the behavior.
Required:
3. The SDK wedge flag was sticky-only — a single transient
Control-request-timeout from a flaky network blip locked the
workspace into degraded for the whole process lifetime, even
when the next query() would have succeeded. Added
_clear_sdk_wedge_on_success(), called from _run_query's success
path. The next heartbeat after a working query reports
runtime_state empty and the platform recovers the workspace to
online without a manual restart. New regression test.
4. _report_tool_use now sets target_id = WORKSPACE_ID for self-
actions, matching the convention other self-logged activity
rows use. DB consumers joining on target_id see a well-defined
value instead of NULL.
Optional taken:
5. Tightened _WEDGE_ERROR_PATTERNS from "control request timeout"
to "control request timeout: initialize" — suffix-anchored so a
future SDK error on an in-flight tool-call control message
doesn't get misclassified as the unrecoverable post-init wedge.
6. Dropped the redundant "context canceled" substring fallback in
isUpstreamBusyError. errors.Is(err, context.Canceled) is the
typed check; the substring would also match healthy client-side
aborts, which we don't want classified as upstream-busy.
Verified: 1010 canvas tests + 64 Python tests + full Go suite pass;
migration applies cleanly on dev DB with all 8 enum values; reverse
migration restores TEXT.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two halves of the same UX win — the user wants to see what Claude is
doing while a chat reply is in flight instead of staring at "0s" for
minutes.
Workspace side (claude_sdk_executor.py):
- The executor's _run_query message loop already iterated the SDK
stream for AssistantMessage.TextBlock content. Now also detects
ToolUseBlock / ServerToolUseBlock entries (by class name, since
the conftest stub doesn't define them) and fires-and-forgets a
POST /workspaces/:id/activity row of type agent_log per tool use.
- _summarize_tool_use maps the common tools (Read, Write, Edit,
Bash, Glob, Grep, WebFetch, WebSearch, Task, TodoWrite) to a
one-line summary with the file path / pattern / command, falling
back to "🛠 <tool>(…)" for anything else. Truncated at 200 chars.
- Posts directly to /workspaces/:id/activity rather than going
through a2a_tools.report_activity, which would also push a
/registry/heartbeat current_task and double-log as a TASK_UPDATED
line in the same chat feed.
- All failures swallowed silently — telemetry must not break
the conversation.
Canvas side (ChatTab.tsx):
- The existing ACTIVITY_LOGGED handler streams a2a_send /
a2a_receive / task_update events into a sliding-window
activityLog state. Two issues fixed:
1. No `msg.workspace_id === workspaceId` filter — a sibling
workspace's a2a_send was leaking into the wrong chat
panel as "→ Delegating to X...". Added an early return.
2. No agent_log render branch. Added one that renders the
summary verbatim (the workspace already prefixed its
own emoji icon, so no double-icon).
- Existing 8-line sliding window keeps the UI scoped; older
progress lines naturally roll off as new ones arrive.
Result: when DD is delegating to Visual Designer + reading
config files + running Bash to lint, the spinner area shows:
📄 Read /configs/system-prompt.md
⚡ Bash: pnpm test
→ Delegating to Visual Designer...
← Visual Designer responded (47s)
instead of bare "0s · Processing with Claude Code..." for minutes.
63 Python tests + 58 canvas chat tests pass; tsc clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The staging-tabs E2E has been failing for 6+ hours on the same
locator timeout — diagnosed earlier today as the canvas's
lib/api.ts:62-74 redirect-on-401 path firing mid-test:
e2e/staging-tabs.spec.ts:45:7 › tab: skills
TimeoutError: locator.scrollIntoViewIfNeeded: Timeout 5000ms
- navigated to "https://scenic-pumpkin-83.authkit.app/?..."
Several side-panel tabs (Peers, Skills, Channels, Memory, Audit,
and anything workspace-scoped) hit endpoints under
`/workspaces/<id>/*` that require a workspace-scoped token, NOT
the tenant admin bearer the test uses. The endpoints respond 401
in SaaS mode. canvas/src/lib/api.ts:62-74 reacts to ANY 401 by
setting `window.location.href` to AuthKit — yanking the page off
the tenant origin mid-test.
The test comment at line 18 already acknowledged the 401 class
("Peers tab: 401 without workspace-scoped token") but assumed
those would surface as "errored content" rather than a hard
navigation. The redirect logic in api.ts was added later and
breaks the assumption.
Fix: add a Playwright route handler that catches any 401 from
`/workspaces/<id>/*` paths and replaces with `200 + empty body`.
Body shape is best-effort by URL — list endpoints (paths not
ending in a UUID-shaped segment) get `[]`, single-resource
endpoints get `{}`. Both are valid JSON and well-written panels
render an empty state for either rather than crashing.
The two route patterns (`/workspaces/...` and `/cp/auth/me`)
don't overlap — the existing `/cp/auth/me` mock continues to
gate AuthGate's session check independently.
Verification:
- Type-check passes (tsc clean for the spec; pre-existing errors
in unrelated test files unchanged)
- Can't run staging E2E locally without CP admin token; CI will
exercise the real path against the freshly-provisioned tenant
- E2E Staging SaaS (full lifecycle) is currently green at 08:07Z,
confirming the underlying staging infra works — the failures
have been narrowly in this Playwright-tabs spec
Targets staging per molecule-core convention.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three required fixes from the bundle review of 391e1872:
1. workspace/a2a_client.py: substring `type_name in msg` could miss
the diagnostic prefix when an exception's message embedded a
different class name mid-string (e.g. `OSError("see ConnectionError
below")` → printed as plain msg, type lost). Switched to a
prefix-anchored check (`msg.startswith(f"{type_name}:")` etc.) so
the type label is always added when not already at the start of
the message.
2. workspace/a2a_tools.py: `activity_logs.error_detail` is unbounded
TEXT on the platform (handlers/activity.go does not validate
length). A buggy or hostile peer could stream arbitrarily large
error messages into the caller's activity log. Cap at 4096 chars
at the producer — comfortably above any real exception traceback,
well below an obvious-DoS threshold.
3. New regression test for JSON-RPC `code=0` — pins the
`code is not None` semantics so the code is preserved in the
detail rather than collapsing into the no-code path. Code=0 is
not valid per the spec, but a malformed peer can still emit it
and we want it visible for diagnosis.
Plus one optional taken: extracted the A2A-error → hint mapping into
canvas/src/components/tabs/chat/a2aErrorHint.ts. The two prior copies
(AgentCommsPanel.inferCauseHint + ActivityTab.inferA2AErrorHint) had
already drifted — Activity tab gained `not found`/`offline` cases the
chat panel never picked up, AgentCommsPanel handled empty-input
explicitly while Activity didn't. The shared module is the merged
superset, with 10 unit tests pinning each named pattern + the
"most specific first" ordering (Claude SDK wedge wins over generic
timeout).
Skipped (per analysis):
- Unicode-naive 120-char slice — Python str[:N] slices on code
points, not bytes. Safe.
- Nested [A2A_ERROR] confusion — non-issue per reviewer; outer
prefix winning still produces a structured render.
- MessagePreview + JsonBlock dual render on errors — intentional
drilldown; raw JSON is below the fold for operators who need it.
- console.warn dedup — refetches don't happen per-event so spam
risk is low.
- str(data)[:200] materialization — A2A response bodies aren't
typically MB-sized.
Verified: 1005 canvas tests pass (10 new hint tests); 10 Python
send_a2a_message tests pass (1 new for code=0); tsc clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Symptom: Activity tab and Agent Comms surfaced bare "[A2A_ERROR] "
(prefix + nothing) for failed delegations. Operator had no signal
to act on — no exception type, no target, no hint about what went
wrong, no next step. Fix is in three layers.
1. workspace/a2a_client.py — every error path now produces an
actionable detail string:
- except branch: some httpx exceptions (RemoteProtocolError,
ConnectionReset variants) stringify to "". Pre-fix the catch
was `f"{_A2A_ERROR_PREFIX}{e}"` → bare prefix. Now falls back
to `<TypeName> (no message — likely connection reset or silent
timeout)` and always appends `[target=<url>]` for traceability
in chained delegations.
- JSON-RPC error branch: previously dropped error.code on the
floor and printed "unknown" when message was missing. Now
surfaces both, including the well-defined "JSON-RPC error
with no message (code=N)" path.
- "neither result nor error" branch: pre-fix returned
str(payload) which the canvas rendered as a successful
response block. Now tagged as A2A_ERROR with a payload
snippet so downstream UI routes through the error path.
2. workspace/a2a_tools.py — tool_delegate_task now passes
error_detail (the stripped error message) through to the
activity-log POST. The platform's activity_logs.error_detail
column is the canvas's red error chip source; populating it
makes the failure visible in the row header without the user
having to expand into raw response_body JSON. The summary line
also gets a 120-char prefix of the cause so the collapsed row
reads "React Engineer failed: ConnectionResetError: ... [target=...]"
instead of "React Engineer failed".
3. canvas/src/components/tabs/ActivityTab.tsx — MessagePreview
now detects [A2A_ERROR]-prefixed bodies and renders a
structured error block (red chip, stripped detail, cause hint)
instead of the previous gray text-block that showed the literal
"[A2A_ERROR]" string. inferA2AErrorHint mirrors the patterns
from AgentCommsPanel.inferCauseHint so the same symptom reads
the same way in both surfaces (Claude SDK init wedge → restart
workspace; timeout → busy/stuck; connection-reset → transient
blip then check logs).
Tests: 9 send_a2a_message tests pass (including a new regression
test for the empty-stringifying-exception case that the user
reported); 995 canvas tests pass; tsc clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reported symptom: canvas edges show "1 call · just now" between two
agents, but the Agent Comms tab for the source workspace renders
"No agent-to-agent communications yet" — even though
GET /workspaces/<id>/activity?source=agent&limit=50 returns a2a_send
+ a2a_receive rows.
Confirmed via curl that the API does return the rows the panel
should map. The panel's load handler was the suspect, but it had:
.catch(() => setLoading(false))
which swallowed every failure path — network errors, JSON parse,
ANY throw inside the .then body — without leaving a single trace in
the console. The panel just sat on its empty state and gave the user
zero signal to act on. (And by extension, gave us nothing to debug
remotely either.)
Two changes:
1. Wrap the per-row `toCommMessage` call in a try/catch so one
malformed activity row (unexpected request_body shape, etc.)
doesn't throw out of the for-loop and skip the
setMessages(msgs) line. Previously the panel would silently
drop the entire batch when ANY row failed to parse.
2. Replace the bare `.catch(() => setLoading(false))` with a
logging variant. Now a future "panel stuck empty" report comes
with `AgentCommsPanel: load activity failed <err>` or
`AgentCommsPanel: failed to map activity row {...}` in the
console — diagnosable instead of opaque.
Behavior on the happy path is unchanged (5 existing tests still
pass; tsc clean). This is purely defensive: it makes the failure
path visible so the next stuck-empty report can be root-caused
instead of guessed at.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bundle-level review caught an implicit coupling in useCanvasViewport
between two distinct fit effects:
- settle fit: 1200ms one-shot when provisioning transitions to zero
(deploy just finished — settle on the whole org once)
- tracking fit: 500ms debounced per molecule:fit-deploying-org event
(track the org's bounds as children land during the deploy)
Both effects shared a single autoFitTimerRef, so each one's
clearTimeout call could silently cancel the other's pending fit.
Today's behavior happened to land in the right order out of luck —
the tracking handler fires per-arrival during the deploy, then the
settle effect arms after the last child completes. But nothing in
the code enforces that ordering; a future refactor that, say,
fires the settle effect from the same event sequence as the
tracking timer (mid-deploy status flicker) would silently drop the
settle fit because the tracking timer's clearTimeout ran last.
Splitting into settleFitTimerRef + trackingFitTimerRef makes the
two effects fully independent. Cleanup clears both. Tests still pass
(995/995); the refactor is mechanical.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three review-driven fixes plus regression coverage for the bugs
landed in 176b703d / deedb5ef:
1. clearTimeout the prior reload handle before scheduling a new one in
both installFromSource and handleUninstall. Two installs within the
PLUGIN_RELOAD_DELAY_MS window (15s) used to queue two
loadInstalled() calls; the unmount cleanup only cleared the latest
handle, and the second reconciliation could overwrite a still-
correct optimistic state with a stale snapshot mid-restart.
2. Drop `setInstalledLoaded(true)` from the optimistic block. That
flag's contract is "the initial GET has succeeded at least once" —
it gates the auto-expand-registry effect. A user installing a
custom-source plugin BEFORE the initial fetch returned would flip
the gate prematurely, the auto-expand would never fire, and a
followup loadInstalled racing with the optimistic write could
overwrite our entry with [] mid-restart.
3. Don't force `supported_on_runtime: true` on the optimistic record.
The "inert on this runtime" badge in the row renders on the value
`=== false`. Forcing true would hide the badge for 15s if the user
installed a plugin that doesn't actually support the workspace's
runtime; the real value lands at refetch. Leaving the field
undefined keeps the badge neutral until reconciliation arrives.
Plus a behavioral test (SkillsTab.install.test.tsx) that asserts:
- the install POST URL contains the workspaceId (not "undefined")
- the row's "Install" button is replaced by the green "Installed"
tag synchronously after POST resolves, without advancing any
timer — locks in the optimistic-update contract so a future
refactor can't silently regress it.
995 canvas tests pass (2 new); tsc clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After clicking Install, the button reverted from "Installing..." → "Install"
the moment the POST returned, then sat there for ~15s before the green
"Installed" tag appeared. The 15s gap is PLUGIN_RELOAD_DELAY_MS — we
delay the GET /workspaces/:id/plugins refetch to wait for the workspace
to restart (the listing handler returns [] while the container is
restarting because findRunningContainer comes up empty).
Uninstall already does optimistic local-state mutation (line 244 prior
to this commit) so the green tag → install button transition is
instant. Install was the inconsistent half — push the registry entry
into `installed` immediately after POST returns 200 and let the
delayed refetch reconcile.
The optimistic record uses the registry entry's metadata (name,
version, description, tags, runtimes, skills) and sets
supported_on_runtime=true. If reconciliation later disagrees (server
filter, install actually failed at the runtime layer), the refetch
overwrites the local record. Worst case is a brief 15s window where
we show "Installed" for a plugin that won't load — same window the
user previously experienced as "stuck on Install button" — but flipped
to the correct expected state.
Custom-source installs (github://, etc.) don't have a registry entry
to use, so they keep the old behavior of waiting for the refetch. Most
users install from the registry list in the UI.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SkillsTab read \`data.id\` from its props and used the value to build
two API URLs:
POST /workspaces/\${data.id}/plugins
DELETE /workspaces/\${data.id}/plugins/\${pluginName}
But \`data\` is the React Flow node.data blob (WorkspaceNodeData) —
the workspace id lives on \`node.id\`, NOT on \`node.data\`. WorkspaceNodeData
extends \`Record<string, unknown>\`, which makes \`data.id\` type-check
silently as \`unknown\` instead of erroring. So every install/uninstall
hit \`/workspaces/undefined/plugins\`, the server's not-found path
returned 503 "workspace container not running" (misleading — the real
issue was the bogus URL), and the user got a confusing toast.
Every other tab in SidePanel takes \`workspaceId={selectedNodeId}\` as
an explicit prop. SkillsTab was the lone outlier, presumably because
"data has all the fields I need" is the obvious-looking shortcut that
TypeScript can't catch through the index-signature interface.
Fix: make \`workspaceId\` an explicit prop on SkillsTab, drop the
\`data.id\` reads, thread the prop from SidePanel like the other tabs.
Test fixture updated to pass it.
Verified: 993 canvas tests pass; tsc clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Independent code review surfaced two required documentation fixes and
one growth-correctness gap. All addressed here.
Auto-fit gate (useCanvasViewport):
The previous "subtree-grew-by-count" check missed the delete-then-add
case: subtree of 6 → delete one → 5 → a different child arrives → 6
again. A length-only comparison reads no growth and the fit is
skipped, leaving the new node off-screen. Switched to an id-set
membership snapshot so any brand-new id forces the fit even when the
count is unchanged.
The gate logic is now extracted as a pure exported function
`shouldFitGrowing(currentIds, prevIds, userPannedAt, lastAutoFitAt)`
so the regression-prone decision can be unit-tested in isolation
without standing up React Flow + DOM event refs. 8 cases cover:
first-fit, empty-prior, brand-new id, status-update with user pan,
no-pan-ever, pan-before-last-fit, delete-then-add same length, and
shrink-only with user pan.
Parser parity (dotenv.go + next.config.ts):
Existing-env semantics were undocumented in both parsers. Both now
explicitly note that an explicitly-set empty string (`KEY=` from the
parent shell) counts as "set" — the file value does NOT backfill —
matching the Go (os.LookupEnv) and Node (`process.env[k] !==
undefined`) primitives.
`export ` prefix uses a literal space; `export\tFOO=bar` is
intentionally rejected. Added the same comment in both parsers
to lock in this parity invariant since the commit message claims
"if one parser changes, the other has to."
Skipped (per analysis):
- Drag-pan respect for left-click drag-pan during deploy. The
growth-check safety net means any pan gets overridden on the
next arrival anyway, which is the desired behavior for the
"watch the org deploy" use case. After deploy completes, no
more fit-deploying-org events fire so drag-pan works freely.
- Map cleanup for lastFitSubtreeIdsRef. Per-tab session, UUID
keys, tiny entries — not worth the cleanup hook.
993 canvas tests pass (8 new); Go dotenv tests pass; tsc clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Symptom: org import zoomed to fit the parent + first child, then froze
at that framing while the remaining children kept materialising
off-screen. The user had to manually pan/zoom to see the new arrivals.
Two stacked bugs in useCanvasViewport's deploy-time auto-fit:
1. The user-pan-respect gate stamps userPannedAtRef on EVERY
pointerdown that lands inside .react-flow__pane. That fires for
ordinary clicks (deselect, click-near-a-card, modal-close-bubble
from the import dialog) — not just for actual pan gestures. One
accidental pre-import click was enough to lock out every fit for
the rest of the deploy. Wheel is the canonical unambiguous
pan/zoom signal; drop pointerdown.
2. Even with a real pan during deploy, when more children land the
org's bounds grow and the user has lost context — the new
arrivals are off-screen and the deploy is the primary thing they
want to watch right now. The guard had no growth awareness, so
one pan cancelled all follow-up fits unconditionally. Now we
track the subtree size at the last fit (per root), and if the
current subtree is larger we force the fit through regardless of
the user-pan timestamp. When the subtree size hasn't changed
(status updates on already-positioned nodes), the user-pan
respect still applies — so post-deploy exploration isn't
yanked back.
The Map keyed by root id supports back-to-back imports of different
orgs without one's growth count blocking the other's first fit.
985 canvas tests pass; tsc clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Symptom: spawn animation missing on org import. Workspaces appeared in
their final positions all at once instead of materialising one-by-one.
Root cause: the WS pill said "Reconnecting" forever because the canvas
was trying to connect to ws://localhost:3000/ws — its own port, where
Next.js dev doesn't serve a WebSocket — instead of the platform's
ws://localhost:8080/ws.
Why: deriveWsBaseUrl() falls back to window.location when
NEXT_PUBLIC_WS_URL is unset. Next.js auto-loads .env from the project
root only — and the canonical NEXT_PUBLIC_WS_URL /
NEXT_PUBLIC_PLATFORM_URL live in the monorepo root .env, alongside the
Go platform's MOLECULE_ENV / DATABASE_URL. Without an extra
canvas/.env.local copy (which would still be a per-developer manual
step), the canvas dev server starts blind to those vars.
Fix: next.config.ts now walks upward from __dirname looking for the
monorepo root (same workspace-server/go.mod sentinel the platform's
dotenv loader uses) and merges the root .env into process.env BEFORE
Next.js compiles. Existing env wins over file values, so docker
runs / CI / explicit exports still dominate.
The parser is a TypeScript mirror of workspace-server/cmd/server/
dotenv.go's parseDotEnvLine — same rules (export prefix, quotes,
inline comments, BOM) so a single .env line behaves identically across
both processes. If one parser changes, the other has to.
Production unaffected: `output: "standalone"` bakes resolved env into
the build, the workspace-server sentinel isn't shipped in deploy
artifacts, and the existing-env-wins rule means container env
dominates anywhere this file is consulted at runtime.
Verified: canvas dev startup log now shows
"[next.config] loaded 49 vars from /Users/.../molecule-core/.env";
served bundle has the correct ws://localhost:8080/ws URL; WS pill
flips to "Connected" after a hard refresh and per-workspace spawn
animations fire on the next org import as expected.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Independent code review surfaced three required fixes and one cheap
optional one. All addressed here.
dotenv parser:
- `export FOO=bar` was parsed as key `"export FOO"` (with embedded
space) and silently os.Setenv'd, so a developer pasting from a
direnv `.envrc` would get junk vars. Now strips the prefix.
- Quoted values weren't unwrapped: `FOO="hello world"` produced value
`"hello world"` with literal quotes. Now strips one matched pair of
surrounding `"` or `'`. Inside a quoted value `#` is part of the
value, not a comment marker (matches godotenv convention).
- UTF-8 BOM at file start (Windows editors) would have produced a
first key like U+FEFF + "FOO". Now stripped via TrimPrefix.
dotenv loader:
- findDotEnv()'s upward walk would happily pick up `~/.env` or a
sibling-repo `.env` if the binary was run from `~/Documents/other-
project/`. Real foot-gun on shared dev boxes. Now gated on a
monorepo sentinel: the candidate directory must contain
`workspace-server/go.mod`. Falls through to "no .env found" (=
pre-fix behavior) when the sentinel is absent.
socket fallback poll:
- startFallbackPoll() previously fired only on onclose, so the very
first connect attempt — when onclose hasn't fired yet because we
never had a successful onopen — left the canvas with no HTTP poll
for the duration of the failing handshake (Chrome can hold a
SYN-SENT WebSocket open ~75s before giving up). Now also called at
the top of connect(); the timer-already-running guard makes it a
no-op when one cycle later onclose calls it again.
Test coverage added: export prefix, single+double quoted values, hash
inside quotes preserved, unterminated quote falls back to bare value,
CRLF stripping locked in, BOM stripping, and a sentinel-rejection
regression test that creates a temp .env with no workspace-server
sibling and asserts findDotEnv refuses to load it.
Verified: 985 canvas tests + 30 dotenv subtests + 4 dotenv integration
tests all pass; tsc clean; rebuilt platform from monorepo root with
stripped env still loads .env (49 vars) and /workspaces returns 200.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Deleting a parent on a wedged WS used to leave the child cards on
the canvas as orphaned roots until the user manually refreshed.
Why: Canvas.tsx and DetailsTab.tsx both called `removeNode(parentId)`
after `DELETE /workspaces/:id?confirm=true` returned 200. `removeNode`
deliberately re-parents children rather than cascading — it relies on
the per-descendant WORKSPACE_REMOVED WS events the platform emits as
part of the cascade to drop each child individually. When the WS is
unhealthy those events never arrive, so the local store keeps the
children alive (now re-parented to root since their actual parent is
gone).
Fix: new `removeSubtree(rootId)` action on the canvas store mirrors
the server-side cascade — drops the root + every descendant + every
incident edge in one atomic set(). Both delete call sites now use it.
The WS events still arrive when WS is healthy and become idempotent
no-ops because the nodes are already gone.
Why a new action instead of changing removeNode: removeNode's
re-parenting behavior is correct for non-cascading flows (drag-out,
manual node detach in the future). Adding a sibling action keeps
both call shapes available rather than forcing every caller to opt
out of cascade.
6 new unit tests cover root cascade, mid-level cascade, leaf
no-op-cascade, selection clearing across the subtree, selection
preservation outside the subtree, and edge cleanup.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>