Closes the gap where the Director would say "ZIP is ready at /tmp/foo.zip"
in plain text instead of attaching a download chip — the runtime literally
had no API for outbound file attachments. The canvas + platform's
chat-uploads infrastructure already supported the inbound (user → agent)
direction (commit 94d9331c); this PR wires the outbound side.
End-to-end shape:
agent: send_message_to_user("Done!", attachments=["/tmp/build.zip"])
↓ runtime
POST /workspaces/<self>/chat/uploads (multipart)
↓ platform
/workspace/.molecule/chat-uploads/<uuid>-build.zip
→ returns {uri: workspace:/...build.zip, name, mimeType, size}
↓ runtime
POST /workspaces/<self>/notify
{message: "Done!", attachments: [{uri, name, mimeType, size}]}
↓ platform
Broadcasts AGENT_MESSAGE with attachments + persists to activity_logs
with response_body = {result: "Done!", parts: [{kind:file, file:{...}}]}
↓ canvas
WS push: canvas-events.ts adds attachments to agentMessages queue
Reload: ChatTab.loadMessagesFromDB → extractFilesFromTask sees parts[]
Either path → ChatTab renders download chip via existing path
Files changed:
workspace-server/internal/handlers/activity.go
- NotifyAttachment struct {URI, Name, MimeType, Size}
- Notify body accepts attachments[], broadcasts in payload,
persists as response_body.parts[].kind="file"
canvas/src/store/canvas-events.ts
- AGENT_MESSAGE handler reads payload.attachments, type-validates
each entry, attaches to agentMessages queue
- Skips empty events (was: skipped only when content empty)
workspace/a2a_tools.py
- tool_send_message_to_user(message, attachments=[paths])
- New _upload_chat_files helper: opens each path, multipart POSTs
to /chat/uploads, returns the platform's metadata
- Fail-fast on missing file / upload error — never sends a notify
with a half-rendered attachment chip
workspace/a2a_mcp_server.py
- inputSchema declares attachments param so claude-code SDK
surfaces it to the model
- Defensive filter on the dispatch path (drops non-string entries
if the model sends a malformed payload)
Tests:
- 4 new Python: success path, missing file, upload 5xx, no-attach
backwards compat
- 1 new Go: Notify-with-attachments persists parts[] in
response_body so chat reload reconstructs the chip
Why /tmp paths work even though they're outside the canvas's allowed
roots: the runtime tool reads the bytes locally and re-uploads through
/chat/uploads, which lands the file under /workspace (an allowed root).
The agent can specify any readable path.
Does NOT include: agent → agent file transfer. Different design problem
(cross-workspace download auth: peer would need a credential to call
sender's /chat/download). Tracked as a follow-up under task #114.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
[Molecule-Platform-Evolvement-Manager]
Addresses github-code-quality finding on PR #2064:
> Comparison between inconvertible types
> Variable 'info' cannot be of type null, but it is compared to
> an expression of type null.
By line 75, `info` has been narrowed to non-null via the
`if (!info) return null;` guard at line 56 — so `open={info !== null}`
always evaluates to `true`. Switch to JSX shorthand `open` for
clarity and to silence the static check.
Behaviorally identical; the modal still opens whenever the parent
renders this component (which only happens with non-null info).
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Critical follow-up to PR #2126's review. Two real bugs:
1. **Runtime QUEUED never resolved.** Platform's drain stitch updates
the platform's delegate_result row when a queued delegation finally
completes, but never pushes back to the runtime. The LLM polling
check_delegation_status saw status="queued" forever — combined with
the new docstring guidance ("queued → wait, peer will reply"), the
model would wait indefinitely on a state that never resolves.
Strictly worse than pre-PR behavior where it would have at least
bypassed.
2. **Live updates dead code.** delegation.go writes activity rows by
direct INSERT INTO activity_logs, bypassing the LogActivity helper
that fires ACTIVITY_LOGGED. Adding "delegation" to the canvas's
ACTIVITY_LOGGED filter (PR #2126 first cut) was inert — initial
GET worked, live updates did not.
Fix:
(1) Runtime side, workspace/builtin_tools/delegation.py:
- New `_refresh_queued_from_platform(task_id)` async helper that
pulls /workspaces/<self>/delegations and finds the platform-side
delegate_result row for our task_id.
- check_delegation_status calls _refresh when local status is
QUEUED, so the LLM's poll itself drives state convergence.
- Best-effort: GET failure leaves local state untouched, next
poll retries.
- Docstring updated to reflect the actual behavior ("polls
transparently — keep polling and you'll see the flip").
- 4 new tests cover: QUEUED → completed via refresh; QUEUED →
failed via refresh; refresh keeps QUEUED when platform hasn't
resolved; refresh swallows network errors safely.
(2) Canvas side, AgentCommsPanel.tsx WS push handler:
- Listens for DELEGATION_SENT / DELEGATION_STATUS / DELEGATION_COMPLETE
/ DELEGATION_FAILED in addition to ACTIVITY_LOGGED.
- Each event's payload synthesized into an ActivityEntry shape
so toCommMessage's existing delegation branch maps it. Status
derived: STATUS uses payload.status, COMPLETE → "completed",
FAILED → "failed", SENT → "pending".
- The ACTIVITY_LOGGED branch keeps the "delegation" type accepted
as a no-op-today / future-proof path: if delegation handlers
are ever refactored to call LogActivity, this lights up
automatically without another canvas change.
Doesn't change: the docstring guidance ("queued → wait, don't bypass")
is now actually load-bearing because the refresh path will deliver
the eventual outcome. Without the refresh, the guidance was a trap.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Review-feedback follow-up. Pre-fix, A2A_IDLE_TIMEOUT_SECONDS=foo or =-30
fell back to the default with zero log signal — operator sets the wrong
value, sees "no effect," wastes hours debugging "why is my override not
working." Now bad-input cases log a clear message naming the variable,
the bad value, and the default applied.
Refactor: extract parseIdleTimeoutEnv(string) → time.Duration so the
parse logic is unit-testable. defaultIdleTimeoutDuration is a const so
tests reference it without re-deriving the value.
8 new unit tests cover empty / valid / negative / zero / non-numeric /
float / trailing-units inputs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two compounding bugs caused the "context canceled" wave on 2026-04-26
(15+ failed user/agent A2A calls in 1hr across 6 workspaces, including
the user's "send it in the chat" message that the director never
received):
1. **a2a_proxy.go:applyIdleTimeout cancels the dispatch after 60s of
broadcaster silence** for the workspace. Resets on any SSE event
for the workspace, fires cancel() if no event arrives in time.
2. **registry.go:Heartbeat broadcast was conditional** —
`if payload.CurrentTask != prevTask`. The runtime POSTs
/registry/heartbeat every 30s, but if current_task hasn't changed
the handler emits ZERO broadcasts. evaluateStatus only broadcasts
on online/degraded transitions — also no-op when steady.
Net: a claude-code agent on a long packaging step or slow tool call
keeps the same current_task for >60s → no broadcasts → idle timer
fires → in-flight request cancelled mid-flight with the "context
canceled" error the user sees in the activity log.
Fix:
(a) Heartbeat handler always emits a `WORKSPACE_HEARTBEAT` BroadcastOnly
event (no DB write — same path as TASK_UPDATED). At the existing 30s
runtime cadence this resets the idle timer twice per minute.
Cost is one in-memory channel send per active SSE subscriber + one
WS hub fan-out per heartbeat — far below any noise floor.
(b) idleTimeoutDuration default bumped 60s → 5min as a safety net for
any future regression where the heartbeat path goes silent (e.g.
runtime crashed mid-request before its next heartbeat). Made
env-overridable via A2A_IDLE_TIMEOUT_SECONDS for ops who want to
tune (canary tests fail-fast, prod tenants with slow plugins want
longer). Either fix alone closes today's gap; both together is
defence in depth.
The runtime side already POSTs /registry/heartbeat every 30s via
workspace/heartbeat.py — no runtime change needed.
Test: TestHeartbeatHandler_AlwaysBroadcastsHeartbeat pins the property
that an SSE subscriber observes a WORKSPACE_HEARTBEAT broadcast on a
same-task heartbeat (the regression scenario). All 16 existing handler
tests still pass.
Doesn't fix: task #102 (single SDK session bottleneck) — peers will
still queue when busy. But this PR ensures the queue/wait flow
actually completes instead of being killed by the idle timer
mid-wait.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two bugs that compounded into the "Director does the work itself" UX:
1. workspace/builtin_tools/delegation.py: _execute_delegation only
handled HTTP 200 in the response branch. When the peer's a2a-proxy
returned HTTP 202 + {queued: true} (single-SDK-session bottleneck
on the peer), the loop fell through. Two iterations later the
`if "error" in result` check tried to access an unbound `result`,
the goroutine ended quietly, and the delegation stayed at FAILED
with error="None". The LLM checking status saw "failed" + the
platform's "Delegation queued — target at capacity" log line in
chat context, concluded the peer was permanently unavailable, and
bypassed delegation to do the work itself.
Fix: explicit 202+queued branch. Adds DelegationStatus.QUEUED,
marks the local delegation as QUEUED, mirrors to the platform,
and returns cleanly without retrying. The retry loop is for
transient transport errors — queueing is a real ack, not a failure
to retry against (retrying would just re-queue the same task).
check_delegation_status docstring extended with explicit per-status
guidance: pending/in_progress → wait, queued → wait (peer busy on
prior task, reply WILL arrive), completed → use result, failed →
real error in error field; only fall back on failed, never queued.
2. canvas/src/components/tabs/chat/AgentCommsPanel.tsx: filter dropped
every delegation row because it whitelisted only a2a_send /
a2a_receive. activity_type='delegation' rows (written by the
platform's /delegate handler with method='delegate' or
'delegate_result') never reached toCommMessage. User saw "No
agent-to-agent communications yet" while 6+ delegations existed
in the DB.
Fix: include "delegation" in the both the initial filter and the
WS push filter, plus a delegation branch in toCommMessage that
maps the row as outbound (always — platform proxies on our behalf)
and uses summary as the primary text source.
Tests:
- 3 new Python tests cover the 202+queued path: status becomes
QUEUED not FAILED; no retry on queued (counted by URL match
against the A2A target since the mock is shared across all
AsyncClient calls); bare 202 without {queued:true} still
falls through to the existing retry-then-FAILED path.
- 3 new TS tests cover the delegation mapper: 'delegate' row
maps as outbound to target with summary text; queued
'delegate_result' preserves status='queued' (load-bearing for
the LLM's wait-vs-bypass decision); missing target_id returns
null instead of rendering a ghost.
Does NOT solve: the underlying single-SDK-session bottleneck that
causes peers to queue in the first place. Tracked as task #102
(parallel SDK sessions per workspace) — real architectural work.
This PR makes the runtime handle the queueing correctly so the LLM
doesn't bail out, and makes the delegations visible in Agent Comms
so operators can see what's happening.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
[Molecule-Platform-Evolvement-Manager]
## What was broken
`canary-staging.yml`'s teardown safety-net step filtered candidate
slugs with `f'e2e-{today}-canary-'`. But `test_staging_full_saas.sh`
emits canary slugs as `e2e-canary-${date}-${RUN_ID_SUFFIX}` — date
SECOND, mode FIRST. Full-mode slugs are the other way around
(`e2e-${date}-${RUN_ID_SUFFIX}`), and the canary workflow seems to
have been copy-pasted from there without re-checking the slug
generator.
Net effect: the safety-net step ran on every cancelled / failed
canary, hit the CP, got the org list, filtered to zero matches,
and exited cleanly. Every cancelled canary EC2 leaked until the
once-an-hour `sweep-stale-e2e-orgs.yml` cron eventually caught it
(120-min default age threshold means ≥1h leak in the worst case).
## Today's incident
Canary run 24966995140 cancelled at 21:03Z. EC2
`tenant-e2e-canary-20260426-canary-24966` still running 1h25m
later, manually terminated by the CEO. Three earlier cancellations
today (16:04Z, 19:26Z, 20:02Z) hit the same gap — visible as the
hourly canary failure pattern in #2090.
## Fix
- Filter prefix corrected to `e2e-canary-${today}-` (mode FIRST,
date SECOND) to match the actual slug emitter.
- Added per-run scoping (`-canary-${GITHUB_RUN_ID}-` suffix) when
GITHUB_RUN_ID is set, mirroring the e2e-staging-saas.yml safety
net's per-run scoping that was added after the 2026-04-21
cross-run cleanup incident — guards against a queued canary's
safety-net step deleting an in-flight different canary's slug
while the queue's `cancel-in-progress: false` lets two reach the
teardown step concurrently.
- Added a comment block tracing the bug + the prior incident so
the next maintainer doesn't re-introduce the same mistake.
## Test plan
- [x] Manual trace: today's slug `e2e-canary-20260426-canary-24966...`
now matches `e2e-canary-20260426-canary-24966` prefix
- [x] YAML parses
- [ ] Next canary cancellation cleans up automatically
## Companion PR
The PRIMARY symptom (TLS-timeout failures, not the leaked EC2)
traces to a separate bug in `molecule-controlplane`: tunnel/DNS
creation errors are logged-and-continued rather than failing
provision. PR coming separately.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two paper cuts the fix addresses:
1. nuke-and-rebuild.sh wipes the compose stack but never re-populates
workspace-configs-templates/, org-templates/, or plugins/. Those dirs
are .gitignored — the curated set lives in manifest.json as external
repos cloned via clone-manifest.sh (idempotent). Without that step,
a fresh checkout or a post-deletion run leaves the dirs empty, which
silently hides the entire template palette in Canvas + falls back to
bare default workspace provisioning. Symptom: "Deploy your first
agent" shows zero templates.
2. The existing ws-* container reap was already in the script (good),
but it only fires when this script runs. Folks running `docker compose
down -v` directly leave orphan ws-* containers behind. Documented
that explicitly in the script comment so future readers understand
why those lines are critical.
The fix is just `bash clone-manifest.sh` added to the script. clone-
manifest.sh is idempotent — populated dirs short-circuit, so a re-nuke
on a healthy machine pays only a few stat calls.
scripts/test-nuke-and-rebuild.sh exercises the canonical workflow end-
to-end:
- plants a fake orphan ws-* container, then asserts it gets reaped
- renames the manifest dirs to simulate a fresh checkout, then
asserts they get repopulated
- waits for /health and asserts the platform sees the same template
count on disk as via /configs in the container (catches bind-mount
drift)
- asserts the image-auto-refresh watcher (PR #2114) starts, since
that's load-bearing for the CD chain users now rely on
The test pre-flights port 5432/6379/8080 and exits 0 with a SKIP
message if a non-target compose project is holding them — common when
parallel monorepo checkouts coexist on one Docker daemon.
scripts/ is intentionally outside CI shellcheck per ci.yml comment, but
both files pass `shellcheck --severity=warning` anyway.
Defers but does not solve the runtime root-cause for orphan ws-* after
plain `docker compose down -v`: the orphan-sweeper in the platform only
reaps containers whose workspace row says status='removed', so a wiped
DB → no row → sweeper ignores them. Proper fix needs container labels
keyed to a per-platform-instance UUID so the sweeper can confidently
reap "containers I provisioned that aren't in my DB anymore" without
nuking a sibling platform's containers on a shared daemon. Tracked as
task #109's follow-up; out of scope for this PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #2107 bumped the bash-side TLS-readiness deadline in
tests/e2e/test_staging_full_saas.sh from 600s to 900s (15 min) AND
added a diagnostic burst on the fail path so the next failure would
identify the broken layer (DNS / TLS / HTTP). What I missed: the
canary workflow's own timeout-minutes was also 15. So GitHub Actions
killed the job at the 15:00 wall-clock mark BEFORE the bash `fail`
+ diagnostic could fire — every cancellation silent, no failure
comment on #2090, no diagnostic data attached.
Visible in the 21:03 UTC canary run: cancelled at 14:03 step time
(15:18 wall) without ever reaching the diagnostic block.
Bump to 25 min — gives ~10 min headroom over the 15-min bash deadline
for setup (org create + tenant provision + admin token fetch) plus
the diagnostic dump plus teardown. Still tighter than the sibling
staging E2E jobs (20/40/45 min) so a genuine wedge surfaces here
first.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The existing sweeper only reaps ws-* containers whose workspace row
has status='removed'. That misses the entire wiped-DB case: an
operator does `docker compose down -v` (kills the postgres volume),
the previous platform's ws-* containers keep running, the new
platform boots into an empty workspaces table — first pass finds
zero candidates and those containers leak forever. Symptom users
hit today: 7 ws-* containers from 11h ago, no rows in DB, no
visibility in Canvas, eating CPU + memory.
Fix shape:
1. Provisioner stamps every ws-* container + volume with
`molecule.platform.managed=true`. Without a label, the sweeper
would have to assume any unlabeled ws-* container might belong
to a sibling platform stack on a shared Docker daemon.
2. Provisioner exposes ListManagedContainerIDPrefixes — a label-filter
counterpart to the existing name-filter.
3. Sweeper splits sweepOnce into two independent passes:
- sweepRemovedRows (unchanged behavior; status='removed' only)
- sweepLabeledOrphansWithoutRows (new; labeled containers whose
workspace_id has no row in the table at all)
Each pass has its own short-circuit so an empty result or transient
error in one doesn't block the other — load-bearing because the
wiped-DB pass exists precisely for cases where the removed-row
pass finds nothing.
Safe under multi-platform-on-shared-daemon: only containers carrying
our label get reaped, sibling stacks' containers are invisible to this
pass. (For now the label is a constant string; a future per-instance
UUID layer can refine "ours" further if a real shared-daemon scenario
emerges.)
Migration: existing platforms running pre-PR builds have UNLABELED
ws-* containers. After this lands they continue to NOT be reaped by
the new path (no label = invisible). They'll only be cleaned via
manual intervention or once the operator recreates them — same as
today. No regression.
Tests cover all five branches of the new pass: happy-path reap,
no-reap when row exists, mixed reap-some-keep-some, Docker error
short-circuits cleanly, non-UUID prefixes get filtered before the
SQL query.
Pairs with PR #2122 (script-level fix). Together they close the
orphan-leak path for both `bash scripts/nuke-and-rebuild.sh` users
(handled by the script) AND `docker compose down -v` users (handled
by the runtime).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
[Molecule-Platform-Evolvement-Manager]
Closes the first item from #2071 (Canvas test gaps follow-up):
adds behavioural coverage for the shared template-deploy hook that
both TemplatePalette (sidebar) and EmptyState (welcome grid) drive.
10 cases across 4 buckets:
**Happy path (4):**
- preflight ok → POST /workspaces → onDeployed fires with new id
- caller-supplied canvasCoords flows into the POST body
- default coords fall in [100,500) × [100,400) when canvasCoords omitted
- template.runtime is preferred over the resolveRuntime fallback
(locks the deduped-fallback table contract added in #2061)
**Preflight failures (2):**
- network throw sets error AND clears `deploying` (regression test
for the "stranded button" bug called out in the SUT's inline
comment — drop the try block and you'll fail this test)
- not-ok-with-missing-keys opens the modal without firing POST
**Modal lifecycle (2):**
- 'keys added' click retries POST without re-running preflight
(verifies the executeDeploy / deploy split — preflight call count
stays at 1, POST count goes to 1)
- 'cancel' click closes modal without firing POST
**POST failures (2):**
- Error rejection surfaces the message
- non-Error rejection surfaces the "Deploy failed" fallback
Mocks `@/lib/api`, `@/lib/deploy-preflight`, and `@/components/MissingKeysModal`
(stand-in component exposes the two callbacks as test-id buttons —
the real radix modal is irrelevant to this hook's behavior). Test
file follows the `vi.hoisted` + import-after-mocks pattern from
`canvas/src/app/__tests__/orgs-page.test.tsx`.
## Test plan
- [x] All 10 cases pass locally (`vitest run useTemplateDeploy.test.tsx`)
- [x] No changes to the SUT — pure additive coverage
- [ ] CI green
Follow-ups for the rest of #2071 (separate PRs):
- A2AEdge rendering + click-to-select-source
- OrgCancelButton cancel flow + optimistic state
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
[Molecule-Platform-Evolvement-Manager]
## What was breaking
Two distinct failure modes in `.github/workflows/secret-scan.yml`,
both visible after PR #2115 / #2117 hit the merge queue:
1. **`merge_group` events**: the script reads `github.event.before /
after` to determine BASE/HEAD. Those properties only exist on
`push` events. On `merge_group` events both came back empty, the
script fell through to "no BASE → scan entire tree" mode, and
false-positived on `canvas/src/lib/validation/__tests__/secret-formats.test.ts`
which contains a `ghp_xxxx…` literal as a masking-function fixture.
(Run 24966890424 — exit 1, "matched: ghp_[A-Za-z0-9]{36,}".)
2. **`push` events with shallow clone**: `fetch-depth: 2` doesn't
always cover BASE across true merge commits. When BASE is in the
payload but absent from the local object DB, `git diff` errors
out with `fatal: bad object <sha>` and the job exits 128.
(Run 24966796278 — push at 20:53Z merging #2115.)
## Fixes
- Add a dedicated fetch step for `merge_group.base_sha` (mirrors
the existing pull_request base fetch) so the diff base is in the
object DB before `git diff` runs.
- Move event-specific SHAs into a step `env:` block so the script
uses a clean `case` over `${{ github.event_name }}` instead of
a single `if pull_request / else push` that left merge_group on
the empty branch.
- Add an on-demand fetch for the push-event BASE when it isn't in
the shallow clone, plus a `git cat-file -e` guard before the
diff so we fall through cleanly to the "scan entire tree" path
if the fetch fails (correct, just slower) instead of exiting 128.
## Defense-in-depth
`secret-formats.test.ts` had two literal continuous-string fixtures
(`'ghp_xxxx…'`, `'github_pat_xxxx…'`). The ghp_ one matched the
secret-scan regex. Switched both to the `'prefix_' + 'x'.repeat(N)`
pattern already used elsewhere in the same file — runtime value is
the same, but the literal source text no longer matches the regex
even if the BASE detection ever falls back to tree-scan mode again.
## Test plan
- [x] No remaining regex matches in the secret-formats.test.ts source
- [x] YAML structure preserved
- [ ] CI passes on this PR's pull_request scan (was already passing)
- [ ] CI passes on this PR's merge_group scan (the new path)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-up to #2110 (which generalised pruneStaleKeys to Map<string, T>).
Identified by the simplify reviewer on that PR as the only other
in-tree caller of the same shape: `for (const id of map.keys()) { if
(!liveIds.has(id)) map.delete(id); }`.
Net: -3 lines, one less hand-rolled GC loop. No behaviour change —
the helper does exactly what the inline block did.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Simplify pass on top of #2069 fix:
- Export FALLBACK_POLL_MS from canvas/src/store/socket.ts and import
it as TOMBSTONE_TTL_MS in deleteTombstones.ts. Single source of
truth — tuning one without the other would silently re-open the
hydrate-races-delete window. Required-fix per simplify reviewer.
- Compress deleteTombstones.ts docstring from 30 lines to 10 — keep
the "what + why module-level"; drop the long-form problem
description (issue #2069 carries it).
- Compress canvas.ts call-site comments at removeSubtree (4 lines →
2) and hydrate (2 lines → 2 but tighter).
- Don't reassign the workspaces parameter inside hydrate — use a
const `live` and thread it through the two downstream calls
(computeAutoLayout, buildNodesAndEdges). Same effect, no lint
smell.
- Trim the canvas.test.ts integration-test preamble.
No behaviour change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes#2069. removeSubtree dropped a parent + descendants locally
after DELETE returned 200, but a GET /workspaces request that was
IN-FLIGHT before the DELETE completed could land AFTER and hydrate
the store with a stale snapshot — re-introducing the deleted nodes
on the canvas until the next 10s fallback poll corrected it.
New module canvas/src/store/deleteTombstones.ts holds a transient
process-lifetime Map<id, deletedAt>. removeSubtree calls
markDeleted(removedIds); hydrate calls wasRecentlyDeleted(id) to
filter the incoming workspaces. TTL is 10s — matches the WS-fallback
poll cadence so a single round-trip is covered, after which a
legitimately re-imported id flows through normally.
GC happens lazily at every read AND at write time so the map stays
bounded — no separate timer / interval / unmount plumbing.
Tests:
- canvas/src/store/__tests__/deleteTombstones.test.ts: 7 cases
covering immediate flag, never-marked, TTL boundary (9999ms vs
10001ms), GC-on-read, GC-on-write, re-mark resets timestamp,
iterable input.
- canvas/src/store/__tests__/canvas.test.ts: end-to-end "hydrate
cannot resurrect ids that removeSubtree just dropped (#2069)"
exercises the full chain at the store level.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Picks up the GHCR digest watcher added in PR #2114 with no operator
action: just `docker compose up` and the platform self-heals to the
latest workspace-template image within 5 minutes of publish.
Default ON for local dev because that's where the runtime → workspace
iteration loop is tightest. .env.example documents the override knob
for the rare "running a long test that shouldn't be disturbed by a
publish" case.
Co-authored-by: Hongming Wang <hongmingwangalt@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After landing the 1-required-review gate on staging in cycle 24, every
agent-authored PR sits with `REVIEW_REQUIRED` until someone notices.
CODEOWNERS solves the routing half: every changed path matches `*`, so
GitHub auto-requests review from @hongmingwang-moleculeai (the
personal account, separate from the HongmingWang-Rabbit agent
identity). PRs land in the personal account's notification queue
automatically.
The `* @hongmingwang-moleculeai` line is informational (route the
request) rather than enforced — branch protection's
require_code_owner_reviews flag is off, so any approving review still
satisfies the 1-review gate. Flip that on later if you want CODEOWNERS
approval to be the *required* review type.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an opt-in goroutine that polls GHCR every 5 minutes for digest
changes on each workspace-template-*:latest tag and invokes the same
refresh logic /admin/workspace-images/refresh exposes. With this, the
chain from "merge runtime PR" to "containers running new code" is fully
hands-off — no operator step between auto-tag → publish-runtime →
cascade → template image rebuild → host pull + recreate.
Opt-in via IMAGE_AUTO_REFRESH=true. SaaS deploys whose pipeline already
pulls every release should leave it off (would be redundant work);
self-hosters get true zero-touch.
Why a refactor of admin_workspace_images.go is in this PR:
The HTTP handler held all the refresh logic inline. To share it with
the new watcher without HTTP loopback, extracted WorkspaceImageService
with a Refresh(ctx, runtimes, recreate) (RefreshResult, error) shape.
HTTP handler is now a thin wrapper; behavior is preserved (same JSON
response, same 500-on-list-failure, same per-runtime soft-fail).
Watcher design notes:
- Last-observed digest tracked in memory (not persisted). On boot the
first observation per runtime is seed-only — no spurious refresh
fires on every restart.
- On Refresh error, the seen digest rolls back so the next tick retries.
Without this rollback a transient Docker glitch would convince the
watcher the work was done.
- Per-runtime fetch errors don't block other runtimes (one template's
brief 500 doesn't pause the others).
- digestFetcher injection seam in tick() lets unit tests cover all
bookkeeping branches without standing up an httptest GHCR server.
Verified live: probed GHCR's /token + manifest HEAD against
workspace-template-claude-code; got HTTP 200 + a real
Docker-Content-Digest. Same calls the watcher makes.
Co-authored-by: Hongming Wang <hongmingwangalt@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs: point new-runtime-template flow at the GitHub template repo
The 'Writing a new adapter' section was a 6-step manual checklist that
re-derived the canonical shape every time. Now that
Molecule-AI/molecule-ai-workspace-template-starter exists as a GitHub
template, the flow collapses to:
gh repo create ... --template Molecule-AI/molecule-ai-workspace-template-starter
Plus a fill-in-the-TODO-markers table.
Why this matters: the starter ships with the
'repository_dispatch: [runtime-published]' cascade receiver pre-wired,
which means new templates pick up runtime PyPI publishes automatically
without the one-time setup PR each existing template needed (PRs #6-#22
across the 8 template repos that we just opened to retrofit). At
'hundreds of runtimes' scale this is the difference between linear PR-
toil and zero PR-toil per template addition.
Also adds: 'When the starter itself needs to evolve' — explicit pattern
for keeping the canonical shape in one place when it changes.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
* docs(workspace-runtime): drop PYPI_TOKEN refs — OIDC is the new auth
Reflects PR #2113 (PyPI Trusted Publisher / OIDC migration). No static
PyPI token exists in the repo anymore, so the docs shouldn't claim one
does. Replaces the PYPI_TOKEN row in the Required Secrets table with an
"Auth" section pointing at the OIDC config; TEMPLATE_DISPATCH_TOKEN is
still the only repo secret the cascade needs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Hongming Wang <hongmingwangalt@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drops the static PYPI_TOKEN secret in favor of OIDC trusted publishing.
PyPI now mints a short-lived upload credential after verifying the
workflow's OIDC claim against the trusted-publisher config registered
for molecule-ai-workspace-runtime (Molecule-AI/molecule-core,
publish-runtime.yml, environment pypi-publish).
Why:
- A leaked PYPI_TOKEN would let any holder publish arbitrary versions of
molecule-ai-workspace-runtime to PyPI from anywhere — bypassing the
monorepo's review and CI gates entirely. The 8 template repos pull
this package; a malicious publish poisons all of them.
- Trusted Publisher (OIDC) makes that exfil path moot: no long-lived
credential exists to leak. Only this exact workflow, on this repo,
in the pypi-publish environment, can upload.
After this lands and the first OIDC publish succeeds, the PYPI_TOKEN
repo secret should be deleted (it becomes dead weight + a leak surface
with no purpose).
Belt-and-suspenders companion to PR #56 in molecule-ai-workspace-runtime
(sibling repo lockdown). Without OIDC, the sibling lockdown alone
doesn't prevent local `python -m build && twine upload` from a laptop
with a personal PyPI maintainer credential.
Co-authored-by: Hongming Wang <hongmingwangalt@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The original smoke step had `assert a2a_client._A2A_QUEUED_PREFIX`
which is a feature-flag-style check — it fires false-positive every time
staging is mid-release of that specific feature. Caught when the dry-run
publish (run 24965411618) failed because _A2A_QUEUED_PREFIX hadn't
landed on staging yet (it lives in PR #2061's series, separate from the
PR #2103 chain that shipped this workflow).
Replaced with checks for stable invariants of the package contract:
- a2a_client._A2A_ERROR_PREFIX exists (always has, since the
[A2A_ERROR] sentinel is the foundational error-tagging primitive)
- adapters.get_adapter is callable
- BaseAdapter has the .name() static method (interface anchor)
- AdapterConfig has __init__ (dataclass present)
These four cover the cases the smoke test actually needs to catch:
import-path rewrites broken by build_runtime_package.py, missing
modules, dataclass shape regressions. They don't fire when a specific
feature is mid-merge.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Hongming Wang <hongmingwangalt@gmail.com>
Simplify pass on top of #2070 fix:
- Rename pruneStaleSubtreeIds → pruneStaleKeys, generalize to
Map<string, T> so the same shape can absorb other keyed-by-node-id
caches (ProvisioningTimeout.tsx tracking map is the obvious next
caller — left as a follow-up to keep this PR scoped).
- Trim the helper docstring to remove implementation-detail rot
(O(map_size), cadence claims). The ref-block comment carries the
rationale where it actually matters (at the call site).
- Add identity-preservation test: survivors must keep their original
Set reference. Guards against a future "rebuild instead of delete"
regression that would silently invalidate downstream === checks.
No behaviour change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes#2070. The Map<rootId, Set<nodeId>> in useCanvasViewport.ts
accumulated entries indefinitely — adds on every successful auto-fit,
never deletes when a root left state.nodes (cascade delete or manual
remove). Operationally invisible until thousands of imports, but the
fix is cheap.
Adds pruneStaleSubtreeIds(map, liveNodeIds) — a pure helper exported
alongside the existing shouldFitGrowing helper, called at the top of
runFit before any read or write to the map. Bounds the map to "roots
present right now" instead of "every root ever auto-fitted in this
session." O(map_size) per fit; runs only at user-driven cadence.
Tests in __tests__/useCanvasViewport.test.ts cover the four cases:
delete-some / no-op / clear-all / never-add.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Defense-in-depth for the #2090-class incident (2026-04-24): GitHub's
hosted Copilot Coding Agent leaked a ghs_* installation token into
tenant-proxy/package.json via npm init slurping the URL from a
token-embedded origin remote. We can't fix upstream's clone hygiene,
so we gate at the PR layer.
Single workflow, dual purpose:
1. PR / push / merge_group gate on this repo (molecule-monorepo).
Refuses any change whose diff additions contain a credential-shaped
string. Same shape as Block forbidden paths — error message tells
the agent how to recover without echoing the secret value.
2. Reusable workflow entry point (workflow_call) for the rest of the
org. Other Molecule-AI repos enroll with a 3-line workflow:
jobs:
secret-scan:
uses: Molecule-AI/molecule-monorepo/.github/workflows/secret-scan.yml@main
This makes molecule-monorepo the single source of truth for the
regex set; consumer repos pick up new patterns without per-repo PRs.
Pattern set covers GitHub family (ghp_, ghs_, gho_, ghu_, ghr_,
github_pat_), Anthropic / OpenAI / Slack / AWS. Mirror of the
runtime's bundled pre-commit hook (molecule-ai-workspace-runtime:
molecule_runtime/scripts/pre-commit-checks.sh) — keep aligned when
either side adds a pattern.
Self-exclude on .github/workflows/secret-scan.yml so the file's own
regex literals don't block its merge.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Simplify pass on top of the canary fix:
- Drop the three CP commit SHAs from comments — issue #2090 covers
the audit trail, SHAs would rot.
- Pull the inline `900` into TLS_TIMEOUT_SEC=$((15 * 60)) so the
bash mirrors the TS side (15 min) at a glance.
- TENANT_HOST extraction now strips http(s) AND any port suffix, so
getent doesn't silently fail on a ws://host:443 style URL.
- sed-redact Authorization/Cookie out of the curl -v dump, defensive
against future callers adding an auth header to this probe.
Pure cleanup; no behaviour change to the happy path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Canary #2090 has been red for 6 consecutive runs over 4+ hours, all
timing out at the TLS-readiness step exactly at the 10-min cap. Time
window correlates with three CP commits that landed today/yesterday
and changed EC2 boot behaviour:
- molecule-controlplane@a3eb8be — fix(ec2): force fresh clone of /opt/adapter
- molecule-controlplane@ed70405 — feat(sweep): wire up healthcheck loop
- molecule-controlplane@4ab339e — fix(provisioner): aggregate cleanup errors
Two changes here, both surgical:
1. Bump the bash-side TLS deadline from 600s to 900s, and the canvas TS
mirror from 10m to 15m. Stays below the 20-min provision envelope
(so a genuinely-stuck tenant still fails loud at the earlier
provision step instead of masquerading as TLS).
2. On TLS-timeout, dump a diagnostic burst before exiting:
- getent hosts $TENANT_HOST (DNS resolution state)
- curl -kv $TENANT_URL/health (TLS handshake + HTTP layer)
The previous failure log was just "no 2xx in N min" with no signal
for which layer was actually broken. After this, the next timeout
tells us whether DNS, TLS handshake, or HTTP layer is the culprit
so the CP root cause can be isolated without speculation.
This is the unblock; a separate molecule-controlplane issue tracks the
underlying regression suspicion.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
External architecture review flagged the SECRETS_ENCRYPTION_KEY env var
on the platform as encryption-at-rest theater. The reviewer read only
the platform repo and missed that the master key actually lives in AWS
KMS at the control plane layer, with envelope encryption wrapping each
tenant secret blob.
Adds docs/architecture/secrets-key-custody.md as the canonical source
of truth for the full chain:
- Two-mode envelope (KMS_KEY_ARN vs static-key fallback)
- Per-blob AES-256-GCM with KMS-wrapped DEKs
- Where each key actually lives (KMS, CP env, tenant env)
- Threat model per attacker capability
- Rotation story (annual KMS CMK rotation, manual DEK rotation on incident)
- Audit posture (SOC2 / ISO 27001 questionnaire bullets)
Patches three downstream docs that previously stopped at the env-var
level and link them to the new custody doc:
- development/constraints-and-rules.md (Rule 11)
- architecture/database-schema.md (workspace_secrets paragraph)
- architecture/molecule-technical-doc.md (env-vars table)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>