Three small fixes from the self-review of #2209:
1. **Required: concurrency group.** Two pushes to main in quick
succession (manual UI merge then auto-promote-staging's ff-push,
or any back-to-back main pushes) would race two auto-sync runs
against the same staging branch — second `git push origin staging`
fails non-fast-forward, surfacing as a red CI alert for what should
be a no-op. Add `concurrency: { group: auto-sync-main-to-staging,
cancel-in-progress: false }` so the second run waits for the first
and sees its result.
2. **Hygiene: `git merge --abort` on conflict.** The conflict-error
path exits 1 with the work tree in a half-merged state. Doesn't
affect future runs (each gets a fresh checkout) but is an
unpleasant artifact for anyone who shells into the runner. Abort
first, then exit.
3. **Doc accuracy: "Loop safety" comment.** The original said the
chain terminates because "main is either a no-op or advances
further." That's true but understates the actual safety: GitHub
Actions explicitly does NOT trigger downstream workflow runs from
`GITHUB_TOKEN`-authored pushes. So the loop is impossible by
construction, not just by happy coincidence of ref state. Updated
the comment to reflect the actual mechanism.
Plus a step-name nit: "Fast-forward staging → main" reads as if main
is the target. Renamed to "Fast-forward staging to main" for
consistency with the workflow's name (main → staging).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Background
`auto-promote-staging.yml` advances main via `git merge --ff-only`
+ `git push origin main` — clean fast-forward, no merge commit. But
manual `staging → main` merges via the GitHub UI / API create a merge
commit on main that staging doesn't have. The next `staging → main`
PR then evaluates as "BEHIND" because staging is missing that merge
commit, requiring a manual `gh pr update-branch` round-trip.
This pattern bit twice on 2026-04-28 (PRs #2202 and #2205, both
manual bridges to land pipeline fixes themselves). Each needed
update-branch + re-CI before they could merge. Annoying and
avoidable.
What this workflow does
Triggered on every push to main (regardless of source: auto-promote,
UI merge, API merge, direct push):
1. Check whether main is already in staging's ancestry. If yes,
no-op — auto-promote-staging keeps them aligned via ff push,
and the no-op case is the steady state.
2. If not (manual merge commit on main, or direct main hotfix):
try `git merge --ff-only origin/main` first. Works when staging
hasn't diverged with its own commits.
3. If ff fails (staging has its own in-flight feature work):
`git merge --no-ff origin/main -m "chore: sync main → staging"`.
Absorbs main's tip while keeping staging's own history.
4. Push staging.
Loop safety
Pushing the synced staging triggers auto-promote-staging.yml, which
checks gates on staging's new tip and, if green, ff-pushes staging
to main. Since staging now ⊇ main, the resulting push to main is
either a no-op (no ref change → no push event fires → auto-sync
doesn't re-trigger) or advances main further. In the latter case
auto-sync fires once more, sees main already in staging's ancestry,
no-ops. Bounded.
Conflict handling
If the merge step hits conflicts (staging and main diverged with
incompatible changes), the workflow fails with a clear summary
pointing to manual resolution. This shouldn't happen in practice —
staging is the integration branch; conflicts indicate a direct main
hotfix touching the same code as in-flight staging work.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous assertion `'Silent Agent' not in result` was pinning
the buggy behavior — peers without an agent_card were silently
dropped from the prompt. With the fallback to DB name+role those
peers are correctly visible. Flip the assertion so the test pins
the new (correct) rendering and would catch a regression to the
silent-drop behavior.
Bug: a Design Director coordinator with 6 freshly-created worker peers
rendered an empty `## Your Peers` section in its system prompt — the
hosting registry endpoint correctly returned all 6 peers, but
`summarize_peer_cards()` silently dropped every entry whose
`agent_card` column was null (the default until A2A discovery has
run end-to-end against the worker). The coordinator then refused to
delegate any task because "no peers exist".
Fix: fall back to the registry row's `name` and `role` columns when
`agent_card` is missing, malformed, or wrong-typed, instead of
skipping the peer. The registry endpoint
(`workspace-server/internal/handlers/discovery.go:queryPeerMaps`) has
always returned both fields — they were just being thrown away on
the consumer side. `build_peer_section()` now renders `Role: …` when
the agent_card-derived skill list is empty so the coordinator's
prompt still has something concrete to delegate against.
Also hoists `import json` out of the per-peer loop body to module
level (was previously imported once per iteration).
Tests: new `test_shared_runtime_peer_summary.py` pins all four
fallback cases (null / malformed string / wrong type / null + no
DB name) plus the agent-card-present happy path and the mixed-list
case the coordinator actually consumes. First peer-summary test
coverage `shared_runtime.py` has had — no prior tests existed.
Refs: 2026-04-27 Design Director discovery report from infra team.
Two latent bash bugs in the canonical secret-scan workflow caught
during the post-merge review of molecule-controlplane #301 (a
private consumer that inlined this workflow's logic and got both
fixes there). Same bugs apply here; fixing in canonical means every
public consumer (gh-identity, github-app-auth, the 8 workspace
template repos) inherits the fix on their next workflow_call.
Bug 1: `printf "$OFFENDING"` is a format-string sink.
OFFENDING is built from filenames: `${f} (matched: ${pattern})\n`.
When passed to printf as the first argument, `%` characters in a
filename are interpreted as conversion specifiers — corrupting the
error message or printing `%(missing)` artifacts. No filename in
the current tree triggers it, but a future test fixture, build
artifact, or contributor-supplied path could.
Fix: `printf '%b' "$OFFENDING"` interprets the literal `\n` we
appended without treating OFFENDING as a format string.
Bug 2: `for f in $CHANGED` word-splits on whitespace.
Filenames containing spaces would split into multiple tokens. The
self-exclude check (`[ "$f" = "$SELF" ] && continue`) and the diff
lookup would both operate on partial-path tokens. No filename in
the current tree has whitespace, but the failure would be silent
if one ever did.
Fix: `while IFS= read -r f; do ... done <<< "$CHANGED"` reads
whole lines as filenames. Added `[ -z "$f" ] && continue` to
match the original `for` loop's implicit empty-input skip.
Both fixes are mechanically straightforward (~16 lines net diff,
mostly comments documenting the why). No behavior change for
filenames in the current tree; strictly better for the edge cases.
The same fixes already shipped in molecule-controlplane via #301
which inlined a copy of this workflow. The runtime's bundled
pre-commit hook (molecule-ai-workspace-runtime:
molecule_runtime/scripts/pre-commit-checks.sh) likely has the same
bugs — flagged as a follow-up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the final gap in the SaaS pipeline. After auto-promote-staging
fast-forwards main, publish-workspace-server-image builds new
`:staging-<sha>` images, but `:latest` (what prod tenants pull) only
moves on either a manual `promote-latest.yml` dispatch or a canary-
verify retag (gated on Phase 2 fleet that doesn't exist).
This workflow closes that gap by retagging
`platform:staging-<sha>` + `platform-tenant:staging-<sha>` → `:latest`
whenever E2E Staging SaaS passes for a `main` push. Uses crane
(no Docker daemon needed). Verifies both images exist before retagging
either, so a half-published state is impossible.
Why trigger only on `main` (not staging):
- `:latest` is what prod tenants pull. Only SHAs that have reached
`main` (via auto-promote-staging) should advance `:latest`.
- Triggering on staging would let a staging-only revert advance
`:latest` to a SHA that never reaches `main`, breaking the
invariant "production runs what's on `main`".
Why a separate workflow rather than folding into e2e-staging-saas.yml:
- Test concerns and release concerns separate.
- Disabling promote during an incident is one workflow toggle, not
an edit to the long E2E file.
- When Phase 2 canary work eventually lands, the canary path can
replace this trigger without touching the E2E workflow.
Doc-aligned: per molecule-controlplane/docs/canary-tenants.md,
"green staging E2E → :latest" is the recommended approach for the
current scale (≤20 paying tenants); canary fleet is deferred until
blast radius grows.
Pipeline after this lands is fully self-healing:
staging push → 4 gates green → auto-promote fast-forwards main
→ publish-workspace-server-image → E2E Staging SaaS
→ THIS WORKFLOW retags :latest → tenant fleet auto-pulls in 5 min
(or redeploy-tenants-on-main fans out faster)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Observed 2026-04-28: auto-promote ran for staging head 96955f7b with
all gates actually green (verified via /commits/<sha>/check-runs API)
yet `check-all-gates-green` reported `CodeQL → missing/none` and
aborted. Same SHA was promotable; auto-promote couldn't see it.
Cause: `gh run list --workflow="CodeQL"` matched two workflows in
this repo:
- codeql.yml (explicit, scans both staging and main)
- codeql (GitHub UI-configured Code-quality default setup,
internal, scans default branch only)
gh CLI rejects ambiguous `--workflow=<name>` lookups and returns no
result → the gate fell through to `missing/none` and ALL_GREEN was
set false. Every staging push since both names existed has been
silently dead-locked.
Fix: switch GATES from display-name strings to workflow file paths.
File paths are the unique identifier for a workflow file in
.github/workflows/; display names are decoration and can collide.
The same `gh run list --workflow=<file.yml>` query that fails on
"CodeQL" succeeds on "codeql.yml" because the file path resolves
unambiguously.
No behavior change for the other three gates (CI, E2E Canvas, E2E
API Smoke) since their names didn't collide — they keep working,
they just identify by ci.yml / e2e-staging-canvas.yml / e2e-api.yml
now. The log line shape changes from `CI → completed/success` to
`ci.yml → completed/success` which is fine for ops grep.
When adding/removing a gate going forward: file paths only. Keep
branch-protection required-checks (check-run display names) in
sync as a separate manual step.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The auto-promote-staging.yml gate-check (line 99) treats "workflow
didn't run" as failure. Path-filtered triggers on E2E API Smoke Test
and E2E Staging Canvas meant a platform-only or test-only push to
staging — say, the prior PR #2201 which only touched
tests/e2e/test_staging_full_saas.sh — never triggered the canvas
workflow, and auto-promote saw `missing/none`, marked all_green=false,
and aborted. Same class for any push that doesn't touch the gate's
watched paths. Dead-lock by design, never noticed because the gate
was new.
Fix per Design B (always-run + fast-skip):
- Drop `paths:` from the push/pull_request triggers on both gate
workflows. The workflow now always fires on every staging+main
push/PR.
- Add a `detect-changes` job using `dorny/paths-filter@v3` that
decides whether to do real work, scoped to the same paths the
trigger filter used to watch.
- Real work job (e2e-api / playwright) gates on
`needs: detect-changes; if: needs.detect-changes.outputs.X == 'true'`.
- Add a sibling `no-op` job that runs when the filter output is
false, emitting `::notice::… no-op pass`. The workflow run's
conclusion is `success` either way — auto-promote sees green and
proceeds.
manual `workflow_dispatch` and the weekly canvas `schedule` short-
circuit detect-changes to always-run — those triggers exist precisely
to exercise the suite and shouldn't be silently no-op'd.
Why this approach over making auto-promote-staging smarter:
The alternative (Design A, considered + rejected) was to teach
auto-promote-staging to read each gate's `paths:` filter and treat
"no run because filter excluded the commit" as conditional pass.
That couples auto-promote to other workflows' YAML schema and breaks
silently if a gate is renamed or its filter changes. Design B keeps
the auto-promote contract simple ("each gate emits success") and
makes each gate self-describing — adding a new gate doesn't require
touching auto-promote.
Cost: ~10-30s of runner overhead per gate per push for the no-op when
paths don't match. Negligible vs the alternative of dead-locked
auto-promote chains.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
E2E Staging SaaS has been failing on every cron + push run since
2026-04-27 with `LEAK: org … still present post-teardown (count=1)`,
exit 4. Root cause: the curl timeout on the teardown DELETE was 30s
and the post-DELETE leak check was a single 10s sleep — but the
DELETE handler runs the full GDPR Art. 17 cascade synchronously,
including EC2 termination which AWS reports in 30–60s. Real-world
wall time on a prod-shaped run was 57s on 2026-04-27 (hongmingwang
DELETE); the 30s curl timeout aborted the request mid-cascade and
the 10s post-sleep check found the row still present (status not
yet 'purged').
Two-part fix to match real cascade timing:
1. DELETE curl gets its own --max-time 120 (was 30) so the
synchronous cascade has room to complete in-band.
2. The leak check polls up to 60s for status='purged' instead of
one rigid 10s sleep. Covers two cases:
- DELETE returns 5xx mid-cascade but the cascade finishes anyway
(we still observe a clean state).
- DELETE legitimately exceeds 120s — eventual-consistency catches
the eventual purge instead of false-flagging a leak.
The 5–15s estimate in `molecule-controlplane/internal/handlers/
purge.go`'s comment is the API-call cost only, not the AWS-side
time-to-termination it waits on. The async-purge refactor noted in
that comment would let us drop these timeouts back to ~15s — file
that under future work.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes#132. Extends the cascade propagation probe (added in #2197
and clarified in #2198) with a content-integrity check.
The previous probe verified pip can RESOLVE the version we just
published (catches surface 1+2 propagation lag — metadata + simple
index). It did NOT verify pip can DOWNLOAD bytes that match what we
uploaded — leaving a window where a Fastly stale-content scenario
(rare but PyPI has had it: e.g. 2026-04-01 incident where a CDN node
served a previous version's wheel under the new version's URL for
~90s after upload) would pass the probe and ship corrupt builds to
all 8 receiver templates.
Two-stage check, both must pass before the cascade fans out:
(a) `pip install --no-cache-dir PACKAGE==VERSION` succeeds —
version is resolvable. (Existing, unchanged.)
(b) `pip download` of the same wheel + `sha256sum` matches the
hash captured pre-upload from `dist/*.whl`. (New.)
Captured BEFORE upload via a new `wheel_hash` step that exposes
`steps.wheel_hash.outputs.wheel_sha256`, bubbled up as
`needs.publish.outputs.wheel_sha256`, and consumed by the cascade
probe via the EXPECTED_SHA256 env var.
`pip download` is the right primitive: it writes the actual .whl
file (vs `pip install` which unpacks and discards), so we can
sha256sum it directly. Combined with --no-cache-dir + a wiped
/tmp/probe-dl per poll, every poll re-fetches from the live Fastly
edge — no local-cache mask.
Per-poll cost: ~3-5s pip install + ~3s pip download + 4s sleep.
30-poll budget = ~5-6 min wall on a slow runner (vs the previous
~4-5 min for resolve-only). Well within the cascade's tolerance for
a known-rare CDN issue, and the overwhelming-common case (Fastly
serves matching bytes immediately) exits on the first poll.
Verified locally: pip download of the current PyPI-latest
(molecule-ai-workspace-runtime 0.1.29) produced
sha256=7e782b2d50812257…, exactly matching PyPI's own metadata
endpoint. The mismatch path is exercised inline (different builds
of the same version produce different hashes by definition — the
build_runtime_package.py output is timestamp-deterministic only
within a single CI invocation).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes#134. The post-merge review of #2196 flagged that the combined
workflow's `paths:` filter (the union of both jobs' needs:
`workspace/**` + `scripts/build_runtime_package.py` + the workflow
itself) caused the `pypi-latest-install` job to fire on every
doc-only / adapter-only / unrelated workspace/ edit. The PyPI artifact
that job tests against can't change based on our workspace/ source —
only on actual PyPI publishes — so those runs add noise without
information.
Splits the previously-merged combined workflow:
runtime-pin-compat.yml (kept):
- PyPI-latest install + import smoke (was: pypi-latest-install)
- Narrow `paths:` filter — only fires when workspace/requirements.txt
or this workflow file changes
- Cron-driven daily for upstream-yank detection (unchanged)
runtime-prbuild-compat.yml (new):
- PR-built wheel + import smoke (was: local-build-install)
- Broad `paths:` filter — fires on any workspace/ source change,
scripts/build_runtime_package.py, or this workflow file
- No cron (workspace/ doesn't change between firings)
Behavior identical to before for content; only the trigger surface is
narrower per-job. Each workflow's name is its own status check, so
branch protection (which currently lists neither as required) can
gate them independently in future.
The prior comment in the combined file explicitly acknowledged the
asymmetry and proposed this split as a follow-up; this is that
follow-up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cascade's PyPI-propagation gate polled `/pypi/<pkg>/<ver>/json`,
which is one of THREE surfaces pip touches when resolving an install:
1. /pypi/<pkg>/<ver>/json — metadata endpoint (the old check)
2. /simple/<pkg>/ — pip's primary download index
3. files.pythonhosted.org — CDN-fronted wheel binary
Each has its own cache. Any one of them can lag behind the others,
and the previous gate would let the cascade fire while (2) or (3)
still served the previous version. Downstream `pip install` in the
template repos then resolved to the OLD wheel, the docker layer
cache locked that stale resolution in, and subsequent rebuilds kept
shipping the old runtime — the "five times in one night" cache trap
referenced in the prior comment.
Replace the metadata-only poll with an actual `pip install
--no-cache-dir --force-reinstall --no-deps PACKAGE==VERSION` from
a fresh venv. If pip can resolve and install the exact version we
just published, every receiver template will too — pip itself is
the ground truth for what the receivers will see, no proxy guessing
about which surface is lagging.
- Venv created once outside the loop; only `pip install` runs in
the poll body.
- --no-cache-dir + --force-reinstall ensures every poll hits the
live PyPI surfaces (no local-cache mask).
- --no-deps keeps each poll fast — we only care about resolving
THIS package, not its dep tree.
- Loop budget: 30 attempts × 4s ≈ 2 min (vs prior 30 × 2s = 60s).
Generous vs typical PyPI propagation, surfaces real upstream
issues past the budget.
Verified locally:
- Probing a non-existent version (0.1.999999) → pip exits 1, loop
retries.
- Probing the current PyPI-latest → pip exits 0, `pip show`
returns the version, loop succeeds.
Closes#130.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes#128's chicken-and-egg. The original gate installed the
CURRENTLY-PUBLISHED molecule-ai-workspace-runtime from PyPI, then
overlaid workspace/requirements.txt, then smoke-imported. That
catches problems with the already-shipped artifact (the daily-cron
upstream-yank case), but it cannot catch problems introduced by the
PR itself: the imports it exercises are from the OLD wheel, not the
PR's source. A PR that adds `from a2a.utils.foo import bar` (where
`bar` is added in a2a-sdk 1.5 and the runtime currently pins 1.3)
slips through:
1. Pip resolves the existing PyPI wheel + a2a-sdk 1.3.
2. Smoke imports the OLD main.py — no reference to `bar` → green.
3. Merge → publish-runtime.yml ships a wheel WITH the new import.
4. Tenant images redeploy → all crash on first boot with
ImportError: cannot import name 'bar' from 'a2a.utils.foo'.
Splits the workflow into two jobs:
- pypi-latest-install (renamed from default-install): unchanged
behavior. Runs on the daily cron and on requirements.txt /
workflow edits. Catches upstream PyPI yanks + the
already-shipped artifact going stale.
- local-build-install (new): runs scripts/build_runtime_package.py
on the PR's workspace/, builds the wheel with python -m build
(mirroring publish-runtime.yml byte-for-byte), installs that
wheel, then runs the same smoke import. Tests the artifact
that WOULD be published if this PR merges.
Path filter widened to workspace/** so any runtime-source change
triggers the local-build job. The pypi-latest job's filter is the
same union; its internal logic is unchanged so the daily-cron and
upstream-detection use cases continue to work.
Verified locally: built the wheel from current workspace/ source via
the same script + python -m build invocation, installed into a fresh
venv, imported from molecule_runtime.main import main_sync
successfully.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The existing wheel-smoke catches AgentCard kwarg-shape regressions
(state_transition_history, supported_protocols) but doesn't catch the
SDK-contract drift class that #2193 just fixed in production: the
a2a-sdk 1.x rename of /.well-known/agent.json →
/.well-known/agent-card.json, plus AGENT_CARD_WELL_KNOWN_PATH moving
to a2a.utils.constants. main.py's readiness probe hardcoded the old
literal and 404'd every attempt, silently dropping every workspace's
initial_prompt for ~weeks before a user reported it.
Two additions to the smoke block:
1. Mount alignment: build an AgentCard, call create_agent_card_routes(),
and assert AGENT_CARD_WELL_KNOWN_PATH is among the mounted paths.
Catches a future SDK release that decouples the constant value
from the route factory's mount path. The source-tree test
(workspace/tests/test_agent_card_well_known_path.py) catches the
main.py side; this catches the SDK side BEFORE PyPI upload.
2. Message helper smoke: import a2a.helpers.new_text_message and
instantiate one. The v0→v1 cheat sheet (memory:
reference_a2a_sdk_v0_to_v1_migration.md) flagged this as a real
migration find — main.py and a2a_executor.py call it in hot
paths, so an import break errors every reply before the message
even leaves the workspace.
Verified by running the equivalent Python inside
ghcr.io/molecule-ai/workspace-template-langgraph:latest:
✓ well-known mount alignment OK (/.well-known/agent-card.json)
✓ message helper import + call OK
Closes the structural-fix half of the #2193 finding from the code-
review-and-quality pass: "the wheel publish smoke didn't catch this.
This is the 7th a2a-sdk migration find of this kind. Task #131 is the
right root-cause fix."
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Independent code review caught a real bug in the previous commit's
stale-token revoke pass. The platform's restart endpoint
(workspace_restart.go:104) Stops the workspace container synchronously
then dispatches re-provisioning to a goroutine (line 173). For a
workspace that's been idle past the 5-minute grace window — extremely
common: user comes back to a long-idle workspace and clicks Restart —
this opens a race window:
1. Container stopped → ListWorkspaceContainerIDPrefixes returns no
entry → workspace becomes a stale-token candidate.
2. issueAndInjectToken runs in the goroutine: revokes old tokens,
issues a fresh one, writes it to /configs/.auth_token.
3. If the sweeper's predicate-only UPDATE
`WHERE workspace_id = $1 AND revoked_at IS NULL` runs AFTER
IssueToken commits but is racing the SELECT-then-UPDATE window,
it revokes the freshly-issued token alongside the old ones.
4. Container starts with a now-revoked token → 401 forever.
The fix carries the SAME staleness predicate from the SELECT into the
per-workspace UPDATE: a token created within the grace window can't
match `< now() - grace` and is automatically excluded. The operation
is now idempotent against fresh inserts.
Also addresses other findings from the same review:
- Add `status NOT IN ('removed', 'provisioning')` to the SELECT
(R2 + first-line C1 defence). 'provisioning' is set synchronously
in workspace_restart.go before the async re-provision begins, so
it's a reliable in-flight signal that narrows the candidate set.
- Stop calling wsauth.RevokeAllForWorkspace from the sweeper —
that helper revokes EVERY live token unconditionally; the sweeper
needs "every STALE live token" which is a different (safer)
operation. Inline the UPDATE so we own the predicate end-to-end.
Drop the wsauth import (no longer needed in this package).
- Tighten expectStaleTokenSweepNoOp regex to anchor at start and
require the status filter, so a future query whose first line
coincidentally starts with "SELECT DISTINCT t.workspace_id" can't
silently absorb the helper's expectation (R3).
- Defensive `if reaper == nil { return }` at top of
sweepStaleTokensWithoutContainer — even though StartOrphanSweeper
already short-circuits on nil, a future refactor that wires this
pass directly without checking would otherwise mass-revoke in
CP/SaaS mode (F2).
- Comment in the function explaining why empty likes is intentionally
NOT a short-circuit (asymmetry with the first two passes is the
whole point — "no containers running" is the load-bearing case).
- Add TestSweepOnce_StaleTokenRevokeUsesStalenessPredicate that
asserts the UPDATE shape (predicate present, grace bound). A
real-Postgres integration test would prove the race resolution
end-to-end; this catches the regression where someone simplifies
the UPDATE back to predicate-only.
- Add TestSweepStaleTokens_NilReaperEarlyExit pinning the F2 guard.
Existing tests updated to match the new query/UPDATE shape with tight
regexes that pin all the safety guards (status filter, staleness
predicate in both SELECT and UPDATE).
Full Go suite green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Heals the user-reported "auth token conflict after volume wipe" failure
mode. When an operator nukes a workspace's /configs volume outside the
platform's restart endpoint (common via `docker compose down -v` or
manual cleanup scripts), the DB still holds live workspace_auth_tokens
for that workspace while the recreated container has an empty
/configs/.auth_token. Subsequent /registry/register calls 401 forever:
requireWorkspaceToken sees live tokens, container has no token to
present, and the workspace is permanently wedged until an operator
manually revokes via SQL.
The platform's restart endpoint already handles this correctly via
wsauth.RevokeAllForWorkspace inside issueAndInjectToken. This change
adds a third orphan-sweeper pass — sweepStaleTokensWithoutContainer —
as the safety net for the equivalent action taken outside the API.
Detection criterion: workspace has at least one live (non-revoked)
token whose most-recent activity (COALESCE(last_used_at, created_at))
is older than staleTokenGrace (5 minutes), AND no live Docker
container's name prefix matches the workspace ID.
Safety filters that bound the revoke radius:
1. Only runs in single-tenant Docker mode. The orphan sweeper is
wired only when prov != nil in cmd/server/main.go — CP/SaaS mode
never gets here, so an empty container list cannot be confused
with "no Docker at all" (which would otherwise revoke every
workspace's tokens in production SaaS).
2. staleTokenGrace = 5min skips tokens issued/used in the last
5 minutes. Bounds the race with mid-provisioning (token issued
moments before docker run completes) and brief restart windows
— a healthy workspace touches last_used_at every 30s heartbeat,
so 5min is 10× the heartbeat interval.
3. The query joins workspaces.status != 'removed' so deleted
workspaces are not revoked here (handled at delete time by the
explicit RevokeAllForWorkspace call).
4. make_interval(secs => $2) avoids a time.Duration.String() →
"5m0s" mismatch with Postgres interval grammar that I caught
during implementation.
5. Each revocation logs the workspace ID so operators can correlate
"workspace just lost auth" with this sweeper, not blame a
network blip.
Failure mode: revoke fails (transient DB error). Loop bails to avoid
log spam; next 60s cycle retries. Worst case a workspace stays
401-blocked an extra minute.
Tests: 5 new tests covering the headline scenario, the safety gate
(workspace with container is NOT revoked), revoke-failure-bails-loop,
query-error-non-fatal, and Docker-list-failure-skips-cycle. All 11
existing sweepOnce tests updated to register the new third-pass query
expectation via a small `expectStaleTokenSweepNoOp` helper that keeps
their existing assertions readable.
Full Go test suite green: registry, wsauth, handlers, and all other
packages.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The initial-prompt readiness probe in workspace/main.py hardcoded the
pre-1.x well-known path. After the a2a-sdk 1.x bump the SDK started
mounting the agent card at the new canonical path (the value of
`a2a.utils.constants.AGENT_CARD_WELL_KNOWN_PATH`), so the probe
returned 404 every attempt and silently fell through to "server not
ready after 30s, skipping". Net effect: every workspace silently
dropped its `initial_prompt` from config.yaml — the agent never sent
the kickoff self-message, and users hit a fresh chat with no context.
Reported by an external user as "/.well-known/agent.json 404 — the
a2a-sdk agent card route was not being mounted at the expected path".
The route IS mounted; the probe was looking at the wrong place.
Fix imports `AGENT_CARD_WELL_KNOWN_PATH` from `a2a.utils.constants`
and uses it directly in the probe URL — the SDK constant is now the
single source of truth, so any future rename travels through
automatically.
Adds two static regression tests pinning the invariant:
1. No hardcoded `/.well-known/agent.json` literal anywhere in
main.py.
2. The probe URL fstring interpolates AGENT_CARD_WELL_KNOWN_PATH
(catches a "fix" that imports the constant for show but reverts
to a literal in the actual GET).
Verified manually inside ghcr.io/molecule-ai/workspace-template-langgraph
that AGENT_CARD_WELL_KNOWN_PATH == '/.well-known/agent-card.json' and
that `create_agent_card_routes(card)` mounts at exactly that path —
constant + mount are aligned in the runtime image, so the probe will
now find the server.
Full workspace test suite: 1209 passed, 2 xfailed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Manual fresh-user clean-slate test surfaced three friction points in
the existing dev-start.sh:
1. The script ran docker compose -f docker-compose.infra.yml
directly, bypassing infra/scripts/setup.sh — so the workspace
template registry was never populated and the canvas template
palette came up empty (the "Template palette is empty"
troubleshooting hit).
2. ADMIN_TOKEN was not handled at all. Without it, the AdminAuth
fail-open gate worked initially but slammed shut the moment the
first workspace registered a token — at which point the canvas
could no longer call /workspaces or /templates. New users hit
401s with no obvious next step.
3. The script wasn't mentioned in docs/quickstart.md. New users
followed the documented 4-step manual flow and never discovered
the single command existed.
Fixes:
- dev-start.sh now calls infra/scripts/setup.sh, which brings up
full infra (postgres + redis + langfuse + clickhouse + temporal)
AND populates the template/plugin registry from manifest.json.
- On first run, dev-start.sh writes MOLECULE_ENV=development to
.env. This activates middleware.isDevModeFailOpen() which lets
the canvas keep calling admin endpoints without a bearer (the
intended local-dev escape hatch). The .env is preserved on
re-runs and sourced before the platform launches.
- The script intentionally does NOT auto-generate an ADMIN_TOKEN.
A first attempt did, and broke the canvas because isDevModeFailOpen
requires ADMIN_TOKEN empty AND MOLECULE_ENV=development together.
Setting ADMIN_TOKEN in dev would close the hatch and the canvas
has no way to read that token in a dev build (no
NEXT_PUBLIC_ADMIN_TOKEN bake step here). The .env comment block
explicitly warns future contributors not to add it.
- Both processes' logs go to /tmp/molecule-{platform,canvas}.log
instead of stdout-mixed so the readiness banner stays clean.
- Health-poll loops cap at 30s with a clear timeout error pointing
to the log file, instead of hanging forever.
- The readiness banner now lists the log paths AND tells the user
the next step is "open localhost:3000 → add API key in Config →
Secrets & API Keys → Global", instead of just listing service
URLs.
Quickstart doc rewrite leads with:
git clone ...
cd molecule-monorepo
./scripts/dev-start.sh
The 4-step manual flow is preserved as "Manual setup (advanced)"
for contributors who want per-component logs.
Verified end-to-end from clean Docker (no containers, no volumes,
no .env) three times: total wall-clock ~12s for a re-run with
cached npm/docker layers. Platform's HTTP 200 on /workspaces
without a bearer confirms the dev-mode auth hatch is active.
Three different intermittent failures observed during a single
manual-test session — RemoteProtocolError, ReadTimeout, ConnectError —
each surfaced as a "Failed to deliver to <peer>" error chip in the
canvas Agent Comms panel even though the next attempt would have
succeeded (verified by direct probes from the same source workspace
to the same peer). The error message even told the user "Usually a
transient network blip — retry once," but it left the retry to a
human reading the error message.
Auto-retry inside send_a2a_message itself: up to 5 attempts (1
initial + 4 retries) with exponential backoff (1s, 2s, 4s, 8s,
16s-capped), each backoff jittered ±25% to break sync across
siblings. Cumulative wall-clock capped at 600s by
_DELEGATE_TOTAL_BUDGET_S so a string of 5×300s ReadTimeouts can't
make the caller wait 25 minutes — once the deadline elapses, retries
stop even if attempts remain.
Retry only on transport-layer transients:
- ConnectError / ConnectTimeout (peer's listening socket not ready)
- RemoteProtocolError (peer closed TCP without writing — observed
when a peer's prior in-flight Claude SDK session aborted)
- ReadError / WriteError (network blip on Docker bridge)
- ReadTimeout (peer wrote no response in 300s)
Application-level errors are NOT retried — they're deterministic and
retrying just wastes wall-clock:
- HTTP 4xx (peer rejected the request format)
- JSON parse failures (peer returned garbage)
- JSON-RPC error in response body (peer's runtime errored cleanly)
- Programmer-bug exceptions (ValueError, etc.)
8 new tests pin the contract:
- retry succeeds after 2 RemoteProtocolErrors
- retry succeeds after 1 ConnectError
- all 5 attempts fail → returns formatted last-error
- capped at exactly _DELEGATE_MAX_ATTEMPTS (regression cover for
"did someone bump the constant accidentally?")
- JSON-RPC error response NOT retried (1 attempt only)
- non-httpx exception NOT retried (programmer bugs stay loud)
- total budget caps the loop even if attempts remain
- backoff schedule grows exponentially with ±25% jitter
Refactor: extracted _format_a2a_error() so the success and exhausted
paths share one error-formatting routine. _delegate_backoff_seconds()
is a pure function so the schedule is unit-testable without monkey-
patching asyncio.sleep.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Canvas Agent Comms bubbles for outbound delegation showed only
"Delegating to <peer>" boilerplate during the live update window —
the actual task text only surfaced after a refresh re-fetched the row
from /workspaces/:id/activity. Symptom flagged today during a fresh
delegation manual test where the bubble said "Delegating to Perf
Auditor" instead of the user's "audit moleculesai.app for
performance" prompt.
Root cause: LogActivity's broadcast payload at activity.go:510-518
deliberately omitted request_body and response_body, so the canvas's
live-update path (AgentCommsPanel.tsx:271-289) saw `p.request_body =
undefined` and toCommMessage fell back to the
`Delegating to ${peerName}` template string. The DB row stored the
real task / reply, which is why GET-on-mount worked.
Fix: include both bodies in the broadcast as json.RawMessage values
(no re-marshal cost — they were already encoded for the DB insert
above). Same pattern as tool_trace, which has been included since #1814.
Each side is bounded by the workspace-side caller's own caps: the
runtime's report_activity helper caps error_detail at 4096 chars and
summary at 256; request/response are constrained by the runtime's
own limits — typical delegate_task payload is hundreds of chars to a
few KB. If a much-larger broadcast becomes a concern later, a soft
cap can be added at this site without breaking the contract.
Two regression tests pin the broadcast shape:
- request_body present → canvas renders the actual task text
- response_body present → canvas renders the actual reply text
- response_body nil → omitted from payload (no empty-bubble flicker)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cascade-deleting a 7-workspace org returned 500 with
"workspace marked removed, but 2 stop call(s) failed — please retry:
stop eeb99b5d-...: force-remove ws-eeb99b5d-607: Error response
from daemon: removal of container ws-eeb99b5d-607 is already in
progress"
even though the DB-side post-condition succeeded (removed_count=7) and
the containers WERE removed shortly after. The fanout fired Stop() on
every workspace concurrently and the orphan sweeper happened to reap
two of them at the same instant, so Docker rejected the second
ContainerRemove with "removal already in progress" — a race-condition
ack, not a real failure. Retrying just races the same in-flight
removal.
The post-condition we care about (the container WILL be gone) is
identical to a successful removal, so Stop() should treat it the
same way it already treats "No such container" — a no-op return nil
that lets the caller proceed with volume cleanup. Real daemon
failures (timeout, EOF, ctx cancel) still surface as errors.
Two pieces:
- New isRemovalInProgress() predicate using the same string-match
approach as isContainerNotFound (docker/docker has no typed
errdef for this; the CLI itself relies on the message).
- Stop() now treats the predicate as success, with a log line
distinct from the not-found path so debugging can tell which
race fired.
Both substrings ("removal of container" + "already in progress") must
match — "already in progress" alone would false-positive on unrelated
operations like image pulls. Truth table pinned in 7 new test cases.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>