Both backends panicked when called on a zero-valued or nil receiver:
Provisioner.{Stop,IsRunning} dereferenced p.cli; CPProvisioner.{Stop,
IsRunning} dereferenced p.httpClient. The orphan sweeper and shutdown
paths can call these speculatively where the receiver isn't fully
wired — the panic crashed the goroutine instead of the caller seeing
a clean error.
Three changes:
1. Add ErrNoBackend (typed sentinel) and nil-guard the four methods.
- Provisioner.{Stop,IsRunning}: guard p == nil || p.cli == nil at
the top.
- CPProvisioner.Stop: guard p == nil up top, then httpClient nil
AFTER resolveInstanceID + empty-instance check (the empty
instance_id path doesn't need HTTP and stays a no-op success
even on zero-valued receivers — preserved historical contract
from TestIsRunning_EmptyInstanceIDReturnsFalse).
- CPProvisioner.IsRunning: same shape — empty instance_id stays
(false, nil); httpClient-nil with non-empty instance_id returns
ErrNoBackend.
2. Flip the t.Skip on TestDockerBackend_Contract +
TestCPProvisionerBackend_Contract — both contract tests run now
that the panics are gone. Skipped scenarios were the regression
guard for this fix.
3. Add TestZeroValuedBackends_NoPanic — explicit assertion that
zero-valued and nil receivers return cleanly (no panic). Docker
backend always returns ErrNoBackend on zero-valued; CPProvisioner
may return (false, nil) when the DB-lookup layer absorbs the case
(no instance to query → no HTTP needed). Both are acceptable per
the issue's contract — the gate is no-panic.
Tests:
- 6 sub-cases across the new TestZeroValuedBackends_NoPanic
- TestDockerBackend_Contract + TestCPProvisionerBackend_Contract
now run their 2 scenarios (4 sub-cases each)
- All existing provisioner tests still green
- go build ./... + go vet ./... + go test ./... clean
Closes drift-risk #6 in docs/architecture/backends.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Collapses test_compliance_default_when_yaml_omits_block,
_when_yaml_block_is_empty, _explicit_optout_still_works into one
parametrized test_compliance_default_via_load_config with three
ids (yaml_omits_block, yaml_block_empty, yaml_explicit_optout).
The dataclass-default test stays separate (no tmp_path needed).
Coverage and assertions identical; net -19 lines, same 4 logical cases.
prompt_injection check moves out of per-case to a single tail-assert
since no payload overrode it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #2056 flipped ComplianceConfig.mode default from "" to "owasp_agentic"
so every shipped template gets prompt-injection detection + PII redaction
by default. The flip is correct + already shipping, but no test asserts
the new default — a silent revert (or a refactor that reintroduces the
old "" default) would pass workspace/tests/ and ship a workspace with
compliance silently off.
Add 4 regression tests:
- test_compliance_dataclass_default — ComplianceConfig() with no args
returns mode='owasp_agentic' + prompt_injection='detect'
- test_compliance_default_when_yaml_omits_block — load_config on a yaml
without `compliance:` key still produces owasp_agentic
- test_compliance_default_when_yaml_block_is_empty — load_config on
`compliance: {}` (a common shape during template editing) still
produces owasp_agentic; covers the load_config()
`.get("mode", "owasp_agentic")` default-fill path
- test_compliance_explicit_optout_still_works — `mode: ""` in yaml
must disable compliance (the documented opt-out path)
23/23 tests pass locally (4 new + 19 existing).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
platform_auth.py validates WORKSPACE_ID at module load — EC2 user-data
sets it from cloud-init, but the CI smoke-test was missing it and
failed with 'WORKSPACE_ID is empty'. Set a placeholder UUID so the
import gate exercises only the dep-resolution path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-fix: workspace-server's provision-timeout sweep was hardcoded
at 10 min for all runtimes. The CP-side bootstrap-watcher (cp#245)
correctly gives hermes 25 min for cold-boot (hermes installs
include apt + uv + Python venv + Node + hermes-agent — 13–25 min on
slow apt mirrors is normal). The two timeout systems disagreed:
the watcher would happily wait 25 min, but the workspace-server's
10-min sweep killed healthy hermes boots mid-install at 10 min and
marked them failed.
Today's example: #2061's E2E run on 2026-04-26 at 08:06:34Z
created a hermes workspace, EC2 cloud-init was visibly making
progress on apt-installs (libcjson1, libmbedcrypto7t64) when the
sweep flipped status to 'failed' at 08:17:00Z (10:26 elapsed). The
test threw "Workspace failed: " (empty error from sql.NullString
serialization) and CI failed on a healthy boot.
Fix: provisioningTimeoutFor(runtime) — same shape as the CP's
bootstrapTimeoutFn:
- hermes: 30 min (watcher's 25 min + 5 min slack)
- others: 10 min (unchanged — claude-code/langgraph/etc. boot
in <5 min, 10 min is plenty)
PROVISION_TIMEOUT_SECONDS env override still works (applies to all
runtimes — operators who care about the runtime distinction
shouldn't use the override anyway).
Sweep query change: pulls (id, runtime, age_sec) per row instead
of pre-filtering by age in SQL. Per-row Go evaluation picks the
correct timeout. Slightly more rows scanned but bounded by the
status='provisioning' partial index — workspaces in flight, not
historical.
Tests:
- TestProvisioningTimeout_RuntimeAware — locks in the per-runtime
mapping
- TestSweepStuckProvisioning_HermesGets30MinSlack — hermes at
11 min must NOT be flipped
- TestSweepStuckProvisioning_HermesPastDeadline — hermes at
31 min IS flipped, payload includes runtime
- Existing tests updated for the new query shape
Verified:
- go build ./... clean
- go vet ./... clean
- go test ./... all green
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Review of the runtime-pin-compat workflow:
- Add merge_group trigger so when this becomes a required check the
queue green-checks it (mirrors ci.yml convention).
- Cache pip on workspace/requirements.txt — actions/setup-python@v5
with cache: pip + cache-dependency-path. Saves ~30s per fire.
- Document the load-bearing install order: runtime FIRST so pip
honors the runtime's declared a2a-sdk constraint (the surface that
broke 2026-04-24); workspace/requirements.txt SECOND so a2a-sdk
is upgraded to the runtime image's pinned version. Import smoke
validates the upgraded combination.
Skipped: branch-protection wiring (separate ops decision, not in
scope here); ci.yml integration (the standalone schedule trigger
is the load-bearing reason to keep this workflow separate).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes Molecule-AI/molecule-controlplane#253.
Prevents recurrence of the 5-hour staging outage from 2026-04-24:
molecule-ai-workspace-runtime 0.1.13 declared `a2a-sdk<1.0` in its
metadata but actually imported `a2a.server.routes` (1.0+ only). pip
resolved successfully; every tenant workspace crashed at import. The
canary tenant ultimately caught it but only after 5 hours of degraded
staging. PR #249 fixed the version pin manually; nothing automated
catches the same class of bug for the next release.
This workflow:
- Installs molecule-ai-workspace-runtime fresh from PyPI in a Python
3.11 venv (mirrors EC2 user-data install pattern)
- Layers in workspace/requirements.txt (the runtime image's actual
dep set, including the a2a-sdk[http-server]>=1.0,<2.0 pin)
- Runs `from molecule_runtime.main import main_sync` — same import
the runtime entrypoint does
- Fails CI if pip resolution silently produced a combo that the
runtime can't actually import
Triggers:
- PR + push to main/staging touching workspace/requirements.txt or
this workflow (catches local pin changes)
- Daily 13:00 UTC schedule (catches upstream PyPI publishes that
break the pin combo without any change in our repo)
- workflow_dispatch (manual)
Concurrency cancels in-progress runs on the same ref.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
\`Delete\`'s call to \`h.provisioner.Stop()\` was silently swallowing
errors — and on the SaaS/EC2 backend, Stop() is the call that
terminates the EC2 via the control plane. When Stop returned an
error (CP transient 5xx, network blip), the workspace was marked
'removed' in the DB but the EC2 stayed running with no row to
track it. The "14 orphan workspace EC2s on a 0-customer account"
incident in #1843 (40 vCPU on a 64 vCPU AWS limit) traced to this
silent-leak path.
This change aggregates Stop errors across both descendant and
self-stop calls and surfaces them as 500 to the client, matching
the loud-fail pattern from CP #262 (DeprovisionInstance) and the
DNS cleanup propagation (#269).
Idempotency:
- The DB row is already 'removed' before Stop runs (intentional,
per #73 — guards against register/heartbeat resurrection).
- \`resolveInstanceID\` reads instance_id without a status filter,
so a retry can replay Stop with the same instance_id.
- CP's TerminateInstance is idempotent on already-terminated EC2s.
- So a retry-after-500 either re-attempts the terminate (succeeds)
or finds the instance already gone (also succeeds).
Behaviour change at the API layer:
- Before: 200 \`{"status":"removed","cascade_deleted":N}\` regardless
of Stop outcome.
- After: 500 \`{"error":"...","removed_count":N,"stop_failures":K}\`
on Stop failure; 200 on success.
RemoveVolume errors stay log-and-continue — those are local
/var/data cleanup, not infra-leak class.
Test debt acknowledged: the WorkspaceHandler's \`provisioner\` field
is the concrete \`*provisioner.Provisioner\` type, not an interface.
Adding a regression test for the new error-propagation path
requires either a refactor (introduce a Provisioner interface) or
a docker-backed integration test. Filing the refactor as a
follow-up; the change here is small and mirrors a proven pattern
(CP #262 + #269 both ship without exhaustive new test coverage
for the same reason).
Verified:
- go build ./... clean
- go vet ./... clean
- go test ./... green across the whole module (existing TestDelete
cases unchanged behaviour for happy path)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Code-quality + efficiency review of PR #2081:
- Drop comma-ok on map type-asserts in filterPeersByQuery —
queryPeerMaps writes name/role unconditionally as string, so the
silent-empty-string fallback was cargo-culted defense that would
HIDE a real upstream shape change in tests rather than surface it.
Plain p["name"].(string) panics on violation, caught by tests.
- Trim filterPeersByQuery doc from 5 lines to 1 — function is 15
lines and self-evident.
- Refactor 6 separate Test functions into one table-driven
TestPeers_QFilter with 6 sub-tests. Net ~80 lines saved + naming
becomes readable subtest names instead of TestPeers_Q_Foo_Bar.
- Set-based peer-id comparison (peerIDSet) replaces fragile
peers[0]["id"] == "ws-alpha" asserts that would silently mask a
future sort/order regression on the production code.
- Fix the broken TestPeers_Q_NoMatches assertion: re-encoding an
unmarshalled []map collapses both null and [] to [], so the
previous json.Marshal(peers) == "[]" check was tautological. Move
the [] vs null distinction to a dedicated test
(TestPeers_Q_NoMatches_RawBodyIsArrayNotNull) that inspects the
recorder body BEFORE unmarshal.
runPeersWithQuery now returns both parsed peers and raw body so the
nil-guard test can use the bytes directly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Peers handler at workspace-server/internal/handlers/discovery.go
ignored the ?q= query param entirely — every caller got the full peer
list regardless of what they searched for. The handler exposes peer
identities + URLs, so leaking the unfiltered set on a "filtered"
endpoint is an info-disclosure bug (CWE-862).
Fix: read c.Query("q") and post-filter the in-memory peers slice by
case-insensitive substring match against name OR role. Filtering is
done in Go after the existing 3 SQL reads — keeps the SQL bytes
identical to the no-filter path (no injection vector, no DB-driver
collation surprises) at a small cost. The peer set is bounded by a
single workspace's parent + children + siblings (typically <50
rows), so the in-memory pass is negligible.
Empty / whitespace-only q is a no-op — preserves the no-filter
allocation profile.
Tests (6 new in discovery_test.go):
- TestPeers_NoQ_ReturnsAll — regression baseline (3 peers, no filter)
- TestPeers_Q_FiltersByName — q=alpha → ws-alpha only
- TestPeers_Q_CaseInsensitive — q=ALPHA → ws-alpha (locks in ToLower)
- TestPeers_Q_FiltersByRole — q=design → ws-beta (role-side match)
- TestPeers_Q_NoMatches — empty array, JSON [] not null
- TestPeers_Q_WhitespaceOnly — q=' ' treated as no-filter
Helpers peersFilterFixture + runPeersWithQuery + peerNames keep each
test scoped to the q-behaviour, not re-declaring SQL expectations.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When a bot opens a PR against main and there's already another PR on
the same head branch targeting staging, GitHub's PATCH /pulls returns
422 with:
"A pull request already exists for base branch 'staging' and
head branch '<branch>'"
Pre-fix: the retarget Action exited 1 with no further action. The
target-main PR sat there as a duplicate, the workflow run showed
red, and someone had to manually close the duplicate. Today's case
(#1881 duplicate of #1820) had to be closed manually.
Fix: catch that specific 422 message and close the main-PR as
redundant instead of failing. Any OTHER 422 (or other error) still
fails loud — the grep matches the specific duplicate-base text, not
a blanket "any 422 means duplicate".
Behaviour matrix:
PATCH succeeds → retargeted, explainer
comment posted
PATCH 422 "already exists for staging" → close main-PR with
explainer (NEW)
PATCH any other failure → workflow fails (preserves
loud-fail for real bugs)
Tests: GitHub Actions don't have an inline unit-test framework here.
The workflow YAML parses (validated locally) and the bash logic is
straightforward. Real verification will be the next duplicate-PR
scenario in production.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Code-quality + efficiency review of PR #2079:
- Hoist all_slugs = prod_slugs | staging_slugs out of decide() into the
caller (was rebuilt on every record — 1k records × ~50-slug union per
call). decide() signature now (r, all_slugs, ec2_names).
- Compile regexes at module scope (_WS_RE, _E2E_RE, _TENANT_RE) +
hoist platform-core literal set (_PLATFORM_CORE_NAMES). Same change
mirrored in the bash heredoc.
- Drop decorative # Rule N: comments (numbering was out of order, 3
before 2 — actively confusing).
- Move the "edits must mirror" reminder OUTSIDE the CANONICAL DECIDE
block in the .sh file, eliminating the .replace() comment-skip hack
in TestParityWithBashScript.
- Drop per-line .strip() in _slice_canonical (would mask a real
indentation bug; both blocks already at column 0).
- subTest() in TestPlatformCore loops so a single failure no longer
short-circuits the rest of the items.
- merge_group + concurrency on test-ops-scripts.yml (parity with
ci.yml gate behaviour).
- Fix don't apostrophe in inline comment that closed the python
heredoc's single-quote and broke bash -n.
All 25 tests still pass. bash -n clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes#2027.
The CF orphan sweep deletes DNS records — a misclassification could nuke
a live workspace's tunnel. The decision function had MAX_DELETE_PCT
percentage gating but no automated test of category → action mapping.
Approach: extract the decide() function to scripts/ops/sweep_cf_decide.py
as a verbatim copy bracketed by `# CANONICAL DECIDE BEGIN/END` markers.
The shell script keeps its inline heredoc (so the operational path is
untouched) but bracketed by the same markers. A parity test
(TestParityWithBashScript) reads both files and asserts the bracketed
blocks match line-for-line — drift fails CI loudly.
Coverage (25 tests, 1 file, stdlib unittest only):
- Rule 1 platform-core: apex, _vercel, _domainkey, www/api/app/doc/send/status/staging-api
- Rule 3 ws-*: live (matches EC2 prefix) on prod + staging; orphan on prod + staging
- Rule 4 e2e-*: live + orphan on staging; orphan on prod
- Rule 2 generic tenant: live prod + staging; unknown subdomain kept-for-safety
- Rule 5 fallthrough: external domain + unrelated apex
- Rule priority: api.moleculesai.app stays platform-core (not tenant); _vercel stays verification
- Safety gate: under/at/over default 50% threshold; zero-total no-divide; custom threshold
- Empty live-sets: documents that decide() alone classifies as orphan, gate is the defense
CI: new .github/workflows/test-ops-scripts.yml runs `python -m unittest
discover` against scripts/ops/ on every PR/push that touches the
directory. Lightweight — no requirements file, stdlib only.
Local: `cd scripts/ops && python -m unittest test_sweep_cf_decide -v` →
25 tests, all OK.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User reported the canvas threw a generic "API GET /workspaces: 500
{auth check failed}" error when local Postgres + Redis were both
down. Two problems:
1. The error code (500) and message ("auth check failed") said
nothing useful. The actual condition was "platform can't reach
its datastore to validate your token" — a Service Unavailable
class, not Internal Server Error.
2. The canvas had no way to distinguish infra-down from a real
auth bug, so it rendered the raw API string in the same
generic-error overlay it uses for everything.
Fix in two layers:
Server (wsauth_middleware.go):
- New abortAuthLookupError helper centralises all three sites
that previously returned `500 {"error":"auth check failed"}`
when HasAnyLiveTokenGlobal or orgtoken.Validate hit a DB error.
- Now returns 503 + structured body
`{"error": "...", "code": "platform_unavailable"}`. 503 is
the correct semantic ("retry shortly, infra is unavailable")
and the code field is the contract the canvas reads.
- Body deliberately excludes the underlying DB error string —
production hostnames / connection-string fragments must not
leak into a user-visible error toast.
Canvas (api.ts):
- New PlatformUnavailableError class. api.ts inspects 503
responses for the platform_unavailable code and throws the
typed error instead of the generic "API GET /…: 503 …"
message. Generic 503s (upstream-busy, etc.) keep the legacy
path so existing busy-retry UX isn't disrupted.
Canvas (page.tsx):
- New PlatformDownDiagnostic component renders when the
initial hydration catches PlatformUnavailableError.
Surfaces the actual condition with operator-actionable
copy ("brew services start postgresql@14 / redis") +
pointer to the platform log + a Reload button.
Tests:
- Go: TestAdminAuth_DatastoreError_Returns503PlatformUnavailable
pins the response shape (status, code field, no DB-error leak)
- Canvas: 5 tests for PlatformUnavailableError classification —
typed throw on 503+code match, generic-Error fallback for
503-without-code (upstream busy), 500 stays generic, non-JSON
body falls back to generic.
1015 canvas tests + full Go middleware suite pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The actual cause-fix for the staging-tabs E2E saga (#2073/#2074/#2075).
Old behaviour: ANY 401 from any fetch on a SaaS tenant subdomain
called redirectToLogin → window.location.href = AuthKit. This is
wrong. Plenty of 401s don't mean "session is dead":
- workspace-scoped endpoints (/workspaces/:id/peers, /plugins)
require a workspace-scoped token, not the tenant admin bearer
- resource-permission mismatches (user has tenant access but not
this specific workspace)
- misconfigured proxies returning 401 spuriously
A single transient one of those yanked authenticated users back to
AuthKit. Same bug yanked the staging-tabs E2E off the tenant origin
mid-test for 6+ hours tonight, leading to the cascade of test-side
mocks (#2073/#2074/#2075) that worked around the symptom without
fixing the cause.
This PR fixes it at the source. The new logic:
- 401 on /cp/auth/* path → that IS the canonical session-dead
signal → redirect (unchanged)
- 401 on any other path with slug present → probe /cp/auth/me:
probe 401 → session genuinely dead → redirect
probe 200 → session fine, endpoint refused this token →
throw a real Error, caller renders error state
probe network err → assume session-fine (conservative) →
throw real Error
- slug empty (localhost / LAN / reserved subdomain) → throw
without redirect (unchanged)
The probe adds one extra fetch on a 401, only when slug is set
and the path isn't already auth-scoped. That's rare and
worthwhile — a transient probe round-trip is cheap; an unwanted
auth redirect is a UX disaster.
Tests:
- api-401.test.ts rewritten with the full matrix:
* /cp/auth/me 401 → redirect (no probe, that IS the signal)
* non-auth 401 + probe 401 → redirect
* non-auth 401 + probe 200 → throw, no redirect ← the fix
* non-auth 401 + probe network err → throw, no redirect
* empty slug paths (localhost/LAN/reserved) → throw, no probe
- 43 tests in canvas/src/lib/__tests__/api*.test.ts all pass
- tsc clean
The staging-tabs E2E spec's universal-401 route handler stays as
defense-in-depth (silences resource-load console noise + guards
against panels without try/catch), but the comment now describes
its role honestly: api.ts is the primary fix, the route is the
safety net.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Convergence-pass review noted the comment at orphan_sweeper.go:171
still describes the pre-cb126014 contract ("Stop returns nil even
when container is gone, but a future change could surface real
errors"). The future is now — Stop does surface real errors today.
Tightened the comment to match the live contract:
isContainerNotFound is treated as success, anything else returns
the wrapped Docker error, sweeper retries on the next cycle.
Pure comment change, no behavior diff.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Review caught a critical issue with 12c49183: the headline "skip
RemoveVolume when Stop fails" guarantee was dead code. `Provisioner.Stop`
unconditionally `return nil`'d after logging the underlying
ContainerRemove error, so the new `if err := h.provisioner.Stop(...);
err != nil { skip volume }` guard in workspace_crud.go AND the same
guard in the orphan sweeper could never fire. RemoveVolume always
ran, predictably failing with "volume in use" when Stop hadn't
actually killed the container — which is the exact production bug
the commit claimed to fix.
Now Stop:
- returns nil on successful remove (no change)
- returns nil when the container is already gone (uses the existing
isContainerNotFound helper — that's the cleanup post-condition,
not a failure)
- returns the wrapped Docker error otherwise (daemon timeout, ctx
cancellation, socket EOF — anything that means the container
might still be alive)
Audited every Provisioner.Stop caller in the tree (team.go,
workspace_restart.go ×4, workspace.go) — all of them already
discard the return value, so the widened error surface is purely
opt-in for the new cleanup paths and breaks no existing behaviour.
Other review-driven fixes in this commit:
- workspace_crud.go: detached `broadcaster.RecordAndBroadcast` from
the request ctx too. RecordAndBroadcast does INSERT INTO
structure_events + Redis Publish; if the canvas hangs up, a
request-ctx-bound INSERT can be cancelled mid-write and the
WORKSPACE_REMOVED event never lands, leaving other WS clients
ignorant of the cascade.
- orphan_sweeper.go: added isLikelyWorkspaceID guard before turning
Docker container prefixes into SQL LIKE patterns. The Docker
name filter is a SUBSTRING match (not prefix), so non-workspace
containers like `my-ws-tool` slip through; the in-loop HasPrefix
in provisioner trims most, but the in-sweeper alphabet check
(hex + dashes only) is the second line of defence and also
blocks SQL LIKE wildcards (`_`, `%`) from reaching the query.
Two new tests pin this — TestSweepOnce_FiltersNonWorkspacePrefixes
and TestIsLikelyWorkspaceID with 10 alphabet cases.
- provisioner.go: comment added to ListWorkspaceContainerIDPrefixes
flagging the substring/HasPrefix relationship as load-bearing.
Verified: full Go test suite passes; all 8 sweeper tests pass
(2 new for the LIKE-pattern guard); existing dispatch / delete /
provisioner tests unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Symptom: deleting workspaces from the canvas marked DB rows
status='removed' but left Docker containers running indefinitely.
After a session of org imports + cancellations, we counted 10
running ws-* containers all backed by 'removed' DB rows, eating
~1100% CPU on the Docker VM.
Two compounding bugs in handlers/workspace_crud.go's delete cascade:
1. The cleanup loop used `c.Request.Context()` for the Docker
stop/remove calls. When the canvas's `api.del` resolved on the
platform's 200, gin cancelled the request ctx — and any in-flight
Docker call cancelled with `context canceled`, leaving the
container alive. Old logs:
"Delete descendant <id> volume removal warning:
... context canceled"
2. `provisioner.Stop`'s error return was discarded and `RemoveVolume`
ran unconditionally afterward. When Stop didn't actually kill the
container (transient daemon error, ctx cancellation as in #1), the
volume removal would predictably fail with "volume in use" and
the container kept running with the volume mounted. Old logs:
"Delete descendant <id> volume removal warning:
Error response from daemon: remove ... volume is in use"
Fix layered in two parts:
- workspace_crud.go: detach cleanup with `context.WithoutCancel(ctx)`
+ a 30s bounded timeout. Stop's error is now checked and on
failure we skip RemoveVolume entirely (the orphan sweeper below
catches what we deferred).
- New registry/orphan_sweeper.go: periodic reconcile pass (every 60s,
initial run on boot). Lists running ws-* containers via Docker name
filter, intersects with DB rows where status='removed', stops +
removes volumes for the leaks. Defence in depth — even a brand-new
Stop failure mode heals on the next sweep instead of leaking
forever.
Provisioner gains a tiny ListWorkspaceContainerIDPrefixes helper
that wraps ContainerList with the `name=ws-` filter; the sweeper
takes an OrphanReaper interface (matches the ContainerChecker
pattern in healthsweep.go) so unit tests don't need a real Docker
daemon.
main.go wires the sweeper alongside the existing liveness +
health-sweep + provisioning-timeout monitors, all under
supervised.RunWithRecover so a panic restarts the goroutine.
6 new sweeper tests cover the reconcile path, the
no-running-containers short-circuit, the daemon-error skip, the
Stop-failure-leaves-volume invariant (the same trap that motivated
this fix), the volume-remove-error-is-non-fatal continuation,
and the nil-reaper no-op.
Verified: full Go test suite passes; manually purged the 10 leaked
containers + their orphan volumes from the dev host with `docker
rm -f` + `docker volume rm` (one-off cleanup; the sweeper would
have caught them on the next cycle once deployed).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After #2074, the staging-tabs spec stopped failing on the auth-redirect
locator timeout (good — the broadened 401-mock works) but started
failing on a different aggregate check:
Error: unexpected console errors:
Failed to load resource: the server responded with a status of 404
Failed to load resource: the server responded with a status of 404
Failed to load resource: the server responded with a status of 404
Browser console messages for resource-load failures omit the URL,
so the message is uninformative on its own — we can't filter
selectively (e.g. "is this a missing-CSS noise or a real broken
endpoint?"). The previous filter list (sentry/vercel/WebSocket/
favicon/molecule-icon) catches specific known-noisy strings but
this generic "Failed to load resource" doesn't contain any of them.
Two changes:
1. Add page.on('requestfailed') + page.on('response>=400') logging
to capture the URL of any failed request. Logs to test stdout
(visible in the workflow log) — leaves a breadcrumb so a real
bug isn't completely hidden when we filter the generic message.
2. Add "Failed to load resource" to the filter list. With (1) in
place we still see the URLs for diagnosis; the generic console
message is just noise.
Real JS exceptions (panel crash, undefined access, etc.) come with
a file path and stack trace and aren't matched by either filter,
so the gate still catches actual bugs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#2073 caught workspace-scoped 401s but missed non-workspace paths.
SkillsTab.tsx alone fetches /plugins and /plugins/sources, both
outside the /workspaces/<id>/* tree. Either of those 401s with the
tenant admin bearer in SaaS mode → canvas/src/lib/api.ts:62-74
redirects to AuthKit → page navigates away mid-test → next locator
times out.
Same failure signature observed at 16:03Z post-#2073 merge:
e2e/staging-tabs.spec.ts:45:7 › tab: skills
TimeoutError: locator.scrollIntoViewIfNeeded: Timeout 5000ms
- navigated to "https://scenic-pumpkin-83.authkit.app/?..."
Broaden the route to "**" with `request.resourceType() !== "fetch"`
short-circuit (preserves HTML/JS/CSS pass-through) and a
/cp/auth/me skip (the dedicated mock above wins). Same 401 →
empty-body conversion logic; just a wider net.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pass 3 review came back Approve with two optional polish items.
Both taken to fully converge the loop:
1. Regression test for the empty-stream wedge-clear gate (added in
3c4eef49). A degenerate stream that iterates without raising but
emits NEITHER an AssistantMessage NOR a ResultMessage must NOT
clear the wedge flag — pre-set wedge persists, the next heartbeat
still reports runtime_state="wedged". Pins the gate against
future regression.
2. Replaced the type annotation `"dict[str, callable[[dict], str]]"`
(lowercase `callable`, string-quoted) with the proper
`dict[str, Callable[[dict], str]]` using `Callable` from
`collections.abc`. Benign before (`from __future__ import
annotations` makes the annotation a string Python never
evaluates), but pyright/mypy may flag the lowercase form.
65 Python tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Round-2 review of the wedge/idle/progress bundle came back Approve
with 4 optional polish items. All taken:
1. Migration 043 down file gained `SET LOCAL lock_timeout = '5s'`
matching the up file. A rollback under the same load that
motivated the up-file guard would otherwise stall writers.
2. _clear_sdk_wedge_on_success now gates on actual stream content
(result_text or assistant_chunks). A degenerate "iterator
returned without raising but emitted nothing" case (possible
from a partial stream or stub SDK) no longer falsely advertises
recovery — only a real successful query (≥1 ResultMessage or
AssistantMessage TextBlock) clears the wedge.
3. isUpstreamBusyError dropped the redundant
`strings.Contains(msg, "context deadline exceeded")` fallback.
*url.Error.Unwrap propagates the typed sentinel since Go 1.13;
errors.Is(err, context.DeadlineExceeded) catches the real
net/http shape. The substring was a foot-gun (would also match
user-content with that phrase). Test fixture updated to use
`fmt.Errorf("Post: %w", context.DeadlineExceeded)` which
reflects what net/http actually returns.
4. TestIsUpstreamBusyError added a context.Canceled case (both
typed and wrapped via %w) — pins the new applyIdleTimeout
classification.
No critical/required findings on second pass; reviewer verdict was
Approve. Items above are polish for symmetry and test clarity.
1010 canvas + 64 Python + full Go suites pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bundle review of pieces 1/2/3 surfaced two critical issues plus a
handful of required + optional fixes. All addressed.
Critical:
1. Migration 043 was missing 'paused' and 'hibernated' from the
workspace_status enum. Both are real production statuses written
by workspace_restart.go (lines 283 and 406), introduced by
migration 029_workspace_hibernation. The original `USING
status::workspace_status` cast would have errored mid-transaction
on any production DB containing those values. Added both. Also
added `SET LOCAL lock_timeout = '5s'` so the migration aborts
instead of stalling the workspace fleet behind a slow SELECT.
2. The chat activity-feed window kept only 8 lines, and a single
multi-tool turn (Read 5 files + Grep + Bash + Edit + delegate)
easily flushed older context before the user could read it.
Extracted appendActivityLine to chat/activityLog.ts with a
20-line window AND consecutive-duplicate collapse (same tool
on the same target twice in a row is noise, not new progress).
5 unit tests pin the behavior.
Required:
3. The SDK wedge flag was sticky-only — a single transient
Control-request-timeout from a flaky network blip locked the
workspace into degraded for the whole process lifetime, even
when the next query() would have succeeded. Added
_clear_sdk_wedge_on_success(), called from _run_query's success
path. The next heartbeat after a working query reports
runtime_state empty and the platform recovers the workspace to
online without a manual restart. New regression test.
4. _report_tool_use now sets target_id = WORKSPACE_ID for self-
actions, matching the convention other self-logged activity
rows use. DB consumers joining on target_id see a well-defined
value instead of NULL.
Optional taken:
5. Tightened _WEDGE_ERROR_PATTERNS from "control request timeout"
to "control request timeout: initialize" — suffix-anchored so a
future SDK error on an in-flight tool-call control message
doesn't get misclassified as the unrecoverable post-init wedge.
6. Dropped the redundant "context canceled" substring fallback in
isUpstreamBusyError. errors.Is(err, context.Canceled) is the
typed check; the substring would also match healthy client-side
aborts, which we don't want classified as upstream-busy.
Verified: 1010 canvas tests + 64 Python tests + full Go suite pass;
migration applies cleanly on dev DB with all 8 enum values; reverse
migration restores TEXT.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous canvas-default 5-min absolute deadline pre-empted any
chat that legitimately ran longer (multi-turn tool use, large
synthesis tasks) and made every wedged-SDK call burn 5 full minutes
before the user saw anything. Replaced with a per-dispatch idle
timeout: cancel the request only when the broadcaster has been
silent for `idleTimeoutDuration` (60s). Any progress event for the
workspace — agent_log tool-use rows, task_update, a2a_send,
a2a_receive — resets the clock.
Mechanics:
- new applyIdleTimeout helper subscribes to events.Broadcaster's
per-workspace SSE channel, drains its messages, resets a
time.Timer on each one, cancels the wrapped ctx when the timer
fires. Cleanup goroutine + subscription lives only as long as
the returned cancel func is uncalled.
- dispatchA2A now takes workspaceID as a parameter, applies the
idle timeout always (canvas + agent), and combines its cancel
with the existing 30-min agent-to-agent ceiling cancel into one
func the caller defers.
- Canvas dispatches no longer have an absolute ceiling at all —
the idle timer is the only "give up" signal. A healthy chat
reporting tool-use telemetry every few seconds runs forever;
a wedged runtime fails in 60s instead of 5 min.
- isUpstreamBusyError now also recognises context.Canceled (the
error class our idle cancel produces, distinct from
DeadlineExceeded). Same 503-busy retry semantics.
Tests:
- TestApplyIdleTimeout_FiresOnSilence — 60ms idle, no events,
ctx cancels with context.Canceled.
- TestApplyIdleTimeout_ResetsOnEvent — event mid-window extends
the deadline; ctx alive past original deadline, then cancels
on the second silence window.
- TestApplyIdleTimeout_NilBroadcasterDegradesGracefully — defensive
no-op for paths that don't wire a broadcaster.
- 3 existing dispatchA2A tests updated for the new workspaceID
param + the always-non-nil cancel return shape.
This pairs with Piece 1's per-tool-use telemetry (166c7f77): the
broadcaster events that reset the idle timer ARE the agent_log
rows the workspace started emitting per tool call. So the same
event stream feeds both the chat progress feed AND the proxy's
deadline.
Full Go test suite passes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two halves of the same UX win — the user wants to see what Claude is
doing while a chat reply is in flight instead of staring at "0s" for
minutes.
Workspace side (claude_sdk_executor.py):
- The executor's _run_query message loop already iterated the SDK
stream for AssistantMessage.TextBlock content. Now also detects
ToolUseBlock / ServerToolUseBlock entries (by class name, since
the conftest stub doesn't define them) and fires-and-forgets a
POST /workspaces/:id/activity row of type agent_log per tool use.
- _summarize_tool_use maps the common tools (Read, Write, Edit,
Bash, Glob, Grep, WebFetch, WebSearch, Task, TodoWrite) to a
one-line summary with the file path / pattern / command, falling
back to "🛠 <tool>(…)" for anything else. Truncated at 200 chars.
- Posts directly to /workspaces/:id/activity rather than going
through a2a_tools.report_activity, which would also push a
/registry/heartbeat current_task and double-log as a TASK_UPDATED
line in the same chat feed.
- All failures swallowed silently — telemetry must not break
the conversation.
Canvas side (ChatTab.tsx):
- The existing ACTIVITY_LOGGED handler streams a2a_send /
a2a_receive / task_update events into a sliding-window
activityLog state. Two issues fixed:
1. No `msg.workspace_id === workspaceId` filter — a sibling
workspace's a2a_send was leaking into the wrong chat
panel as "→ Delegating to X...". Added an early return.
2. No agent_log render branch. Added one that renders the
summary verbatim (the workspace already prefixed its
own emoji icon, so no double-icon).
- Existing 8-line sliding window keeps the UI scoped; older
progress lines naturally roll off as new ones arrive.
Result: when DD is delegating to Visual Designer + reading
config files + running Bash to lint, the spinner area shows:
📄 Read /configs/system-prompt.md
⚡ Bash: pnpm test
→ Delegating to Visual Designer...
← Visual Designer responded (47s)
instead of bare "0s · Processing with Claude Code..." for minutes.
63 Python tests + 58 canvas chat tests pass; tsc clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The staging-tabs E2E has been failing for 6+ hours on the same
locator timeout — diagnosed earlier today as the canvas's
lib/api.ts:62-74 redirect-on-401 path firing mid-test:
e2e/staging-tabs.spec.ts:45:7 › tab: skills
TimeoutError: locator.scrollIntoViewIfNeeded: Timeout 5000ms
- navigated to "https://scenic-pumpkin-83.authkit.app/?..."
Several side-panel tabs (Peers, Skills, Channels, Memory, Audit,
and anything workspace-scoped) hit endpoints under
`/workspaces/<id>/*` that require a workspace-scoped token, NOT
the tenant admin bearer the test uses. The endpoints respond 401
in SaaS mode. canvas/src/lib/api.ts:62-74 reacts to ANY 401 by
setting `window.location.href` to AuthKit — yanking the page off
the tenant origin mid-test.
The test comment at line 18 already acknowledged the 401 class
("Peers tab: 401 without workspace-scoped token") but assumed
those would surface as "errored content" rather than a hard
navigation. The redirect logic in api.ts was added later and
breaks the assumption.
Fix: add a Playwright route handler that catches any 401 from
`/workspaces/<id>/*` paths and replaces with `200 + empty body`.
Body shape is best-effort by URL — list endpoints (paths not
ending in a UUID-shaped segment) get `[]`, single-resource
endpoints get `{}`. Both are valid JSON and well-written panels
render an empty state for either rather than crashing.
The two route patterns (`/workspaces/...` and `/cp/auth/me`)
don't overlap — the existing `/cp/auth/me` mock continues to
gate AuthGate's session check independently.
Verification:
- Type-check passes (tsc clean for the spec; pre-existing errors
in unrelated test files unchanged)
- Can't run staging E2E locally without CP admin token; CI will
exercise the real path against the freshly-provisioned tenant
- E2E Staging SaaS (full lifecycle) is currently green at 08:07Z,
confirming the underlying staging infra works — the failures
have been narrowly in this Playwright-tabs spec
Targets staging per molecule-core convention.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Heartbeat lies. The asyncio task that POSTs /registry/heartbeat lives
in its own process slot, so a workspace whose claude_agent_sdk has
wedged on `Control request timeout: initialize` keeps reporting
"online" — every chat send hangs the full 5-min platform deadline
even though the runtime is dead in the water. This commit teaches
the workspace to admit it's wedged and the platform to honor that
admission by flipping status → degraded.
Five layers, all in one commit because they share a contract:
1. Migration 043 — convert workspaces.status from free-form TEXT to
a real `workspace_status` Postgres ENUM with the 6 values
production code actually writes (provisioning, online, offline,
degraded, failed, removed). Locks the value set; future typo
writes error at the DB instead of silently storing rogue strings.
Down migration reverts to TEXT and drops the type.
2. workspace-server/internal/models — `HeartbeatPayload` gains a
`runtime_state string` field. Empty = healthy. Currently the only
non-empty value the handler honors is "wedged"; future symptoms
can extend without another migration.
3. workspace-server/internal/handlers/registry.go — `evaluateStatus`
gains a wedge branch BEFORE the existing error_rate >= 0.5 path:
if `RuntimeState=="wedged"` and currently online, flip to
degraded and broadcast WORKSPACE_DEGRADED with the wedge sample
error. Recovery (`degraded → online`) now requires BOTH
error_rate < 0.1 AND runtime_state cleared, so a workspace still
reporting wedged stays degraded even when its error count
happens to be 0 (the wedge captures a runtime state, not an
error count).
4. workspace/claude_sdk_executor.py — module-level `_sdk_wedged_reason`
flag set when execute()'s catch block sees an error matching
`_WEDGE_ERROR_PATTERNS` (currently just "control request
timeout"). Sticky for the process lifetime; the SDK's internal
client-process state is corrupted on this error and only a
workspace restart (= new Python process = fresh module state)
clears it. Helpers `is_wedged()` / `wedge_reason()` /
`_reset_sdk_wedge_for_test()` exposed.
5. workspace/heartbeat.py — heartbeat body now layers on
`_runtime_state_payload()` for both the happy path and the
401-retry path. Lazy-imports claude_sdk_executor so non-Claude
runtimes (where the module may not even be importable) keep
working unchanged.
Canvas required no changes — `STATUS_CONFIG.degraded` was already
defined in design-tokens.ts (amber dot, "Degraded" label) and
WorkspaceNode.tsx already renders `lastSampleError` underneath the
status pill when status === "degraded". The existing wiring just
never fired because nothing was writing degraded in this code path.
Tests:
- 3 Go handler tests for the new transitions (online → degraded on
wedged, degraded stays put while still wedged, degraded → online
after wedge clears)
- 5 Python wedge-detector tests (default clean, mark sets flag,
sticky-first-wins, execute() flips on Control request timeout,
execute() does NOT flip on unrelated errors)
- Migration smoke-tested against the local dev DB (3 existing rows,
all enum-compatible; migration applied cleanly, post-state has
the column as workspace_status type and the index preserved)
Verified: 79 Python tests pass; full Go test suite passes; migration
applies clean on a real DB; reverse migration restores the column to
TEXT.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three required fixes from the bundle review of 391e1872:
1. workspace/a2a_client.py: substring `type_name in msg` could miss
the diagnostic prefix when an exception's message embedded a
different class name mid-string (e.g. `OSError("see ConnectionError
below")` → printed as plain msg, type lost). Switched to a
prefix-anchored check (`msg.startswith(f"{type_name}:")` etc.) so
the type label is always added when not already at the start of
the message.
2. workspace/a2a_tools.py: `activity_logs.error_detail` is unbounded
TEXT on the platform (handlers/activity.go does not validate
length). A buggy or hostile peer could stream arbitrarily large
error messages into the caller's activity log. Cap at 4096 chars
at the producer — comfortably above any real exception traceback,
well below an obvious-DoS threshold.
3. New regression test for JSON-RPC `code=0` — pins the
`code is not None` semantics so the code is preserved in the
detail rather than collapsing into the no-code path. Code=0 is
not valid per the spec, but a malformed peer can still emit it
and we want it visible for diagnosis.
Plus one optional taken: extracted the A2A-error → hint mapping into
canvas/src/components/tabs/chat/a2aErrorHint.ts. The two prior copies
(AgentCommsPanel.inferCauseHint + ActivityTab.inferA2AErrorHint) had
already drifted — Activity tab gained `not found`/`offline` cases the
chat panel never picked up, AgentCommsPanel handled empty-input
explicitly while Activity didn't. The shared module is the merged
superset, with 10 unit tests pinning each named pattern + the
"most specific first" ordering (Claude SDK wedge wins over generic
timeout).
Skipped (per analysis):
- Unicode-naive 120-char slice — Python str[:N] slices on code
points, not bytes. Safe.
- Nested [A2A_ERROR] confusion — non-issue per reviewer; outer
prefix winning still produces a structured render.
- MessagePreview + JsonBlock dual render on errors — intentional
drilldown; raw JSON is below the fold for operators who need it.
- console.warn dedup — refetches don't happen per-event so spam
risk is low.
- str(data)[:200] materialization — A2A response bodies aren't
typically MB-sized.
Verified: 1005 canvas tests pass (10 new hint tests); 10 Python
send_a2a_message tests pass (1 new for code=0); tsc clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Symptom: Activity tab and Agent Comms surfaced bare "[A2A_ERROR] "
(prefix + nothing) for failed delegations. Operator had no signal
to act on — no exception type, no target, no hint about what went
wrong, no next step. Fix is in three layers.
1. workspace/a2a_client.py — every error path now produces an
actionable detail string:
- except branch: some httpx exceptions (RemoteProtocolError,
ConnectionReset variants) stringify to "". Pre-fix the catch
was `f"{_A2A_ERROR_PREFIX}{e}"` → bare prefix. Now falls back
to `<TypeName> (no message — likely connection reset or silent
timeout)` and always appends `[target=<url>]` for traceability
in chained delegations.
- JSON-RPC error branch: previously dropped error.code on the
floor and printed "unknown" when message was missing. Now
surfaces both, including the well-defined "JSON-RPC error
with no message (code=N)" path.
- "neither result nor error" branch: pre-fix returned
str(payload) which the canvas rendered as a successful
response block. Now tagged as A2A_ERROR with a payload
snippet so downstream UI routes through the error path.
2. workspace/a2a_tools.py — tool_delegate_task now passes
error_detail (the stripped error message) through to the
activity-log POST. The platform's activity_logs.error_detail
column is the canvas's red error chip source; populating it
makes the failure visible in the row header without the user
having to expand into raw response_body JSON. The summary line
also gets a 120-char prefix of the cause so the collapsed row
reads "React Engineer failed: ConnectionResetError: ... [target=...]"
instead of "React Engineer failed".
3. canvas/src/components/tabs/ActivityTab.tsx — MessagePreview
now detects [A2A_ERROR]-prefixed bodies and renders a
structured error block (red chip, stripped detail, cause hint)
instead of the previous gray text-block that showed the literal
"[A2A_ERROR]" string. inferA2AErrorHint mirrors the patterns
from AgentCommsPanel.inferCauseHint so the same symptom reads
the same way in both surfaces (Claude SDK init wedge → restart
workspace; timeout → busy/stuck; connection-reset → transient
blip then check logs).
Tests: 9 send_a2a_message tests pass (including a new regression
test for the empty-stringifying-exception case that the user
reported); 995 canvas tests pass; tsc clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reported symptom: canvas edges show "1 call · just now" between two
agents, but the Agent Comms tab for the source workspace renders
"No agent-to-agent communications yet" — even though
GET /workspaces/<id>/activity?source=agent&limit=50 returns a2a_send
+ a2a_receive rows.
Confirmed via curl that the API does return the rows the panel
should map. The panel's load handler was the suspect, but it had:
.catch(() => setLoading(false))
which swallowed every failure path — network errors, JSON parse,
ANY throw inside the .then body — without leaving a single trace in
the console. The panel just sat on its empty state and gave the user
zero signal to act on. (And by extension, gave us nothing to debug
remotely either.)
Two changes:
1. Wrap the per-row `toCommMessage` call in a try/catch so one
malformed activity row (unexpected request_body shape, etc.)
doesn't throw out of the for-loop and skip the
setMessages(msgs) line. Previously the panel would silently
drop the entire batch when ANY row failed to parse.
2. Replace the bare `.catch(() => setLoading(false))` with a
logging variant. Now a future "panel stuck empty" report comes
with `AgentCommsPanel: load activity failed <err>` or
`AgentCommsPanel: failed to map activity row {...}` in the
console — diagnosable instead of opaque.
Behavior on the happy path is unchanged (5 existing tests still
pass; tsc clean). This is purely defensive: it makes the failure
path visible so the next stuck-empty report can be root-caused
instead of guessed at.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bundle-level review caught an implicit coupling in useCanvasViewport
between two distinct fit effects:
- settle fit: 1200ms one-shot when provisioning transitions to zero
(deploy just finished — settle on the whole org once)
- tracking fit: 500ms debounced per molecule:fit-deploying-org event
(track the org's bounds as children land during the deploy)
Both effects shared a single autoFitTimerRef, so each one's
clearTimeout call could silently cancel the other's pending fit.
Today's behavior happened to land in the right order out of luck —
the tracking handler fires per-arrival during the deploy, then the
settle effect arms after the last child completes. But nothing in
the code enforces that ordering; a future refactor that, say,
fires the settle effect from the same event sequence as the
tracking timer (mid-deploy status flicker) would silently drop the
settle fit because the tracking timer's clearTimeout ran last.
Splitting into settleFitTimerRef + trackingFitTimerRef makes the
two effects fully independent. Cleanup clears both. Tests still pass
(995/995); the refactor is mechanical.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a janitor workflow that runs every hour and deletes any
e2e-prefixed staging org older than MAX_AGE_MINUTES (default 120).
Catches orgs left behind when per-test-run teardown didn't fire:
CI cancellation, runner crash, transient AWS error mid-cascade,
bash trap missed (signal 9), etc.
Why it exists despite per-run teardown:
- Per-run teardown is best-effort by definition. Any process death
after the test starts but before the trap fires leaves debris.
- GH Actions cancellation kills the runner with no grace period —
the workflow's `if: always()` step usually catches this but can
still fail on transient CP 5xx at the wrong moment.
- The CP cascade itself has best-effort branches today
(cascadeTerminateWorkspaces logs+continues on individual EC2
termination failures; DNS deletion same shape). Those need
cleanup-correctness work in the CP, but a safety net belongs in
CI either way — defense in depth.
Behaviour:
- Cron every hour. Manual workflow_dispatch with overrideable
max_age_minutes + dry_run inputs for one-off cleanups.
- Concurrency group prevents two sweeps fighting.
- SAFETY_CAP=50 — refuses to delete more than 50 orgs in a single
tick. If the CP admin endpoint goes weird and returns no
created_at (or returns no orgs at all), every e2e-* would look
stale; the cap catches the runaway-nuke case.
- DELETE is idempotent CP-side via org_purges.last_step, so a
half-deleted org from a prior sweep gets picked up cleanly on the
next tick.
- Per-org delete failures don't fail the workflow. Next hourly tick
retries. The workflow only fails loud at the safety-cap gate.
Tonight's specific motivation: ~10 canvas-tabs E2E retries in 2 hours
with various failure modes; each provisioned a fresh tenant + EC2 +
DNS + DB row. Some fraction leaked. Without this loop, ops has to
periodically run the manual sweep-cf-orphans.sh script. With it,
staging self-heals.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>