fix/s8-bind-loopback-dev
18 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
9ceda9d81f |
refactor(events): migrate 18 files to typed EventType constants (RFC #2945 PR-B-1)
Mechanical migration of bare event-name strings in BroadcastOnly / RecordAndBroadcast call sites to the typed constants from internal/events/types.go (RFC #2945 PR-B). Wire format unchanged (both shapes serialize to identical WSMessage.Event literals); pinned by TestAllEventTypes_IsSnapshot in #2965. Migrated (18 files, scope: handlers/, scheduler/, registry/, bundle/, channels/): - handlers/{approvals,a2a_proxy_helpers,a2a_queue,activity,agent, delegation,external_rotate,org_import,registry,workspace, workspace_bootstrap,workspace_crud,workspace_provision_shared, workspace_restart}.go - channels/manager.go (caught by hostile-reviewer pass — initial scope missed channels/, found via grep on the post-migration tree) - scheduler/scheduler.go - registry/provisiontimeout.go - bundle/importer.go Hostile self-review (3 weakest spots, addressed) ------------------------------------------------ 1. Missed call sites — initial scope omitted channels/. Post-migration `grep -rEn 'BroadcastOnly\([^,]+,[^,]*"[A-Z_]+"|RecordAndBroadcast\([^,]+,[^,]*"[A-Z_]+"' internal/` found 2 stragglers in channels/manager.go. Migrated. Final grep on the same pattern returns only the docstring example in types.go (intentional). 2. gofmt drift — auto-import injection produced non-canonical import ordering. `gofmt -w` applied ONLY to the 18 modified files (NOT the whole tree, to avoid sweeping unrelated pre-existing drift into this PR's diff). Three pre-existing un-gofmt'd files in handlers/ (a2a_proxy.go, a2a_proxy_test.go, a2a_queue_test.go) left as-is — they're unchanged by this PR and their drift predates it. 3. Wire format — paranoia check: do the constants serialize to the exact strings consumers (canvas TS, hermes plugin, anything parsing WSMessage.Event) expect? Yes. Pinned by the snapshot test. The migration is name-only; not a single character of wire output changes. Verified - go build ./... clean - go vet ./internal/... clean - gofmt -l on the 5 migrated package dirs: only pre-existing files - Full tests: handlers/, channels/, scheduler/, registry/, events/, bundle/ all green (5 ok, 0 fail) PR-B-2 (canvas TS mirror + cross-language parity gate) remains as the final piece of RFC #2945 PR-B. Tracked separately so this PR stays mechanical + reviewable. Refs RFC #2945, PR #2965 (PR-B types). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
be271aef8b |
fix(orphan-sweeper): exclude runtime='external' from stale-token revoke
The Docker-mode orphan sweeper was incorrectly targeting external runtime workspaces, revoking their auth tokens ~6 minutes after creation (one sweep cycle past the 5-min grace). External workspaces have NO local container by design — their agent runs off-host. The "no live container" predicate the sweep uses to detect wiped-volume orphans matches every external workspace unconditionally, which was killing the only auth credential the off-host agent has. Reproducer: create runtime=external workspace, paste the auth token into molecule-mcp / curl, wait 5 minutes. Next request returns `HTTP 401 — token may be revoked`. Platform log shows `Orphan sweeper: revoking stale tokens for workspace <id> (no live container; volume likely wiped)`. Fix: add `AND w.runtime != 'external'` to the sweep's SELECT. The existing test regexes (third-pass query expectations + the shared expectStaleTokenSweepNoOp helper) are tightened to require the new predicate, so a regression that drops it fails CI immediately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
0064f02c00 |
test(sweeper): integration coverage for manifest-override + accessor consolidation
Two follow-ups from PR #2494's review: 1. Two new sweep tests exercise the lookup path through sweepStuckProvisioning end-to-end: - ManifestOverrideSparesRow: claude-code 11min old, manifest=20min → no UPDATE, no broadcast (sparing works through the sweeper) - ManifestOverrideStillFlipsPastDeadline: claude-code 21min old, manifest=20min → flipped + payload.timeout_secs=1200 Closes the gap that the unit-test on provisioningTimeoutFor alone left open: a future refactor could drop the lookup arg from the sweeper's call and only the unit test caught it. Verified by regression-injecting `lookup→nil` in sweepStuckProvisioning — both new tests fail, the old ones still pass. 2. addProvisionTimeoutMs now goes through ProvisionTimeoutSecondsForRuntime instead of calling provisionTimeouts.get directly. Single accessor path for the same data — the canvas response and the sweeper now resolve identically by construction. No production behavior change; tests + accessor cleanup only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
18edf88d59 |
fix(sweeper): honour template-manifest provision_timeout_seconds
Real wiring gap discovered while investigating issue #2486 cluster of prod claude-code workspaces failed at exactly 10m. The runtimeProvisionTimeoutsCache (#2054 phase 2) reads runtime_config.provision_timeout_seconds from each template's config.yaml so the **canvas** spinner respects per-template timeouts — but the **sweeper** in registry/provisiontimeout.go hardcoded 10 min (claude-code) / 30 min (hermes) and never consulted the manifest. So a template that declared a longer window had a UI that waited correctly but a sweeper that killed the row at the hardcoded floor anyway. Resolution order pinned by new TestProvisioningTimeout_ManifestOverride: 1. PROVISION_TIMEOUT_SECONDS env (ops-debug global override) 2. Template manifest lookup (per-runtime, beats hermes default too) 3. Hermes default (30 min — CP bootstrap-watcher 25 min + 5 min slack) 4. DefaultProvisioningTimeout (10 min) Wiring: - registry: new RuntimeTimeoutLookup function type, threaded through StartProvisioningTimeoutSweep + sweepStuckProvisioning + the pre-existing provisioningTimeoutFor. - handlers: ProvisionTimeoutSecondsForRuntime exposes the cache's lookup as a method so main.go can pass it without breaking the handlers→registry import direction. - cmd/server/main.go: wire wh.ProvisionTimeoutSecondsForRuntime into the sweep boot. Verified: - go test -race ./... passes (every workspace-server package). - Regression-injected the lookup arm: 3 manifest-override subcases fail with the actual-vs-expected gap, confirming the new test is load-bearing. - The original two timeout tests (env-override, hermes default) keep passing — `lookup=nil` argument preserves their semantics. Operator action enabled: a template wanting a 15-min window can now just set `runtime_config.provision_timeout_seconds: 900` in its config.yaml and the sweeper honours it on the next workspace-server restart. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
fdf1b5d76a |
refactor(workspace-status): typed constants + AST-based drift gate
Eliminate raw 'awaiting_agent'/'hibernating'/'failed'/etc string literals from production status writes. Adds models.WorkspaceStatus typed alias and models.AllWorkspaceStatuses canonical slice; every UPDATE workspaces SET status = ... now passes a parameterized $N typed value rather than a hard-coded SQL literal. Defense-in-depth follow-up to migration 046 (#2388): the Postgres enum type was missing 'awaiting_agent' + 'hibernating' for ~5 days because sqlmock regex matching cannot enforce live enum constraints. The drift gate is now a proper Go AST + SQL parser (no regex), asserting the codebase ⊆ migration enum and every const appears in the canonical slice. With status as a parameterized typed value, future enum mismatches fail at the SQL layer in tests, not silently in prod. Test coverage: full suite passes with -race; drift gate green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
284511f02e |
feat(external): default external runtime to poll-mode + awaiting_agent
Paired molecule-core change for the molecule-cli `molecule connect` RFC (https://github.com/Molecule-AI/molecule-cli/issues/10). After this PR an `external`-runtime workspace's full lifecycle matches the operator-driven model: it boots in awaiting_agent, the CLI connects in poll mode without operator-side flag tuning, the heartbeat-loss path lands back on awaiting_agent (re-registrable) instead of the terminal-feeling 'offline'. Two changes in workspace-server: 1) `resolveDeliveryMode` (registry.go) now reads `runtime` alongside `delivery_mode`. Resolution order: a. payload.delivery_mode if non-empty (operator override) b. row's existing delivery_mode if non-empty (preserves prior registration) c. **NEW:** "poll" if row.runtime = "external" — external operators run on laptops without public HTTPS; push-mode would hard-fail at validateAgentURL anyway. (`molecule connect` registers without --mode and expects this default.) d. "push" otherwise (historical default for platform-managed runtimes — langgraph, hermes, claude-code, etc.) 2) Heartbeat-loss for external workspaces lands them in `awaiting_agent` instead of `offline`. Two code paths: - `liveness.go` — Redis TTL expiration. Uses a CASE expression so the conditional is one UPDATE (no extra round-trip for non-external runtimes, no TOCTOU between runtime read and status write). - `healthsweep.go::sweepStaleRemoteWorkspaces` — DB-side last_heartbeat_at age scan. This sweep is already external- only by query filter, so the UPDATE just hard-codes the new status. The Docker-side `sweepOnlineWorkspaces` keeps `offline` — recovery there is "restart the container", not "re-register from the operator's box". Why awaiting_agent over offline for external: - Matches the status the workspace was created in (workspace.go:333). - The CLI re-registers on every invocation; awaiting_agent → online is the natural transition. offline is a terminal-feeling status that implies operator intervention is needed. - An operator who closed their laptop overnight should see awaiting_agent in canvas, not 'offline (something is wrong)'. Test plan: - Existing: 9 `resolveDeliveryMode` test sites updated to the new query shape. Sqlmock now reads `delivery_mode, runtime` columns. - New: TestRegister_ExternalRuntime_DefaultsToPoll asserts the external→poll branch. TestRegister_NonExternalRuntime_StillDefaultsToPush guards against the new branch overshooting (langgraph keeps push). - Liveness: regex updated to match the CASE expression. - Healthsweep: `TestSweepStaleRemoteWorkspaces_MarksStaleAwaitingAgent` (renamed for grep-ability), Docker-side sweepOnlineWorkspaces test unchanged (verified to still match `'offline'`). - Full handlers + registry suite green under -race (12.873s + 2.264s). No migration needed — `status` is a free-form text column; both 'offline' and 'awaiting_agent' are existing values used elsewhere (workspace.go uses awaiting_agent on initial external creation). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
317196463a |
fix(orphan-sweeper): close TOCTOU race with issueAndInjectToken on restart
Independent code review caught a real bug in the previous commit's
stale-token revoke pass. The platform's restart endpoint
(workspace_restart.go:104) Stops the workspace container synchronously
then dispatches re-provisioning to a goroutine (line 173). For a
workspace that's been idle past the 5-minute grace window — extremely
common: user comes back to a long-idle workspace and clicks Restart —
this opens a race window:
1. Container stopped → ListWorkspaceContainerIDPrefixes returns no
entry → workspace becomes a stale-token candidate.
2. issueAndInjectToken runs in the goroutine: revokes old tokens,
issues a fresh one, writes it to /configs/.auth_token.
3. If the sweeper's predicate-only UPDATE
`WHERE workspace_id = $1 AND revoked_at IS NULL` runs AFTER
IssueToken commits but is racing the SELECT-then-UPDATE window,
it revokes the freshly-issued token alongside the old ones.
4. Container starts with a now-revoked token → 401 forever.
The fix carries the SAME staleness predicate from the SELECT into the
per-workspace UPDATE: a token created within the grace window can't
match `< now() - grace` and is automatically excluded. The operation
is now idempotent against fresh inserts.
Also addresses other findings from the same review:
- Add `status NOT IN ('removed', 'provisioning')` to the SELECT
(R2 + first-line C1 defence). 'provisioning' is set synchronously
in workspace_restart.go before the async re-provision begins, so
it's a reliable in-flight signal that narrows the candidate set.
- Stop calling wsauth.RevokeAllForWorkspace from the sweeper —
that helper revokes EVERY live token unconditionally; the sweeper
needs "every STALE live token" which is a different (safer)
operation. Inline the UPDATE so we own the predicate end-to-end.
Drop the wsauth import (no longer needed in this package).
- Tighten expectStaleTokenSweepNoOp regex to anchor at start and
require the status filter, so a future query whose first line
coincidentally starts with "SELECT DISTINCT t.workspace_id" can't
silently absorb the helper's expectation (R3).
- Defensive `if reaper == nil { return }` at top of
sweepStaleTokensWithoutContainer — even though StartOrphanSweeper
already short-circuits on nil, a future refactor that wires this
pass directly without checking would otherwise mass-revoke in
CP/SaaS mode (F2).
- Comment in the function explaining why empty likes is intentionally
NOT a short-circuit (asymmetry with the first two passes is the
whole point — "no containers running" is the load-bearing case).
- Add TestSweepOnce_StaleTokenRevokeUsesStalenessPredicate that
asserts the UPDATE shape (predicate present, grace bound). A
real-Postgres integration test would prove the race resolution
end-to-end; this catches the regression where someone simplifies
the UPDATE back to predicate-only.
- Add TestSweepStaleTokens_NilReaperEarlyExit pinning the F2 guard.
Existing tests updated to match the new query/UPDATE shape with tight
regexes that pin all the safety guards (status filter, staleness
predicate in both SELECT and UPDATE).
Full Go suite green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
3332e6878b |
fix(orphan-sweeper): revoke stale tokens for workspaces with no live container
Heals the user-reported "auth token conflict after volume wipe" failure
mode. When an operator nukes a workspace's /configs volume outside the
platform's restart endpoint (common via `docker compose down -v` or
manual cleanup scripts), the DB still holds live workspace_auth_tokens
for that workspace while the recreated container has an empty
/configs/.auth_token. Subsequent /registry/register calls 401 forever:
requireWorkspaceToken sees live tokens, container has no token to
present, and the workspace is permanently wedged until an operator
manually revokes via SQL.
The platform's restart endpoint already handles this correctly via
wsauth.RevokeAllForWorkspace inside issueAndInjectToken. This change
adds a third orphan-sweeper pass — sweepStaleTokensWithoutContainer —
as the safety net for the equivalent action taken outside the API.
Detection criterion: workspace has at least one live (non-revoked)
token whose most-recent activity (COALESCE(last_used_at, created_at))
is older than staleTokenGrace (5 minutes), AND no live Docker
container's name prefix matches the workspace ID.
Safety filters that bound the revoke radius:
1. Only runs in single-tenant Docker mode. The orphan sweeper is
wired only when prov != nil in cmd/server/main.go — CP/SaaS mode
never gets here, so an empty container list cannot be confused
with "no Docker at all" (which would otherwise revoke every
workspace's tokens in production SaaS).
2. staleTokenGrace = 5min skips tokens issued/used in the last
5 minutes. Bounds the race with mid-provisioning (token issued
moments before docker run completes) and brief restart windows
— a healthy workspace touches last_used_at every 30s heartbeat,
so 5min is 10× the heartbeat interval.
3. The query joins workspaces.status != 'removed' so deleted
workspaces are not revoked here (handled at delete time by the
explicit RevokeAllForWorkspace call).
4. make_interval(secs => $2) avoids a time.Duration.String() →
"5m0s" mismatch with Postgres interval grammar that I caught
during implementation.
5. Each revocation logs the workspace ID so operators can correlate
"workspace just lost auth" with this sweeper, not blame a
network blip.
Failure mode: revoke fails (transient DB error). Loop bails to avoid
log spam; next 60s cycle retries. Worst case a workspace stays
401-blocked an extra minute.
Tests: 5 new tests covering the headline scenario, the safety gate
(workspace with container is NOT revoked), revoke-failure-bails-loop,
query-error-non-fatal, and Docker-list-failure-skips-cycle. All 11
existing sweepOnce tests updated to register the new third-pass query
expectation via a small `expectStaleTokenSweepNoOp` helper that keeps
their existing assertions readable.
Full Go test suite green: registry, wsauth, handlers, and all other
packages.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
4915d1d59e |
fix(orphan-sweeper): reap labeled containers with no DB row (wiped-DB)
The existing sweeper only reaps ws-* containers whose workspace row
has status='removed'. That misses the entire wiped-DB case: an
operator does `docker compose down -v` (kills the postgres volume),
the previous platform's ws-* containers keep running, the new
platform boots into an empty workspaces table — first pass finds
zero candidates and those containers leak forever. Symptom users
hit today: 7 ws-* containers from 11h ago, no rows in DB, no
visibility in Canvas, eating CPU + memory.
Fix shape:
1. Provisioner stamps every ws-* container + volume with
`molecule.platform.managed=true`. Without a label, the sweeper
would have to assume any unlabeled ws-* container might belong
to a sibling platform stack on a shared Docker daemon.
2. Provisioner exposes ListManagedContainerIDPrefixes — a label-filter
counterpart to the existing name-filter.
3. Sweeper splits sweepOnce into two independent passes:
- sweepRemovedRows (unchanged behavior; status='removed' only)
- sweepLabeledOrphansWithoutRows (new; labeled containers whose
workspace_id has no row in the table at all)
Each pass has its own short-circuit so an empty result or transient
error in one doesn't block the other — load-bearing because the
wiped-DB pass exists precisely for cases where the removed-row
pass finds nothing.
Safe under multi-platform-on-shared-daemon: only containers carrying
our label get reaped, sibling stacks' containers are invisible to this
pass. (For now the label is a constant string; a future per-instance
UUID layer can refine "ours" further if a real shared-daemon scenario
emerges.)
Migration: existing platforms running pre-PR builds have UNLABELED
ws-* containers. After this lands they continue to NOT be reaped by
the new path (no label = invisible). They'll only be cleaned via
manual intervention or once the operator recreates them — same as
today. No regression.
Tests cover all five branches of the new pass: happy-path reap,
no-reap when row exists, mixed reap-some-keep-some, Docker error
short-circuits cleanly, non-UUID prefixes get filtered before the
SQL query.
Pairs with PR #2122 (script-level fix). Together they close the
orphan-leak path for both `bash scripts/nuke-and-rebuild.sh` users
(handled by the script) AND `docker compose down -v` users (handled
by the runtime).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
d0f198b24f |
merge: resolve staging conflicts (a2a_proxy + workspace_crud)
Three files conflicted with staging changes that landed while this PR sat open. Resolved each by combining both intents (not picking one side): - a2a_proxy.go: keep the branch's idle-timeout signature (workspaceID parameter + comment) AND apply staging's #1483 SSRF defense-in-depth check at the top of dispatchA2A. Type-assert h.broadcaster (now an EventEmitter interface per staging) back to *Broadcaster for applyIdleTimeout's SubscribeSSE call; falls through to no-op when the assertion fails (test-mock case). - a2a_proxy_test.go: keep both new test suites — branch's TestApplyIdleTimeout_* (3 cases for the idle-timeout helper) AND staging's TestDispatchA2A_RejectsUnsafeURL (#1483 regression). Updated the staging test's dispatchA2A call to pass the workspaceID arg introduced by the branch's signature change. - workspace_crud.go: combine both Delete-cleanup intents: * Branch's cleanupCtx detachment (WithoutCancel + 30s) so canvas hang-up doesn't cancel mid-Docker-call (the container-leak fix) * Branch's stopAndRemove helper that skips RemoveVolume when Stop fails (orphan sweeper handles) * Staging's #1843 stopErrs aggregation so Stop failures bubble up as 500 to the client (the EC2 orphan-instance prevention) Both concerns satisfied: cleanup runs to completion past canvas hangup AND failed Stop calls surface to caller. Build clean, all platform tests pass. 🤖 Generated with [Claude Code](https://claude.com/claude-code) |
||
|
|
be1beff4a0 |
fix(registry): runtime-aware provision-timeout sweep — give hermes 30 min
Pre-fix: workspace-server's provision-timeout sweep was hardcoded at 10 min for all runtimes. The CP-side bootstrap-watcher (cp#245) correctly gives hermes 25 min for cold-boot (hermes installs include apt + uv + Python venv + Node + hermes-agent — 13–25 min on slow apt mirrors is normal). The two timeout systems disagreed: the watcher would happily wait 25 min, but the workspace-server's 10-min sweep killed healthy hermes boots mid-install at 10 min and marked them failed. Today's example: #2061's E2E run on 2026-04-26 at 08:06:34Z created a hermes workspace, EC2 cloud-init was visibly making progress on apt-installs (libcjson1, libmbedcrypto7t64) when the sweep flipped status to 'failed' at 08:17:00Z (10:26 elapsed). The test threw "Workspace failed: " (empty error from sql.NullString serialization) and CI failed on a healthy boot. Fix: provisioningTimeoutFor(runtime) — same shape as the CP's bootstrapTimeoutFn: - hermes: 30 min (watcher's 25 min + 5 min slack) - others: 10 min (unchanged — claude-code/langgraph/etc. boot in <5 min, 10 min is plenty) PROVISION_TIMEOUT_SECONDS env override still works (applies to all runtimes — operators who care about the runtime distinction shouldn't use the override anyway). Sweep query change: pulls (id, runtime, age_sec) per row instead of pre-filtering by age in SQL. Per-row Go evaluation picks the correct timeout. Slightly more rows scanned but bounded by the status='provisioning' partial index — workspaces in flight, not historical. Tests: - TestProvisioningTimeout_RuntimeAware — locks in the per-runtime mapping - TestSweepStuckProvisioning_HermesGets30MinSlack — hermes at 11 min must NOT be flipped - TestSweepStuckProvisioning_HermesPastDeadline — hermes at 31 min IS flipped, payload includes runtime - Existing tests updated for the new query shape Verified: - go build ./... clean - go vet ./... clean - go test ./... all green Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
b47a1b87b0 |
chore: refresh stale orphan-sweeper Stop-failure comment
Convergence-pass review noted the comment at orphan_sweeper.go:171
still describes the pre-cb126014 contract ("Stop returns nil even
when container is gone, but a future change could surface real
errors"). The future is now — Stop does surface real errors today.
Tightened the comment to match the live contract:
isContainerNotFound is treated as success, anything else returns
the wrapped Docker error, sweeper retries on the next cycle.
Pure comment change, no behavior diff.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
cb12601414 |
fix(platform): make Provisioner.Stop return real errors so cleanup gates fire
Review caught a critical issue with
|
||
|
|
12c4918318 |
fix(platform): stop leaking workspace containers on delete
Symptom: deleting workspaces from the canvas marked DB rows
status='removed' but left Docker containers running indefinitely.
After a session of org imports + cancellations, we counted 10
running ws-* containers all backed by 'removed' DB rows, eating
~1100% CPU on the Docker VM.
Two compounding bugs in handlers/workspace_crud.go's delete cascade:
1. The cleanup loop used `c.Request.Context()` for the Docker
stop/remove calls. When the canvas's `api.del` resolved on the
platform's 200, gin cancelled the request ctx — and any in-flight
Docker call cancelled with `context canceled`, leaving the
container alive. Old logs:
"Delete descendant <id> volume removal warning:
... context canceled"
2. `provisioner.Stop`'s error return was discarded and `RemoveVolume`
ran unconditionally afterward. When Stop didn't actually kill the
container (transient daemon error, ctx cancellation as in #1), the
volume removal would predictably fail with "volume in use" and
the container kept running with the volume mounted. Old logs:
"Delete descendant <id> volume removal warning:
Error response from daemon: remove ... volume is in use"
Fix layered in two parts:
- workspace_crud.go: detach cleanup with `context.WithoutCancel(ctx)`
+ a 30s bounded timeout. Stop's error is now checked and on
failure we skip RemoveVolume entirely (the orphan sweeper below
catches what we deferred).
- New registry/orphan_sweeper.go: periodic reconcile pass (every 60s,
initial run on boot). Lists running ws-* containers via Docker name
filter, intersects with DB rows where status='removed', stops +
removes volumes for the leaks. Defence in depth — even a brand-new
Stop failure mode heals on the next sweep instead of leaking
forever.
Provisioner gains a tiny ListWorkspaceContainerIDPrefixes helper
that wraps ContainerList with the `name=ws-` filter; the sweeper
takes an OrphanReaper interface (matches the ContainerChecker
pattern in healthsweep.go) so unit tests don't need a real Docker
daemon.
main.go wires the sweeper alongside the existing liveness +
health-sweep + provisioning-timeout monitors, all under
supervised.RunWithRecover so a panic restarts the goroutine.
6 new sweeper tests cover the reconcile path, the
no-running-containers short-circuit, the daemon-error skip, the
Stop-failure-leaves-volume invariant (the same trap that motivated
this fix), the volume-remove-error-is-non-fatal continuation,
and the nil-reaper no-op.
Verified: full Go test suite passes; manually purged the 10 leaked
containers + their orphan volumes from the dev host with `docker
rm -f` + `docker volume rm` (one-off cleanup; the sweeper would
have caught them on the next cycle once deployed).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
ec52d155f4 |
fix(sweeper): emit WORKSPACE_PROVISION_FAILED so canvas updates UI
The provision-timeout sweeper was emitting a new WORKSPACE_PROVISION_TIMEOUT event type, but the canvas event handler (canvas-events.ts:234) only has a case for WORKSPACE_PROVISION_FAILED — the sweep's event fell through silently. DB was being marked 'failed' but the UI stayed on 'starting' indefinitely until the user hard-refreshed. Reusing the existing event name keeps the UI reaction uniform across both fail paths (runtime-crash via bootstrap-watcher and boot-timeout via sweeper). Operators who need to distinguish can read the `source` payload field — "bootstrap_watcher" vs "provision_timeout_sweep". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
fcd3a6eaf0 |
fix(test): align ssrf_test.go localhost test cases with isSafeURL behaviour (#1192)
* feat(canvas): rewrite MemoryInspectorPanel to match backend API Issue #909 (chunk 3 of #576). The existing MemoryInspectorPanel used the wrong API endpoint (/memory instead of /memories) and wrong field names (key/value/version instead of id/content/scope/namespace/created_at). It also lacked LOCAL/TEAM/GLOBAL scope tabs and a namespace filter. Changes: - Fix endpoint: GET /workspaces/:id/memories with ?scope= query param - Fix MemoryEntry type to match actual API: id, content, scope, namespace, created_at, similarity_score - Add LOCAL/TEAM/GLOBAL scope tabs - Add namespace filter input - Remove Edit functionality (no update endpoint in backend) - Delete uses DELETE /workspaces/:id/memories/:id (by id, not key) - Full rewrite of 27 tests to match new API and UI structure - Uses ConfirmDialog (not native dialogs) for delete confirmation - All dark zinc theme (no light colors) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: tighten types + improve provision-timeout message (#1135, #1136) #1135 — TypeScript: make BudgetData.budget_used and WorkspaceMetrics fields optional to match actual partial-response shapes from provisioning- stuck workspaces. Runtime already guarded with ?? 0. #1136 — provisiontimeout.go: replace misleading "check required env vars" hint (preflight catches that case upfront) with accurate message about container starting but failing to call /registry/register. 🤖 Generated with [Claude Code](https://claude.com/claude-code) * fix(test): align ssrf_test.go localhost test cases with isSafeURL behaviour isSafeURL blocks 127.0.0.1 via ip.IsLoopback() even in dev environments. The test cases `wantErr: false` for localhost were incorrect — the test would fail when go test runs. Fix by changing wantErr to true for both localhost test cases. Rationale: loopback blocking at this layer is intentional. Access control is enforced by WorkspaceAuth + CanCommunicate at the A2A routing layer, not by the URL validation. Opening this would widen the SSRF attack surface without adding real dev flexibility. Closes: ssrf_test.go inconsistency reported 2026-04-21 Co-Authored-By: Claude Sonnet 4.7 <noreply@anthropic.com> --------- Co-authored-by: Molecule AI Core-UIUX <core-uiux@agents.moleculesai.app> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|
|
c3f7447e86 |
fix: harden stuck-provisioning UX — details crash, preflight, sweeper
Workspaces stuck in status='provisioning' previously surfaced in three
bad ways:
1. **Details tab crashed** with `Cannot read properties of undefined
(reading 'toLocaleString')`. `BudgetSection` + `WorkspaceUsage`
assumed full response shapes but a provisioning-stuck workspace
returns partial `{}`. Guard each deep field with `?? 0` and cover
the partial-response case with regression tests.
2. **Missing required env vars failed silently** 15+ minutes later as
a cosmetic "Provisioning Timeout" banner. The in-container preflight
catches them but by then the container has already crashed without
calling /registry/register, so the workspace sat in 'provisioning'
forever. Mirror the preflight server-side: parse config.yaml's
`runtime_config.required_env` before launch, fail fast with a
WORKSPACE_PROVISION_FAILED event naming the missing vars.
3. **No backend timeout** ever flipped a stuck workspace to 'failed'.
Add a registry sweeper (10m default, env-overridable) that detects
workspaces stuck past the window, flips them to 'failed', and emits
WORKSPACE_PROVISION_TIMEOUT. Race-safe: the UPDATE re-checks the
status + age predicate so a concurrent register/restart wins.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
d8026347e5 |
chore: open-source restructure — rename dirs, remove internal files, scrub secrets
Renames: - platform/ → workspace-server/ (Go module path stays as "platform" for external dep compat — will update after plugin module republish) - workspace-template/ → workspace/ Removed (moved to separate repos or deleted): - PLAN.md — internal roadmap (move to private project board) - HANDOFF.md, AGENTS.md — one-time internal session docs - .claude/ — gitignored entirely (local agent config) - infra/cloudflare-worker/ → Molecule-AI/molecule-tenant-proxy - org-templates/molecule-dev/ → standalone template repo - .mcp-eval/ → molecule-mcp-server repo - test-results/ — ephemeral, gitignored Security scrubbing: - Cloudflare account/zone/KV IDs → placeholders - Real EC2 IPs → <EC2_IP> in all docs - CF token prefix, Neon project ID, Fly app names → redacted - Langfuse dev credentials → parameterized - Personal runner username/machine name → generic Community files: - CONTRIBUTING.md — build, test, branch conventions - CODE_OF_CONDUCT.md — Contributor Covenant 2.1 All Dockerfiles, CI workflows, docker-compose, railway.toml, render.yaml, README, CLAUDE.md updated for new directory names. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |