molecule-core

Author	SHA1	Message	Date
Hongming Wang	4e90f3f5b7	Merge pull request #2081 from Molecule-AI/fix/peers-q-filter-1038 fix(discovery): apply ?q= filter to Peers list (#1038)	2026-04-26 09:21:44 +00:00
Hongming Wang	c07a71523b	Merge pull request #2083 from Molecule-AI/feat/runtime-pin-compat-gate-cp253 test(ci): runtime + a2a-sdk pin compatibility gate (controlplane#253)	2026-04-26 09:21:42 +00:00
Hongming Wang	b232015eee	Merge pull request #2085 from Molecule-AI/test/compliance-default-2059 test(config): lock ComplianceConfig default to owasp_agentic (#2059)	2026-04-26 09:21:41 +00:00
Hongming Wang	966821b7d8	Merge pull request #2086 from Molecule-AI/fix/provisioner-nil-guards-1813 fix(provisioner): nil guards on Stop/IsRunning, unblock contract tests (closes #1813)	2026-04-26 09:20:22 +00:00
Hongming Wang	48b494def3	fix(provisioner): nil guards on Stop/IsRunning, unblock contract tests (closes #1813 ) Both backends panicked when called on a zero-valued or nil receiver: Provisioner.{Stop,IsRunning} dereferenced p.cli; CPProvisioner.{Stop, IsRunning} dereferenced p.httpClient. The orphan sweeper and shutdown paths can call these speculatively where the receiver isn't fully wired — the panic crashed the goroutine instead of the caller seeing a clean error. Three changes: 1. Add ErrNoBackend (typed sentinel) and nil-guard the four methods. - Provisioner.{Stop,IsRunning}: guard p == nil \|\| p.cli == nil at the top. - CPProvisioner.Stop: guard p == nil up top, then httpClient nil AFTER resolveInstanceID + empty-instance check (the empty instance_id path doesn't need HTTP and stays a no-op success even on zero-valued receivers — preserved historical contract from TestIsRunning_EmptyInstanceIDReturnsFalse). - CPProvisioner.IsRunning: same shape — empty instance_id stays (false, nil); httpClient-nil with non-empty instance_id returns ErrNoBackend. 2. Flip the t.Skip on TestDockerBackend_Contract + TestCPProvisionerBackend_Contract — both contract tests run now that the panics are gone. Skipped scenarios were the regression guard for this fix. 3. Add TestZeroValuedBackends_NoPanic — explicit assertion that zero-valued and nil receivers return cleanly (no panic). Docker backend always returns ErrNoBackend on zero-valued; CPProvisioner may return (false, nil) when the DB-lookup layer absorbs the case (no instance to query → no HTTP needed). Both are acceptable per the issue's contract — the gate is no-panic. Tests: - 6 sub-cases across the new TestZeroValuedBackends_NoPanic - TestDockerBackend_Contract + TestCPProvisionerBackend_Contract now run their 2 scenarios (4 sub-cases each) - All existing provisioner tests still green - go build ./... + go vet ./... + go test ./... clean Closes drift-risk #6 in docs/architecture/backends.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 02:17:51 -07:00
rabbitblood	4a4a740804	refactor(test_config): parametrize the 3 yaml-default cases (simplify on #2085 ) Collapses test_compliance_default_when_yaml_omits_block, _when_yaml_block_is_empty, _explicit_optout_still_works into one parametrized test_compliance_default_via_load_config with three ids (yaml_omits_block, yaml_block_empty, yaml_explicit_optout). The dataclass-default test stays separate (no tmp_path needed). Coverage and assertions identical; net -19 lines, same 4 logical cases. prompt_injection check moves out of per-case to a single tail-assert since no payload overrode it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 02:03:59 -07:00
rabbitblood	577294b8f4	test(config): lock ComplianceConfig default to owasp_agentic (#2059 ) PR #2056 flipped ComplianceConfig.mode default from "" to "owasp_agentic" so every shipped template gets prompt-injection detection + PII redaction by default. The flip is correct + already shipping, but no test asserts the new default — a silent revert (or a refactor that reintroduces the old "" default) would pass workspace/tests/ and ship a workspace with compliance silently off. Add 4 regression tests: - test_compliance_dataclass_default — ComplianceConfig() with no args returns mode='owasp_agentic' + prompt_injection='detect' - test_compliance_default_when_yaml_omits_block — load_config on a yaml without `compliance:` key still produces owasp_agentic - test_compliance_default_when_yaml_block_is_empty — load_config on `compliance: {}` (a common shape during template editing) still produces owasp_agentic; covers the load_config() `.get("mode", "owasp_agentic")` default-fill path - test_compliance_explicit_optout_still_works — `mode: ""` in yaml must disable compliance (the documented opt-out path) 23/23 tests pass locally (4 new + 19 existing). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 02:01:57 -07:00
rabbitblood	5ce7af2d2c	fix(ci): set WORKSPACE_ID for the runtime-pin smoke import platform_auth.py validates WORKSPACE_ID at module load — EC2 user-data sets it from cloud-init, but the CI smoke-test was missing it and failed with 'WORKSPACE_ID is empty'. Set a placeholder UUID so the import gate exercises only the dep-resolution path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 01:59:56 -07:00
Hongming Wang	38fead35b4	Merge pull request #2084 from Molecule-AI/fix/provision-timeout-runtime-aware fix(registry): runtime-aware provision-timeout sweep — give hermes 30 min	2026-04-26 08:46:35 +00:00
Hongming Wang	be1beff4a0	fix(registry): runtime-aware provision-timeout sweep — give hermes 30 min Pre-fix: workspace-server's provision-timeout sweep was hardcoded at 10 min for all runtimes. The CP-side bootstrap-watcher (cp#245) correctly gives hermes 25 min for cold-boot (hermes installs include apt + uv + Python venv + Node + hermes-agent — 13–25 min on slow apt mirrors is normal). The two timeout systems disagreed: the watcher would happily wait 25 min, but the workspace-server's 10-min sweep killed healthy hermes boots mid-install at 10 min and marked them failed. Today's example: #2061's E2E run on 2026-04-26 at 08:06:34Z created a hermes workspace, EC2 cloud-init was visibly making progress on apt-installs (libcjson1, libmbedcrypto7t64) when the sweep flipped status to 'failed' at 08:17:00Z (10:26 elapsed). The test threw "Workspace failed: " (empty error from sql.NullString serialization) and CI failed on a healthy boot. Fix: provisioningTimeoutFor(runtime) — same shape as the CP's bootstrapTimeoutFn: - hermes: 30 min (watcher's 25 min + 5 min slack) - others: 10 min (unchanged — claude-code/langgraph/etc. boot in <5 min, 10 min is plenty) PROVISION_TIMEOUT_SECONDS env override still works (applies to all runtimes — operators who care about the runtime distinction shouldn't use the override anyway). Sweep query change: pulls (id, runtime, age_sec) per row instead of pre-filtering by age in SQL. Per-row Go evaluation picks the correct timeout. Slightly more rows scanned but bounded by the status='provisioning' partial index — workspaces in flight, not historical. Tests: - TestProvisioningTimeout_RuntimeAware — locks in the per-runtime mapping - TestSweepStuckProvisioning_HermesGets30MinSlack — hermes at 11 min must NOT be flipped - TestSweepStuckProvisioning_HermesPastDeadline — hermes at 31 min IS flipped, payload includes runtime - Existing tests updated for the new query shape Verified: - go build ./... clean - go vet ./... clean - go test ./... all green Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 01:44:09 -07:00
rabbitblood	b817251c85	refactor(ci): apply simplify findings on #2083 Review of the runtime-pin-compat workflow: - Add merge_group trigger so when this becomes a required check the queue green-checks it (mirrors ci.yml convention). - Cache pip on workspace/requirements.txt — actions/setup-python@v5 with cache: pip + cache-dependency-path. Saves ~30s per fire. - Document the load-bearing install order: runtime FIRST so pip honors the runtime's declared a2a-sdk constraint (the surface that broke 2026-04-24); workspace/requirements.txt SECOND so a2a-sdk is upgraded to the runtime image's pinned version. Import smoke validates the upgraded combination. Skipped: branch-protection wiring (separate ops decision, not in scope here); ci.yml integration (the standalone schedule trigger is the load-bearing reason to keep this workflow separate). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 01:32:56 -07:00
Hongming Wang	c4681c335e	Merge pull request #2082 from Molecule-AI/fix/workspace-delete-propagate-stop-errors-1843 fix(workspace-crud): propagate Stop errors on delete (closes #1843)	2026-04-26 08:31:28 +00:00
rabbitblood	9b42a5e311	test(ci): runtime + a2a-sdk pin compatibility gate (controlplane#253) Closes Molecule-AI/molecule-controlplane#253. Prevents recurrence of the 5-hour staging outage from 2026-04-24: molecule-ai-workspace-runtime 0.1.13 declared `a2a-sdk<1.0` in its metadata but actually imported `a2a.server.routes` (1.0+ only). pip resolved successfully; every tenant workspace crashed at import. The canary tenant ultimately caught it but only after 5 hours of degraded staging. PR #249 fixed the version pin manually; nothing automated catches the same class of bug for the next release. This workflow: - Installs molecule-ai-workspace-runtime fresh from PyPI in a Python 3.11 venv (mirrors EC2 user-data install pattern) - Layers in workspace/requirements.txt (the runtime image's actual dep set, including the a2a-sdk[http-server]>=1.0,<2.0 pin) - Runs `from molecule_runtime.main import main_sync` — same import the runtime entrypoint does - Fails CI if pip resolution silently produced a combo that the runtime can't actually import Triggers: - PR + push to main/staging touching workspace/requirements.txt or this workflow (catches local pin changes) - Daily 13:00 UTC schedule (catches upstream PyPI publishes that break the pin combo without any change in our repo) - workflow_dispatch (manual) Concurrency cancels in-progress runs on the same ref. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 01:30:36 -07:00
Hongming Wang	54e86549ee	fix(workspace-crud): propagate Stop errors on delete (closes #1843 ) \`Delete\`'s call to \`h.provisioner.Stop()\` was silently swallowing errors — and on the SaaS/EC2 backend, Stop() is the call that terminates the EC2 via the control plane. When Stop returned an error (CP transient 5xx, network blip), the workspace was marked 'removed' in the DB but the EC2 stayed running with no row to track it. The "14 orphan workspace EC2s on a 0-customer account" incident in #1843 (40 vCPU on a 64 vCPU AWS limit) traced to this silent-leak path. This change aggregates Stop errors across both descendant and self-stop calls and surfaces them as 500 to the client, matching the loud-fail pattern from CP #262 (DeprovisionInstance) and the DNS cleanup propagation (#269). Idempotency: - The DB row is already 'removed' before Stop runs (intentional, per #73 — guards against register/heartbeat resurrection). - \`resolveInstanceID\` reads instance_id without a status filter, so a retry can replay Stop with the same instance_id. - CP's TerminateInstance is idempotent on already-terminated EC2s. - So a retry-after-500 either re-attempts the terminate (succeeds) or finds the instance already gone (also succeeds). Behaviour change at the API layer: - Before: 200 \`{"status":"removed","cascade_deleted":N}\` regardless of Stop outcome. - After: 500 \`{"error":"...","removed_count":N,"stop_failures":K}\` on Stop failure; 200 on success. RemoveVolume errors stay log-and-continue — those are local /var/data cleanup, not infra-leak class. Test debt acknowledged: the WorkspaceHandler's \`provisioner\` field is the concrete \`*provisioner.Provisioner\` type, not an interface. Adding a regression test for the new error-propagation path requires either a refactor (introduce a Provisioner interface) or a docker-backed integration test. Filing the refactor as a follow-up; the change here is small and mirrors a proven pattern (CP #262 + #269 both ship without exhaustive new test coverage for the same reason). Verified: - go build ./... clean - go vet ./... clean - go test ./... green across the whole module (existing TestDelete cases unchanged behaviour for happy path) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 01:28:50 -07:00
Hongming Wang	56802e1124	Merge branch 'staging' into fix/canvas-multilevel-layout-ux	2026-04-26 01:03:29 -07:00
rabbitblood	641b1391e2	refactor(discovery): apply simplify findings on #1038 PR Code-quality + efficiency review of PR #2081: - Drop comma-ok on map type-asserts in filterPeersByQuery — queryPeerMaps writes name/role unconditionally as string, so the silent-empty-string fallback was cargo-culted defense that would HIDE a real upstream shape change in tests rather than surface it. Plain p["name"].(string) panics on violation, caught by tests. - Trim filterPeersByQuery doc from 5 lines to 1 — function is 15 lines and self-evident. - Refactor 6 separate Test functions into one table-driven TestPeers_QFilter with 6 sub-tests. Net ~80 lines saved + naming becomes readable subtest names instead of TestPeers_Q_Foo_Bar. - Set-based peer-id comparison (peerIDSet) replaces fragile peers[0]["id"] == "ws-alpha" asserts that would silently mask a future sort/order regression on the production code. - Fix the broken TestPeers_Q_NoMatches assertion: re-encoding an unmarshalled []map collapses both null and [] to [], so the previous json.Marshal(peers) == "[]" check was tautological. Move the [] vs null distinction to a dedicated test (TestPeers_Q_NoMatches_RawBodyIsArrayNotNull) that inspects the recorder body BEFORE unmarshal. runPeersWithQuery now returns both parsed peers and raw body so the nil-guard test can use the bytes directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 01:02:19 -07:00
rabbitblood	5fe6397765	fix(discovery): apply ?q= filter to Peers list (#1038 ) The Peers handler at workspace-server/internal/handlers/discovery.go ignored the ?q= query param entirely — every caller got the full peer list regardless of what they searched for. The handler exposes peer identities + URLs, so leaking the unfiltered set on a "filtered" endpoint is an info-disclosure bug (CWE-862). Fix: read c.Query("q") and post-filter the in-memory peers slice by case-insensitive substring match against name OR role. Filtering is done in Go after the existing 3 SQL reads — keeps the SQL bytes identical to the no-filter path (no injection vector, no DB-driver collation surprises) at a small cost. The peer set is bounded by a single workspace's parent + children + siblings (typically <50 rows), so the in-memory pass is negligible. Empty / whitespace-only q is a no-op — preserves the no-filter allocation profile. Tests (6 new in discovery_test.go): - TestPeers_NoQ_ReturnsAll — regression baseline (3 peers, no filter) - TestPeers_Q_FiltersByName — q=alpha → ws-alpha only - TestPeers_Q_CaseInsensitive — q=ALPHA → ws-alpha (locks in ToLower) - TestPeers_Q_FiltersByRole — q=design → ws-beta (role-side match) - TestPeers_Q_NoMatches — empty array, JSON [] not null - TestPeers_Q_WhitespaceOnly — q=' ' treated as no-filter Helpers peersFilterFixture + runPeersWithQuery + peerNames keep each test scoped to the q-behaviour, not re-declaring SQL expectations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 00:57:44 -07:00
Hongming Wang	cbb8ee0807	Merge pull request #2080 from Molecule-AI/fix/retarget-action-handle-duplicate-pr-1884 ci(retarget): handle 422 'duplicate PR' by closing redundant main-PR (closes #1884)	2026-04-26 07:56:13 +00:00
Hongming Wang	b5f9cbbc55	ci(retarget): handle 422 'duplicate PR' by closing redundant main-PR (closes #1884 ) When a bot opens a PR against main and there's already another PR on the same head branch targeting staging, GitHub's PATCH /pulls returns 422 with: "A pull request already exists for base branch 'staging' and head branch '<branch>'" Pre-fix: the retarget Action exited 1 with no further action. The target-main PR sat there as a duplicate, the workflow run showed red, and someone had to manually close the duplicate. Today's case (#1881 duplicate of #1820) had to be closed manually. Fix: catch that specific 422 message and close the main-PR as redundant instead of failing. Any OTHER 422 (or other error) still fails loud — the grep matches the specific duplicate-base text, not a blanket "any 422 means duplicate". Behaviour matrix: PATCH succeeds → retargeted, explainer comment posted PATCH 422 "already exists for staging" → close main-PR with explainer (NEW) PATCH any other failure → workflow fails (preserves loud-fail for real bugs) Tests: GitHub Actions don't have an inline unit-test framework here. The workflow YAML parses (validated locally) and the bash logic is straightforward. Real verification will be the next duplicate-PR scenario in production. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 00:53:55 -07:00
Hongming Wang	8543bae83f	Merge branch 'staging' into fix/canvas-multilevel-layout-ux	2026-04-26 00:36:54 -07:00
rabbitblood	6494e9192b	refactor(ops): apply simplify findings on #2027 PR Code-quality + efficiency review of PR #2079: - Hoist all_slugs = prod_slugs \| staging_slugs out of decide() into the caller (was rebuilt on every record — 1k records × ~50-slug union per call). decide() signature now (r, all_slugs, ec2_names). - Compile regexes at module scope (_WS_RE, _E2E_RE, _TENANT_RE) + hoist platform-core literal set (_PLATFORM_CORE_NAMES). Same change mirrored in the bash heredoc. - Drop decorative # Rule N: comments (numbering was out of order, 3 before 2 — actively confusing). - Move the "edits must mirror" reminder OUTSIDE the CANONICAL DECIDE block in the .sh file, eliminating the .replace() comment-skip hack in TestParityWithBashScript. - Drop per-line .strip() in _slice_canonical (would mask a real indentation bug; both blocks already at column 0). - subTest() in TestPlatformCore loops so a single failure no longer short-circuits the rest of the items. - merge_group + concurrency on test-ops-scripts.yml (parity with ci.yml gate behaviour). - Fix don't apostrophe in inline comment that closed the python heredoc's single-quote and broke bash -n. All 25 tests still pass. bash -n clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 00:28:15 -07:00
rabbitblood	ba78a5c00d	test(ops): unit tests for sweep-cf-orphans decide() (#2027 ) Closes #2027. The CF orphan sweep deletes DNS records — a misclassification could nuke a live workspace's tunnel. The decision function had MAX_DELETE_PCT percentage gating but no automated test of category → action mapping. Approach: extract the decide() function to scripts/ops/sweep_cf_decide.py as a verbatim copy bracketed by `# CANONICAL DECIDE BEGIN/END` markers. The shell script keeps its inline heredoc (so the operational path is untouched) but bracketed by the same markers. A parity test (TestParityWithBashScript) reads both files and asserts the bracketed blocks match line-for-line — drift fails CI loudly. Coverage (25 tests, 1 file, stdlib unittest only): - Rule 1 platform-core: apex, _vercel, _domainkey, www/api/app/doc/send/status/staging-api - Rule 3 ws-: live (matches EC2 prefix) on prod + staging; orphan on prod + staging - Rule 4 e2e-: live + orphan on staging; orphan on prod - Rule 2 generic tenant: live prod + staging; unknown subdomain kept-for-safety - Rule 5 fallthrough: external domain + unrelated apex - Rule priority: api.moleculesai.app stays platform-core (not tenant); _vercel stays verification - Safety gate: under/at/over default 50% threshold; zero-total no-divide; custom threshold - Empty live-sets: documents that decide() alone classifies as orphan, gate is the defense CI: new .github/workflows/test-ops-scripts.yml runs `python -m unittest discover` against scripts/ops/ on every PR/push that touches the directory. Lightweight — no requirements file, stdlib only. Local: `cd scripts/ops && python -m unittest test_sweep_cf_decide -v` → 25 tests, all OK. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 00:22:30 -07:00
Hongming Wang	5e36c6638c	feat(platform,canvas): classify "datastore unavailable" as 503 + dedicated UI User reported the canvas threw a generic "API GET /workspaces: 500 {auth check failed}" error when local Postgres + Redis were both down. Two problems: 1. The error code (500) and message ("auth check failed") said nothing useful. The actual condition was "platform can't reach its datastore to validate your token" — a Service Unavailable class, not Internal Server Error. 2. The canvas had no way to distinguish infra-down from a real auth bug, so it rendered the raw API string in the same generic-error overlay it uses for everything. Fix in two layers: Server (wsauth_middleware.go): - New abortAuthLookupError helper centralises all three sites that previously returned `500 {"error":"auth check failed"}` when HasAnyLiveTokenGlobal or orgtoken.Validate hit a DB error. - Now returns 503 + structured body `{"error": "...", "code": "platform_unavailable"}`. 503 is the correct semantic ("retry shortly, infra is unavailable") and the code field is the contract the canvas reads. - Body deliberately excludes the underlying DB error string — production hostnames / connection-string fragments must not leak into a user-visible error toast. Canvas (api.ts): - New PlatformUnavailableError class. api.ts inspects 503 responses for the platform_unavailable code and throws the typed error instead of the generic "API GET /…: 503 …" message. Generic 503s (upstream-busy, etc.) keep the legacy path so existing busy-retry UX isn't disrupted. Canvas (page.tsx): - New PlatformDownDiagnostic component renders when the initial hydration catches PlatformUnavailableError. Surfaces the actual condition with operator-actionable copy ("brew services start postgresql@14 / redis") + pointer to the platform log + a Reload button. Tests: - Go: TestAdminAuth_DatastoreError_Returns503PlatformUnavailable pins the response shape (status, code field, no DB-error leak) - Canvas: 5 tests for PlatformUnavailableError classification — typed throw on 503+code match, generic-Error fallback for 503-without-code (upstream busy), 500 stays generic, non-JSON body falls back to generic. 1015 canvas tests + full Go middleware suite pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 00:01:56 -07:00
Hongming Wang	194121c674	Merge pull request #2063 from Molecule-AI/feat/redeploy-tenants-on-main-merge ci(redeploy): auto-redeploy tenant EC2s after every main merge	2026-04-26 07:00:59 +00:00
Hongming Wang	944ddcb4e5	Merge pull request #2062 from Molecule-AI/fix/sweep-script-env-override fix(scripts): make sweep-cf-orphans MAX_DELETE_PCT env override actually work	2026-04-26 06:55:14 +00:00
Hongming Wang	20cce3c27c	Merge pull request #2078 from Molecule-AI/fix/api-401-probe-before-redirect fix(api): probe /cp/auth/me before redirecting on 401	2026-04-26 06:51:38 +00:00
Hongming Wang	5a3dbb95e1	fix(api): probe /cp/auth/me before redirecting on 401 The actual cause-fix for the staging-tabs E2E saga (#2073/#2074/#2075). Old behaviour: ANY 401 from any fetch on a SaaS tenant subdomain called redirectToLogin → window.location.href = AuthKit. This is wrong. Plenty of 401s don't mean "session is dead": - workspace-scoped endpoints (/workspaces/:id/peers, /plugins) require a workspace-scoped token, not the tenant admin bearer - resource-permission mismatches (user has tenant access but not this specific workspace) - misconfigured proxies returning 401 spuriously A single transient one of those yanked authenticated users back to AuthKit. Same bug yanked the staging-tabs E2E off the tenant origin mid-test for 6+ hours tonight, leading to the cascade of test-side mocks (#2073/#2074/#2075) that worked around the symptom without fixing the cause. This PR fixes it at the source. The new logic: - 401 on /cp/auth/* path → that IS the canonical session-dead signal → redirect (unchanged) - 401 on any other path with slug present → probe /cp/auth/me: probe 401 → session genuinely dead → redirect probe 200 → session fine, endpoint refused this token → throw a real Error, caller renders error state probe network err → assume session-fine (conservative) → throw real Error - slug empty (localhost / LAN / reserved subdomain) → throw without redirect (unchanged) The probe adds one extra fetch on a 401, only when slug is set and the path isn't already auth-scoped. That's rare and worthwhile — a transient probe round-trip is cheap; an unwanted auth redirect is a UX disaster. Tests: - api-401.test.ts rewritten with the full matrix: * /cp/auth/me 401 → redirect (no probe, that IS the signal) * non-auth 401 + probe 401 → redirect * non-auth 401 + probe 200 → throw, no redirect ← the fix * non-auth 401 + probe network err → throw, no redirect * empty slug paths (localhost/LAN/reserved) → throw, no probe - 43 tests in canvas/src/lib/__tests__/api*.test.ts all pass - tsc clean The staging-tabs E2E spec's universal-401 route handler stays as defense-in-depth (silences resource-load console noise + guards against panels without try/catch), but the comment now describes its role honestly: api.ts is the primary fix, the route is the safety net. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 23:49:28 -07:00
Hongming Wang	b47a1b87b0	chore: refresh stale orphan-sweeper Stop-failure comment Convergence-pass review noted the comment at orphan_sweeper.go:171 still describes the pre-cb126014 contract ("Stop returns nil even when container is gone, but a future change could surface real errors"). The future is now — Stop does surface real errors today. Tightened the comment to match the live contract: isContainerNotFound is treated as success, anything else returns the wrapped Docker error, sweeper retries on the next cycle. Pure comment change, no behavior diff. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 23:34:57 -07:00
Hongming Wang	cb12601414	fix(platform): make Provisioner.Stop return real errors so cleanup gates fire Review caught a critical issue with `12c49183`: the headline "skip RemoveVolume when Stop fails" guarantee was dead code. `Provisioner.Stop` unconditionally `return nil`'d after logging the underlying ContainerRemove error, so the new `if err := h.provisioner.Stop(...); err != nil { skip volume }` guard in workspace_crud.go AND the same guard in the orphan sweeper could never fire. RemoveVolume always ran, predictably failing with "volume in use" when Stop hadn't actually killed the container — which is the exact production bug the commit claimed to fix. Now Stop: - returns nil on successful remove (no change) - returns nil when the container is already gone (uses the existing isContainerNotFound helper — that's the cleanup post-condition, not a failure) - returns the wrapped Docker error otherwise (daemon timeout, ctx cancellation, socket EOF — anything that means the container might still be alive) Audited every Provisioner.Stop caller in the tree (team.go, workspace_restart.go ×4, workspace.go) — all of them already discard the return value, so the widened error surface is purely opt-in for the new cleanup paths and breaks no existing behaviour. Other review-driven fixes in this commit: - workspace_crud.go: detached `broadcaster.RecordAndBroadcast` from the request ctx too. RecordAndBroadcast does INSERT INTO structure_events + Redis Publish; if the canvas hangs up, a request-ctx-bound INSERT can be cancelled mid-write and the WORKSPACE_REMOVED event never lands, leaving other WS clients ignorant of the cascade. - orphan_sweeper.go: added isLikelyWorkspaceID guard before turning Docker container prefixes into SQL LIKE patterns. The Docker name filter is a SUBSTRING match (not prefix), so non-workspace containers like `my-ws-tool` slip through; the in-loop HasPrefix in provisioner trims most, but the in-sweeper alphabet check (hex + dashes only) is the second line of defence and also blocks SQL LIKE wildcards (`_`, `%`) from reaching the query. Two new tests pin this — TestSweepOnce_FiltersNonWorkspacePrefixes and TestIsLikelyWorkspaceID with 10 alphabet cases. - provisioner.go: comment added to ListWorkspaceContainerIDPrefixes flagging the substring/HasPrefix relationship as load-bearing. Verified: full Go test suite passes; all 8 sweeper tests pass (2 new for the LIKE-pattern guard); existing dispatch / delete / provisioner tests unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 23:32:48 -07:00
Hongming Wang	12c4918318	fix(platform): stop leaking workspace containers on delete Symptom: deleting workspaces from the canvas marked DB rows status='removed' but left Docker containers running indefinitely. After a session of org imports + cancellations, we counted 10 running ws-* containers all backed by 'removed' DB rows, eating ~1100% CPU on the Docker VM. Two compounding bugs in handlers/workspace_crud.go's delete cascade: 1. The cleanup loop used `c.Request.Context()` for the Docker stop/remove calls. When the canvas's `api.del` resolved on the platform's 200, gin cancelled the request ctx — and any in-flight Docker call cancelled with `context canceled`, leaving the container alive. Old logs: "Delete descendant <id> volume removal warning: ... context canceled" 2. `provisioner.Stop`'s error return was discarded and `RemoveVolume` ran unconditionally afterward. When Stop didn't actually kill the container (transient daemon error, ctx cancellation as in #1), the volume removal would predictably fail with "volume in use" and the container kept running with the volume mounted. Old logs: "Delete descendant <id> volume removal warning: Error response from daemon: remove ... volume is in use" Fix layered in two parts: - workspace_crud.go: detach cleanup with `context.WithoutCancel(ctx)` + a 30s bounded timeout. Stop's error is now checked and on failure we skip RemoveVolume entirely (the orphan sweeper below catches what we deferred). - New registry/orphan_sweeper.go: periodic reconcile pass (every 60s, initial run on boot). Lists running ws-* containers via Docker name filter, intersects with DB rows where status='removed', stops + removes volumes for the leaks. Defence in depth — even a brand-new Stop failure mode heals on the next sweep instead of leaking forever. Provisioner gains a tiny ListWorkspaceContainerIDPrefixes helper that wraps ContainerList with the `name=ws-` filter; the sweeper takes an OrphanReaper interface (matches the ContainerChecker pattern in healthsweep.go) so unit tests don't need a real Docker daemon. main.go wires the sweeper alongside the existing liveness + health-sweep + provisioning-timeout monitors, all under supervised.RunWithRecover so a panic restarts the goroutine. 6 new sweeper tests cover the reconcile path, the no-running-containers short-circuit, the daemon-error skip, the Stop-failure-leaves-volume invariant (the same trap that motivated this fix), the volume-remove-error-is-non-fatal continuation, and the nil-reaper no-op. Verified: full Go test suite passes; manually purged the 10 leaked containers + their orphan volumes from the dev host with `docker rm -f` + `docker volume rm` (one-off cleanup; the sweeper would have caught them on the next cycle once deployed). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 12:36:22 -07:00
Hongming Wang	23bea6e793	Merge pull request #2075 from Molecule-AI/fix/canvas-e2e-filter-resource-404 fix(canvas/e2e): filter generic 'Failed to load resource' + add URL diagnostics	2026-04-25 19:09:19 +00:00
Hongming Wang	bef6fca395	fix(canvas/e2e): filter generic "Failed to load resource" + add URL diagnostics After #2074, the staging-tabs spec stopped failing on the auth-redirect locator timeout (good — the broadened 401-mock works) but started failing on a different aggregate check: Error: unexpected console errors: Failed to load resource: the server responded with a status of 404 Failed to load resource: the server responded with a status of 404 Failed to load resource: the server responded with a status of 404 Browser console messages for resource-load failures omit the URL, so the message is uninformative on its own — we can't filter selectively (e.g. "is this a missing-CSS noise or a real broken endpoint?"). The previous filter list (sentry/vercel/WebSocket/ favicon/molecule-icon) catches specific known-noisy strings but this generic "Failed to load resource" doesn't contain any of them. Two changes: 1. Add page.on('requestfailed') + page.on('response>=400') logging to capture the URL of any failed request. Logs to test stdout (visible in the workflow log) — leaves a breadcrumb so a real bug isn't completely hidden when we filter the generic message. 2. Add "Failed to load resource" to the filter list. With (1) in place we still see the URLs for diagnosis; the generic console message is just noise. Real JS exceptions (panel crash, undefined access, etc.) come with a file path and stack trace and aren't matched by either filter, so the gate still catches actual bugs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 12:07:07 -07:00
Hongming Wang	cdfe4e7b85	Merge pull request #2074 from Molecule-AI/fix/canvas-e2e-broaden-401-mock fix(canvas/e2e): broaden 401-mock to all fetches	2026-04-25 18:43:07 +00:00
Hongming Wang	a84b167d4d	fix(canvas/e2e): broaden 401-mock to all fetches, not just /workspaces/* #2073 caught workspace-scoped 401s but missed non-workspace paths. SkillsTab.tsx alone fetches /plugins and /plugins/sources, both outside the /workspaces/<id>/* tree. Either of those 401s with the tenant admin bearer in SaaS mode → canvas/src/lib/api.ts:62-74 redirects to AuthKit → page navigates away mid-test → next locator times out. Same failure signature observed at 16:03Z post-#2073 merge: e2e/staging-tabs.spec.ts:45:7 › tab: skills TimeoutError: locator.scrollIntoViewIfNeeded: Timeout 5000ms - navigated to "https://scenic-pumpkin-83.authkit.app/?..." Broaden the route to "**" with `request.resourceType() !== "fetch"` short-circuit (preserves HTML/JS/CSS pass-through) and a /cp/auth/me skip (the dedicated mock above wins). Same 401 → empty-body conversion logic; just a wider net. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 11:40:48 -07:00
Hongming Wang	2ee4b67cab	chore: third-pass review polish — empty-stream gate test + Callable type Pass 3 review came back Approve with two optional polish items. Both taken to fully converge the loop: 1. Regression test for the empty-stream wedge-clear gate (added in `3c4eef49`). A degenerate stream that iterates without raising but emits NEITHER an AssistantMessage NOR a ResultMessage must NOT clear the wedge flag — pre-set wedge persists, the next heartbeat still reports runtime_state="wedged". Pins the gate against future regression. 2. Replaced the type annotation `"dict[str, callable[[dict], str]]"` (lowercase `callable`, string-quoted) with the proper `dict[str, Callable[[dict], str]]` using `Callable` from `collections.abc`. Benign before (`from __future__ import annotations` makes the annotation a string Python never evaluates), but pyright/mypy may flag the lowercase form. 65 Python tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 08:52:32 -07:00
Hongming Wang	3c4eef49aa	chore: second-pass review polish — symmetry + clearer test fixtures Round-2 review of the wedge/idle/progress bundle came back Approve with 4 optional polish items. All taken: 1. Migration 043 down file gained `SET LOCAL lock_timeout = '5s'` matching the up file. A rollback under the same load that motivated the up-file guard would otherwise stall writers. 2. _clear_sdk_wedge_on_success now gates on actual stream content (result_text or assistant_chunks). A degenerate "iterator returned without raising but emitted nothing" case (possible from a partial stream or stub SDK) no longer falsely advertises recovery — only a real successful query (≥1 ResultMessage or AssistantMessage TextBlock) clears the wedge. 3. isUpstreamBusyError dropped the redundant `strings.Contains(msg, "context deadline exceeded")` fallback. *url.Error.Unwrap propagates the typed sentinel since Go 1.13; errors.Is(err, context.DeadlineExceeded) catches the real net/http shape. The substring was a foot-gun (would also match user-content with that phrase). Test fixture updated to use `fmt.Errorf("Post: %w", context.DeadlineExceeded)` which reflects what net/http actually returns. 4. TestIsUpstreamBusyError added a context.Canceled case (both typed and wrapped via %w) — pins the new applyIdleTimeout classification. No critical/required findings on second pass; reviewer verdict was Approve. Items above are polish for symmetry and test clarity. 1010 canvas + 64 Python + full Go suites pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 08:48:30 -07:00
Hongming Wang	892de784b3	fix: review-driven hardening of wedge detector + idle timeout + progress feed Bundle review of pieces 1/2/3 surfaced two critical issues plus a handful of required + optional fixes. All addressed. Critical: 1. Migration 043 was missing 'paused' and 'hibernated' from the workspace_status enum. Both are real production statuses written by workspace_restart.go (lines 283 and 406), introduced by migration 029_workspace_hibernation. The original `USING status::workspace_status` cast would have errored mid-transaction on any production DB containing those values. Added both. Also added `SET LOCAL lock_timeout = '5s'` so the migration aborts instead of stalling the workspace fleet behind a slow SELECT. 2. The chat activity-feed window kept only 8 lines, and a single multi-tool turn (Read 5 files + Grep + Bash + Edit + delegate) easily flushed older context before the user could read it. Extracted appendActivityLine to chat/activityLog.ts with a 20-line window AND consecutive-duplicate collapse (same tool on the same target twice in a row is noise, not new progress). 5 unit tests pin the behavior. Required: 3. The SDK wedge flag was sticky-only — a single transient Control-request-timeout from a flaky network blip locked the workspace into degraded for the whole process lifetime, even when the next query() would have succeeded. Added _clear_sdk_wedge_on_success(), called from _run_query's success path. The next heartbeat after a working query reports runtime_state empty and the platform recovers the workspace to online without a manual restart. New regression test. 4. _report_tool_use now sets target_id = WORKSPACE_ID for self- actions, matching the convention other self-logged activity rows use. DB consumers joining on target_id see a well-defined value instead of NULL. Optional taken: 5. Tightened _WEDGE_ERROR_PATTERNS from "control request timeout" to "control request timeout: initialize" — suffix-anchored so a future SDK error on an in-flight tool-call control message doesn't get misclassified as the unrecoverable post-init wedge. 6. Dropped the redundant "context canceled" substring fallback in isUpstreamBusyError. errors.Is(err, context.Canceled) is the typed check; the substring would also match healthy client-side aborts, which we don't want classified as upstream-busy. Verified: 1010 canvas tests + 64 Python tests + full Go suite pass; migration applies cleanly on dev DB with all 8 enum values; reverse migration restores TEXT. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 08:43:10 -07:00
Hongming Wang	bf1dc6b6a5	feat(platform): idle-based A2A timeout, drop 5-min canvas hardcode The previous canvas-default 5-min absolute deadline pre-empted any chat that legitimately ran longer (multi-turn tool use, large synthesis tasks) and made every wedged-SDK call burn 5 full minutes before the user saw anything. Replaced with a per-dispatch idle timeout: cancel the request only when the broadcaster has been silent for `idleTimeoutDuration` (60s). Any progress event for the workspace — agent_log tool-use rows, task_update, a2a_send, a2a_receive — resets the clock. Mechanics: - new applyIdleTimeout helper subscribes to events.Broadcaster's per-workspace SSE channel, drains its messages, resets a time.Timer on each one, cancels the wrapped ctx when the timer fires. Cleanup goroutine + subscription lives only as long as the returned cancel func is uncalled. - dispatchA2A now takes workspaceID as a parameter, applies the idle timeout always (canvas + agent), and combines its cancel with the existing 30-min agent-to-agent ceiling cancel into one func the caller defers. - Canvas dispatches no longer have an absolute ceiling at all — the idle timer is the only "give up" signal. A healthy chat reporting tool-use telemetry every few seconds runs forever; a wedged runtime fails in 60s instead of 5 min. - isUpstreamBusyError now also recognises context.Canceled (the error class our idle cancel produces, distinct from DeadlineExceeded). Same 503-busy retry semantics. Tests: - TestApplyIdleTimeout_FiresOnSilence — 60ms idle, no events, ctx cancels with context.Canceled. - TestApplyIdleTimeout_ResetsOnEvent — event mid-window extends the deadline; ctx alive past original deadline, then cancels on the second silence window. - TestApplyIdleTimeout_NilBroadcasterDegradesGracefully — defensive no-op for paths that don't wire a broadcaster. - 3 existing dispatchA2A tests updated for the new workspaceID param + the always-non-nil cancel return shape. This pairs with Piece 1's per-tool-use telemetry (`166c7f77`): the broadcaster events that reset the idle timer ARE the agent_log rows the workspace started emitting per tool call. So the same event stream feeds both the chat progress feed AND the proxy's deadline. Full Go test suite passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 08:34:55 -07:00
Hongming Wang	166c7f77af	feat(chat): stream per-tool progress into MyChat live feed Two halves of the same UX win — the user wants to see what Claude is doing while a chat reply is in flight instead of staring at "0s" for minutes. Workspace side (claude_sdk_executor.py): - The executor's _run_query message loop already iterated the SDK stream for AssistantMessage.TextBlock content. Now also detects ToolUseBlock / ServerToolUseBlock entries (by class name, since the conftest stub doesn't define them) and fires-and-forgets a POST /workspaces/:id/activity row of type agent_log per tool use. - _summarize_tool_use maps the common tools (Read, Write, Edit, Bash, Glob, Grep, WebFetch, WebSearch, Task, TodoWrite) to a one-line summary with the file path / pattern / command, falling back to "🛠 <tool>(…)" for anything else. Truncated at 200 chars. - Posts directly to /workspaces/:id/activity rather than going through a2a_tools.report_activity, which would also push a /registry/heartbeat current_task and double-log as a TASK_UPDATED line in the same chat feed. - All failures swallowed silently — telemetry must not break the conversation. Canvas side (ChatTab.tsx): - The existing ACTIVITY_LOGGED handler streams a2a_send / a2a_receive / task_update events into a sliding-window activityLog state. Two issues fixed: 1. No `msg.workspace_id === workspaceId` filter — a sibling workspace's a2a_send was leaking into the wrong chat panel as "→ Delegating to X...". Added an early return. 2. No agent_log render branch. Added one that renders the summary verbatim (the workspace already prefixed its own emoji icon, so no double-icon). - Existing 8-line sliding window keeps the UI scoped; older progress lines naturally roll off as new ones arrive. Result: when DD is delegating to Visual Designer + reading config files + running Bash to lint, the spinner area shows: 📄 Read /configs/system-prompt.md ⚡ Bash: pnpm test → Delegating to Visual Designer... ← Visual Designer responded (47s) instead of bare "0s · Processing with Claude Code..." for minutes. 63 Python tests + 58 canvas chat tests pass; tsc clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 08:28:55 -07:00
Hongming Wang	14fab6e544	Merge pull request #2073 from Molecule-AI/fix/canvas-e2e-mock-workspace-apis fix(canvas/e2e): swap workspace-scoped 401s for empty 200s in staging-tabs spec	2026-04-25 15:23:07 +00:00
Hongming Wang	979d4a0b7a	fix(canvas/e2e): swap workspace-scoped 401s for empty 200s The staging-tabs E2E has been failing for 6+ hours on the same locator timeout — diagnosed earlier today as the canvas's lib/api.ts:62-74 redirect-on-401 path firing mid-test: e2e/staging-tabs.spec.ts:45:7 › tab: skills TimeoutError: locator.scrollIntoViewIfNeeded: Timeout 5000ms - navigated to "https://scenic-pumpkin-83.authkit.app/?..." Several side-panel tabs (Peers, Skills, Channels, Memory, Audit, and anything workspace-scoped) hit endpoints under `/workspaces/<id>/` that require a workspace-scoped token, NOT the tenant admin bearer the test uses. The endpoints respond 401 in SaaS mode. canvas/src/lib/api.ts:62-74 reacts to ANY 401 by setting `window.location.href` to AuthKit — yanking the page off the tenant origin mid-test. The test comment at line 18 already acknowledged the 401 class ("Peers tab: 401 without workspace-scoped token") but assumed those would surface as "errored content" rather than a hard navigation. The redirect logic in api.ts was added later and breaks the assumption. Fix: add a Playwright route handler that catches any 401 from `/workspaces/<id>/` paths and replaces with `200 + empty body`. Body shape is best-effort by URL — list endpoints (paths not ending in a UUID-shaped segment) get `[]`, single-resource endpoints get `{}`. Both are valid JSON and well-written panels render an empty state for either rather than crashing. The two route patterns (`/workspaces/...` and `/cp/auth/me`) don't overlap — the existing `/cp/auth/me` mock continues to gate AuthGate's session check independently. Verification: - Type-check passes (tsc clean for the spec; pre-existing errors in unrelated test files unchanged) - Can't run staging E2E locally without CP admin token; CI will exercise the real path against the freshly-provisioned tenant - E2E Staging SaaS (full lifecycle) is currently green at 08:07Z, confirming the underlying staging infra works — the failures have been narrowly in this Playwright-tabs spec Targets staging per molecule-core convention. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 08:08:05 -07:00
Hongming Wang	4eb09e2146	feat(platform,workspace): SDK-wedge detection + workspace_status ENUM Heartbeat lies. The asyncio task that POSTs /registry/heartbeat lives in its own process slot, so a workspace whose claude_agent_sdk has wedged on `Control request timeout: initialize` keeps reporting "online" — every chat send hangs the full 5-min platform deadline even though the runtime is dead in the water. This commit teaches the workspace to admit it's wedged and the platform to honor that admission by flipping status → degraded. Five layers, all in one commit because they share a contract: 1. Migration 043 — convert workspaces.status from free-form TEXT to a real `workspace_status` Postgres ENUM with the 6 values production code actually writes (provisioning, online, offline, degraded, failed, removed). Locks the value set; future typo writes error at the DB instead of silently storing rogue strings. Down migration reverts to TEXT and drops the type. 2. workspace-server/internal/models — `HeartbeatPayload` gains a `runtime_state string` field. Empty = healthy. Currently the only non-empty value the handler honors is "wedged"; future symptoms can extend without another migration. 3. workspace-server/internal/handlers/registry.go — `evaluateStatus` gains a wedge branch BEFORE the existing error_rate >= 0.5 path: if `RuntimeState=="wedged"` and currently online, flip to degraded and broadcast WORKSPACE_DEGRADED with the wedge sample error. Recovery (`degraded → online`) now requires BOTH error_rate < 0.1 AND runtime_state cleared, so a workspace still reporting wedged stays degraded even when its error count happens to be 0 (the wedge captures a runtime state, not an error count). 4. workspace/claude_sdk_executor.py — module-level `_sdk_wedged_reason` flag set when execute()'s catch block sees an error matching `_WEDGE_ERROR_PATTERNS` (currently just "control request timeout"). Sticky for the process lifetime; the SDK's internal client-process state is corrupted on this error and only a workspace restart (= new Python process = fresh module state) clears it. Helpers `is_wedged()` / `wedge_reason()` / `_reset_sdk_wedge_for_test()` exposed. 5. workspace/heartbeat.py — heartbeat body now layers on `_runtime_state_payload()` for both the happy path and the 401-retry path. Lazy-imports claude_sdk_executor so non-Claude runtimes (where the module may not even be importable) keep working unchanged. Canvas required no changes — `STATUS_CONFIG.degraded` was already defined in design-tokens.ts (amber dot, "Degraded" label) and WorkspaceNode.tsx already renders `lastSampleError` underneath the status pill when status === "degraded". The existing wiring just never fired because nothing was writing degraded in this code path. Tests: - 3 Go handler tests for the new transitions (online → degraded on wedged, degraded stays put while still wedged, degraded → online after wedge clears) - 5 Python wedge-detector tests (default clean, mark sets flag, sticky-first-wins, execute() flips on Control request timeout, execute() does NOT flip on unrelated errors) - Migration smoke-tested against the local dev DB (3 existing rows, all enum-compatible; migration applied cleanly, post-state has the column as workspace_status type and the index preserved) Verified: 79 Python tests pass; full Go test suite passes; migration applies clean on a real DB; reverse migration restores the column to TEXT. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 00:59:15 -07:00
Hongming Wang	c159d85eb5	fix(a2a): review-driven hardening — prefix-anchored type check, error_detail cap, shared hint module Three required fixes from the bundle review of `391e1872`: 1. workspace/a2a_client.py: substring `type_name in msg` could miss the diagnostic prefix when an exception's message embedded a different class name mid-string (e.g. `OSError("see ConnectionError below")` → printed as plain msg, type lost). Switched to a prefix-anchored check (`msg.startswith(f"{type_name}:")` etc.) so the type label is always added when not already at the start of the message. 2. workspace/a2a_tools.py: `activity_logs.error_detail` is unbounded TEXT on the platform (handlers/activity.go does not validate length). A buggy or hostile peer could stream arbitrarily large error messages into the caller's activity log. Cap at 4096 chars at the producer — comfortably above any real exception traceback, well below an obvious-DoS threshold. 3. New regression test for JSON-RPC `code=0` — pins the `code is not None` semantics so the code is preserved in the detail rather than collapsing into the no-code path. Code=0 is not valid per the spec, but a malformed peer can still emit it and we want it visible for diagnosis. Plus one optional taken: extracted the A2A-error → hint mapping into canvas/src/components/tabs/chat/a2aErrorHint.ts. The two prior copies (AgentCommsPanel.inferCauseHint + ActivityTab.inferA2AErrorHint) had already drifted — Activity tab gained `not found`/`offline` cases the chat panel never picked up, AgentCommsPanel handled empty-input explicitly while Activity didn't. The shared module is the merged superset, with 10 unit tests pinning each named pattern + the "most specific first" ordering (Claude SDK wedge wins over generic timeout). Skipped (per analysis): - Unicode-naive 120-char slice — Python str[:N] slices on code points, not bytes. Safe. - Nested [A2A_ERROR] confusion — non-issue per reviewer; outer prefix winning still produces a structured render. - MessagePreview + JsonBlock dual render on errors — intentional drilldown; raw JSON is below the fold for operators who need it. - console.warn dedup — refetches don't happen per-event so spam risk is low. - str(data)[:200] materialization — A2A response bodies aren't typically MB-sized. Verified: 1005 canvas tests pass (10 new hint tests); 10 Python send_a2a_message tests pass (1 new for code=0); tsc clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 23:47:44 -07:00
Hongming Wang	391e187281	fix(a2a,canvas): make delivery failures comprehensive instead of "[A2A_ERROR] " Symptom: Activity tab and Agent Comms surfaced bare "[A2A_ERROR] " (prefix + nothing) for failed delegations. Operator had no signal to act on — no exception type, no target, no hint about what went wrong, no next step. Fix is in three layers. 1. workspace/a2a_client.py — every error path now produces an actionable detail string: - except branch: some httpx exceptions (RemoteProtocolError, ConnectionReset variants) stringify to "". Pre-fix the catch was `f"{_A2A_ERROR_PREFIX}{e}"` → bare prefix. Now falls back to `<TypeName> (no message — likely connection reset or silent timeout)` and always appends `[target=<url>]` for traceability in chained delegations. - JSON-RPC error branch: previously dropped error.code on the floor and printed "unknown" when message was missing. Now surfaces both, including the well-defined "JSON-RPC error with no message (code=N)" path. - "neither result nor error" branch: pre-fix returned str(payload) which the canvas rendered as a successful response block. Now tagged as A2A_ERROR with a payload snippet so downstream UI routes through the error path. 2. workspace/a2a_tools.py — tool_delegate_task now passes error_detail (the stripped error message) through to the activity-log POST. The platform's activity_logs.error_detail column is the canvas's red error chip source; populating it makes the failure visible in the row header without the user having to expand into raw response_body JSON. The summary line also gets a 120-char prefix of the cause so the collapsed row reads "React Engineer failed: ConnectionResetError: ... [target=...]" instead of "React Engineer failed". 3. canvas/src/components/tabs/ActivityTab.tsx — MessagePreview now detects [A2A_ERROR]-prefixed bodies and renders a structured error block (red chip, stripped detail, cause hint) instead of the previous gray text-block that showed the literal "[A2A_ERROR]" string. inferA2AErrorHint mirrors the patterns from AgentCommsPanel.inferCauseHint so the same symptom reads the same way in both surfaces (Claude SDK init wedge → restart workspace; timeout → busy/stuck; connection-reset → transient blip then check logs). Tests: 9 send_a2a_message tests pass (including a new regression test for the empty-stringifying-exception case that the user reported); 995 canvas tests pass; tsc clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 23:40:05 -07:00
Hongming Wang	54f7c75c81	fix(canvas): make AgentCommsPanel load failures observable Reported symptom: canvas edges show "1 call · just now" between two agents, but the Agent Comms tab for the source workspace renders "No agent-to-agent communications yet" — even though GET /workspaces/<id>/activity?source=agent&limit=50 returns a2a_send + a2a_receive rows. Confirmed via curl that the API does return the rows the panel should map. The panel's load handler was the suspect, but it had: .catch(() => setLoading(false)) which swallowed every failure path — network errors, JSON parse, ANY throw inside the .then body — without leaving a single trace in the console. The panel just sat on its empty state and gave the user zero signal to act on. (And by extension, gave us nothing to debug remotely either.) Two changes: 1. Wrap the per-row `toCommMessage` call in a try/catch so one malformed activity row (unexpected request_body shape, etc.) doesn't throw out of the for-loop and skip the setMessages(msgs) line. Previously the panel would silently drop the entire batch when ANY row failed to parse. 2. Replace the bare `.catch(() => setLoading(false))` with a logging variant. Now a future "panel stuck empty" report comes with `AgentCommsPanel: load activity failed <err>` or `AgentCommsPanel: failed to map activity row {...}` in the console — diagnosable instead of opaque. Behavior on the happy path is unchanged (5 existing tests still pass; tsc clean). This is purely defensive: it makes the failure path visible so the next stuck-empty report can be root-caused instead of guessed at. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 23:27:50 -07:00
Hongming Wang	28911ded40	fix(canvas): split shared autoFitTimerRef so settle + tracking fits don't cross-cancel Bundle-level review caught an implicit coupling in useCanvasViewport between two distinct fit effects: - settle fit: 1200ms one-shot when provisioning transitions to zero (deploy just finished — settle on the whole org once) - tracking fit: 500ms debounced per molecule:fit-deploying-org event (track the org's bounds as children land during the deploy) Both effects shared a single autoFitTimerRef, so each one's clearTimeout call could silently cancel the other's pending fit. Today's behavior happened to land in the right order out of luck — the tracking handler fires per-arrival during the deploy, then the settle effect arms after the last child completes. But nothing in the code enforces that ordering; a future refactor that, say, fires the settle effect from the same event sequence as the tracking timer (mid-deploy status flicker) would silently drop the settle fit because the tracking timer's clearTimeout ran last. Splitting into settleFitTimerRef + trackingFitTimerRef makes the two effects fully independent. Cleanup clears both. Tests still pass (995/995); the refactor is mechanical. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 23:19:02 -07:00
Hongming Wang	fc54601999	Merge pull request #2067 from Molecule-AI/fix/canary-openai-key-staging ci(canary): inject E2E_OPENAI_API_KEY so A2A turn doesn't 500	2026-04-25 06:12:30 +00:00
Hongming Wang	52d203a098	Merge pull request #2068 from Molecule-AI/ci/sweep-stale-e2e-orgs ci: hourly sweep of stale e2e-* orgs on staging	2026-04-25 06:12:29 +00:00
Hongming Wang	fe075ee1ba	ci: hourly sweep of stale e2e-* orgs on staging Adds a janitor workflow that runs every hour and deletes any e2e-prefixed staging org older than MAX_AGE_MINUTES (default 120). Catches orgs left behind when per-test-run teardown didn't fire: CI cancellation, runner crash, transient AWS error mid-cascade, bash trap missed (signal 9), etc. Why it exists despite per-run teardown: - Per-run teardown is best-effort by definition. Any process death after the test starts but before the trap fires leaves debris. - GH Actions cancellation kills the runner with no grace period — the workflow's `if: always()` step usually catches this but can still fail on transient CP 5xx at the wrong moment. - The CP cascade itself has best-effort branches today (cascadeTerminateWorkspaces logs+continues on individual EC2 termination failures; DNS deletion same shape). Those need cleanup-correctness work in the CP, but a safety net belongs in CI either way — defense in depth. Behaviour: - Cron every hour. Manual workflow_dispatch with overrideable max_age_minutes + dry_run inputs for one-off cleanups. - Concurrency group prevents two sweeps fighting. - SAFETY_CAP=50 — refuses to delete more than 50 orgs in a single tick. If the CP admin endpoint goes weird and returns no created_at (or returns no orgs at all), every e2e-* would look stale; the cap catches the runaway-nuke case. - DELETE is idempotent CP-side via org_purges.last_step, so a half-deleted org from a prior sweep gets picked up cleanly on the next tick. - Per-org delete failures don't fail the workflow. Next hourly tick retries. The workflow only fails loud at the safety-cap gate. Tonight's specific motivation: ~10 canvas-tabs E2E retries in 2 hours with various failure modes; each provisioned a fresh tenant + EC2 + DNS + DB row. Some fraction leaked. Without this loop, ops has to periodically run the manual sweep-cf-orphans.sh script. With it, staging self-heals. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 23:07:57 -07:00
Hongming Wang	43c28710ac	Merge pull request #2066 from Molecule-AI/fix/e2e-staging-status-field fix(e2e): poll instance_status not status — staging E2E never matched the field, masked all real bugs	2026-04-25 05:58:36 +00:00

... 12 13 14 15 16 ...

3663 Commits