molecule-core

Author	SHA1	Message	Date
Hongming Wang	ca6fc55c8b	fix(a2a_proxy): derive callerID from bearer when X-Workspace-ID absent (#2306 ) External callers (third-party SDKs, the channel plugin) authenticate purely via bearer and frequently don't set the X-Workspace-ID header. Without this, activity_logs.source_id ends up NULL — breaking the peer_id signal on notifications, the "Agent Comms by peer" canvas tab, and any analytics that breaks down inbound A2A by sender. The bearer is the authoritative caller identity per the wsauth contract (it's what proves who you are); the header is a display/routing hint that must agree with it. So we derive callerID from the bearer's owning workspace whenever the header is absent. The existing validateCallerToken guard fires after this and enforces token-to-callerID binding the same way it always has. Org-token requests are skipped — those grant org-wide access and don't bind to a single workspace, so the canvas-class semantics (callerID="") are preserved. Bearer-resolution failures (revoked, removed workspace) fall through to canvas-class as well, never 401. New wsauth.WorkspaceFromToken exposes the bearer→workspace lookup as a modular interface; mirrors ValidateAnyToken's defense-in-depth JOIN on workspaces.status != 'removed'. Tests: 4 unit tests on WorkspaceFromToken + 3 integration tests on ProxyA2A covering the three observable paths (bearer-derived, org-token skipped, derive-failure fallthrough). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 16:05:56 -07:00
Hongming Wang	4050999a15	Merge pull request #2310 from Molecule-AI/auto/issue-2307-peer-staging-e2e test(e2e): staging peer-visibility harness for #2307	2026-04-29 23:00:10 +00:00
Hongming Wang	da17753dec	Merge pull request #2321 from Molecule-AI/auto/issue-canvas-noop-required-check ci: no-op shadow for Canvas (Next.js) required check	2026-04-29 22:47:08 +00:00
Hongming Wang	fcb2049f3f	ci: add no-op shadow for Canvas (Next.js) required check PRs that don't touch canvas/ paths skip the Canvas (Next.js) job via its `if: needs.changes.outputs.canvas == 'true'` guard. GitHub reports SKIPPED for that conclusion. Branch protection on staging requires Canvas (Next.js) — and treats SKIPPED as not-passed, blocking merge on every workspace-server-only or migration-only PR. This is the design pattern documented in feedback memory "branch_protection_check_name_parity": split into a real job + a no-op shadow that share the same `name:`. Exactly one runs per PR; both report the same check context, and at least one always reports SUCCESS, satisfying the required check. The no-op job runs in a few seconds (single `echo` step) and produces the right check context for any PR that has changes outside canvas/. Concrete blocker that prompted this: PR #2314 (RFC #2312 PR-B) sat APPROVED + CI-green + UP-TO-DATE for half an hour with mergeStateStatus BLOCKED, traced via the GraphQL `isRequired` field to a single SKIPPED Canvas (Next.js) check. PRs #2319 (PR-F) and the rest of the RFC #2312 stack would have hit the same wall.	2026-04-29 15:44:07 -07:00
Hongming Wang	e8943dffd7	Merge pull request #2313 from Molecule-AI/auto/issue-2312-chat-upload-http-forward feat(wsauth): platform→workspace inbound secret (RFC #2312, PR-A)	2026-04-29 22:29:43 +00:00
Hongming Wang	1c9cea980d	feat(wsauth): platform→workspace inbound secret (RFC #2312 , PR-A) Foundation for the HTTP-forward architecture that replaces Docker-exec in chat upload + 5 follow-on handlers. This PR is intentionally scoped to schema + token mint + provisioner wiring; no caller reads the secret yet so behavior is unchanged. Why a second per-workspace bearer (not reuse the existing workspace_auth_tokens row): workspace_auth_tokens workspaces.platform_inbound_secret ───────────────────── ───────────────────────────────── workspace → platform platform → workspace hash stored, plaintext gone plaintext stored (platform reads back) workspace presents bearer platform presents bearer platform validates by hash workspace validates by file compare Distinct roles, distinct rotation lifecycle, distinct audit signal — splitting later would require a fleet-wide rolling rotation, so paying the schema cost up front. Changes: * migration 044: ADD COLUMN workspaces.platform_inbound_secret TEXT * wsauth.IssuePlatformInboundSecret + ReadPlatformInboundSecret * issueAndInjectInboundSecret hook in workspace_provision: mints on every workspace create / re-provision; Docker mode writes plaintext to /configs/.platform_inbound_secret alongside .auth_token, SaaS mode persists to DB only (workspace will receive via /registry/register response in a follow-up PR) * 8 unit tests against sqlmock — covers happy path, rotation, NULL column, empty string, missing workspace row, empty workspaceID PR-B (next) wires up workspace-side `/internal/chat/uploads/ingest` that validates the bearer against /configs/.platform_inbound_secret. Refs #2312 (parent RFC), #2308 (chat upload 503 incident). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 14:09:33 -07:00
Hongming Wang	1475637dca	Merge pull request #2311 from Molecule-AI/auto/issue-2308-revert-gate revert(chat_files): drop the wrong external-runtime gate (#2308)	2026-04-29 21:07:43 +00:00
Hongming Wang	51e48a267a	revert(chat_files): drop the wrong external-runtime gate (#2308 ) PR #2309 added an early-return that 422'd uploads to external workspaces with "file upload not supported." Both halves of that diagnosis were wrong: 1. External workspaces SHOULD support uploads — gating with 422 locks off intended functionality and labels it as design. 2. The 503 the user actually hit was on an INTERNAL workspace, not an external one. The runtime check never even ran. Real root cause (separate fix incoming): - findContainer(...) requires a non-nil h.docker. - In SaaS (MOLECULE_ORG_ID set), main.go selects the CP provisioner instead of the local Docker provisioner — dockerCli is nil. - findContainer short-circuits to "" → 503 "container not running" on every workspace, internal or external, on Railway-hosted SaaS where workspaces actually live on EC2. This PR strips the misleading gate so #2308 can be re-investigated against the real symptom. The proper fix routes the multipart upload over HTTP to the workspace's URL when dockerCli is nil — tracked as a follow-up. Refs #2308. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 13:52:23 -07:00
Hongming Wang	558a0631f9	test(e2e): add staging peer-visibility harness for #2307 Creates a fresh tenant via /cp/admin/orgs, provisions an internal CEO (claude-code default) + external child as its sub-agent, registers the child, and probes peer visibility from three angles: - DB-shape: child appears in /workspaces?parent_id=<parent> - /registry/<child>/peers (child's bearer): does it see parent? - /registry/<parent>/peers (parent's bearer, if exposed) EXIT-trap teardown sends DELETE /cp/admin/tenants/:slug with the required {"confirm":slug} body and polls /cp/admin/orgs for purge confirmation (mirrors test_staging_full_saas.sh). The harness was authored as the staging counterpart to the local two-workspace reproduction script: local doesn't generalize to staging's tenant-proxy auth chain, so each surface needs its own probe. Run: MOLECULE_ADMIN_TOKEN=<CP admin bearer> tests/e2e/test_2307_peer_visibility_staging.sh Refs #2307. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 13:26:24 -07:00
Hongming Wang	e779a4ae7b	Merge pull request #2309 from Molecule-AI/auto/issue-2308-external-upload-422 fix(chat_files): return 422 (not misleading 503) for external workspace upload (#2308)	2026-04-29 19:45:05 +00:00
Hongming Wang	4a6095ee1a	fix(chat_files): return 422 with structured detail for external workspaces (closes #2308 ) Symptom: pasting a screenshot into the canvas chat for a runtime="external" workspace returned `503 {"error":"workspace container not running"}` — accurate from the upload handler's POV (no container exists for external workspaces) but misleading because it implies the container has crashed. Fix: detect runtime="external" via DB lookup BEFORE the container-find step and return 422 with: - error: "file upload not supported for external workspaces" - detail: explains why + points at admin/secrets workaround + references issue #2308 for the v0.2 native-support roadmap - runtime: "external" (machine-readable for clients) Why 422 not 200/501: - 422 = Unprocessable Entity — the request is well-formed but the workspace's runtime can't accept it. Standard REST semantics. - 200 with empty result would lie; 501 implies the API itself is unimplemented (it's not — works for non-external workspaces); 503 was the misleading status this PR fixes. Verified via live E2E against localhost: - Created `runtime=external,external=true` workspace - Posted multipart to /workspaces/:id/chat/uploads - Got 422 with the expected structured body Unit test (`chat_files_external_test.go`) pins the contract via sqlmock + httptest. Notable: the handler is constructed with `templates: nil` to prove the runtime check happens BEFORE any docker plumbing — if a future change moves the check below findContainer, the test crashes on nil-deref instead of silently regressing. Out of scope (for v0.2 follow-up): - Native external-workspace file ingest via artifacts table or the channel-plugin's inbox/ pattern. Requires separate design pass. Closes #2308 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 12:37:49 -07:00
Hongming Wang	2dcacb8f6b	Merge pull request #2302 from Molecule-AI/auto/issue-1815-canvas-coverage-ci-step ci(canvas): wire vitest --coverage into CI for baseline observability (#1815)	2026-04-29 18:26:41 +00:00
Hongming Wang	c2191684bf	Merge pull request #2303 from Molecule-AI/auto/issue-1770-pre-commit-go-build fix(pre-commit): add go build gate for staged Go changes (#1770)	2026-04-29 18:01:56 +00:00
Hongming Wang	2d1b15ecbc	fix(pre-commit): add go build ./... gate for staged Go changes (#1770 ) Catches the bot-generated-structurally-invalid-Go class that took staging Platform(Go) red for hours on 2026-04-22 (PR #1769 commit `66ea0b64` nested a function declaration inside another function's body). The patch tool applied it; the Go parser rejected it; every Go PR targeting staging during the window failed CI through no fault of its own. Hook now runs `cd workspace-server && go build ./...` when any .go file in workspace-server/ is staged. If the build fails, commit is rejected with the first 20 lines of build output. Skip-with-warning when go isn't installed (CI runners + bots without go bypass cleanly). Cost: ~5-10s per commit that touches Go on a warm cache. Acceptable for the class of bug it catches — the alternative (catch at PR-time via CI) is too late, the malformed commit is already shared. This is one of the three guards proposed in #1770. The other two (branch-protection on `Platform (Go)` as required check; SHARED_RULES clarification on bot-PR overrides) are admin / process changes that need your action. Closes the pre-commit half of #1770. Branch-protection + SHARED_RULES work tracks separately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 10:12:22 -07:00
Hongming Wang	9785f5ebb1	Merge pull request #2301 from Molecule-AI/auto/issue-2060-content-routing-doc docs(CONTRIBUTING): add 'What goes where' section for content routing (#2060)	2026-04-29 16:22:36 +00:00
Hongming Wang	949b1b97a5	Merge pull request #2300 from Molecule-AI/auto/issues-2269-2268-restartstates-leak-and-since-secs fix(workspace_crud) + feat(activity): restartStates leak (#2269) + since_secs param (#2268)	2026-04-29 16:22:34 +00:00
Hongming Wang	b733cf46c3	Merge pull request #2298 from Molecule-AI/fix/issue-2290-required-env-no-force-bypass fix(org-import): remove force=true bypass of required-env preflight (#2290)	2026-04-29 16:22:32 +00:00
Hongming Wang	9a0d440fb7	Merge pull request #2296 from Molecule-AI/auto/issue-2270-readme-activity-rename docs(scripts): rename /heartbeat-history → /activity in README	2026-04-29 16:22:30 +00:00
Hongming Wang	22326d6591	Merge pull request #2295 from Molecule-AI/auto/issue-2289-git-path-shutil-which fix: resolve git/gh from PATH instead of hardcoded /usr/local/bin (closes #2289)	2026-04-29 16:22:28 +00:00
Hongming Wang	d8210514c1	ci(canvas): wire vitest --coverage into CI for baseline observability (#1815 ) Step 2 of #1815. Step 1 (instrumentation in canvas/vitest.config.ts) already shipped — the inline comment there explicitly defers wiring into CI to a follow-up because turning on a 70% threshold blind would either fail CI immediately or paper over a real gap with an ad-hoc exclude list. This PR ships the observability half: - Replaces `npx vitest run` with `npx vitest run --coverage` in the canvas-build job. Coverage gets reported on every PR; no threshold gate yet (vitest.config.ts intentionally doesn't set thresholds). - Adds an artifact upload step for canvas/coverage/ (HTML + json-summary) so reviewers can browse the coverage report from any PR. 7-day retention; if-no-files-found=warn so a step skip doesn't fail. Step 3 (thresholds + hard gate) is the natural follow-up — track in a new sub-issue once we've seen ~5-10 PRs of baseline data and know where current coverage sits. The issue body proposed lines:70 / functions:70 / branches:65 / statements:70; that may need adjustment once the baseline lands. Closes the Step-2 portion of #1815. Step 3 stays open or gets a fresh issue depending on your preference. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 08:51:34 -07:00
Hongming Wang	bfa54e2ee7	docs(CONTRIBUTING): add 'What goes where' section for content routing Adds a prominent section to CONTRIBUTING.md documenting that public content (blog, marketing, OG images, SEO briefs, DevRel demos) belongs in Molecule-AI/docs, not molecule-core. Mirrors the routing cheat-sheet from #2060 with the table of content-type → target repo, and points contributors at the existing `Block forbidden paths` CI gate as the loud-fail signal. Per the issue: 11 content PRs were silently blocked over 48h before being closed and redirected. This in-repo notice gives contributors (human and agent) a discoverable spot to learn the rule before opening the wrong PR. The CI gate is already enforcing the policy; this just makes the rule self-service. Closes #2060 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 06:51:30 -07:00
Hongming Wang	9559118678	feat(activity): accept ?since_secs= for time-window filtering (#2268 ) The harness runner (scripts/measure-coordinator-task-bounds-runner.sh) calls `/workspaces/:id/activity?since_secs=$A2A_TIMEOUT` to scope a trace to a specific test window. The query param was silently ignored — `ActivityHandler.List` accepted only `type`, `source`, and `limit`, so the runner got the most-recent-100 events regardless of how long ago they happened. Works for fresh-tenant tests where activity_logs is ~empty pre-run, breaks on busy tenants and on tests that exceed 100 events. Adds `since_secs` parsing with three behaviors: - Valid positive int → `AND created_at >= NOW() - make_interval(secs => $N)` on the SQL. Parameterised; values bound via lib/pq, not interpolated. `make_interval(secs => $N)` is required — the `INTERVAL '$N seconds'` literal form rejects placeholder substitution inside the string. - Above 30 days (2_592_000s) → silently clamped to the cap. Defends against a paranoid client triggering a multi-month full-table scan via `since_secs=999999999`. - Negative, zero, or non-integer → 400 with a structured error, NOT silently dropped. Silent drop is exactly the bug this is fixing — a typoed param shouldn't be lost as most-recent-100. Tests cover all four paths: accepted (with arg-binding assertion via sqlmock.WithArgs), clamped at 30 days, invalid rejected (5 sub-cases), and omitted (verifies no extra clause / arg leak via strict WithArgs count). RFC #2251 §V1.0 step 6 (platform-side-transition audit) also depends on this for time-window filtering of activity_logs. Closes #2268 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 05:53:52 -07:00
Hongming Wang	f75599eba9	fix(workspace_crud): drop restartStates entries on workspace delete (#2269 ) Per-workspace `restartState` entries (introduced under the name `restartMu` pre-#2266, renamed to `restartStates` in #2266) are created via `LoadOrStore` in `workspace_restart.go` but never deleted. On a long-running platform process serving many short-lived workspaces (E2E tests, transient sandbox tenants), the sync.Map grows monotonically — ~16 bytes per workspace ever created. Fix: call `restartStates.Delete(wsID)` after stopAndRemove + ClearWorkspaceKeys for each cascaded descendant and the parent. Mirrors the existing per-ID cleanup loop. `sync.Map.Delete` is safe on absent keys, so workspaces that were never restarted (no LoadOrStore call) are no-op. This is a pre-existing leak — #2266 did not introduce it; just renamed the holder. Filing as a separate commit to keep the change minimal and reviewable. Closes #2269 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 05:53:34 -07:00
Hongming Wang	80c612d987	fix(org-import): remove force=true bypass of required-env preflight The pre-#2290 \`force: true\` flag on POST /org/import skipped the required-env preflight, letting orgs import without their declared required keys (e.g. ANTHROPIC_API_KEY). The ux-ab-lab incident: that import path was used, the org shipped without ANTHROPIC_API_KEY in global_secrets, and every workspace 401'd on the first LLM call. Per #2290 picks (C/remove/both): - Q1=C: template-derived required_env (no schema change — already the existing aggregation via collectOrgEnv). - Q2=remove: drop the bypass entirely. The seed/dev-org flow that legitimately needs to skip becomes a separate dry-run-import path with its own audit trail, not a permission bypass. - Q3=block-at-import-only: provision-time drift logging is a follow-up; for this PR, blocking at import is the gate. Surface change: - Force field removed from POST /org/import request body. - 412 \"suggestion\" text drops the \"or pass force=true\" guidance. - Legacy callers sending {\"force\": true} are silently tolerated (Go's json.Unmarshal drops unknown fields), so no client-side breakage; the bypass effect is just gone. Audited callers in this repo: - canvas/src/components/TemplatePalette.tsx — never sends force. - scripts/post-rebuild-setup.sh — never sends force. - Only external tooling sent force=true. Those callers must now set the global secret via POST /settings/secrets before importing. Adds TestOrgImport_ForceFieldRemoved as a structural pin: if a future change re-adds Force to the body struct, the test fails and forces an explicit reckoning with the #2290 rationale. Closes #2290 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 03:23:23 -07:00
Hongming Wang	4742b6c3f4	Merge pull request #2263 from Molecule-AI/auto-sync/main-d35a2420 chore: sync main → staging (auto, ff to `d35a2420`)	2026-04-29 02:32:35 -07:00
Hongming Wang	499fed5080	docs(scripts): rename /heartbeat-history → /activity in README PR #2265 renamed the harness trace endpoint and event name; sync the cross-repo scripts/README.md to match. Closes #2270 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 02:23:00 -07:00
Hongming Wang	59d65ba557	fix: resolve git/gh from PATH instead of hardcoded /usr/local/bin Closes #2289. Some workspace template images ship `/usr/local/bin/{git,gh}` wrappers that bake `GH_TOKEN` into argv handling (preferred — auto-PR creation authenticates without explicit token plumbing); other templates have plain `/usr/bin/git` installed via apt with no wrapper. The hardcoded `_GIT = "/usr/local/bin/git"` crashed every auto-push attempt on the latter image class: FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/bin/git' File "/app/molecule_runtime/executor_helpers.py", line 524, in _auto_push_and_pr_sync subprocess.run(['/usr/local/bin/git', 'rev-parse', '--is-inside-work-tree'], ...) `shutil.which("git")` walks PATH in order — finds the `/usr/local/bin/` wrapper first when it exists, falls back to `/usr/bin/git` otherwise. GH_TOKEN injection still wins on wrapper-equipped images; auto-push no longer crashes on bare-apt images. Verified locally: `shutil.which("git")` resolves to `/usr/bin/git` on the bug-reporter's image; `shutil.which("gh")` resolves to the homebrew path on dev. Both paths exist + are executable on respective hosts. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 02:07:19 -07:00
hongmingwang-moleculeai	42cf4f444c	Merge pull request #2271 from Molecule-AI/feat/new-response-message-helper feat(runtime): add new_response_message helper for adapter A2A responses	2026-04-29 08:33:11 +00:00
Hongming Wang	a57382e918	feat(runtime): add new_response_message helper for adapter A2A responses Surfaced via cross-template review of the a2a-sdk v0→v1 migration: every adapter executor (claude-code, gemini-cli, crewai, openclaw, autogen) builds A2A response Messages independently using `new_text_message(text)` from the SDK, which omits `task_id` and `context_id`. The runtime's own canonical pattern in `workspace/a2a_executor.py:466-475` correctly threads both: Message( message_id=uuid.uuid4().hex, role=Role.ROLE_AGENT, parts=_parts, task_id=task_id, # ← canonical context_id=context_id, # ← canonical ) Adapters skipping these correlation fields means the platform's a2a proxy can't reliably tie the response back to the originating task. This is a divergence from canonical, not necessarily a strict bug (task_id may be optional with a default) — but it's enough of a correlation/observability gap that the canonical pattern bothers to thread it. Add `new_response_message(context, text, files=None)` to executor_helpers.py — single home for response Message construction. Templates can migrate from `new_text_message(text)` to this helper in stacked PRs once the runtime publishes to PyPI. The helper: - Reads `context.task_id`/`context.context_id` from the inbound RequestContext, falling back to fresh UUIDs (RequestContextBuilder always sets them in production; fallback is for unit tests). - Sets `role=Role.ROLE_AGENT` (the v1 enum value). - Builds text Parts via `Part(text=...)` and file Parts via `Part(url="workspace:<path>", filename=..., media_type=...)`. - Returns a v1 protobuf Message ready for `event_queue.enqueue_event(...)`. Why "files=None" with the workspace: URI scheme as the file Part shape: matches the canonical pattern in a2a_executor.py exactly so the platform's chat-attachment download path (executor_helpers.py `resolve_attachment_uri`) interprets responses uniformly across all adapters. Tests (5, all pass with --no-cov against the live runtime image): - test_new_response_message_text_only - test_new_response_message_with_files - test_new_response_message_files_only_no_text - test_new_response_message_falls_back_when_context_ids_unset - test_new_response_message_handles_missing_attrs The conftest's a2a stubs needed an extension for Message + Role + Part with kwargs preservation. Strictly additive — no existing tests affected. (The 19 pre-existing failures in test_executor_helpers.py are unrelated debt from the commit_memory/recall_memory rewrite, visible on staging baseline before this change.) Per-template migration is the follow-up: claude-code, gemini-cli, crewai, openclaw, autogen all call `new_text_message(text)` today; each gets a per-repo PR replacing it with `new_response_message(context, text)`. This PR ships the helper first so the templates have something to import. Refs: PR #2266/#2267 (restart-race), claude-code #15 (FilePart fix), gemini-cli #10/crewai #8/openclaw #9/autogen #8 (rename PRs). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 01:13:34 -07:00
hongmingwang-moleculeai	aa6c42f042	Merge pull request #2267 from Molecule-AI/fix/restart-panic-recovery fix(restart): clear running flag on panic in cycle()	2026-04-29 07:10:44 +00:00
Hongming Wang	bdfa45572e	fix(restart): clear running flag on panic in cycle() Self-review caught a regression I introduced in #2266: if cycle() panics (e.g. a future provisionWorkspace nil-deref or any runtime error from the DB / Docker / encryption stacks it touches), the loop never reaches `state.running = false`. The flag stays true forever, the early-return guard at the top of coalesceRestart fires for every subsequent call, and that workspace is permanently locked out of restarts until the platform process restarts. The pre-fix code had similar exposure (panic killed the goroutine before defer wsMu.Unlock() ran in some Go versions), but my pending- flag version made it worse: the guard is sticky, not ephemeral. Fix: defer the state-clear so it always runs on exit, including panic. Recover (and DON'T re-raise) so the panic doesn't propagate to the goroutine boundary and crash the whole platform process — RestartByID is always called via `go h.RestartByID(...)` from HTTP handlers, and an unrecovered goroutine panic in Go terminates the program. Crashing the platform for every tenant because one workspace's cycle panicked is the wrong availability tradeoff. The panic message + full stack trace via runtime/debug.Stack() are still logged for debuggability. Regression test in TestCoalesceRestart_PanicInCycleClearsState: 1. First call's cycle panics. coalesceRestart's defer must swallow the panic — assert no panic propagates out (would crash the platform process from a goroutine in production). 2. Second call must run a fresh cycle (proves running was cleared). All 7 tests pass with -race -count=10. Surfaced via /code-review-and-quality self-review of #2266; the re-raise-after-recover anti-pattern (originally argued as "don't mask bugs") came up in the comprehensive review and was corrected to log-with-stack-and-suppress for availability. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 00:00:12 -07:00
hongmingwang-moleculeai	480b0576f2	Merge pull request #2265 from Molecule-AI/fix/harness-runner-activity-endpoint fix(harness-runner): switch from non-existent /heartbeat-history to /activity	2026-04-29 06:53:54 +00:00
hongmingwang-moleculeai	3623e975ec	Merge pull request #2266 from Molecule-AI/fix/restart-race-pending-flag fix(restart): coalesce concurrent restart requests via pending flag	2026-04-29 06:53:40 +00:00
Hongming Wang	f088090b27	fix(restart): coalesce concurrent restart requests via pending flag The naive mutex-with-TryLock pattern in RestartByID was silently dropping the second of two close-together restart requests. SetSecret and SetModel both fire `go restartFunc(...)` from their HTTP handlers, and both DB writes commit before either restart goroutine reaches loadWorkspaceSecrets. If the second goroutine arrives while the first holds the per-workspace mutex, TryLock returns false and the second is logged-and-dropped: Auto-restart: skipping <id> — restart already in progress The first goroutine's loadWorkspaceSecrets ran before the second write committed, so the new container boots without that env var. Surfaced during the RFC #2251 V1.0 measurement as hermes returning "No LLM provider configured" when MODEL_PROVIDER landed after the API-key write and lost its restart to the mutex (HERMES_DEFAULT_MODEL absent → start.sh fell back to nousresearch/hermes-4-70b → derived provider=openrouter → no OPENROUTER_API_KEY → request-time error). The same race hits any back-to-back secret/model save flow including the canvas's "set MiniMax key + pick model" UX. Fix: pending-flag / coalescing pattern. Any restart request that arrives while one is in flight sets `pending=true` and returns. The in-flight runner, on completion, checks the flag and runs another cycle. This collapses N concurrent requests into at most 2 sequential cycles (the current one + one more that picks up everyone who arrived during it), while guaranteeing the final container always sees the latest secrets. Concrete contract: - 1 request, no concurrency: 1 cycle - N concurrent requests during 1 in-flight cycle: 2 cycles total - N sequential requests (no overlap): N cycles - Per-workspace state — different workspaces never serialize Coalescing is extracted into `coalesceRestart(workspaceID, cycle func())` so the gate logic is testable without the full WorkspaceHandler / DB / provisioner stack. RestartByID now wraps that with the production cycle function. runRestartCycle calls provisionWorkspace SYNCHRONOUSLY (drops the historical `go`) so the loop's pending-flag check happens AFTER the new container is up — without that, the next cycle's Stop call would race the previous cycle's still-spawning provision goroutine. sendRestartContext stays async; it's a one-way notification. Tests in workspace_restart_coalesce_test.go cover all five contract points + race-detector clean over 10 iterations: - Single call → 1 cycle - 5 concurrent during in-flight → exactly 2 cycles total - 3 sequential → 3 cycles - Pending-during-cycle picked up (targeted bug repro) - State cleared after drain (running flag reset) - Per-workspace isolation (no cross-workspace serialization) Refs: molecule-core#2256 (V1.0 gate measurement); root cause for the "No LLM provider configured" symptom seen during hermes/MiniMax repro. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 23:31:56 -07:00
Hongming Wang	de01544d6b	fix(harness-runner): switch from non-existent /heartbeat-history to /activity The runner was speculatively calling `/workspaces/:id/heartbeat-history` — that endpoint doesn't exist on workspace-server. On local dev it 404'd; on tenant builds the platform's :8080 canvas-proxy fallback intercepted it and returned 28KB of Next.js HTML which then landed in the JSON event log. Neither outcome was useful trace data. `GET /workspaces/:id/activity` is the existing endpoint that reads activity_logs. That table already records the events the RFC §V1.0 step 6 'platform-side transition' check needs (a2a_send / a2a_receive / task_update / agent_log / error, plus duration_ms + status). Rename the runner's fetch + emitted event accordingly. Verified: GET /workspaces/<uuid>/activity?since_secs=60 returns 200 with `[]` against the local platform; no SaaS skip needed since the endpoint exists in both environments. Refs: molecule-core#2256 (V1.0 gate #1 measurement comment).	2026-04-28 23:12:51 -07:00
hongmingwang-moleculeai	9d159f0a94	Merge pull request #2262 from Molecule-AI/docs/registry-pattern-and-harness-readmes docs: registry pattern + harness scripts READMEs	2026-04-29 05:35:58 +00:00
hongmingwang-moleculeai	a18d116606	Merge pull request #2261 from Molecule-AI/fix/harness-cleanup-failed-event harness: SaaS routing + provider-agnostic config for RFC #2251 measurement	2026-04-29 05:35:43 +00:00
Hongming Wang	d35a2420eb	Merge pull request #2252 from Molecule-AI/staging staging → main: auto-promote `fcd87b9`	2026-04-28 22:34:16 -07:00
Hongming Wang	dd5c54dbaa	fix(harness-runner): WAIT_ONLINE_SECS round-up + SaaS heartbeat skip + UUID/slug validation Three review-driven fixes to the runner before #2261 merges: 1. `WAIT_ONLINE_SECS / 3` truncated; an operator passing 200 actually waited 198s. Round up so 200 → 67 polls × 3s = 201s ≥ requested. 2. The heartbeat-history endpoint isn't on tenant workspace-servers — the platform's :8080 fallback proxies unmatched paths to the canvas Next.js, so the SaaS run captured 28KB of HTML in the `heartbeat_trace` event log. Skip the fetch in MODE=saas; emit an explicit `<skipped: ...>` placeholder. Local mode behaviour unchanged. 3. ORG_ID and ORG_SLUG had no client-side format check, so a typo'd value got swallowed by TenantGuard's intentionally-opaque 404 (which doesn't tell the operator whether slug, UUID, or auth was wrong). Validate UUID and slug shape up front; matching errors are actionable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 22:29:29 -07:00
Hongming Wang	ef95628202	docs: registry pattern + harness scripts READMEs Two docs covering load-bearing patterns from today's work that weren't previously discoverable: 1. workspace/platform_tools/README.md — explains the ToolSpec single-source-of-truth pattern (#2240), the CLI-block alignment gap that hand-maintained generation can't close (#2258), the snapshot golden files + LF-pinning (#2260), and the add/rename/ remove playbook. The next reader who lands in workspace/platform_tools/ now has the design rationale + the safe-edit procedure colocated with the code. 2. scripts/README.md — disambiguates the three measure-coordinator- task-bounds.sh files that now exist across two repos: - scripts/measure-coordinator-task-bounds.sh (canonical OSS, this repo) - scripts/measure-coordinator-task-bounds-runner.sh (Hermes/MiniMax variant, this repo) - scripts/measure-coordinator-task-bounds.sh (production-shape, in molecule-controlplane) Cross-references reference_harness_pair_pattern (auto-memory) for the cross-repo design rationale. Documents the common safety pattern (cleanup trap, DRY_RUN, non-target guard, cleanup_*_failed events) and the heartbeat-trace caveat. Refs: #2240, #2254, #2257, #2258, #2259, #2260; molecule-controlplane#321. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 22:20:00 -07:00
Hongming Wang	00e4766046	docs: registry pattern + harness scripts READMEs Two docs covering load-bearing patterns from today's work that weren't previously discoverable: 1. workspace/platform_tools/README.md — explains the ToolSpec single-source-of-truth pattern (#2240), the CLI-block alignment gap that hand-maintained generation can't close (#2258), the snapshot golden files + LF-pinning (#2260), and the add/rename/ remove playbook. The next reader who lands in workspace/platform_tools/ now has the design rationale + the safe-edit procedure colocated with the code. 2. scripts/README.md — disambiguates the three measure-coordinator- task-bounds.sh files that now exist across two repos: - scripts/measure-coordinator-task-bounds.sh (canonical OSS, this repo) - scripts/measure-coordinator-task-bounds-runner.sh (Hermes/MiniMax variant, this repo) - scripts/measure-coordinator-task-bounds.sh (production-shape, in molecule-controlplane) Cross-references reference_harness_pair_pattern (auto-memory) for the cross-repo design rationale. Documents the common safety pattern (cleanup trap, DRY_RUN, non-target guard, cleanup_*_failed events) and the heartbeat-trace caveat. Refs: #2240, #2254, #2257, #2258, #2259, #2260; molecule-controlplane#321. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 22:19:40 -07:00
Hongming Wang	592f47694b	feat(harness): SaaS routing + provider-agnostic config for RFC #2251 measurement The original measure-coordinator-task-bounds.sh was hardcoded for local-dev (workspace-server on :8080) with claude-code/langgraph templates and OPENROUTER_API_KEY. Running it against staging requires both auth-chain plumbing (per-tenant ADMIN_TOKEN + X-Molecule-Org-Id TenantGuard header + tenant subdomain routing) and template/secret flexibility (e.g. Hermes/MiniMax for Token Plan keys). This adds: * `measure-coordinator-task-bounds-runner.sh` — separate runner that wraps the same workspace-server API calls but takes everything as env-var inputs. Two MODE values: - `local` → direct workspace-server (no auth/tenant scoping) - `saas` → tenant subdomain + per-tenant ADMIN_TOKEN bearer + X-Molecule-Org-Id TenantGuard header. Auto-fetches tenant token via CP /cp/admin/orgs/<slug>/admin-token given ORG_SLUG + CP_ADMIN_API_TOKEN, OR accepts a pre-resolved TENANT_ADMIN_TOKEN. * Configurable PM_TEMPLATE / CHILD_TEMPLATE / MODEL / SECRET_NAME / SECRET_VALUE — defaults match the original (claude-code-default + langgraph + OpenRouter). Hermes/MiniMax example documented in the header. * Per-poll status_change events during wait_online, so a workspace that never reaches online surfaces its last status (provisioning, failed, etc.) instead of a bare timeout. * WAIT_ONLINE_SECS knob (default 180s; SaaS cold-start needs ~420s for first hermes-image pull on a freshly-provisioned EC2 tenant). * `${args[@]+...}` guard on the api() helper — avoids `set -u` exploding on an empty header array (the local-dev hot-path). The original script also gained a SECRET_VALUE block earlier in the session — that change (separately staged) makes the secret-name configurable without forcing every operator through the new runner. V1.0 gate #1 (RFC #2251, Issue 4 repro) measurement results posted as a separate comment on molecule-core#2256. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 22:06:18 -07:00
hongmingwang-moleculeai	b78ca4ae02	Merge pull request #2259 from Molecule-AI/fix/harness-cleanup-failed-event fix(harness): cleanup_failed event + drop misleading exit_code capture	2026-04-29 04:30:34 +00:00
hongmingwang-moleculeai	5b2132b828	Merge pull request #2260 from Molecule-AI/chore/snapshot-lf-attribute chore(gitattributes): pin LF on snapshot golden files	2026-04-29 04:30:15 +00:00
Hongming Wang	9d7bb58374	chore(gitattributes): pin LF on snapshot golden files Self-review follow-up on #2258 (registry snapshot tests, just merged). The byte-exact snapshot comparisons in test_platform_tools.py would fail mysteriously on a Windows contributor's machine with core.autocrlf=true: checkout would convert LF → CRLF, the test would fail locally with no useful diagnostic, and the regen instructions in the test-file header would produce LF files that disagree with the working copy. Pin workspace/tests/snapshots/*.txt to text eol=lf so this can't happen. All three current snapshots are already LF; the attribute ensures it stays that way. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 21:01:44 -07:00
Hongming Wang	6e5b5c4142	fix(harness): cleanup_failed event + drop misleading exit_code capture Self-review follow-ups on #2257: - Drop `local exit_code=$?` from cleanup(). `trap`-handler return values are ignored, so capturing $? only misled a future reader into thinking exit-code preservation was happening. - Replace silenced `>/dev/null 2>&1` DELETE with `-w '%{http_code}'` capture. ADMIN_TOKEN expiring mid-run was the realistic failure mode here — previously we swallowed it under the silenced redirect, leaving workspaces leaked with no signal. Now a 401/403/5xx surfaces as a `cleanup_failed` JSON event with a remediation hint pointing at cleanup-rogue-workspaces.sh; 404 is treated as success (the post-condition — workspace absent — holds). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 21:00:38 -07:00
hongmingwang-moleculeai	0f50c5889b	Merge pull request #2258 from Molecule-AI/chore/registry-snapshot-and-alignment chore(registry): snapshot tests + CLI-block alignment for #2240	2026-04-29 03:56:35 +00:00
hongmingwang-moleculeai	ada0e1ddd0	Merge pull request #2253 from Molecule-AI/docs/auto-promote-staging-prereq-comment docs(ci): document auto-promote-staging GITHUB_TOKEN PR-create prereq	2026-04-29 03:48:51 +00:00
Hongming Wang	07a17c2e59	Merge remote-tracking branch 'origin/staging' into docs/auto-promote-staging-prereq-comment # Conflicts: # .github/workflows/auto-promote-staging.yml	2026-04-28 20:46:42 -07:00
hongmingwang-moleculeai	9bc3d6e352	Potential fix for pull request finding 'Unused global variable' Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>	2026-04-28 20:45:53 -07:00

1 2 3 4 5 ...

3392 Commits