molecule-core

Author	SHA1	Message	Date
Hongming Wang	e0f9434eaf	fix(files-eic): silence ssh known-hosts warning that 500'd Hermes config load GET /workspaces/:id/files/config.yaml on hongming.moleculesai.app's Hermes workspace returned 500 with body: ssh cat: exit status 1 (Warning: Permanently added '[127.0.0.1]:37951' (ED25519) to the list of known hosts.) Root cause: ssh emits the "Permanently added" notice on every fresh tunnel connection, even with UserKnownHostsFile=/dev/null (that prevents persistence, not the warning). It lands on stderr, fooling readFileViaEIC's classifier: if len(out) == 0 && stderr.Len() == 0 { return nil, os.ErrNotExist } return nil, fmt.Errorf("ssh cat: %w (%s)", runErr, ...) stderr was non-empty (the warning), so we returned the wrapped error → 500 from the HTTP layer instead of 404. Fix: add `-o LogLevel=ERROR` to BOTH writeFileViaEIC and readFileViaEIC ssh invocations. Silences info+warning while keeping real auth/tunnel errors visible (those emit at ERROR level). Test: TestSSHArgs_LogLevelErrorBothSites pins the flag in both blocks. Mutation-tested: stripping the flag from one site fails the gate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 19:58:49 -07:00
Hongming Wang	daefdd21c5	Merge pull request #2819 from Molecule-AI/fix/auto-promote-cancelled-not-failure fix(auto-promote): treat E2E completed/cancelled as defer, not failure	2026-05-05 02:33:10 +00:00
Hongming Wang	8df8487bbe	fix(auto-promote): treat E2E completed/cancelled as defer, not failure Bug: the case statement at line 189 grouped completed/failure \| completed/cancelled \| completed/timed_out into the same "abort + exit 1" branch. cancelled ≠ failure — when per-SHA concurrency (memory: feedback_concurrency_group_per_sha) cancels an older E2E run because a newer push landed, the workflow blocked the whole auto-promote chain on a non-failure. Caught 2026-05-05 02:03 on sha `31f9a5e`: E2E got cancelled by concurrency, auto-promote :latest aborted with exit 1, the next auto-promote-staging cycle had to manually clean up. Split: failure/timed_out keep the abort path. cancelled gets its own clean-defer branch (same shape as in_progress) — proceed=false without exit 1, with a step-summary explaining likely concurrency supersession and pointing operators at manual dispatch if they need that specific SHA promoted. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 19:26:29 -07:00
molecule-ai[bot]	9a835ef631	Merge pull request #2817 from Molecule-AI/staging staging → main: auto-promote `856c967`	2026-05-04 19:21:07 -07:00
Hongming Wang	174e594690	Merge pull request #2816 from Molecule-AI/auto-sync/main-31f9a5e8 chore: sync main → staging (auto, ff to `31f9a5e8`)	2026-05-04 19:08:17 -07:00
Hongming Wang	856c967950	Merge pull request #2811 from Molecule-AI/fix/org-import-saas-outer-gate-1777939899 fix(provision): consolidate org-import gate + Auto self-marks-failed	2026-05-05 01:58:37 +00:00
Hongming Wang	73f7e0c03b	Merge pull request #2815 from Molecule-AI/fix/curl-stderr-regression fix(workflows): preserve curl stderr in 8 status-capture sites (PR #2810 follow-up)	2026-05-05 01:57:10 +00:00
molecule-ai[bot]	31f9a5e85e	Merge pull request #2812 from Molecule-AI/staging staging → main: auto-promote `9c9be4c`	2026-05-05 01:55:48 +00:00
Hongming Wang	c5dd14d8db	fix(workflows): preserve curl stderr in 8 status-capture sites Self-review of PR #2810 caught a regression: my mass-fix added `2>/dev/null` to every curl invocation, suppressing stderr. The original `\|\| echo "000"` shape only swallowed exit codes — stderr (curl's `-sS`-shown dial errors, timeouts, DNS failures) still went to the runner log so operators could see WHY a connection failed. After PR #2810 the next deploy failure would log only the bare HTTP code with no context. That's exactly the kind of diagnostic loss that makes outages take longer to triage. Drop `2>/dev/null` from each curl line — keep it on the `cat` fallback (which legitimately suppresses "no such file" when curl crashed before -w ran). The `>tempfile` redirect alone captures curl's stdout (where -w writes) without touching stderr. Same 8 files as #2810: redeploy-tenants-on-{main,staging}, sweep-stale-e2e-orgs, e2e-staging-{sanity,saas,external,canvas}, canary-staging. Tests: - All 8 files pass the lint - YAML valid Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 18:54:50 -07:00
Hongming Wang	7e1fdf5847	refactor(provision): use HasProvisioner() at all gate-y both-nil checks SSOT pass — replace 4 bare `h.provisioner == nil && h.cpProv == nil` checks with `!h.HasProvisioner()`. When a third backend lands (k8s, containerd, whatever), HasProvisioner gets one new field; bare both-nil checks would each need to be hunted and updated. Sites: - a2a_proxy_helpers.go:166 — maybeMarkContainerDead skip-no-backend - workspace_restart.go:118 — Restart endpoint guard - workspace_restart.go:363 — RestartByID coalescer guard - workspace_restart.go:660 — Resume endpoint guard Adds TestNoBareBothNilCheck (source-level) so the antipattern can't slip back in. Out of scope but discovered during the audit (filed separately): - team.go:207 — team-collapse Stop is Docker-only, leaks EC2 on SaaS - workspace_crud.go:423 — workspace delete cleanup is Docker-only, leaks EC2 on SaaS Both need a StopWorkspaceAuto mirror of provisionWorkspaceAuto. Same class of bug as today's org-import incident, different verb (stop vs provision). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 18:51:53 -07:00
Hongming Wang	d084d7e61a	fix(provision): consolidate org-import gate + Auto self-marks-failed Two changes that close the silent-drop bug class: 1. Add WorkspaceHandler.HasProvisioner() and use it as the org-import gate. Pre-fix, org_import.go:178 read `h.provisioner != nil` (Docker- only) — on SaaS tenants where cpProv is wired but Docker is nil, the entire 220-line provisioning prep block was skipped. The Auto call PR #2798 added at line 395 was unreachable on SaaS. Repro: 2026-05-05 01:14 — hongming prod tenant, 7-workspace org import, every workspace sat in 'provisioning' for 10 min until the sweeper marked it failed with the misleading "container started but never called /registry/register". 2. provisionWorkspaceAuto self-marks-failed on the no-backend path. Defense in depth: even if a future caller bypasses HasProvisioner gating or ignores the bool return (TeamHandler pre-#2367 did exactly this), the workspace ends in a clean failed state with an actionable error message instead of lingering until the 10-min sweep. Auto becomes the single source of truth for "start a workspace" — routing AND the no-backend failure path. Create's redundant if-not-Auto-then-mark-failed block collapses (kept only the workspace_config UPSERT, which is a Create-specific UI concern for rendering runtime/model on the Config tab). Tests: - TestProvisionWorkspaceAuto_NoBackendMarksFailed pins the new contract - TestHasProvisioner_TrueOnCPOnly catches the SaaS-only blind spot - TestHasProvisioner_TrueOnDockerOnly preserves self-hosted shape - TestHasProvisioner_FalseWhenNeitherWired pins the gate-out path - TestOrgImportGate_UsesHasProvisionerNotBareField source-pins the gate (verified: FAILS against the buggy `h.provisioner != nil` shape, PASSES with `h.workspace.HasProvisioner()`) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 18:47:02 -07:00
Hongming Wang	9c9be4cf12	Merge pull request #2810 from Molecule-AI/fix/workflow-curl-status-pollution fix(workflows): rewrite curl status-capture to prevent exit-code pollution + add lint	2026-05-05 01:33:37 +00:00
Hongming Wang	f256bfa9c6	Merge pull request #2809 from Molecule-AI/feat/codex-tab-bridge-daemon-snippet feat(external-templates): codex tab now includes bridge-daemon inbound path	2026-05-05 01:33:12 +00:00
Hongming Wang	463316772b	fix(workflows): rewrite curl status-capture to prevent exit-code pollution The 2026-05-04 redeploy-tenants-on-main run for sha `2b862f6` emitted "HTTP 000000" and failed the deploy. Root cause: when curl exits non- zero (connection reset → 56, --fail-with-body 4xx/5xx → 22), the `-w '%{http_code}'` already wrote a status to stdout; the inline `\|\| echo "000"` then fires AND appends another "000" to the captured substitution stdout. Result: HTTP_CODE="<actual><000>" — fails string comparisons against "200" while looking superficially right. Same class of bug the synth-E2E §7c gate hit twice (PRs #2779/#2783 + #2797). Memory feedback_curl_status_capture_pollution.md. Mass fix in 8 workflows: route -w into a tempfile so curl's exit code can't pollute stdout. Wrap with set +e/-e so the non-zero curl exit doesn't trip the outer pipeline. redeploy-tenants-on-main.yml (production-critical, caught the bug) redeploy-tenants-on-staging.yml (sibling) sweep-stale-e2e-orgs.yml (cleanup loop) e2e-staging-sanity.yml (E2E safety-net teardown) e2e-staging-saas.yml e2e-staging-external.yml e2e-staging-canvas.yml canary-staging.yml Plus a new lint workflow `lint-curl-status-capture.yml` that runs on every PR/push touching `.github/workflows/**`. Multi-line aware: collapses bash `\` continuations, then matches the buggy $(curl ... -w '%{http_code}' ... \|\| echo "000") subshell shape. Distinguishes from the SAFE $(cat tempfile \|\| echo "000") shape (cat with missing file emits empty stdout, no pollution). Verified: - All 8 workflows pass the lint locally - A known-bad injection is caught - A known-safe cat-fallback passes through - yaml.safe_load clean on all changed files Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 18:29:38 -07:00
Hongming Wang	dfd0bc528c	fix(external-templates): codex-channel-molecule via git+ URL (not on PyPI yet) Mirrors the pattern hermes-channel-molecule uses (line 256). Drops the broken `pip install codex-channel-molecule` which would 404. PyPI publish workflow is a separate piece of work — until then, git+https install is the path operators get. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 18:29:23 -07:00
Hongming Wang	4ea6f437e9	feat(external-templates): codex tab now includes the bridge-daemon inbound path The codex tab in the External Connect modal had a "outbound-tools-only first cut" caveat — operators got the MCP wiring for codex calling platform tools, but there was no documented inbound path. Canvas messages couldn't wake an idle codex session. That gap is now filled by codex-channel-molecule (github.com/Molecule-AI/codex-channel-molecule), shipped today as the codex counterpart to hermes-channel-molecule. The daemon long-polls the platform inbox, runs `codex exec --resume <session>` per inbound message, captures the assistant reply, routes it back via send_message_to_user / delegate_task, and acks the inbox row. Per-thread session continuity persisted to disk so daemon restarts don't lose conversation context. This commit: - Updates externalCodexTemplate to include `pip install codex-channel-molecule` (step 1) and a foreground `nohup codex-channel-molecule` invocation (step 3) using the same env-var contract as the MCP server (WORKSPACE_ID + PLATFORM_URL + MOLECULE_WORKSPACE_TOKEN). - Adds a "Canvas messages don't wake codex" common-issues entry to the TAB_HELP codex section pointing at the bridge daemon log. - Updates the doc comment to record the upstream deprecation path: when openai/codex#17543 lands, the bridge becomes redundant and the wired MCP server delivers push natively. Verified: TestExternalTemplates_NoMoleculeOrgIDPlaceholder still passes (no MOLECULE_ORG_ID re-introduction); full handlers suite green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 18:28:35 -07:00
Hongming Wang	a872202fe7	Merge pull request #2808 from Molecule-AI/auto-sync/main-2b862f65 chore: sync main → staging (auto, ff to `2b862f65`)	2026-05-04 18:11:12 -07:00
molecule-ai[bot]	2b862f65f9	Merge pull request #2807 from Molecule-AI/staging staging → main: auto-promote `0f389ba`	2026-05-04 17:52:41 -07:00
molecule-ai[bot]	53760a8a2f	Merge pull request #2805 from Molecule-AI/staging staging → main: auto-promote `461e5dc`	2026-05-05 00:40:12 +00:00
Hongming Wang	0f389ba325	Merge pull request #2804 from Molecule-AI/fix/external-templates-drop-molecule-org-id fix(external-templates): drop MOLECULE_ORG_ID from codex/openclaw/hermes snippets	2026-05-05 00:38:45 +00:00
Hongming Wang	472862bc50	fix(external-templates): drop MOLECULE_ORG_ID from operator-facing snippets Codex / openclaw / hermes-channel snippets each instructed operators to set `MOLECULE_ORG_ID = "<your org id>"`. The molecule_runtime MCP subprocess these snippets spawn never reads MOLECULE_ORG_ID — that env var is consumed only by workspace-server's TenantGuard middleware, server-side, on the tenant box itself (set by the control plane via user-data on provision). External operator → tenant calls pass TenantGuard via the isSameOriginCanvas path (Origin matches Host), with auth via Bearer token + X-Workspace-ID. The universal_mcp snippet — which calls into the same molecule_runtime — has always (correctly) omitted MOLECULE_ORG_ID; this brings codex / openclaw / hermes-channel into line. Symptom that caught it: an external codex CLI session, after pasting the codex-tab snippet, surfaced "MOLECULE_ORG_ID is still set to '<your org id>'" as an unresolved blocker — agent reasonably treated the placeholder as required setup. Operator has no value to fill. Pinned with a structural test (TestExternalTemplates_NoMoleculeOrgIDPlaceholder) so the placeholder can't drift back across all six external-tab templates. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 17:30:07 -07:00
Hongming Wang	461e5dcad0	Merge pull request #2803 from Molecule-AI/fix/memory-cutover-misconfig-warn fix(memory v2): warn at boot when cutover env half-configured	2026-05-05 00:27:24 +00:00
Hongming Wang	b5435b4732	fix(memory v2): warn at boot when cutover env half-configured MEMORY_V2_CUTOVER=true gates the admin export/import path on the v2 plugin, but the cutoverActive() check in admin_memories.go silently returns false when the plugin isn't wired: func (h AdminMemoriesHandler) cutoverActive() bool { if os.Getenv(envMemoryV2Cutover) != "true" { return false } return h.plugin != nil && h.resolver != nil } Two operator misconfigs hit the silent-fallback path: 1. MEMORY_V2_CUTOVER=true set, MEMORY_PLUGIN_URL unset → wiring.Build returns nil → handler stays on legacy SQL path → operator sees no error, assumes cutover is live, but every request still writes the legacy table. 2. MEMORY_V2_CUTOVER=true set, MEMORY_PLUGIN_URL set, but plugin unreachable at boot → wiring.Build still returns the bundle (intentional — circuit breaker handles ongoing unavailability), but every cutover write quietly falls back via the breaker. → only signal: legacy table keeps growing. Both are exactly the "structurally invisible until prod" failure mode; the only real-world detection today is "notice the legacy table is still being written to," which no operator will check. Add loud, distinctive WARN log lines at Build() time for both shapes. Boot logs are operator-visible, so a half-config is immediately obvious without needing dashboards. Tests: 4 new (cutover+no-URL → warn, neither set → silent, cutover+probe- fail → loud warn, probe-fail-without-cutover → quiet generic) * 6 existing (still pass; pin no-warning-on-happy-path) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 17:24:11 -07:00
Hongming Wang	4b16c95450	staging → main: auto-promote `f1b72af` staging → main: auto-promote `e39d818`	2026-05-04 17:11:20 -07:00
Hongming Wang	f1b72af97e	Merge pull request #2798 from Molecule-AI/fix/org-import-saas-routing-1777938328 fix(org-import): route through provisionWorkspaceAuto so SaaS gets EC2 — closes #2486	2026-05-04 23:54:37 +00:00
Hongming Wang	31facfc5c4	Merge pull request #2797 from Molecule-AI/fix/synth-e2e-9c-parse fix(synth-e2e): correct §9c stale-409 capture (curl --fail-with-body pollution)	2026-05-04 23:50:59 +00:00
Hongming Wang	19e7acdc22	fix(org-import): route through provisionWorkspaceAuto so SaaS gets EC2 Org-import called h.workspace.provisionWorkspace directly — same silent- drop bug that bit TeamHandler.Expand on 2026-05-04 (see workspace.go :121-125 comment + #2486). Symptom on SaaS: every claude-code workspace sat in "provisioning" until the 600s sweeper marked it failed with "container started but never called /registry/register" — because no container ever existed; the goroutine returned silently when the Docker provisioner field was nil. User reproduced 2026-05-04 ~22:30Z importing a 7-workspace template on the hongming prod tenant. Tenant CP logs (queried live via SSM) showed ZERO "Provisioner: goroutine entered" or "CPProvisioner: goroutine entered" lines for any of the 7 failed workspace UUIDs in the 60min window — confirming the goroutine never ran past line 384 of org_import.go because provisionWorkspace returned early in SaaS mode. The fix is one line: replace h.workspace.provisionWorkspace with h.workspace.provisionWorkspaceAuto. Auto is the single source of truth for backend selection (workspace.go:130) — picks CP-mode when h.cpProv is wired, Docker-mode when h.provisioner is wired, returns false when neither. ALSO adds a generic source-level gate (TestNoCallSiteCallsDirectProvisionerExceptAuto) so the next future caller can't repeat the pattern. Walks every non-test .go file in handlers/ and fails if any direct call to provisionWorkspace( or provisionWorkspaceCP( appears outside the dispatcher's own definition file. The gate currently allows workspace_restart.go which has its own manual if-h.cpProv-else dispatch (functionally equivalent to Auto, not the bug class — but is architectural duplication; follow-up filed for proper de-dup). Test plan: - TestOrgImport_UsesAutoNotDirectDockerPath: pin the org_import.go call site - TestNoCallSiteCallsDirectProvisionerExceptAuto: generic gate against future drift - TestTeamExpand_UsesAutoNotDirectDockerPath (existing): symmetric for team.go All 3 + the rest of the handler suite pass. Closes #2486 Pairs with: PR #2794 (configurable provision concurrency) which made it possible to bisect concurrency-vs-routing as the cause Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 16:49:07 -07:00
Hongming Wang	1ce51abea4	fix(synth-e2e): correct §9c stale-409 capture — curl exit code polluted status The §9c "Memory KV Edit round-trip" gate (added in #2787) captured the expected-409 status code via: $(tenant_call ... -w "%{http_code}" \|\| echo "000") tenant_call uses CURL_COMMON which carries --fail-with-body. On the expected 409, curl exits 22; the `\|\| echo "000"` then fires and appends "000" to the captured stdout — yielding "409000" instead of "409", failing the gate even though the contract was satisfied. Caught on PR #2792's first E2E run (status got "409000"). Has been silently failing the staging-SaaS E2E since #2787 merged earlier today; nothing else surfaced it because the workflow is informational, not required. Fix: route -w into its own tempfile so curl's exit code can't pollute the captured stdout. Wrap with set +e/-e so the 22 doesn't trip the outer pipeline. Same shape as the §7c gate fix that PR #2779/#2783 landed for the same class of bug. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 16:46:35 -07:00
Hongming Wang	0ec226e119	Merge pull request #2795 from Molecule-AI/feat/python-critical-path-coverage-floor ci(coverage): per-file 75% floor for MCP/inbox/auth Python critical paths (Phase A of #2790)	2026-05-04 23:39:06 +00:00
Hongming Wang	872b781f64	Merge pull request #2792 from Molecule-AI/feat/drop-shared-context feat: drop shared_context — use memory v2 team namespace	2026-05-04 23:37:49 +00:00
Hongming Wang	0dd1244510	Merge pull request #2794 from Molecule-AI/fix/cfg-prov-conc-iso feat(org-import): make provision concurrency configurable via env	2026-05-04 23:37:15 +00:00
Hongming Wang	26fa220bef	ci(coverage): per-file 75% floor for MCP/inbox/auth Python critical paths Closes part of #2790 (Phase A). The Python total floor at 86% (set in workspace/pytest.ini, issue #1817) averages over ~6000 lines, so a single MCP-critical file could regress to ~50% with no CI complaint as long as other modules compensate. This is the same distribution gap that #1823 closed Go-side: total floor passes while a critical handler sits at 0%. Added gates for these five files (per-file floor 75%): - workspace/a2a_mcp_server.py — MCP dispatcher (PR #2766 / #2771) - workspace/mcp_cli.py — molecule-mcp standalone CLI entry - workspace/a2a_tools.py — workspace-scoped tool implementations - workspace/inbox.py — multi-workspace inbox + per-workspace cursors - workspace/platform_auth.py — per-workspace token resolver These handle multi-tenant routing, auth tokens, and inbox dispatch. Risk shape mirrors Go-side tokens/secrets — a 0%/50% file here is exactly where the PR #2766 dispatcher bug class slips through without a structural test. Floor 75% is strictly additive — current actuals 80-96% (measured 2026-05-04). No existing PR fails. Ratchet plan in COVERAGE_FLOOR.md target 90% by 2026-08-04. Implementation: pytest already writes .coverage; new step emits a JSON view scoped to the critical files via `coverage json --include="*name"`, then jq extracts each file's percent_covered. Exact key match by basename so workspace/builtin_tools/a2a_tools.py (a different 100% file) doesn't shadow workspace/a2a_tools.py. Verified locally with the actual coverage data: - floor=75 → 0 failures (matches current state) - floor=81 → 1 failure (a2a_tools.py at 80%) — proves the gate trips Pairs with PR #2791 (Phase B — schema↔dispatcher AST drift gate). Phase C (molecule-mcp e2e harness) remains the largest piece in #2790. YAML validated locally before commit per feedback_validate_yaml_before_commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 16:35:21 -07:00
Hongming Wang	5559e96400	Merge branch 'temp-staging' into try-merge # Conflicts: # tests/e2e/test_staging_full_saas.sh	2026-05-04 16:34:55 -07:00
Hongming Wang	3bc7749e84	feat(org-import): make provision concurrency configurable via env Org-import was hard-capped at 3 concurrent workspace provisions (#1084), calibrated for Docker-mode workspaces where each provision was a docker-run. Now that workspaces are EC2 instances, AWS RunInstances parallelises happily and the artificial cap of 3 makes a 7-workspace org-import take 3-4× longer than necessary (3 batches × ~70s/provision ≈ 4 min wall time when AWS could absorb all 7 in parallel for ~70s). This PR makes the cap configurable via MOLECULE_PROVISION_CONCURRENCY: unset → 3 (Docker-mode default, unchanged) "0" → effectively unlimited (SaaS / EC2 backend; AWS rate-limit + vCPU quota are the real backpressure) N>0 → exactly N N<0 → fall back to default 3 + warning log garbage → fall back to default 3 + warning log The "0 = unlimited" mapping is the user-facing convention requested for SaaS deployments — operators don't have to pick an arbitrary large number. Implementation hands off 1<<20 internally so the channel-based semaphore stays a no-op without infinite-buffer risk. Test coverage (org_provision_concurrency_test.go, 6 cases / 15 subtests): - unset → default - "0" → large unlimited cap - positive integer exact (1, 5, 10, 50) - negative → default + warning - non-numeric → default + warning - whitespace-trimmed (" 7 " → 7) Boot-time log line confirms the resolved cap so an operator can verify their env is being honored without re-deploying. Does NOT address the separate 600s "never registered" timeout the user also reported during org-import — that's filed as molecule-core#2793 for proper investigation (parallel-provision contention, network routing, register-retry budget, or container-start failure are all candidates and need live SSM capture to bisect). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 16:33:49 -07:00
Hongming Wang	6d7a7fc86f	feat(org-import): make provision concurrency configurable via env Org-import was hard-capped at 3 concurrent workspace provisions (#1084), calibrated for Docker-mode workspaces where each provision was a docker-run. Now that workspaces are EC2 instances, AWS RunInstances parallelises happily and the artificial cap of 3 makes a 7-workspace org-import take 3-4× longer than necessary (3 batches × ~70s/provision ≈ 4 min wall time when AWS could absorb all 7 in parallel for ~70s). This PR makes the cap configurable via MOLECULE_PROVISION_CONCURRENCY: unset → 3 (Docker-mode default, unchanged) "0" → effectively unlimited (SaaS / EC2 backend; AWS rate-limit + vCPU quota are the real backpressure) N>0 → exactly N N<0 → fall back to default 3 + warning log garbage → fall back to default 3 + warning log The "0 = unlimited" mapping is the user-facing convention requested for SaaS deployments — operators don't have to pick an arbitrary large number. Implementation hands off 1<<20 internally so the channel-based semaphore stays a no-op without infinite-buffer risk. Test coverage (org_provision_concurrency_test.go, 6 cases / 15 subtests): - unset → default - "0" → large unlimited cap - positive integer exact (1, 5, 10, 50) - negative → default + warning - non-numeric → default + warning - whitespace-trimmed (" 7 " → 7) Boot-time log line confirms the resolved cap so an operator can verify their env is being honored without re-deploying. Does NOT address the separate 600s "never registered" timeout the user also reported during org-import — that's filed as molecule-core#2793 for proper investigation (parallel-provision contention, network routing, register-retry budget, or container-start failure are all candidates and need live SSM capture to bisect). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 16:32:56 -07:00
Hongming Wang	ecb3c75d74	Merge pull request #2791 from Molecule-AI/feat/mcp-dispatcher-schema-drift-gate test(mcp): structural gate — schema↔dispatcher drift (Phase B of #2790)	2026-05-04 23:32:19 +00:00
Hongming Wang	2f7beb9bce	feat: drop shared_context — use memory v2 team namespace instead Parent → child knowledge sharing previously lived behind a `shared_context` list in config.yaml: at boot, every child workspace HTTP-fetched its parent's listed files via GET /workspaces/:id/shared-context and prepended them as a "## Parent Context" block. That paid the full transfer cost on every boot regardless of whether the agent needed it, single-parent SPOF, no team or org scope, and broken if the parent was unreachable. Replace with memory v2's team:<id> namespace: agents call recall_memory on demand. For large blob-shaped artefacts see RFC #2789 (platform-owned shared file storage). Removed: - workspace/coordinator.py: get_parent_context() - workspace/prompt.py: parent_context arg + injection block - workspace/adapter_base.py: import + call + arg pass - workspace/config.py: shared_context field + parser entry - workspace-server/internal/handlers/templates.go: SharedContext handler - workspace-server/internal/router/router.go: GET /shared-context route - canvas/src/components/tabs/ConfigTab.tsx: Shared Context tag input - canvas/src/components/tabs/config/form-inputs.tsx: schema field + default - canvas/src/components/tabs/config/yaml-utils.ts: serializer entry - 6 tests pinning the removed behavior; 5 doc references Added regression gates so any reintroduction is loud: - workspace/tests/test_prompt.py: build_system_prompt must NOT emit "## Parent Context" - workspace/tests/test_config.py: legacy YAML key loads cleanly but shared_context attr must NOT exist on WorkspaceConfig - tests/e2e/test_staging_full_saas.sh §9d: GET /shared-context must NOT return 200 against a live tenant Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 16:30:26 -07:00
Hongming Wang	bd881f8756	test(mcp): structural gate — schema↔dispatcher drift catches dropped kwargs Closes part of #2790 (Phase B). Prevents a recurrence of the PR #2766 → PR #2771 cycle: PR #2766 added ``source_workspace_id`` to four tools' ``input_schema`` and tool implementations, but the dispatcher in ``a2a_mcp_server.handle_tool_call`` silently dropped the kwarg for ``commit_memory`` / ``recall_memory`` / ``chat_history`` / ``get_workspace_info``. Schema lied; LLMs populated the param; every call fell back to ``WORKSPACE_ID``, defeating multi-tenant isolation. Existing dispatcher tests asserted return-value substrings (``"working" in result``) instead of kwarg flow, so the bug shipped to main and was only caught by re-reviewing post-merge. This change adds an AST-driven gate. For every ToolSpec in platform_tools.registry.TOOLS, the gate finds the matching ``elif name == "<tool>"`` arm in a2a_mcp_server.py and asserts that every property declared in input_schema.properties is read by an ``arguments.get("<property>", ...)`` call inside that arm. A new schema field the dispatcher forgets to forward fails CI loudly. Three tests: - test_every_dispatch_arm_reads_every_schema_property: main drift gate. Walks registry, matches dispatch arms by name, diffs declared vs read keys. - test_dispatch_arms_reach_every_registered_tool: inverse direction. A registered tool with no dispatch arm is "Unknown tool" at runtime, even though docs/wrappers/schema all advertise it. Catches PRs that add a ToolSpec but forget the dispatcher. - test_drift_gate_self_check_finds_known_arms: pin the AST parser. If handle_tool_call is refactored into a different shape (dict dispatch, registry-driven, etc.) and _load_dispatch_arms returns {}, the main gate vacuously passes — this self-check makes that failure mode explicit by requiring 12 known arms to be discovered. Verified the gate catches the PR #2766 bug: stripping ``source_workspace_id=arguments.get(...)`` from the commit_memory arm fails the gate with a descriptive error pointing at the missing kwarg and referencing the prior incident. Restored → 3 tests pass. Suite: 1733 passed (was 1730 + 3 new), 3 skipped, 2 xfailed. Why AST, not runtime invocation: the runtime mock-based tests in test_a2a_mcp_server.py already assert kwargs flow correctly for four explicitly-tested tools. This gate is cheaper (~1ms), catches new properties before someone has to remember the runtime test, and runs as a structural invariant. Phase A (Python coverage floor) and Phase C (molecule-mcp e2e harness) remain in #2790 as separate follow-ups. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 16:29:54 -07:00
hongmingwang-moleculeai	e39d818ac4	Merge pull request #2787 from Molecule-AI/feat/memory-tab-edit-affordance feat(memory tab): add Edit affordance with optimistic-locking	2026-05-04 23:20:51 +00:00
molecule-ai[bot]	ed4d24fb8c	Merge pull request #2786 from Molecule-AI/staging staging → main: auto-promote `095171f`	2026-05-04 16:19:31 -07:00
Hongming Wang	3a5544a9e6	feat(memory tab): add Edit affordance with optimistic-locking Memory tab supported only Add+Delete. Correcting an entry meant deleting and re-adding, losing the row's version counter and any concurrent-write guard the agent depends on. Now: per-row Edit button reveals an inline editor (value textarea + TTL). Save POSTs to the existing /memory upsert endpoint with if_match_version pinned to the entry's current version. On 409 the UI surfaces a retry hint and reloads. Tests: - 11 vitest cases covering pre-fill (JSON vs string), payload shape (parsed JSON, fallback to plain text, TTL inclusion/omission), cancel, 409 retry path, generic error path, and the no-version back-compat case. - E2E gate 9c in test_staging_full_saas.sh: seed → GET version → conditional update → assert new value → stale-version POST must 409. Pins the optimistic-locking contract end-to-end on staging. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 16:18:08 -07:00
Hongming Wang	095171f163	Merge pull request #2785 from Molecule-AI/fix/readfile-eic-symmetry fix(workspace files API): GET ReadFile via SSH-EIC for SaaS workspaces (fixes 'No config.yaml found' on Config tab)	2026-05-04 23:05:34 +00:00
Hongming Wang	9c7b34cb7f	fix(workspace files API): GET ReadFile via SSH-EIC for SaaS workspaces Pre-fix WriteFile (templates.go:436) had an `instance_id != ""` branch that dispatched to writeFileViaEIC (SSH through EC2 Instance Connect), but ReadFile (templates.go:362) skipped that branch entirely. ReadFile always tried `findContainer` (which only works for local-Docker workspaces, not SaaS EC2-per-workspace ones) and fell through to `resolveTemplateDir` (which returns the seed template, not the persisted workspace state). Net effect on production: every Canvas Config tab open against a SaaS workspace returned 404 "No config.yaml found" because GET couldn't see what PUT had written. Visible to users after PR #2781 ("show-misconfigured-state") surfaced the 404 as an error UX. Caught by the synth-E2E 7c gate's GET-back assertion, but misdiagnosed as a "test bug" and the GET assertion was dropped in PR #2783 (rather than fixed at the source). This PR closes the loop: 1. New `readFileViaEIC` helper in template_files_eic.go that mirrors writeFileViaEIC's SSH-via-EIC dance and runs `sudo -n cat <path>`. Returns os.ErrNotExist on missing file (cat exits 1 with empty stdout under `2>/dev/null`) so the handler maps it cleanly to 404. 2. ReadFile dispatch now mirrors WriteFile's: when `instance_id` is non-empty, use readFileViaEIC; otherwise fall through to the local-Docker / template-dir path. 3. ReadFile's DB query expanded to also select instance_id + runtime (was just name). Three sqlmock-based tests updated to match the new column shape; the existing local-Docker fallback path stays green by passing instance_id="" in the mock rows. Follow-up (separate PR): the synth-E2E 7c gate should restore the GET-back marker assertion now that the read/write paths are unified. That'll also catch any future Files API regression in the round-trip. This PR doesn't touch the gate to keep the scope tight. Verification: - go build ./... clean - full handlers test suite green (0.4s for ReadFile subset; 5.8s full) - The 3 ReadFile sqlmock tests still cover the local-Docker fallback (instance_id=""); SaaS EIC dispatch is covered by the upcoming re-enabled synth-E2E 7c GET assertion (deferred to follow-up) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 16:02:26 -07:00
Hongming Wang	8514ff1a96	Merge pull request #2780 from Molecule-AI/staging staging → main: auto-promote `e67a854`	2026-05-04 15:54:02 -07:00
Hongming Wang	1785732bbb	Merge pull request #2783 from Molecule-AI/fix/synth-gate-drop-get-roundtrip fix(synth-e2e): drop GET-back round-trip from 7c gate; PUT-200 only	2026-05-04 22:34:58 +00:00
Hongming Wang	066a0772ee	fix(synth-e2e): drop GET-back round-trip from 7c gate After the curl parse fix in #2779, the gate started reliably catching a DIFFERENT bug than it was designed for: the Files API's PUT and GET hit different paths/hosts and don't see each other's writes. PUT /workspaces/<id>/files/config.yaml → template_files_eic.go writeFileViaEIC → SSH-as-ubuntu through EIC tunnel into the workspace EC2 → `sudo install -D /dev/stdin /configs/config.yaml` → Lands at host:/configs on the workspace EC2 (correct: bind- mounted into the workspace container) GET /workspaces/<id>/files/config.yaml → templates.go ReadFile → `findContainer` looks for a docker container ON THE PLATFORM-TENANT HOST (not the workspace EC2) → Workspace containers don't run on platform-tenant; this returns empty → Fallback: read from h.resolveTemplateDir(wsName) on the platform-tenant host — i.e., the seed template directory, not the persisted workspace config So the GET reliably returns the original template config, not what PUT just wrote. The user-facing Save & Restart still works because the container reads /configs/config.yaml directly via bind-mount — the asymmetry only bites the gate. This is a separate latent bug worth its own task: unify the Files API read/write path (likely: ReadFile should also use SSH-EIC to the workspace EC2 for instance-backed workspaces, mirroring WriteFile). Tracked separately. For now, drop the GET-back assertion and keep just the PUT-200 check. The PUT-200 still catches today's bug class (#2769 EACCES on /opt/configs would have failed PUT with 500). When the read/write paths are unified, restore the marker check. Verification: - bash -n clean - The PUT-200 check would have caught PR #2769's bug (500 EACCES) - The dropped GET-back check would not have prevented today's user bug (PR #2769 was caught by the user, not by the gate, and the gate only existed afterward) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 15:32:47 -07:00
Hongming Wang	3f2cc8cdd6	Merge pull request #2781 from Molecule-AI/feat/canvas-show-misconfigured-state feat(canvas): render misconfigured workspaces with configuration_status from agent_card	2026-05-04 22:20:12 +00:00
Hongming Wang	5c80b9c3d6	feat(canvas): render misconfigured workspaces with the configuration_status from agent_card Closes molecule-controlplane#467 (issue filed against CP, but resolution landed canvas-side because the workspace-server ALREADY returns the agent_card JSONB blob with configuration_status / configuration_error fields populated by molecule-core PR #2756). No CP-side change needed — the gap was the canvas's blindness to those fields. Before this PR, a workspace whose adapter.setup() failed (typically missing/rotated LLM credential) appeared identical to a healthy one in the canvas tile: green "Online" status, no error indication. The operator had to dig into workspace logs to discover the env var to set. This PR surfaces the state via the existing status-pill UX: 1. STATUS_CONFIG gains a "not_configured" entry — amber dot/glow, "Not configured" label. Distinct from "online" (emerald) and "failed" (red) — the workspace is reachable, it just needs config. 2. canvas-topology exposes getConfigurationStatus / getConfigurationError helpers — strict equality on the JSONB field so unknown values pass through as null instead of crashing the tile renderer. 3. WorkspaceNode derives an `effectiveStatus` that overrides data.status with "not_configured" when (status === "online" AND agent_card.configuration_status === "not_configured"). The override only applies on top of "online" — a genuinely offline / failed / provisioning workspace keeps its existing treatment. 4. The configuration_error string surfaces in two places: the tile's aria-label (screen reader access) + a truncated preview row at the bottom of the tile (same visual as the existing "degraded error preview" — mirrors the established pattern for in-tile error surfacing). Test coverage: 11 new in canvas-topology-configuration-status.test.ts. Each helper covered for the happy path, missing fields, defensive ignores of unknown values, and an end-to-end "stale ready overrides old error" guard. Once this lands + canvas redeploys, operators see "Not configured: Neither OPENAI_API_KEY nor MINIMAX_API_KEY is set" right on the workspace tile instead of a confused-looking green "online" workspace that silently 503s every JSON-RPC request. Pairs with: molecule-core PR #2756 (decouple agent-card from setup), #2775 (boot_routes pin), #2778 (secret_redactor) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 15:14:40 -07:00
Hongming Wang	a8850bac55	Merge pull request #2778 from Molecule-AI/fix/redact-secrets-1777932233 fix(runtime): redact secret-shaped tokens from JSON-RPC error.data	2026-05-04 22:13:29 +00:00
Hongming Wang	adfa34c4ae	Merge pull request #2779 from Molecule-AI/fix/synth-gate-curl-parse fix(synth-e2e): correct curl status-code parse in 7c gate	2026-05-04 22:11:54 +00:00

1 2 3 4 5 ...

4211 Commits