molecule-core

Author	SHA1	Message	Date
Hongming Wang	56149f8a24	fix(bundle): markFailed sets last_sample_error + AST gate Closes the bug class surfaced by Canvas E2E #2632: a workspace ends up status='failed' with last_sample_error=NULL, and operators (or the E2E poll loop) see the useless "Workspace failed: (no last_sample_error)" with no triage signal. Two pieces: 1. bundle/importer.go markFailed — the UPDATE was setting only status, leaving last_sample_error NULL. Same incident class as the silent-drop bugs in PRs #2811 + #2824, different code path. markProvisionFailed in workspace_provision_shared.go has set the message column for a long time; this writer drifted the convention. Fix: include last_sample_error in the SET clause + the broadcast. 2. AST drift gate (db/workspace_status_failed_message_drift_test.go) — Go AST walk that finds every db.DB.{Exec,Query,QueryRow}Context call whose argument list binds models.StatusFailed and asserts the SQL literal contains last_sample_error. Catches the next caller that drifts the same convention. Verified to FAIL against the bug shape (reverted importer.go temporarily — gate flagged the exact line) and PASS against the fix. Why an AST gate vs a regex: pre-fix attempt with a regex over UPDATE statements flagged status='online' / status='hibernating' / status= 'removed' UPDATEs as false positives. Walking the AST and only flagging calls that pass the StatusFailed constant eliminates that. Out of scope (filed separately if needed): - The Canvas E2E that surfaced the missing message (#2632) is now a required check on staging via PR #2827. Once this fix lands the next staging push should re-run #2632's failing case and produce a meaningful last_sample_error. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 21:08:08 -07:00
Hongming Wang	43caac911a	Merge pull request #2834 from Molecule-AI/fix/branch-protection-apply-respects-live-state fix(branch-protection): apply.sh respects live state + full-payload drift	2026-05-05 03:54:50 +00:00
Hongming Wang	2e505e7748	fix(branch-protection): apply.sh respects live state + full-payload drift Multi-model review of #2827 caught: the script as-shipped would have silently weakened branch protection on EVERY non-checks dimension the moment anyone ran it. Live staging had enforce_admins=true, dismiss_stale_reviews=false, strict=true, allow_fork_syncing=false, bypass_pull_request_allowances={ HongmingWang-Rabbit + molecule-ai app } Script wrote the opposite for all five. Per memory feedback_dismiss_stale_reviews_blocks_promote.md, the dismiss_stale_reviews flip alone is the load-bearing one — would silently re-block every auto-promote PR (cost user 2.5h once). This PR: 1. apply.sh: per-branch payloads (build_staging_payload / build_main_payload) that codify the deliberate per-branch policy already on the repo, with the script's net contribution being ONLY the new check names (Canvas tabs E2E + E2E API Smoke on staging, Canvas tabs E2E on main). 2. apply.sh: R3 preflight that hits /commits/{sha}/check-runs and asserts every desired check name has at least one historical run on the branch tip. Catches typos like "Canvas Tabs E2E" vs "Canvas tabs E2E" — pre-fix a typo would silently block every PR forever waiting for a context that never emits. Skip via --skip-preflight for genuinely-new workflows whose first run hasn't fired. 3. drift_check.sh: compares the FULL normalised payload (admin, review, lock, conversation, fork-syncing, deletion, force-push) not just the checks list. Pre-fix the drift gate would have missed a UI click that flipped enforce_admins or dismiss_stale_reviews. Drops app_id from the comparison since GH auto-resolves -1 to a specific app id post-write. 4. branch-protection-drift.yml: per memory feedback_schedule_vs_dispatch_secrets_hardening.md — schedule + pull_request triggers HARD-FAIL when GH_TOKEN_FOR_ADMIN_API is missing (silent skip masks the gate disappearing). workflow_dispatch keeps soft-skip for one-off operator runs. Verified by running drift_check against live state: pre-fix would have shown 5 destructive drifts on staging + 5 on main. Post-fix shows ONLY the 2 intended additions on staging + 1 on main, which go away after `apply.sh` runs.	2026-05-04 20:52:11 -07:00
Hongming Wang	b3b9a242d6	Merge pull request #2832 from Molecule-AI/feat/rfc2829-pr1-delegations-table feat(delegations): durable per-task ledger + audit-write helper (RFC #2829 PR-1)	2026-05-05 03:47:06 +00:00
Hongming Wang	ed6dfe01e5	feat(delegations): durable per-task ledger + audit-write helper (RFC #2829 PR-1) Adds the `delegations` table and the DelegationLedger writer that PRs #2-#4 of RFC #2829 build on. Schema-only foundation — no behavior change in this PR. PR-2 wires the ledger into the existing handlers and ships the result- push-to-inbox cutover behind a feature flag. Why a dedicated table when activity_logs already records every delegation event: Today, "what is currently in flight for this workspace" is reconstructed by GROUPing activity_logs by delegation_id and ORDER BY created_at DESC. PR-3's stuck-task sweeper needs the join SELECT delegation_id FROM delegations WHERE status = 'in_progress' AND last_heartbeat < now() - interval '10 minutes' which is impossible to express against the event stream without a window over every (delegation_id, latest event) pair — a planner-killing query at scale. The dedicated table makes the sweeper an indexed scan. Same posture as tenant_resources (PR #2343, memory `reference_tenant_resources_audit`): activity_logs remains the audit- grade source of truth, delegations is the queryable view for dashboards + sweeper joins. Symmetric writes — both tables are written, neither blocks orchestration on the other's failure. Schema highlights: - delegation_id PRIMARY KEY (caller-chosen, idempotent retry on restart is a no-op via ON CONFLICT DO NOTHING) - caller_id / callee_id NOT FK — workspace delete must NOT cascade- delete delegation history (audit retention) - status CHECK constraint enforces the lifecycle (queued\|dispatched\|in_progress\|completed\|failed\|stuck) - last_heartbeat NULL-able; PR-3 sweeper compares to NOW() - deadline default now()+6h matches longest-observed legit delegation (memory-namespace migrations) — protects against forever-heartbeating wedged agents - Partial index `idx_delegations_inflight_heartbeat` keeps the sweeper hot path tiny (only non-terminal rows) - UNIQUE(caller_id, idempotency_key) WHERE NOT NULL — natural collision becomes ON CONFLICT no-op without colliding across callers DelegationLedger.SetStatus enforces forward-only on terminal states (completed/failed/stuck cannot be revised) as defense-in-depth on the schema CHECK. Same-status replay is a no-op. Missing-row SetStatus is a no-op (transient inconsistency the next agent retry will heal). Heartbeat updates only in-flight rows — terminal-state delegations are silently skipped. Coverage: - 17 unit tests against sqlmock-backed *sql.DB (Insert happy path, missing-required guards, truncation, lifecycle transitions, terminal forward-only protection, replay no-op, missing-row no-op, empty-input rejection, heartbeat semantics, transition table shape) - Migration roundtrip verified on a real Postgres 15 instance: up creates the expected schema with all 4 indexes + CHECK, down drops everything cleanly. Refs RFC #2829.	2026-05-04 20:43:06 -07:00
Hongming Wang	ca6e7c39cf	Merge pull request #2830 from Molecule-AI/ux/terminal-tab-external-not-available feat(canvas/terminal): "Not available" banner for runtimes without a TTY	2026-05-05 03:35:52 +00:00
Hongming Wang	ba63f76e10	feat(canvas/terminal): not-available banner for runtimes without a TTY Pre-fix TerminalTab tried to open /ws/terminal/<id> for every workspace including external ones (which have no shell endpoint on the workspace-server). The server returned 404, status flipped to "error", the user saw "Connection failed" with a Reconnect button — reading as a bug when really the runtime intentionally has no TTY. Now: when data.runtime is in RUNTIMES_WITHOUT_TERMINAL (currently just "external"), TerminalTab renders a NotAvailablePanel with a big terminal-off icon and a one-line explanation including the runtime name. The xterm + WebSocket dance is skipped entirely — no spurious 404s, no scary error UI, no Reconnect that can't help. The runtime is determined from the data prop now threaded by SidePanel.tsx (existing pattern for ChatTab/ConfigTab/etc). Tests: 4 new in TerminalTab.notAvailable.test.tsx pin: external renders banner with runtime name, external doesn't open WS, claude- code mounts normally (regression cover for the early-return scope), data omitted falls through (back-compat). Build clean. 1258 tests pass.	2026-05-04 20:33:13 -07:00
Hongming Wang	b037d555fa	Merge pull request #2828 from Molecule-AI/docs/abstraction-pattern-1777951500 docs(backends): document Auto-dispatcher SoT pattern + source-level pins (closes #10)	2026-05-05 03:30:56 +00:00
Hongming Wang	62fc25757c	docs(backends): document Auto-dispatcher SoT pattern + source-level pins Closes #10. The 2026-05-05 hongming silent-drop incident shipped because the backends.md parity matrix didn't enforce a "go through the dispatcher" rule — three handlers (TeamHandler.Expand, OrgHandler.createWorkspaceTree, workspace_crud.go's stopAndRemove) silently bypassed routing on SaaS for ~6 months across two distinct verbs. This doc pass: - Adds a "How to dispatch" section that's the canonical answer to "where do I call Start / Stop / Has from?". Names the three dispatchers (provisionWorkspaceAuto, StopWorkspaceAuto, HasProvisioner), their fallbacks, and the allowed exceptions. - Updates the matrix lifecycle rows so every dispatched operation points at the dispatcher source, not the per-backend bodies. - Adds Org-import + Team-collapse rows so the bulk paths are visible to anyone scanning for parity gaps. - Lists the source-level pins (4 of them) under Enforcement so future contributors see them as load-bearing tests, not noise. - Adds a "When you add a NEW dispatch site" section so the next verb (Pause / Hibernate / Snapshot) lands as a dispatcher mirror, not as another bespoke handler that drifts from the existing two. - Refreshes Last audit to 2026-05-05. No code change; doc-only. The SoT abstractions described here landed in PRs #2811 + #2824. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 20:25:10 -07:00
Hongming Wang	a345adacad	Merge pull request #2827 from Molecule-AI/feat/e2e-required-on-staging-1777950000 ci: e2e coverage matrix + branch-protection-as-code (closes #9)	2026-05-05 03:24:54 +00:00
Hongming Wang	7cc1c39c49	ci: e2e coverage matrix + branch-protection-as-code Closes #9. Three pieces, all small: 1. docs/e2e-coverage.md — source of truth for which E2E suites guard which surfaces. Today three were running but informational only on staging; that's how the org-import silent-drop bug shipped without a test catching it pre-merge. Now the matrix shows what's required where + a follow-up note for the two suites that need an always-emit refactor before they can be required. 2. tools/branch-protection/apply.sh — branch protection as code. Lets `staging` and `main` required-checks live in a reviewable shell script instead of UI clicks that get lost between admins. This PR's net change: add `E2E API Smoke Test` and `Canvas tabs E2E` as required on staging. Both already use the always-emit path-filter pattern (no-op step emits SUCCESS when the workflow's paths weren't touched), so making them required can't deadlock unrelated PRs. 3. branch-protection-drift.yml — daily cron + drift_check.sh that compares live protection against apply.sh's desired state. Catches out-of-band UI edits before they drift further. Fails the workflow on mismatch; ops re-runs apply.sh or updates the script. Out of scope (filed as follow-ups): - e2e-staging-saas + e2e-staging-external use plain `paths:` filters and never trigger when paths are unchanged. They need refactoring to the always-emit shape (same as e2e-api / e2e-staging-canvas) before they can be required. - main branch protection mirrors staging here; if main wants the E2E SaaS / External added later, do it in apply.sh and rerun. Operator must apply once after merge: bash tools/branch-protection/apply.sh The drift check picks it up from there. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 20:21:59 -07:00
Hongming Wang	111c3d2c01	Merge pull request #2825 from Molecule-AI/feat/configtab-drop-skills-tools-section feat(ConfigTab): drop Skills/Tools tag inputs, give Prompt Files its own section	2026-05-05 03:05:10 +00:00
Hongming Wang	46d79a3e3b	Merge pull request #2824 from Molecule-AI/fix/stop-workspace-auto-saas-1777945000 fix(provision): StopWorkspaceAuto mirror — close SaaS EC2-leak class	2026-05-05 03:05:09 +00:00
Hongming Wang	2198f92dcb	Merge pull request #2823 from Molecule-AI/feat/codex-tab-pypi-install feat(external-templates): codex tab uses plain pip install	2026-05-05 03:03:08 +00:00
Hongming Wang	beab899501	feat(ConfigTab): drop Skills/Tools tag inputs, give Prompt Files its own section User feedback (2026-05-04 conversation): > "Skills and Tools are having their own tab as plugin, and Prompt > Files are in the file system which can be directly edited. Am I > missing something?" > "Tools should be merged into plugin then, and for prompt files... it > should be in another section than in skill& tools" The "Skills & Tools" section in ConfigTab had three TagList inputs: - Skills: managed via the dedicated SkillsTab (per-workspace skill folders) — duplicate UI affordance - Tools: managed via the Plugins tab (install a plugin → its tools become available) — duplicate UI affordance - Prompt Files: load order for system-prompt files — semantically unrelated to skills/tools Drop the Skills + Tools inputs. Move Prompt Files into its own section with explanatory copy that names the auto-loaded files (system-prompt.md, CLAUDE.md, AGENTS.md) and points users at the Files tab for actual editing. Schema fields `config.skills` and `config.tools` are KEPT (load-bearing for runtime skill loading + tool registry); only the inline editor goes away. Operators who need to edit them can still use the Raw YAML toggle. Tests: - New ConfigTab.sections.test.tsx with 4 cases: 1. "Skills & Tools" section title is gone 2. Skills tag input is absent 3. Tools tag input is absent 4. Prompt Files section exists with explanatory copy Sibling ConfigTab tests (hermes, provider) all still pass (20/20). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 20:02:05 -07:00
Hongming Wang	b851cfc813	Merge pull request #2822 from Molecule-AI/fix/files-eic-ssh-warning fix(files-eic): silence ssh known-hosts warning that 500'd Hermes config load	2026-05-05 03:02:00 +00:00
Hongming Wang	3cb72b1df0	Merge pull request #2821 from Molecule-AI/auto-sync/main-80e4b9ac chore: sync main → staging (auto, ff to `80e4b9ac`)	2026-05-04 20:04:16 -07:00
Hongming Wang	11c9ed2a46	fix(provision): StopWorkspaceAuto mirror — close SaaS EC2-leak class Closes #2813 (team-collapse) and #2814 (workspace delete). Two leaks, one class. Both call sites had the same shape pre-fix: if h.provisioner != nil { h.provisioner.Stop(ctx, wsID) } On SaaS where h.provisioner (Docker) is nil and h.cpProv is set, that gate evaluates false and the EC2 keeps running. Workspace gets marked removed in DB; EC2 lives on until the orphan sweeper catches it. Same drift class as PR #2811's org-import provision bug — a Docker- only check on what should be a both-backend operation. Confirmed in production: PR #2811's verification step deleted a test workspace and the EC2 stayed running until I terminated it manually. Fix: WorkspaceHandler.StopWorkspaceAuto(ctx, wsID) — symmetric mirror of provisionWorkspaceAuto. CP first, Docker second, no-op when neither is wired (a workspace nobody is running can't be stopped — that's a no-op, not a failure, distinct from provision's mark-failed contract). Three call-site changes: - team.go:208 (Collapse) → h.wh.StopWorkspaceAuto(ctx, childID) - workspace_crud.go:432 (stopAndRemove) → h.StopWorkspaceAuto(...); RemoveVolume stays Docker-only behind an explicit gate since CP-managed workspaces have no host-bind volumes - TeamHandler.provisioner field + NewTeamHandler's *Provisioner param removed as dead code (Stop was the only call site) Volume cleanup separation is intentional: the abstraction is "stop the running workload," not "tear down all state." Callers that need volume cleanup keep their `if h.provisioner != nil { RemoveVolume }` gate AFTER the Stop call. Tests: - TestStopWorkspaceAuto_RoutesToCPWhenSet — SaaS path - TestStopWorkspaceAuto_RoutesToDockerWhenOnlyDocker — self-hosted - TestStopWorkspaceAuto_NoBackendIsNoOp — pins the contract distinction from provisionWorkspaceAuto's mark-failed - TestNoCallSiteCallsBareStop — source-level pin against `.provisioner.Stop(` / `.cpProv.Stop(` outside the dispatcher, per-backend bodies, restart helper, and the Docker-daemon-direct short-lived-container path. Strips Go comments before substring match so archaeology in code comments doesn't trip the gate. - Verified: pin FAILS against the buggy shape (workspace_crud.go reversion); team.go reversion compile-fails because the field is gone — even stronger than the test. Out of scope (tracked under #2799): - workspace_restart.go's manual if-cpProv-else dispatch with retry semantics tuned for the restart hot path. Functionally equivalent + wraps cpStopWithRetry, so it's not the bug class this PR closes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 20:00:23 -07:00
Hongming Wang	c0bfd19b9e	feat(external-templates): codex tab uses plain pip install for bridge daemon `codex-channel-molecule` 0.1.0 is now on PyPI, so operators no longer need the `git+https://...` URL workaround. Verified: `pip install codex-channel-molecule` from a clean venv installs the wheel and the `codex-channel-molecule --help` console script runs. PyPI: https://pypi.org/project/codex-channel-molecule/0.1.0/ Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 19:58:56 -07:00
Hongming Wang	e0f9434eaf	fix(files-eic): silence ssh known-hosts warning that 500'd Hermes config load GET /workspaces/:id/files/config.yaml on hongming.moleculesai.app's Hermes workspace returned 500 with body: ssh cat: exit status 1 (Warning: Permanently added '[127.0.0.1]:37951' (ED25519) to the list of known hosts.) Root cause: ssh emits the "Permanently added" notice on every fresh tunnel connection, even with UserKnownHostsFile=/dev/null (that prevents persistence, not the warning). It lands on stderr, fooling readFileViaEIC's classifier: if len(out) == 0 && stderr.Len() == 0 { return nil, os.ErrNotExist } return nil, fmt.Errorf("ssh cat: %w (%s)", runErr, ...) stderr was non-empty (the warning), so we returned the wrapped error → 500 from the HTTP layer instead of 404. Fix: add `-o LogLevel=ERROR` to BOTH writeFileViaEIC and readFileViaEIC ssh invocations. Silences info+warning while keeping real auth/tunnel errors visible (those emit at ERROR level). Test: TestSSHArgs_LogLevelErrorBothSites pins the flag in both blocks. Mutation-tested: stripping the flag from one site fails the gate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 19:58:49 -07:00
molecule-ai[bot]	80e4b9ac9a	Merge pull request #2820 from Molecule-AI/staging staging → main: auto-promote `daefdd2`	2026-05-04 19:53:08 -07:00
Hongming Wang	daefdd21c5	Merge pull request #2819 from Molecule-AI/fix/auto-promote-cancelled-not-failure fix(auto-promote): treat E2E completed/cancelled as defer, not failure	2026-05-05 02:33:10 +00:00
Hongming Wang	8df8487bbe	fix(auto-promote): treat E2E completed/cancelled as defer, not failure Bug: the case statement at line 189 grouped completed/failure \| completed/cancelled \| completed/timed_out into the same "abort + exit 1" branch. cancelled ≠ failure — when per-SHA concurrency (memory: feedback_concurrency_group_per_sha) cancels an older E2E run because a newer push landed, the workflow blocked the whole auto-promote chain on a non-failure. Caught 2026-05-05 02:03 on sha `31f9a5e`: E2E got cancelled by concurrency, auto-promote :latest aborted with exit 1, the next auto-promote-staging cycle had to manually clean up. Split: failure/timed_out keep the abort path. cancelled gets its own clean-defer branch (same shape as in_progress) — proceed=false without exit 1, with a step-summary explaining likely concurrency supersession and pointing operators at manual dispatch if they need that specific SHA promoted. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 19:26:29 -07:00
molecule-ai[bot]	9a835ef631	Merge pull request #2817 from Molecule-AI/staging staging → main: auto-promote `856c967`	2026-05-04 19:21:07 -07:00
Hongming Wang	174e594690	Merge pull request #2816 from Molecule-AI/auto-sync/main-31f9a5e8 chore: sync main → staging (auto, ff to `31f9a5e8`)	2026-05-04 19:08:17 -07:00
Hongming Wang	856c967950	Merge pull request #2811 from Molecule-AI/fix/org-import-saas-outer-gate-1777939899 fix(provision): consolidate org-import gate + Auto self-marks-failed	2026-05-05 01:58:37 +00:00
Hongming Wang	73f7e0c03b	Merge pull request #2815 from Molecule-AI/fix/curl-stderr-regression fix(workflows): preserve curl stderr in 8 status-capture sites (PR #2810 follow-up)	2026-05-05 01:57:10 +00:00
molecule-ai[bot]	31f9a5e85e	Merge pull request #2812 from Molecule-AI/staging staging → main: auto-promote `9c9be4c`	2026-05-05 01:55:48 +00:00
Hongming Wang	c5dd14d8db	fix(workflows): preserve curl stderr in 8 status-capture sites Self-review of PR #2810 caught a regression: my mass-fix added `2>/dev/null` to every curl invocation, suppressing stderr. The original `\|\| echo "000"` shape only swallowed exit codes — stderr (curl's `-sS`-shown dial errors, timeouts, DNS failures) still went to the runner log so operators could see WHY a connection failed. After PR #2810 the next deploy failure would log only the bare HTTP code with no context. That's exactly the kind of diagnostic loss that makes outages take longer to triage. Drop `2>/dev/null` from each curl line — keep it on the `cat` fallback (which legitimately suppresses "no such file" when curl crashed before -w ran). The `>tempfile` redirect alone captures curl's stdout (where -w writes) without touching stderr. Same 8 files as #2810: redeploy-tenants-on-{main,staging}, sweep-stale-e2e-orgs, e2e-staging-{sanity,saas,external,canvas}, canary-staging. Tests: - All 8 files pass the lint - YAML valid Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 18:54:50 -07:00
Hongming Wang	7e1fdf5847	refactor(provision): use HasProvisioner() at all gate-y both-nil checks SSOT pass — replace 4 bare `h.provisioner == nil && h.cpProv == nil` checks with `!h.HasProvisioner()`. When a third backend lands (k8s, containerd, whatever), HasProvisioner gets one new field; bare both-nil checks would each need to be hunted and updated. Sites: - a2a_proxy_helpers.go:166 — maybeMarkContainerDead skip-no-backend - workspace_restart.go:118 — Restart endpoint guard - workspace_restart.go:363 — RestartByID coalescer guard - workspace_restart.go:660 — Resume endpoint guard Adds TestNoBareBothNilCheck (source-level) so the antipattern can't slip back in. Out of scope but discovered during the audit (filed separately): - team.go:207 — team-collapse Stop is Docker-only, leaks EC2 on SaaS - workspace_crud.go:423 — workspace delete cleanup is Docker-only, leaks EC2 on SaaS Both need a StopWorkspaceAuto mirror of provisionWorkspaceAuto. Same class of bug as today's org-import incident, different verb (stop vs provision). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 18:51:53 -07:00
Hongming Wang	d084d7e61a	fix(provision): consolidate org-import gate + Auto self-marks-failed Two changes that close the silent-drop bug class: 1. Add WorkspaceHandler.HasProvisioner() and use it as the org-import gate. Pre-fix, org_import.go:178 read `h.provisioner != nil` (Docker- only) — on SaaS tenants where cpProv is wired but Docker is nil, the entire 220-line provisioning prep block was skipped. The Auto call PR #2798 added at line 395 was unreachable on SaaS. Repro: 2026-05-05 01:14 — hongming prod tenant, 7-workspace org import, every workspace sat in 'provisioning' for 10 min until the sweeper marked it failed with the misleading "container started but never called /registry/register". 2. provisionWorkspaceAuto self-marks-failed on the no-backend path. Defense in depth: even if a future caller bypasses HasProvisioner gating or ignores the bool return (TeamHandler pre-#2367 did exactly this), the workspace ends in a clean failed state with an actionable error message instead of lingering until the 10-min sweep. Auto becomes the single source of truth for "start a workspace" — routing AND the no-backend failure path. Create's redundant if-not-Auto-then-mark-failed block collapses (kept only the workspace_config UPSERT, which is a Create-specific UI concern for rendering runtime/model on the Config tab). Tests: - TestProvisionWorkspaceAuto_NoBackendMarksFailed pins the new contract - TestHasProvisioner_TrueOnCPOnly catches the SaaS-only blind spot - TestHasProvisioner_TrueOnDockerOnly preserves self-hosted shape - TestHasProvisioner_FalseWhenNeitherWired pins the gate-out path - TestOrgImportGate_UsesHasProvisionerNotBareField source-pins the gate (verified: FAILS against the buggy `h.provisioner != nil` shape, PASSES with `h.workspace.HasProvisioner()`) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 18:47:02 -07:00
Hongming Wang	9c9be4cf12	Merge pull request #2810 from Molecule-AI/fix/workflow-curl-status-pollution fix(workflows): rewrite curl status-capture to prevent exit-code pollution + add lint	2026-05-05 01:33:37 +00:00
Hongming Wang	f256bfa9c6	Merge pull request #2809 from Molecule-AI/feat/codex-tab-bridge-daemon-snippet feat(external-templates): codex tab now includes bridge-daemon inbound path	2026-05-05 01:33:12 +00:00
Hongming Wang	463316772b	fix(workflows): rewrite curl status-capture to prevent exit-code pollution The 2026-05-04 redeploy-tenants-on-main run for sha `2b862f6` emitted "HTTP 000000" and failed the deploy. Root cause: when curl exits non- zero (connection reset → 56, --fail-with-body 4xx/5xx → 22), the `-w '%{http_code}'` already wrote a status to stdout; the inline `\|\| echo "000"` then fires AND appends another "000" to the captured substitution stdout. Result: HTTP_CODE="<actual><000>" — fails string comparisons against "200" while looking superficially right. Same class of bug the synth-E2E §7c gate hit twice (PRs #2779/#2783 + #2797). Memory feedback_curl_status_capture_pollution.md. Mass fix in 8 workflows: route -w into a tempfile so curl's exit code can't pollute stdout. Wrap with set +e/-e so the non-zero curl exit doesn't trip the outer pipeline. redeploy-tenants-on-main.yml (production-critical, caught the bug) redeploy-tenants-on-staging.yml (sibling) sweep-stale-e2e-orgs.yml (cleanup loop) e2e-staging-sanity.yml (E2E safety-net teardown) e2e-staging-saas.yml e2e-staging-external.yml e2e-staging-canvas.yml canary-staging.yml Plus a new lint workflow `lint-curl-status-capture.yml` that runs on every PR/push touching `.github/workflows/**`. Multi-line aware: collapses bash `\` continuations, then matches the buggy $(curl ... -w '%{http_code}' ... \|\| echo "000") subshell shape. Distinguishes from the SAFE $(cat tempfile \|\| echo "000") shape (cat with missing file emits empty stdout, no pollution). Verified: - All 8 workflows pass the lint locally - A known-bad injection is caught - A known-safe cat-fallback passes through - yaml.safe_load clean on all changed files Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 18:29:38 -07:00
Hongming Wang	dfd0bc528c	fix(external-templates): codex-channel-molecule via git+ URL (not on PyPI yet) Mirrors the pattern hermes-channel-molecule uses (line 256). Drops the broken `pip install codex-channel-molecule` which would 404. PyPI publish workflow is a separate piece of work — until then, git+https install is the path operators get. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 18:29:23 -07:00
Hongming Wang	4ea6f437e9	feat(external-templates): codex tab now includes the bridge-daemon inbound path The codex tab in the External Connect modal had a "outbound-tools-only first cut" caveat — operators got the MCP wiring for codex calling platform tools, but there was no documented inbound path. Canvas messages couldn't wake an idle codex session. That gap is now filled by codex-channel-molecule (github.com/Molecule-AI/codex-channel-molecule), shipped today as the codex counterpart to hermes-channel-molecule. The daemon long-polls the platform inbox, runs `codex exec --resume <session>` per inbound message, captures the assistant reply, routes it back via send_message_to_user / delegate_task, and acks the inbox row. Per-thread session continuity persisted to disk so daemon restarts don't lose conversation context. This commit: - Updates externalCodexTemplate to include `pip install codex-channel-molecule` (step 1) and a foreground `nohup codex-channel-molecule` invocation (step 3) using the same env-var contract as the MCP server (WORKSPACE_ID + PLATFORM_URL + MOLECULE_WORKSPACE_TOKEN). - Adds a "Canvas messages don't wake codex" common-issues entry to the TAB_HELP codex section pointing at the bridge daemon log. - Updates the doc comment to record the upstream deprecation path: when openai/codex#17543 lands, the bridge becomes redundant and the wired MCP server delivers push natively. Verified: TestExternalTemplates_NoMoleculeOrgIDPlaceholder still passes (no MOLECULE_ORG_ID re-introduction); full handlers suite green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 18:28:35 -07:00
Hongming Wang	a872202fe7	Merge pull request #2808 from Molecule-AI/auto-sync/main-2b862f65 chore: sync main → staging (auto, ff to `2b862f65`)	2026-05-04 18:11:12 -07:00
molecule-ai[bot]	2b862f65f9	Merge pull request #2807 from Molecule-AI/staging staging → main: auto-promote `0f389ba`	2026-05-04 17:52:41 -07:00
molecule-ai[bot]	53760a8a2f	Merge pull request #2805 from Molecule-AI/staging staging → main: auto-promote `461e5dc`	2026-05-05 00:40:12 +00:00
Hongming Wang	0f389ba325	Merge pull request #2804 from Molecule-AI/fix/external-templates-drop-molecule-org-id fix(external-templates): drop MOLECULE_ORG_ID from codex/openclaw/hermes snippets	2026-05-05 00:38:45 +00:00
Hongming Wang	472862bc50	fix(external-templates): drop MOLECULE_ORG_ID from operator-facing snippets Codex / openclaw / hermes-channel snippets each instructed operators to set `MOLECULE_ORG_ID = "<your org id>"`. The molecule_runtime MCP subprocess these snippets spawn never reads MOLECULE_ORG_ID — that env var is consumed only by workspace-server's TenantGuard middleware, server-side, on the tenant box itself (set by the control plane via user-data on provision). External operator → tenant calls pass TenantGuard via the isSameOriginCanvas path (Origin matches Host), with auth via Bearer token + X-Workspace-ID. The universal_mcp snippet — which calls into the same molecule_runtime — has always (correctly) omitted MOLECULE_ORG_ID; this brings codex / openclaw / hermes-channel into line. Symptom that caught it: an external codex CLI session, after pasting the codex-tab snippet, surfaced "MOLECULE_ORG_ID is still set to '<your org id>'" as an unresolved blocker — agent reasonably treated the placeholder as required setup. Operator has no value to fill. Pinned with a structural test (TestExternalTemplates_NoMoleculeOrgIDPlaceholder) so the placeholder can't drift back across all six external-tab templates. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 17:30:07 -07:00
Hongming Wang	461e5dcad0	Merge pull request #2803 from Molecule-AI/fix/memory-cutover-misconfig-warn fix(memory v2): warn at boot when cutover env half-configured	2026-05-05 00:27:24 +00:00
Hongming Wang	b5435b4732	fix(memory v2): warn at boot when cutover env half-configured MEMORY_V2_CUTOVER=true gates the admin export/import path on the v2 plugin, but the cutoverActive() check in admin_memories.go silently returns false when the plugin isn't wired: func (h AdminMemoriesHandler) cutoverActive() bool { if os.Getenv(envMemoryV2Cutover) != "true" { return false } return h.plugin != nil && h.resolver != nil } Two operator misconfigs hit the silent-fallback path: 1. MEMORY_V2_CUTOVER=true set, MEMORY_PLUGIN_URL unset → wiring.Build returns nil → handler stays on legacy SQL path → operator sees no error, assumes cutover is live, but every request still writes the legacy table. 2. MEMORY_V2_CUTOVER=true set, MEMORY_PLUGIN_URL set, but plugin unreachable at boot → wiring.Build still returns the bundle (intentional — circuit breaker handles ongoing unavailability), but every cutover write quietly falls back via the breaker. → only signal: legacy table keeps growing. Both are exactly the "structurally invisible until prod" failure mode; the only real-world detection today is "notice the legacy table is still being written to," which no operator will check. Add loud, distinctive WARN log lines at Build() time for both shapes. Boot logs are operator-visible, so a half-config is immediately obvious without needing dashboards. Tests: 4 new (cutover+no-URL → warn, neither set → silent, cutover+probe- fail → loud warn, probe-fail-without-cutover → quiet generic) * 6 existing (still pass; pin no-warning-on-happy-path) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 17:24:11 -07:00
Hongming Wang	4b16c95450	staging → main: auto-promote `f1b72af` staging → main: auto-promote `e39d818`	2026-05-04 17:11:20 -07:00
Hongming Wang	f1b72af97e	Merge pull request #2798 from Molecule-AI/fix/org-import-saas-routing-1777938328 fix(org-import): route through provisionWorkspaceAuto so SaaS gets EC2 — closes #2486	2026-05-04 23:54:37 +00:00
Hongming Wang	31facfc5c4	Merge pull request #2797 from Molecule-AI/fix/synth-e2e-9c-parse fix(synth-e2e): correct §9c stale-409 capture (curl --fail-with-body pollution)	2026-05-04 23:50:59 +00:00
Hongming Wang	19e7acdc22	fix(org-import): route through provisionWorkspaceAuto so SaaS gets EC2 Org-import called h.workspace.provisionWorkspace directly — same silent- drop bug that bit TeamHandler.Expand on 2026-05-04 (see workspace.go :121-125 comment + #2486). Symptom on SaaS: every claude-code workspace sat in "provisioning" until the 600s sweeper marked it failed with "container started but never called /registry/register" — because no container ever existed; the goroutine returned silently when the Docker provisioner field was nil. User reproduced 2026-05-04 ~22:30Z importing a 7-workspace template on the hongming prod tenant. Tenant CP logs (queried live via SSM) showed ZERO "Provisioner: goroutine entered" or "CPProvisioner: goroutine entered" lines for any of the 7 failed workspace UUIDs in the 60min window — confirming the goroutine never ran past line 384 of org_import.go because provisionWorkspace returned early in SaaS mode. The fix is one line: replace h.workspace.provisionWorkspace with h.workspace.provisionWorkspaceAuto. Auto is the single source of truth for backend selection (workspace.go:130) — picks CP-mode when h.cpProv is wired, Docker-mode when h.provisioner is wired, returns false when neither. ALSO adds a generic source-level gate (TestNoCallSiteCallsDirectProvisionerExceptAuto) so the next future caller can't repeat the pattern. Walks every non-test .go file in handlers/ and fails if any direct call to provisionWorkspace( or provisionWorkspaceCP( appears outside the dispatcher's own definition file. The gate currently allows workspace_restart.go which has its own manual if-h.cpProv-else dispatch (functionally equivalent to Auto, not the bug class — but is architectural duplication; follow-up filed for proper de-dup). Test plan: - TestOrgImport_UsesAutoNotDirectDockerPath: pin the org_import.go call site - TestNoCallSiteCallsDirectProvisionerExceptAuto: generic gate against future drift - TestTeamExpand_UsesAutoNotDirectDockerPath (existing): symmetric for team.go All 3 + the rest of the handler suite pass. Closes #2486 Pairs with: PR #2794 (configurable provision concurrency) which made it possible to bisect concurrency-vs-routing as the cause Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 16:49:07 -07:00
Hongming Wang	1ce51abea4	fix(synth-e2e): correct §9c stale-409 capture — curl exit code polluted status The §9c "Memory KV Edit round-trip" gate (added in #2787) captured the expected-409 status code via: $(tenant_call ... -w "%{http_code}" \|\| echo "000") tenant_call uses CURL_COMMON which carries --fail-with-body. On the expected 409, curl exits 22; the `\|\| echo "000"` then fires and appends "000" to the captured stdout — yielding "409000" instead of "409", failing the gate even though the contract was satisfied. Caught on PR #2792's first E2E run (status got "409000"). Has been silently failing the staging-SaaS E2E since #2787 merged earlier today; nothing else surfaced it because the workflow is informational, not required. Fix: route -w into its own tempfile so curl's exit code can't pollute the captured stdout. Wrap with set +e/-e so the 22 doesn't trip the outer pipeline. Same shape as the §7c gate fix that PR #2779/#2783 landed for the same class of bug. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 16:46:35 -07:00
Hongming Wang	0ec226e119	Merge pull request #2795 from Molecule-AI/feat/python-critical-path-coverage-floor ci(coverage): per-file 75% floor for MCP/inbox/auth Python critical paths (Phase A of #2790)	2026-05-04 23:39:06 +00:00
Hongming Wang	872b781f64	Merge pull request #2792 from Molecule-AI/feat/drop-shared-context feat: drop shared_context — use memory v2 team namespace	2026-05-04 23:37:49 +00:00

1 2 3 4 5 ...

4231 Commits