molecule-core

Author	SHA1	Message	Date
Hongming Wang	2df644f528	fix(handlers): unblock Platform (Go) CI — sqlmock budget-check + test loopback Fixes 14 of the 18 failing tests that have been reddening Platform (Go) CI on main since the 2026-04-18 open-source restructure + 2026-04-21 SSRF-backport. Reduces handlers package failure count 18 → 4 (remaining 4 are unrelated schema/behavior drift — see follow-ups). Three root causes fixed: 1. httptest.NewServer binds to 127.0.0.1; isSafeURL rejects loopback. Tests that stub workspace URLs via httptest therefore 502'd at the SSRF guard before reaching the handler logic they wanted to exercise. Fix: add `testAllowLoopback` var to ssrf.go + `allowLoopbackForTest(t)` helper in handlers_test.go. Only 127.0.0.0/8 and ::1 are relaxed; 169.254 metadata, RFC-1918, TEST-NET, CGNAT, and link-local protections remain active. Flag is paired with t.Cleanup and is never touched by production code. 2. ProxyA2A's checkWorkspaceBudget query (SELECT budget_limit, COALESCE (monthly_spend, 0) FROM workspaces WHERE id = $1) was added with the restructure but the a2a_proxy_test.go sqlmock expectations never caught up, producing "call to Query ... was not expected" on every ProxyA2A-exercising test. Fix: `expectBudgetCheck(mock, workspaceID)` helper that registers an empty-rows expectation (checkWorkspaceBudget fails-open on sql.ErrNoRows, so an empty result = "no budget limit"). Added to each of the 8 affected TestProxyA2A_* tests in the correct position relative to access-control + activity-log expectations. 3. TestAdminMemories_Import_Success + _RedactsSecretsBeforeDedup mocked a 5-arg INSERT when the handler actually issues a 4-arg INSERT (workspace_id, content, scope, namespace) unless the payload carries a created_at override. Removed the spurious 5th AnyArg from both tests; _PreservesCreatedAt is untouched since it legitimately uses the 5-arg form. Also: TestResolveAgentURL_CacheHit and _CacheMissDBHit used bogus `cached.example` / `dbhit.example` hostnames that fail DNS resolution inside isSafeURL (which happens BEFORE the loopback check). Swapped to `127.0.0.1` variants preserving test intent (they never hit the network). Remaining 4 failures — out of scope for this PR, tracked separately: - TestGitHubToken_NoTokenProvider (handler behavior drift — 500 vs 404) - TestWorkspaceList + TestWorkspaceList_WithData (Scan arg count — workspaces table gained a column, mock not updated) - TestRegister_ProvisionerURLPreserved (request body shape drift) Closes the 4 wrong-target PRs (#1710, #1718, #1719, #1664) that all tried to silence the symptom by disabling golangci-lint — which has `continue-on-error: true` in ci.yml and was never the actual blocker. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 19:40:06 -07:00
Hongming Wang	47e459cdec	Merge pull request #1714 from Molecule-AI/fix/hermes-require-model-at-create fix(canvas): require hermes model at create (fixes silent Anthropic 401)	2026-04-22 19:02:21 -07:00
Hongming Wang	e08ea7b5ba	fix(canvas): require hermes model at create + send to CP (fixes silent Anthropic 401) Root cause of the hermes 401 "Invalid API key" on SaaS workspaces: 1. CreateWorkspaceDialog never sent `model` in the /workspaces POST 2. Tenant/CP plumbed through a valid (provider, API key) but empty MODEL 3. Workspace install.sh ran with HERMES_DEFAULT_MODEL unset 4. derive-provider.sh saw no slug → PROVIDER="auto" 5. Hermes fell back to its compiled-in default (Anthropic via OpenAI-compat adapter) 6. User's MINIMAX_API_KEY was present but irrelevant — hermes tried Anthropic with it → 401 Fix: - Extend HERMES_PROVIDERS with `defaultModel` + `models` (suggestion list). Each provider ships with a known-good default so the trap is physically impossible to hit with the new form. - Add a required Model input to the Hermes panel, auto-populated from the provider's defaultModel when the provider changes (only if the user hasn't typed their own slug yet). - Datalist surfaces additional model suggestions per provider so users can pick a different size (e.g. M2.7-highspeed) without typing the whole slug. - handleCreate validates hermesModel is non-empty, sends as `model` in the POST body alongside the secrets block. - useEffect guard avoids clobbering a user-typed custom slug when they toggle providers back and forth. Existing 19 a11y tests still pass (non-SaaS path unchanged, four-tier picker still renders, arrow-key nav still wraps). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 18:59:49 -07:00
Hongming Wang	59e0fd68f2	Merge pull request #1697 from Molecule-AI/docs/move-marketing-strategy-to-internal docs: move marketing strategy + research to internal repo	2026-04-22 18:48:03 -07:00
Hongming Wang	0582651284	Merge remote-tracking branch 'origin/main' into docs/move-marketing-strategy-to-internal	2026-04-22 18:46:31 -07:00
Hongming Wang	66de81fbfa	Merge pull request #1689 from Molecule-AI/refactor/strip-secret-service-dropdown refactor(secrets): strip Service dropdown from Add-Key form	2026-04-22 18:46:02 -07:00
Hongming Wang	e8523d7e02	Merge pull request #1693 from Molecule-AI/feat/saas-tier-default-t3 feat(canvas): add T4 tier (full-host) + default T4 on SaaS	2026-04-22 18:45:57 -07:00
Hongming Wang	7207133825	Merge pull request #1702 from Molecule-AI/fix/files-api-saas-ssh-write feat(files-api): SSH-backed write for SaaS workspaces (fixes 500 docker not available)	2026-04-22 18:45:52 -07:00
Hongming Wang	4bee15fc6a	Merge pull request #1695 from Molecule-AI/fix/cp-admin-bearer-for-console fix(cp-provisioner): use CP_ADMIN_API_TOKEN for /cp/admin/* (unblocks View Logs)	2026-04-22 18:45:48 -07:00
Hongming Wang	470e824ce1	Merge pull request #1696 from Molecule-AI/fix/orgtokens-uuid-coalesce fix(orgtoken): cast org_id to text in COALESCE (prevents /org/tokens 500)	2026-04-22 18:45:43 -07:00
Hongming Wang	03741d1110	feat(files-api): SSH-backed write for SaaS workspaces (fixes 500 docker not available) Symptom (prod, hongmingwang tenant, 2026-04-22): PUT /workspaces/:id/files/config.yaml → 500 {"error":"failed to write file: docker not available"} Root cause: WriteFile + ReplaceFiles always reached for the tenant's Docker client, but SaaS workspaces run as EC2 VMs (no Docker on the tenant to cp into). There was no SaaS code path, so Save/Save&Restart in the Config tab silently 500'd for every SaaS user. Fix: add writeFileViaEIC — same ephemeral-keypair + EIC-tunnel dance that the Terminal tab already uses (terminal.go). Flow: 1. ssh-keygen ephemeral ed25519 pair 2. aws ec2-instance-connect send-ssh-public-key (60s validity) 3. aws ec2-instance-connect open-tunnel (TLS → :22) 4. ssh ... "install -D -m 0644 /dev/stdin <abs path>" install -D creates missing parent dirs atomically 5. Kill tunnel + wipe keydir Runtime → base-path map (new table workspaceFilePathPrefix): hermes → /home/ubuntu/.hermes langgraph → /opt/configs external → /opt/configs unknown → /opt/configs Both WriteFile (single file) and ReplaceFiles (bulk) detect `workspaces.instance_id != ''` and route to EIC instead of Docker. Local/self-hosted Docker path is unchanged. Security: the only variable piece in the remote ssh command is the absolute path, which is built via map lookup + filepath.Clean so traversal is blocked. shellQuote() wraps it as defence-in-depth. validateRelPath rejects absolute paths and surviving `..` segments up-front; tests assert traversal rejection. Follow-ups tracked separately: - Reload hook after save (hermes gateway restart via SSH) - Per-tunnel batching for ReplaceFiles with many files - Runtime-specific base paths should be declared in the runtime manifest, not hardcoded in the handler Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 18:27:12 -07:00
Hongming Wang	0574e7c1d0	feat(canvas): add T4 tier (full-host access); SaaS default T4 Following feedback that T4 — not T3 — is the full-access tier: - Non-SaaS picker now shows all four tiers: T1 Sandboxed, T2 Standard, T3 Privileged, T4 Full Access. Four-column grid. - SaaS picker stays single-option but now locks to T4 (was T3). Every SaaS workspace gets a dedicated EC2 VM, which is unambiguously the "full host" case — T3 (privileged container) was a category mismatch. - Default tier on SaaS is 4 (was 3). CP provisioner already supports tier 4 (t3.large / 80 GB). TIER_CONFIG already has T4's amber color. Tests updated for the four-tier picker: wrap tests now go T4 ↔ T1, and the selection/tabIndex tests cover the fourth button. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 18:17:13 -07:00
Hongming Wang	9d0d21390e	docs(marketing+research): move sensitive strategy + research to internal repo These files have been in public monorepo docs/ since the open-source restructure on 2026-04-18, but are operational (outreach targets, analytics tracking IDs, staged unpublished social copy) or strategic (launch plans, SEO briefs, keyword targets, competitive research). Per the internal documentation policy (2026-04-22), they belong in the private internal repo. Pair PR: internal#27 receives the files. Removed: - docs/marketing/campaigns/* — 6 campaign packs with outreach + analytics - docs/marketing/plans/phase-30-launch-plan.md — draft launch plan - docs/marketing/briefs/* — 2 SEO content briefs - docs/marketing/seo/keywords.md — keyword strategy - docs/research/cognee-*.md — 2 architecture + isolation evals What stays public: - docs/marketing/blog/ — published blog posts - docs/marketing/devrel/demos/ — dev-facing demo scripts + video - docs/marketing/discord-adapter-day2/ — already-posted community copy No external references to update — cross-references among these files are now intact inside the internal repo; no public CLAUDE.md / README / PLAN / docs/README referenced the moved paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 17:53:55 -07:00
Hongming Wang	8a2345e4c6	Merge PR #1692 : fix(ssrf): honour saasMode for RFC-1918 private IPs fix(ssrf): honour saasMode for RFC-1918 private IPs — unblocks SaaS chat	2026-04-22 17:47:58 -07:00
Hongming Wang	aacd8c9d82	ci: retrigger after retarget to main	2026-04-22 17:25:41 -07:00
Hongming Wang	72524284d3	ci: retrigger after retarget to main	2026-04-22 17:25:39 -07:00
Hongming Wang	9a20fdbe3c	ci: retrigger after retarget to main	2026-04-22 17:25:38 -07:00
Hongming Wang	0baa6abe18	ci: retrigger after retarget to main	2026-04-22 17:25:11 -07:00
Hongming Wang	7d01f13500	fix(orgtoken): cast org_id to text in COALESCE to prevent 500 Symptom (prod tenant hongmingwang): GET /org/tokens → 500 orgtoken list: orgtoken: list: pq: invalid input syntax for type uuid: "" Postgres rejects COALESCE(uuid_col, '') because it can't cast the empty string to UUID. Cast to ::text first so the COALESCE operates on matching types. OrgID on the Go side is already string, so no scan changes needed. sqlmock doesn't exercise pq type coercion — it accepts any AddRow value for any column — which is why the existing tests pass while prod 500s. Real-Postgres integration coverage is the systemic fix (tracked separately), but this PR unblocks the Settings → Org Tokens page today. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 17:18:56 -07:00
Hongming Wang	4c0cb487c1	fix(cp-provisioner): use CP_ADMIN_API_TOKEN bearer for /cp/admin/* routes Symptom (prod tenant hongmingwang, 2026-04-22): cp provisioner: console: unexpected 401 GET /workspaces/:id/console → 502 (View Logs broken) Root cause: the tenant's CPProvisioner.authHeaders sent the provision- gate shared secret as the Authorization bearer for every outbound CP call, including /cp/admin/workspaces/:id/console. But CP gates /cp/admin/* with CP_ADMIN_API_TOKEN — a distinct secret so a compromised tenant's provision credentials can't read other tenants' serial console output. Bearer mismatch → 401. Fix: split authHeaders into two methods — - provisionAuthHeaders(): Authorization: Bearer <MOLECULE_CP_SHARED_SECRET> for /cp/workspaces/* (Start, Stop, IsRunning) - adminAuthHeaders(): Authorization: Bearer <CP_ADMIN_API_TOKEN> for /cp/admin/* (GetConsoleOutput and future admin reads) Both still send X-Molecule-Admin-Token for per-tenant identity. When CP_ADMIN_API_TOKEN is unset (dev / self-hosted single-secret setups), cpAdminAPIKey falls back to sharedSecret so nothing regresses. Rollout requirement: the tenant EC2 needs CP_ADMIN_API_TOKEN in its env — this PR wires up the code, but CP's tenant-provision path must inject the value. Filed as follow-up; until then, operators can set it manually on existing tenants. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 17:13:38 -07:00
Hongming Wang	8b1af9708c	feat(canvas): default tier T3 and hide T1/T2 on SaaS On SaaS every workspace gets its own EC2 VM — the Docker-sandbox distinction between T1 (sandboxed), T2 (standard Docker), and T3 (full host access) doesn't apply. A SaaS workspace is always a dedicated VM, which is "full access" by construction. Showing T1/T2 in that UI is a category error: users pick a sandbox level that has no effect on the actual EC2 machine they get. Changes: - tenant.ts: export isSaaSTenant() — returns true when canvas is served at <slug>.moleculesai.app (SSR-safe: false on server) - CreateWorkspaceDialog: when isSaaSTenant(), render only the T3 option, default tier=3, grid collapses to a single column. Label gets a " — dedicated VM" hint so the user knows what they're getting. On self-hosted the full T1/T2/T3 picker is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 17:02:48 -07:00
Hongming Wang	6d87408f77	fix(ssrf): honour saasMode for RFC-1918 private IPs Workspaces on SaaS register with their VPC-private IP (172.31.x.x on AWS default VPCs). The SSRF guard in ssrf.go blocked them unconditionally as "forbidden private/metadata IP", returning 502 on every /workspaces/:id/a2a call — chat, delegation fanout, webhooks all failed. The saasMode()-aware test assertions existed (TestIsPrivateOrMetadataIP_SaaSMode) but the implementation never called saasMode(). Wire it up. In SaaS: - RFC-1918 (10/8, 172.16/12, 192.168/16) and IPv6 ULA fd00::/8 are allowed - 169.254/16 metadata, TEST-NET, 100.64/10 CGNAT, loopback, link-local stay blocked in every mode Also hardens IPv6: link-local multicast and interface-local multicast are now rejected; DNS-resolved v6 addrs are checked too. Symptom log (prod tenant hongmingwang): ProxyA2A: unsafe URL for workspace a8af9d79-...: forbidden private/metadata IP: 172.31.47.119 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 17:00:30 -07:00
Hongming Wang	d956164812	refactor(secrets): strip Service dropdown from Add-Key form The Add-Key form used to open with a required Service dropdown (GitHub / Anthropic / OpenRouter / Other) that gated everything else. The dropdown did no persistent work — the secret store only cares about (key_name, value); the Service label was never saved anywhere. It also suffered registry drift: today we support ~22 hermes-dispatched providers (MiniMax, Gemini, DeepSeek, Kimi, Qwen, NVIDIA, etc.); only 3 had entries. Everyone else landed in "Other" with no downside beyond the mandatory click. Replaces it with: 1. Key-name <datalist> autocomplete sourced from new KEY_NAME_SUGGESTIONS in lib/services.ts — 26 entries covering common infra keys + every hermes-supported provider. 2. inferGroup(keyName) derives classification at render time, matching what the store already does in getGrouped(). No behaviour change for list grouping. 3. Provider docs link renders inline only when inferGroup recognises the name. For 'custom' keys we stay quiet — no false-structure prompt. 4. Test-connection button still available when the inferred group supports it AND the value is format-valid. Same providers as before. SERVICES registry preserved for LIST rendering + test routing. Result: two fields instead of three. One fewer decision. Provider- agnostic by design — new providers work the moment someone types their canonical env var name; no UI code change per provider. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 16:41:43 -07:00
Hongming Wang	2b603164de	Merge pull request #1685 from Molecule-AI/feat/propagate-model-env-to-provision feat(provision): propagate workspace model into runtime env (MVP hermes MiniMax flow)	2026-04-22 16:17:38 -07:00
Hongming Wang	7e3cd043c8	feat(provision): propagate workspace model into runtime env Tenant's workspace provisioner now forwards payload.Model (set by canvas Config tab when a user picks a model) through to the workspace's runtime env as HERMES_DEFAULT_MODEL, so install.sh / start.sh in the template can seed the right ~/.hermes/config.yaml without any post-provision manual step. Helper applyRuntimeModelEnv() is runtime-switched so each template owns its own env contract — hermes uses HERMES_DEFAULT_MODEL, future runtimes with different config schemas register their own cases. Runtimes that read model from /configs/config.yaml instead (langgraph, claude-code, deepagents) are unaffected: the switch has no case for them, so this is a no-op in those paths. Applied in both the Docker provisioner path (provisionWorkspaceOpts) and the SaaS/CP path (provisionWorkspaceCP) so local dev and production behave identically. Combined with: - molecule-controlplane#231 (/opt/adapter/install.sh hook) - molecule-ai-workspace-template-hermes#8 (install.sh for bare-host) - molecule-ai-workspace-template-hermes#9 (derive-provider.sh) this completes the MVP flow: customer creates a hermes workspace in canvas with model = minimax/MiniMax-M2.7-highspeed + secret MINIMAX_API_KEY = sk-cp-…, clicks Save, workspace provisions with the MiniMax Token Plan hermes-agent gateway up and ready for the first chat — no ops touch. Foundation this builds on: - env injection works for every runtime - secret passthrough is generic (already via workspace_secrets) - per-runtime env-var contract encoded once (applyRuntimeModelEnv) - canvas Save button for later-edit remains a Files-API-over-EIC concern (tracked separately) See internal/product/designs/workspace-backends.md for the broader architectural direction this fits into. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 16:17:08 -07:00
Hongming Wang	41316eea54	Merge pull request #1682 from Molecule-AI/fix/f1085-rm-scope-v4 fix(F1085): scope rm to /configs/path - 1-line fix	2026-04-22 16:07:19 -07:00
rabbitblood	f4207cd1dc	fix(F1085): scope rm to /configs/<path> not /configs + <path> rm received /configs and filePath as two separate arguments, deleting the entire /configs dir on every call. Concatenate to target only the intended file. validateRelPath already prevents traversal, so this is a logic bug not a security vulnerability. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-22 15:42:50 -07:00
Hongming Wang	557394f853	Merge pull request #1667 from Molecule-AI/fix/canary-verify-graceful-skip ci: canary-verify graceful-skip + draft auto-promote staging→main	2026-04-22 14:43:08 -07:00
Hongming Wang	7c102dbc7e	ci: canary-verify graceful-skip + draft auto-promote staging→main Two related workflow hygiene changes: ## (1) canary-verify: graceful-skip when canary secrets absent Before: canary-verify hit `scripts/canary-smoke.sh` which exited non-zero when CANARY_TENANT_URLS was empty. Every main publish ran → canary-verify failed → red check on main CI signal (7/7 in past 24h). Noise, no value. After: smoke step detects the missing-secrets case, writes a warning to the step summary, sets an output `smoke_ran=false`, and exits 0. The workflow completes green without pretending to have tested anything. Gated downstream: `promote-to-latest` now requires BOTH `needs.canary-smoke.result == success` AND `needs.canary-smoke.outputs.smoke_ran == true`. A skip does NOT auto-promote — manual `promote-latest.yml` remains the release gate while Phase 2 canary is absent (see molecule-controlplane/docs/canary-tenants.md for the fleet stand-up plan + decision framework). When the canary fleet is stood up and secrets populated: delete the early-exit branch + the smoke_ran gate. The workflow goes back to its original "smoke gates promotion" semantics. ## (2) auto-promote-staging.yml — draft New workflow that fires after CI / E2E Staging Canvas / E2E API / CodeQL complete on the staging branch, checks that ALL four are green on the same SHA, and fast-forwards `main` to that SHA. Shipped disabled: the promote step is gated behind repo variable `AUTO_PROMOTE_ENABLED=true`. Until that's set, the workflow dry-runs and logs what it would have done. Toggle via Settings → Variables when staging CI has been reliably green for a few days. Safety: - workflow_run events only fire on push to staging (PRs into staging don't promote). - Every required gate must be `completed/success` on the same head_sha. Pending / failed / skipped / cancelled → abort. - `--ff-only` push. Refuses to advance main if it has diverged from staging history (someone landed a direct-to-main commit that's not on staging). Human resolves the fork. - `workflow_dispatch` with `force=true` lets us test the flow end-to-end before flipping the variable on. Motivation: molecule-core#1496 has been open with 1172 commits divergence between staging and main. Today that trapped PR #1526 (dynamic canvas runtime dropdown) on staging while prod users hit the hardcoded-dropdown bug. Auto-promote retires the bulk staging→main PR pattern once the staging CI it depends on is reliable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 14:40:28 -07:00
Hongming Wang	ed6f4c65f6	Merge pull request #1666 from Molecule-AI/fix/canvas-dynamic-runtime-forward-port fix(canvas): forward-port dynamic runtime dropdown (#1526) to main	2026-04-22 14:29:04 -07:00
Hongming Wang	f6e6a64ba9	fix(canvas): forward-port dynamic runtime dropdown from staging (PR #1526 ) PR #1526 shipped the /templates registry + canvas dynamic Runtime / Model / Required-Env fields on 2026-04-22 — but merged into the staging branch, not main. The staging→main promotion PR #1496 has been open unmerged for a while with 1172 commits divergence, so prod (which builds from main) still carries the old hardcoded dropdown. Symptom seen on hongmingwang.moleculesai.app today: - New Hermes Agent workspace (template declares runtime: hermes) loads Config tab → Runtime dropdown shows "LangGraph (default)" because there's no <option value="hermes"> in the hardcoded list; it falls back to empty-value silently. - Model field is a plain TextInput with static placeholder "e.g. anthropic:claude-sonnet-4-6" — should be a combobox populated from the selected runtime's models[]. - Required Env Vars is a TagList with static placeholder "e.g. CLAUDE_CODE_OAUTH_TOKEN" — should auto-populate from the selected model's required_env. - Net effect: "Save & Deploy" sends empty model + empty env to the provisioner → workspace instant-fails. This PR cherry-picks the exact three files from PR #1526 (#359dc61 on staging) forward to main, without pulling the other 1171 commits: - canvas/src/components/tabs/ConfigTab.tsx - RuntimeOption interface + FALLBACK_RUNTIME_OPTIONS (hermes, gemini-cli included) - useEffect fetches /templates and populates runtimeOptions dynamically - dropdown renders from runtimeOptions (no hardcoded list) - Model becomes a combobox with datalist of available models per selected runtime - Required Env Vars auto-populates from the selected model's required_env on model change - workspace-server/internal/handlers/templates.go - /templates endpoint returns [{id, name, runtime, models}] with per-template models registry (id, name, required_env) - workspace-server/internal/handlers/templates_test.go - Tests for runtime+models parsing and legacy top-level model fallback The canvas Runtime dropdown now resolves "hermes" correctly; Model dropdown shows the models[] from the hermes template; Env auto-populates with HERMES_API_KEY (or whichever model selected). Verified locally: - workspace-server builds clean - Template handler tests pass: TestTemplatesList_RuntimeAndModelsRegistry, TestTemplatesList_LegacyTopLevelModel, TestTemplatesList_NonexistentDir Follow-up: the staging→main promotion gap (#1496) is the underlying process issue. Either merge that PR or adopt a policy of landing fixes directly on main (as several PRs have today). Files here were chosen minimally to avoid pulling unrelated staging changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 14:28:38 -07:00
Hongming Wang	0db8445538	Merge pull request #1661 from Molecule-AI/docs/move-sensitive-to-internal docs(security): move sensitive runbooks to private internal repo	2026-04-22 14:17:36 -07:00
Hongming Wang	bc82fa4e0e	docs(security): move sensitive runbooks to private internal repo Three changes to stop ferrying sensitive content through our public monorepo. All content already imported to Molecule-AI/internal (private) — see linked PRs below. ## docs/incidents/INCIDENT_LOG.md — replaced with stub Contained full security audit cycle records with CWE references, file:line pointers to historical vulnerabilities, and severity ratings. None of that belongs in a public repo. → Moved to Molecule-AI/internal/security/incident-log.md (PR #20). Monorepo file becomes a 17-line stub pointing at the internal location. Future incidents land in the internal file only. ## docs/architecture/canary-release.md — redacted identifiers Had AWS account ID `004947743811` and IAM role name `MoleculeStagingProvisioner` embedded. Even though the fleet described isn't actually running (see state note), these identifiers are account-specific and don't belong in public git. → Removed both values, replaced with generic references + a pointer to Molecule-AI/internal/runbooks/canary-fleet.md (PR #21) where the actual identifiers live. Any future rotation touches the internal file, no public-git-history rewrite needed. ## docs/infra/workspace-terminal.md — reduced to public summary Contained the full ops runbook: bootstrap script output, per-tenant SG backfill loop with live SG IDs, customer slug names (hongmingwang). Useful content but too specific for a public repo. → Moved to Molecule-AI/internal/runbooks/workspace-terminal.md (PR #22). Monorepo file becomes a 30-line public summary of what the feature does + pointers to code, so external readers / self-hosters still get the design story. ## What's NOT in this PR (follow-up) Marketing briefs, SEO plans, campaign copy, research dossiers, and internal product designs (hermes-adapter-plan, medo-integration, cognee-*) are the next batches. See docs policy doc coming next to set team expectations. Net removal: ~820 lines from public git going forward. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 14:17:11 -07:00
Hongming Wang	691de28064	Merge pull request #1649 from Molecule-AI/docs/reconcile-canary-release-reality docs(canary-release): flag as aspirational; link to current state	2026-04-22 14:03:47 -07:00
Hongming Wang	ded10a0660	docs(canary-release): flag as aspirational; link to current state The canary-release.md doc describes the pipeline as if the fleet is running — referring to AWS account 004947743811 and a configured MoleculeStagingProvisioner role. Reality as of 2026-04-22: no canary tenants are provisioned, the 3 GH Actions secrets are empty, and canary-verify.yml has failed 7/7 times in a row. Added a top-of-doc ⚠️ state note that: 1. Clarifies this is intended design, not deployed reality. 2. Notes the AWS account ID is historical / unverified. 3. Explains that merges currently rely on manual promote-latest. 4. Cross-links to molecule-controlplane/docs/canary-tenants.md for the Phase 1 work that's shipped, the Phase 2 stand-up plan, and the "should we even do this now?" decision framework. 5. Asks whoever lands Phase 2 to reconcile the two docs. No behaviour change — doc-only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 14:03:27 -07:00
Hongming Wang	c4f7d551dc	Merge pull request #1628 from Molecule-AI/fix/cicd-unblock-latent-bugs fix(ci): unblock main CI on ubuntu-latest (2 latent bugs)	2026-04-22 13:19:09 -07:00
Hongming Wang	1aea013e20	fix(ci): unblock main CI on ubuntu-latest — IPv6-safe addr + MagicMock seed Two latent bugs the self-hosted Mac mini had been hiding. Both caught by the newer toolchain on ubuntu-latest runners after PR #1626. 1. workspace-server/internal/handlers/terminal.go:442 `fmt.Sprintf("%s:%d", host, port)` flagged by go vet as unsafe for IPv6 (it omits the required [::] brackets). Replaced with `net.JoinHostPort(host, strconv.Itoa(port))` which handles both IPv4 and IPv6 correctly. No runtime behaviour change — the only call site passes "127.0.0.1", so the bug would never trigger in practice, but vet is right to flag it as a latent correctness issue. 2. workspace/tests/test_a2a_executor.py::test_set_current_task_updates_heartbeat `MagicMock()` auto-creates attributes on first access, so `getattr(heartbeat, "active_tasks", 0)` in shared_runtime.py returned a MagicMock rather than the default 0. Adding 1 to a MagicMock returns another MagicMock, so the assertion `heartbeat.active_tasks == 1` never held. Seeding `heartbeat.active_tasks = 0` before the first call makes getattr() return a real int, matching how the real HeartbeatLoop class initialises itself. Both pre-existed on main and were hidden by the older Python / Go toolchains on the Mac mini runner. Verified locally (venv pytest pass, `go vet ./...` + `go build ./...` clean on workspace-server). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 13:18:46 -07:00
Hongming Wang	557e7a0697	Merge pull request #1626 from Molecule-AI/perf/public-workflows-ubuntu-latest perf(ci): all public-repo workflows → ubuntu-latest	2026-04-22 13:04:06 -07:00
Hongming Wang	f3e658a091	Merge pull request #1624 from Molecule-AI/feat/provisioner-pull-templates-from-ghcr feat(provisioner): pull workspace-template images from GHCR	2026-04-22 13:04:03 -07:00
Hongming Wang	e298393df5	perf(ci): move all public-repo workflows to ubuntu-latest molecule-core is a public repo — GHA-hosted minutes are free. The self-hosted Mac mini was only in play to dodge GHA rate limits (memory feedback_selfhosted_runner), but for these specific workflows it came with real costs: - Docker-push workflows emulated linux/amd64 from arm64 via QEMU — every canvas + platform image build ran ~2-3x slower than native. - Six PRs worth of keychain-avoidance hacks in publish-* because `docker login` on macOS writes to osxkeychain unconditionally, and the Mac mini's launchd user-agent keychain is locked. - Homebrew pin-down environment variables (HOMEBREW_NO_) sprinkled everywhere to work around the shared /opt/homebrew symlink mess on the runner. - Setup-python@v5 couldn't write to /Users/runner, so ci.yml python-lint resorted to a hand-rolled Homebrew python3.11 dance. - Single runner → fan-out contention; CodeQL's 45-min analysis fought the canvas publish for the one slot. Changes across the 7 workflows: - runs-on: [self-hosted, macos, arm64] → ubuntu-latest (every job) - publish-canvas-image + publish-workspace-server-image: drop the hand-rolled auths-map step + QEMU setup + buildx v4 → docker/login-action@v3 + setup-buildx@v3. Linux + amd64 target = native build. - canary-verify + promote-latest: replace `brew install crane` + HOMEBREW_NO_ incantations with imjasonh/setup-crane@v0.4. - codeql.yml: drop `brew install jq` — jq is preinstalled on ubuntu-latest. - ci.yml shellcheck: drop the self-hosted existence check — shellcheck is preinstalled via apt. - ci.yml python-lint: replace the Homebrew python3.11 path dance with actions/setup-python@v5 (which works fine on GHA-hosted), add requirements.txt caching while we're there. - Remove stale comments referencing "the self-hosted runner", "Mac mini", keychain, osxkeychain etc. The self-hosted Mac mini remains in service for private-repo workflows only. Memory feedback_selfhosted_runner updated to reflect the public-repo scope carve-out. Net -96 lines across the 7 files. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 12:56:49 -07:00
Hongming Wang	9df3159c59	feat(provisioner): pull workspace-template images from GHCR Every standalone workspace-template repo now publishes to ghcr.io/molecule-ai/workspace-template-<runtime>:latest via the reusable publish-template-image workflow in molecule-ci (landed today — one caller per template repo). This PR makes the provisioner actually use those images: - RuntimeImages map + DefaultImage switched from bare local tags (workspace-template:<runtime>) to their GHCR equivalents. - New ensureImageLocal step before ContainerCreate: if the image isn't present locally, attempt `docker pull` and drain the progress stream to completion. Best-effort — if the pull fails (network, auth, rate limit) the subsequent ContainerCreate still surfaces the actionable "No such image" error, now with a GHCR-appropriate hint instead of the defunct `bash workspace/build-all.sh <runtime>` advice. - runtimeTagFromImage now handles both forms: legacy `workspace-template:<runtime>` (local dev via build-all.sh / rebuild-runtime-images.sh) and the current GHCR shape. Keeps error hints sensible in both worlds. - Tests cover the GHCR path for tag extraction and the new error message shape. Legacy local tags still recognised. Local dev path unchanged — scripts/build-images.sh and workspace/rebuild-runtime-images.sh still produce locally-tagged `workspace-template:<runtime>` images, and Docker's image resolver matches them before any pull is attempted. So contributors can keep iterating on a template repo without round-tripping through GHCR. Follow-on impact: - hongmingwang.moleculesai.app (and any other tenant EC2) will auto-pull `ghcr.io/molecule-ai/workspace-template-hermes:latest` on the next hermes workspace provision — picking up the real Nous hermes-agent behind the A2A bridge (template-hermes v2.1.0) without any tenant-side rebuild step. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 12:39:56 -07:00
Hongming Wang	a8e4afe863	Merge pull request #1591 from Molecule-AI/fix/canvas-dockerfile-uid-collision fix(canvas): unblock publish-canvas-image — drop default node user before uid 1000	2026-04-22 10:22:18 -07:00
Hongming Wang	5f96a832e7	fix(canvas): drop node:20-alpine default user before creating canvas uid 1000 publish-canvas-image has been failing on every main push since 2026-04-21 at `addgroup -g 1000 canvas` because node:20-alpine already ships a `node` user/group at uid/gid 1000. Same collision workspace-server/Dockerfile.tenant already fixes with `deluser --remove-home node` before `addgroup`. Copying that pattern here so the workflow goes green again and canvas images publish to ghcr. No runtime behaviour change — canvas still runs as non-root uid 1000. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 09:42:02 -07:00
Hongming Wang	fc27477df9	fix(canvas): stop infinite re-render on ContextMenu mount (#1544 ) fix(canvas): stop infinite re-render on ContextMenu mount	2026-04-21 21:50:41 -07:00
Hongming Wang	e88ab70251	fix(canvas): stop infinite re-render on ContextMenu mount ContextMenu's children selector ran .filter() inside the Zustand hook, returning a brand-new array reference on every render. useSyncExternalStore under the hood compares snapshots with Object.is — a new array always differs, so React kept scheduling re-renders, hit the 50-update depth cap, and crashed with minified error #185. Observed as "Application error: a client-side exception" on every SaaS tenant once a session cookie resolved. Caught in dev mode where the build emits the clear warning: The result of getSnapshot should be cached to avoid an infinite loop at ContextMenu (src/components/ContextMenu.tsx:26:34) Fix: select the stable nodes array once, derive children via useMemo outside the store subscription. Same output, no new reference per render. Manually verified: dev bundle served through a cloudflared tunnel to a live tenant, ContextMenu component mounts cleanly, remaining console errors are all unrelated (localhost API 401s from the dev server pointing at its own origin). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 21:47:32 -07:00
Hongming Wang	9466542212	docs(infra): add tenant env-var section + fix backfill loop split Review turned up two issues in the rollout runbook: 1. The tenant env-var list was missing — today's debugging burned 2 hours on hongmingwang where everything worked infra-side but canvas 401'd because MOLECULE_ORG_SLUG and CP_UPSTREAM_URL weren't set. Doc without this sends the next operator down the same hole. Added a dedicated step-3 table covering CP_UPSTREAM_URL, MOLECULE_ORG_SLUG, MOLECULE_ORG_ID, AWS_REGION with the exact failure mode each one produces when missing. 2. Backfill loop used tab-separated aws-cli output directly, which can concatenate all SG ids into one word and run the loop body once with no iteration. Inserted `\| tr '\t' '\n'` — no-op on well-behaved output, fix on the concatenated case. Renumbered subsequent sections. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 21:01:30 -07:00
Hongming Wang	456b8fd184	docs(infra): workspace-terminal runbook with verified commands Expanded the rollout section with the exact scripts + env vars that landed to make Hermes workspace Terminal work on 2026-04-22. Points at molecule-controlplane#227 (which adds bootstrap script + EIC_ENDPOINT_SG_ID env var) so operators can reproduce the setup on a new AWS account in one command. Also documents the existing-workspace backfill for the instance_id column — the CP only writes on new provisions, so pre-migration workspaces need a manual UPDATE before Terminal routes to the remote path. Refs: #1528 (resolved) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 19:50:59 -07:00
Hongming Wang	3820a0cc5b	feat(terminal): remote path via aws ec2-instance-connect (#1533 ) feat(terminal): remote path via aws ec2-instance-connect + pty	2026-04-21 18:40:23 -07:00
Hongming Wang	9aef3ed046	feat(workspace): persist CP-returned EC2 instance_id on provision (#1531 ) feat(workspace): persist CP-returned EC2 instance_id on provision	2026-04-21 18:40:05 -07:00
Hongming Wang	bca11fea9f	fix(terminal): correct CP branch to SSH-only (no docker exec) Proven by end-to-end testing against a live Hermes workspace EC2: CP-provisioned workspaces run the agent as a NATIVE process under the ubuntu user, not inside a Docker container. The earlier \`aws ec2-instance-connect ssh -- docker exec -it ws-X bash\` was doubly wrong: - aws-cli's \`ssh\` subcommand doesn't accept a trailing command - Even if it did, there's no container to exec into Replaced with a three-step pipeline that matches what actually works when run by hand: 1. ssh-keygen — ephemeral ed25519 per session 2. aws ec2-instance-connect send-ssh-public-key --instance-os-user ubuntu 3. aws ec2-instance-connect open-tunnel --local-port N (runs in background) 4. ssh -p N -i <key> ubuntu@127.0.0.1 Infra prerequisites (verified in docs/infra/workspace-terminal.md): - EIC service-linked role created - EIC Endpoint in the workspace VPC (we created eice-08b035ec8789202f9) - Workspace SG allows 22/tcp from the EIC Endpoint's SG - molecule-cp IAM: ec2:DescribeInstances + ec2-instance-connect:* Changes in this commit: - eicSSHOptions struct carries session inputs between factories - openTunnelCmd + sshCommandCmd + sendSSHPublicKey are package vars so tests can stub them individually - Default OS user is \"ubuntu\" (Ubuntu 24.04 CP AMI). Override via WORKSPACE_EC2_OS_USER env var if the AMI changes - AWS_REGION env var respected; default us-east-2 matches current CP - pickFreePort + waitForPort helpers — no hardcoded ports, tolerates multiple concurrent sessions - Tests updated: two argv-shape regressions for open-tunnel + ssh (SSH shape was the silent-drift case that caused the first failure) Refs: #1528, #1531 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 18:39:00 -07:00

1 2 3 4 5 ...

1434 Commits