molecule-core

Author	SHA1	Message	Date
molecule-ai[bot]	45715aa8a5	fix(canvas/test): patch test regressions from PR #1243 + proximity hitbox fix (#1313 ) * fix(ci): revert cancel-in-progress to true — ubuntu-runner dispatch stalled With cancel-in-progress: false, pending CI runs accumulate in the ci-staging concurrency group. New pushes create queued runs, but GitHub dispatches multiple runs for the same SHA instead of replacing the pending one. All runs get stuck/cancelled before completing. Reverting to cancel-in-progress: true restores CI operation — runs that are superseded are cancelled, freeing the concurrency slot for the new run to proceed. Runner availability (ubuntu-latest dispatch stall) is a separate infra issue tracked independently. * fix(security): validate tar header names in copyFilesToContainer — CWE-22 path traversal (#1043) Tar header names were built from raw map keys without validation. A malicious server-side caller could embed "../" in a file name to escape the destPath volume mount (/configs) and write files outside the intended directory. Fix: validate each name with filepath.Clean + IsAbs + HasPrefix("..") checks before using it in the tar header, then join with destPath for the archive header. Also guard parent-directory creation against traversal. Closes #1043. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(canvas/test): patch regressed tests from PR #1243 orgs-page flakiness fix Two regressions introduced by PR #1243 (fix issue #1207): 1. ContextMenu.keyboard.test.tsx — `setPendingDelete` now receives `{id, name, hasChildren}` (cascade-delete UX, PR #1252), but the test expected only `{id, name}`. Added `hasChildren: false` to the assertion. 2. orgs-page.test.tsx — 10 tests awaited `vi.advanceTimersByTimeAsync(50)` without `act()`. With fake timers, `setState` (synchronous) is flushed by `advanceTimersByTimeAsync`, but the React state update it triggers is a microtask — so the test saw stale render. Wrapping in `act(async () => { await vi.advanceTimersByTimeAsync(50); })` ensures microtasks drain before assertions run. All 813 vitest tests pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(canvas): add 100px proximity threshold to drag-to-nest detection Fixes #1052 — previously, getIntersectingNodes() returned any node whose bounding box overlapped the dragged node, regardless of actual pixel distance. On a sparse canvas this triggered the "Nest Workspace" dialog even when the dragged node was nowhere near any target. The fix adds an on-node-drag proximity filter: only nodes within 100px (center-to-center) of the dragged node are eligible as nest targets. Distance is computed as squared Euclidean to avoid the sqrt overhead in the hot drag path. Added two tests to Canvas.pan-to-node.test.tsx covering the mock wiring and confirming the regression is addressed in Canvas.tsx. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: molecule-ai[bot] <276602405+molecule-ai[bot]@users.noreply.github.com> Co-authored-by: Molecule AI Core-FE <core-fe@agents.moleculesai.app> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-21 07:06:57 +00:00
molecule-ai[bot]	bcf7f93281	fix(ci): restore valid YAML in ci.yml — correct concurrency + ubuntu runner Root cause: commits `e6d48e6` and `e085621` stored ci.yml with JSON-escaped content (literal \n sequences, leading double-quote) instead of proper YAML with actual newlines. All CI runs failed with "workflow file issue" before any job could start. Fix: restore from pre-corruption base (`2517164`), apply intended changes: - concurrency.cancel-in-progress: true → false (queue rather than cancel) - changes job: runs-on ubuntu-latest (frees mac mini for real work) PR #1242 intent preserved, corruption from API commit removed.	2026-04-21 03:27:06 +00:00
molecule-ai[bot]	012f13ca46	fix(ci): remove garbage commit-SHA line from ci.yml — restore valid concurrency block Line 9 of ci.yml accidentally contained a bare string with the commit SHA instead of the intended concurrency: block, causing all CI runs to fail with a YAML parse error. Also restores the changes from the PR #1242 intent (workflow-level concurrency with cancel-in-progress: false). Fixes: CI failure on staging after PR #1242 merge.	2026-04-21 03:15:42 +00:00
molecule-ai[bot]	e6d48e6590	ci: add workflow-level concurrency to ci.yml and codeql.yml (#1242 ) cancel-in-progress: false queues new runs so the single mac mini runner doesn't fight itself when pushes stack during rebases or cross-PR contention. Existing e2e-api.yml already has this pattern. Fixes: 19 queued runs on single self-hosted runner (02:55 UTC snapshot) Co-authored-by: Molecule AI Fullstack (floater) <fullstack-floater@agents.moleculesai.app>	2026-04-21 03:07:31 +00:00
molecule-ai[bot]	eae762ec08	fix(ci): move changes job off self-hosted runner + add workflow concurrency Two changes to relieve macOS arm64 runner contention: 1. `changes` job: runs on `ubuntu-latest` instead of `[self-hosted, macos, arm64]`. This job does a plain `git diff` — it has zero macOS dependencies. Moving it off the runner frees the slot immediately on every workflow trigger. 2. Add workflow-level concurrency to `ci.yml`: `concurrency: group: ci-${{ github.ref }}; cancel-in-progress: true` Without this, every new push to a PR or main queues a full new workflow run, each competing for the same single runner. With `cancel-in-progress: true`, stale in-flight CI runs are cancelled when a newer commit arrives — the runner always runs the latest state, not a backlog of old ones. Context: the self-hosted macOS arm64 runner is shared by ci.yml, e2e-api.yml, canary-verify.yml, and publish-*.yml. The combination of (1) the `changes` job holding the runner during `fetch-depth: 0` checkout on every trigger, and (2) no workflow-level cancellation caused 100+ queued runs with 0 in-progress. Follow-up candidates (need verification before changing): - platform-build: Go build may work on ubuntu-latest (no macOS deps) - canvas-build: Next.js build may work on ubuntu-latest - python-lint: needs `setup-python` instead of Homebrew Python Co-authored-by: Molecule AI Infra-SRE <infra-sre@agents.moleculesai.app>	2026-04-21 01:44:27 +00:00
Hongming Wang	52235aeb27	feat(router): /cp/* reverse-proxy to CP + same-origin canvas fetches Canvas's browser bundle issues fetches to both CP endpoints (/cp/auth/me, /cp/orgs, ...) AND tenant-platform endpoints (/canvas/viewport, /approvals/pending, /org/templates). They share ONE build-time base URL. Baking api.moleculesai.app broke tenant calls with 404; baking the tenant subdomain broke auth. Tried both today and saw exactly one failure mode per attempt. Real fix: same-origin fetches + tenant-side split. Adds: internal/router/cp_proxy.go # /cp/* → CP_UPSTREAM_URL mounted before NoRoute(canvasProxy). Now a tenant serves: /cp/* → reverse-proxy to api.moleculesai.app /canvas/viewport, /approvals/pending, /workspaces/:id/*, /ws, /registry, → tenant platform (existing handlers) /metrics everything else → canvas UI (existing reverse-proxy) Canvas middleware reverts to `connect-src 'self' wss:` for the same-origin path (keeping explicit PLATFORM_URL whitelist as a self-hosted escape hatch when the build-arg is non-empty). CI build-arg flips to NEXT_PUBLIC_PLATFORM_URL="" so the bundle issues relative fetches. Security of cp_proxy: - Cookie + Authorization PRESERVED across the hop (opposite of canvas proxy) — they carry the WorkOS session, which is the whole point. - Host rewritten to upstream so CORS + cookie-domain on the CP side see their own hostname. - Upstream URL validated at construction: must parse, must be http(s), must have a host — misconfig fails closed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 13:01:40 -07:00
Hongming Wang	b9e1f1e88e	fix(ci): bake api.moleculesai.app into tenant canvas bundle Canvas's browser-side code (auth.ts, api.ts, billing.ts) all call fetch(PLATFORM_URL + /cp/). PLATFORM_URL comes from NEXT_PUBLIC_PLATFORM_URL at build time; with the build arg unset, it falls back to http://localhost:8080 in the compiled bundle. That means on a tenant like hongmingwang.moleculesai.app, the user's browser actually tried to fetch http://localhost:8080/cp/ auth/me — which resolves to the USER'S OWN machine, not the tenant. Login redirect loops 404. Every tenant canvas has been unable to complete a fresh login on this path; existing sessions only worked because the cookie was already set domain-wide. Fix: pass NEXT_PUBLIC_PLATFORM_URL=https://api.moleculesai.app as a build arg in the tenant-image workflow. CP already allows CORS from .moleculesai.app + credentials, and the session cookie is scoped to .moleculesai.app so tenant subdomains inherit it. Verified in prod by rebuilding canvas locally with the flag and hot-patching the hongmingwang instance via SSM. Baked chunks now contain api.moleculesai.app; browser auth redirects resolve cleanly to the CP. Self-hosted users override by rebuilding with their own URL — same pattern molecule-app uses with NEXT_PUBLIC_CP_ORIGIN. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 12:51:22 -07:00
rabbitblood	446f2d6e51	fix(ci): replace sleep 360 with health-check poll in canary-verify (#1013 ) The canary-verify workflow blocked the self-hosted runner for a fixed 6 minutes regardless of whether canaries had already updated. This wastes the runner slot when canaries update in 2-3 minutes. Fix: poll each canary's /health endpoint every 30s for up to 7 min. Exit early when all canaries report the expected SHA. Falls back to proceeding after timeout — the smoke suite validates regardless. Typical time saving: ~3-4 minutes per canary verify run. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-19 19:29:15 -07:00
Hongming Wang	07ec90a23c	ci(codeql): cover main + staging via workflow GitHub's UI-configured "Code quality" scan only fires on the default branch (staging), which leaves every staging→main promotion PR unscanned. The "On push and pull requests to" field in the UI has no dropdown; multi-branch scanning on private repos without GHAS isn't available there. Workflow file gives us the control we can't get in the UI: triggers on push + pull_request for both branches. Runs on the same self-hosted mac mini via [self-hosted, macos, arm64]. upload: never — GHAS isn't enabled on this repo so the SARIF upload API 403s. Keep results locally, filter to error+warning severity, fail the PR check on findings, publish SARIF as a workflow artifact. Flipping upload: never → always after GHAS is enabled (if ever) is a one-line change. Picks up the review-flagged improvements from the earlier closed PR: - jq install step (brew, no assumption it's present) - severity filter (error+warning only, drops noisy note-level) - set -euo pipefail - SARIF glob (file name doesn't match matrix language id) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 14:34:04 -07:00
Hongming Wang	53c55097f8	fix(ci): move canary-verify to self-hosted runner GitHub-hosted ubuntu-latest runs on this repo hit "recent account payments have failed or your spending limit needs to be increased" — same root cause as the publish + CodeQL + molecule-app workflow moves earlier this quarter. canary-verify was the last one still on ubuntu-latest. Switches both jobs to [self-hosted, macos, arm64]. crane install switched from Linux tarball to brew (matches promote-latest.yml's install pattern + avoids /usr/local/bin write perms on the shared mac mini). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 11:26:41 -07:00
Hongming Wang	7619e44802	ci(promote-latest): suppress brew cleanup that hits perm-denied on shared runner	2026-04-19 05:55:45 -07:00
Hongming Wang	fb2c126ed1	ci(promote-latest): run on self-hosted mac mini (GH-hosted quota blocked)	2026-04-19 05:53:39 -07:00
Hongming Wang	5a67c6be4a	ci(promote-latest): workflow_dispatch to retag :staging-<sha> → :latest Escape hatch for the initial rollout window (canary fleet not yet provisioned, so canary-verify.yml's automatic promotion doesn't fire) AND for manual rollback scenarios. Uses the default GITHUB_TOKEN which carries write:packages on repo- owned GHCR images, so no new secrets are needed. crane handles the remote retag without pulling or pushing layers. Validates the src tag exists before retagging + verifies the :latest digest post-retag so a typo can't silently promote the wrong image. Trigger from Actions → promote-latest → Run workflow → enter the short sha (e.g. "4c1d56e"). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 05:42:48 -07:00
Hongming Wang	ac85ee2a0d	fix(ci): clone sibling plugin repo so publish-workspace-server-image builds Publish has been failing since the 2026-04-18 open-source restructure (#964's merge) because workspace-server/Dockerfile still COPYs ./molecule-ai-plugin-github-app-auth/ but the restructure moved that code out to its own repo. Every main merge since has produced a "failed to compute cache key: /molecule-ai-plugin-github-app-auth: not found" error — prod images haven't moved. Fix: add an actions/checkout step that fetches the plugin repo into the build context before docker build runs. Private-repo safe: uses PLUGIN_REPO_PAT secret (fine-grained PAT with Contents:Read on Molecule-AI/molecule-ai-plugin-github-app-auth). Falls back to the default GITHUB_TOKEN if the plugin repo is public. Ops: set repo secret PLUGIN_REPO_PAT before the next main merge, or publish will fail with a 404 on the checkout step. Also gitignores the cloned dir so local dev builds don't accidentally commit it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 05:19:31 -07:00
Hongming Wang	7e3a6fd756	feat(canary): gate :latest tag promotion on canary verify green (Phase 3) Completes the canary release train. Before this, publish-workspace- server-image.yml pushed both :staging-<sha> and :latest on every main merge — meaning the prod tenant fleet auto-pulled every image immediately, before any post-deploy smoke test. A broken image (think: this morning's E2E current_task drift, but shipped at 3am instead of caught in CI) would have fanned out to every running tenant within 5 min. Now: - publish workflow pushes :staging-<sha> ONLY - canary tenants are configured to track :staging-<sha>; they pick up the new image on their next auto-update cycle - canary-verify.yml runs the smoke suite (Phase 2) after the sleep - on green: a new promote-to-latest job uses crane to remotely retag :staging-<sha> → :latest for both platform and tenant images - prod tenants auto-update to the newly-retagged :latest within their usual 5-min window - on red: :latest stays frozen on prior good digest; prod is untouched crane is pulled onto the runner (~4 MB, GitHub release) rather than docker-daemon retag so the workflow doesn't need a privileged runner. Rollback: if canary passed but something surfaces post-promotion, operator runs "crane tag ghcr.io/molecule-ai/platform:<prior-good-sha> latest" manually. A follow-up can wrap that in a Phase 4 admin endpoint / script. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 03:33:04 -07:00
Hongming Wang	c1ac32365d	feat(canary): smoke harness + GHA verification workflow (Phase 2) Post-deploy verification for staging tenant images. Runs against the canary fleet after each publish-workspace-server-image build — catches auto-update breakage (a la today's E2E current_task drift) before it propagates to the prod tenant fleet that auto-pulls :latest every 5 min. scripts/canary-smoke.sh iterates a space-sep list of canary base URLs (paired with their ADMIN_TOKENs) and checks: - /admin/liveness reachable with admin bearer (tenant boot OK) - /workspaces list responds (wsAuth + DB path OK) - /memories/commit + /memories/search round-trip (encryption + scrubber) - /events admin read (AdminAuth C4 path) - /admin/liveness without bearer returns 401 (C4 fail-closed regression) .github/workflows/canary-verify.yml runs after publish succeeds: - 6-min sleep (tenant auto-updater pulls every 5 min) - bash scripts/canary-smoke.sh with secrets pulled from repo settings - on failure: writes a Step Summary flagging that :latest should be rolled back to prior known-good digest Phase 3 follow-up will split the publish workflow so only :staging-<sha> ships initially, and canary-verify's green gate is what promotes :staging-<sha> → :latest. This commit lays the test gate alone so we have something running against tenants immediately. Secrets to set in GitHub repo settings before this workflow can run: - CANARY_TENANT_URLS (space-sep list) - CANARY_ADMIN_TOKENS (same order as URLs) - CANARY_CP_SHARED_SECRET (matches staging CP PROVISION_SHARED_SECRET) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 03:30:19 -07:00
Hongming Wang	64796838e0	ci: update GitHub Actions to current stable versions (closes #780 ) - golangci/golangci-lint-action@v4 → v9 - docker/setup-qemu-action@v3 → v4 - docker/setup-buildx-action@v3 → v4 - docker/build-push-action@v5 → v6 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-18 12:04:10 -07:00
Hongming Wang	04292f419c	fix(ci): update working-directory for workspace-server/ and workspace/ renames - platform-build: working-directory platform → workspace-server - golangci-lint: working-directory platform → workspace-server - python-lint: working-directory workspace-template → workspace - e2e-api: working-directory platform → workspace-server - canvas-deploy-reminder: fix duplicate if: key (merged into single condition) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-18 07:05:44 -07:00
Hongming Wang	a62ad0bd66	chore: update publish workflow name + document staging-first flow Default branch is now staging for both molecule-core and molecule-controlplane. PRs target staging, CEO merges staging → main to promote to production. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-18 07:02:02 -07:00
rabbitblood	525212c64d	Merge branch 'main' of https://github.com/Molecule-AI/molecule-core	2026-04-18 01:08:53 -07:00
rabbitblood	8562ef8f46	fix(ci): add staging branch to CI triggers PRs targeting staging got no CI because the workflow only triggered on main. Now runs on both main and staging pushes + PRs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-18 01:08:44 -07:00
Hongming Wang	eafc413a43	chore: rename publish-platform-image → publish-workspace-server-image Aligns CI workflow filename with the platform/ → workspace-server/ rename. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-18 01:05:09 -07:00
Hongming Wang	39074cc4ae	chore: final open-source cleanup — binary, stale paths, private refs - Remove compiled workspace-server/server binary from git - Fix .gitignore, .gitattributes, .githooks/pre-commit for renamed dirs - Fix CI workflow path filters (workspace-template → workspace) - Replace real EC2 IP and personal slug in test_saas_tenant.sh - Scrub molecule-controlplane references in docs - Fix stale workspace-template/ paths in provisioner, handlers, tests - Clean tracked Python cache files Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-18 00:38:55 -07:00
Hongming Wang	d8026347e5	chore: open-source restructure — rename dirs, remove internal files, scrub secrets Renames: - platform/ → workspace-server/ (Go module path stays as "platform" for external dep compat — will update after plugin module republish) - workspace-template/ → workspace/ Removed (moved to separate repos or deleted): - PLAN.md — internal roadmap (move to private project board) - HANDOFF.md, AGENTS.md — one-time internal session docs - .claude/ — gitignored entirely (local agent config) - infra/cloudflare-worker/ → Molecule-AI/molecule-tenant-proxy - org-templates/molecule-dev/ → standalone template repo - .mcp-eval/ → molecule-mcp-server repo - test-results/ — ephemeral, gitignored Security scrubbing: - Cloudflare account/zone/KV IDs → placeholders - Real EC2 IPs → <EC2_IP> in all docs - CF token prefix, Neon project ID, Fly app names → redacted - Langfuse dev credentials → parameterized - Personal runner username/machine name → generic Community files: - CONTRIBUTING.md — build, test, branch conventions - CODE_OF_CONDUCT.md — Contributor Covenant 2.1 All Dockerfiles, CI workflows, docker-compose, railway.toml, render.yaml, README, CLAUDE.md updated for new directory names. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-18 00:24:44 -07:00
Hongming Wang	295c4d930a	chore: open-source preparation — scrub secrets, add community files Security: - Replace hardcoded Cloudflare account/zone/KV IDs in wrangler.toml with placeholders; add wrangler.toml to .gitignore, ship .example - Replace real EC2 IPs in docs with <EC2_IP> placeholders - Redact partial CF API token prefix in retrospective - Parameterize Langfuse dev credentials in docker-compose.infra.yml - Replace Neon project ID in runbook with <neon-project-id> Community: - Add CONTRIBUTING.md (build, test, branch conventions, CI info) - Add CODE_OF_CONDUCT.md (Contributor Covenant 2.1) Cleanup: - Replace personal runner username/machine name in CI + PLAN.md - Replace personal tenant URL in MCP setup guide - Replace personal author field in bundle-system doc - Replace personal login in webhook test fixture - Rewrite cryptominer incident reference as generic security remediation - Remove private repo commit hashes from PLAN.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-18 00:10:56 -07:00
Hongming Wang	e093f121f0	fix(ci): use github.event.before for push diff, fetch-depth 0 HEAD~1 doesn't work for merge commits. Use github.event.before (the previous main tip) for push events and github.event.pull_request.base.sha for PRs. fetch-depth: 0 ensures both SHAs are available. Fallback: if BASE is empty (new branch), run all jobs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 20:23:28 -07:00
Hongming Wang	310fc56f96	fix(ci): replace dorny/paths-filter with git diff (macOS compat) dorny/paths-filter uses Docker internally which doesn't work on the self-hosted macOS arm64 runner — every CI run since the path filter change has failed with no jobs. Replace with a simple git diff against HEAD~1 that checks path prefixes. Same behavior, no Docker dependency. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 20:16:39 -07:00
Hongming Wang	945016d104	fix(ci): skip CI jobs for docs-only PRs using path filters CI now detects which paths changed and skips irrelevant jobs: - Platform (Go): only runs when platform/ changes - Canvas (Next.js): only runs when canvas/ changes - Python Lint: only runs when workspace-template/ changes - Shellcheck: only runs when tests/e2e/ or scripts/ change - E2E API: only runs when platform/ or tests/e2e/** change Docs-only PRs (.md, docs/*) skip all 5 jobs, saving ~15 min of runner time per PR. Uses dorny/paths-filter for the CI workflow and native paths: filter for the E2E workflow. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 10:09:39 -07:00
Hongming Wang	440e09b360	fix(ci): remove Fly registry from publish pipeline, push tenant to GHCR Fly.io was deleted — EC2 tenant instances now pull from GHCR. - Remove Fly registry push step (401 Unauthorized since Fly deleted) - Remove flyctl deploy step - Push tenant image to ghcr.io/molecule-ai/platform-tenant instead - Simplify GHCR auth config (remove Fly token) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 18:26:26 -07:00
Hongming Wang	c5ef9a71fc	fix(ci): use Dockerfile.tenant for Fly registry image (Go + Canvas) The publish workflow was pushing platform/Dockerfile (Go-only) to the Fly registry, but tenant machines run the combined image (Go + Canvas reverse proxy). This caused "canvas unavailable" after machine update. Changes: - Fly registry build: platform/Dockerfile → platform/Dockerfile.tenant - GHCR: keeps Go-only image (for self-hosted/dev use) - Path triggers: add canvas/** and manifest.json (tenant image includes both) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 09:31:51 -07:00
Hongming Wang	0c9fda559a	fix(ci): bypass docker login + macOS Keychain for image publish Six prior PRs (#273, #319, #322, #341, #484, #486) all kept calling `docker login` and tried to coerce credsStore via increasingly elaborate config tricks. None worked. The latest publish-canvas-image and publish-platform-image runs on main are still failing with: error storing credentials - err: exit status 1, out: `User interaction is not allowed. (-25308)` Verified locally on the runner host (2026-04-16): `docker login` on macOS unconditionally writes credentials to osxkeychain after a successful login, regardless of the config presented to it. # I wrote this: { "auths": {}, "credsStore": "", "credHelpers": {} } # After `docker login --config <dir> ghcr.io ...` succeeded: { "auths": { "ghcr.io": {} }, # empty — auth is in Keychain "credsStore": "osxkeychain" # Docker rewrote it back } So `--config` flag, DOCKER_CONFIG env var, credsStore="" etc. all share the same fate: Docker re-enables osxkeychain after every successful login. The Mac mini runner is a launchd user agent with a locked Keychain, so storage fails with -25308. This PR replaces the `docker login` invocation entirely. We write `base64(user:pat)` directly into the disposable DOCKER_CONFIG's `auths` map. `docker/build-push-action@v5` and the daemon honor the auths map for push without ever calling `docker login`, so the Keychain is never involved. Same shape in both workflows: - publish-canvas-image.yml — single registry (ghcr.io) - publish-platform-image.yml — two registries (ghcr.io + registry.fly.io) Fly username remains literal "x". Security: - Token env vars never echoed. Heredoc writes the auth blob via `umask 077` (file mode 600). The temp config dir lives under RUNNER_TEMP and is reaped at job end. - Diagnostics preserved (docker version + binary ls + registry keys only, no values) so future runner permission regressions remain visible without leaking secrets. Equivalent to closed PR #464 — re-opening because main is still broken (verified by inspecting the most recent failure). The closing comment on #464 stated the issue was already addressed by #341, but it isn't.	2026-04-16 09:25:20 -07:00
Hongming Wang	6f3c16eb78	fix(ci): use docker login CLI instead of login-action to bypass macOS Keychain docker/login-action@v3 ignores DOCKER_CONFIG and still tries the macOS system keychain on the self-hosted runner, producing: error storing credentials: User interaction is not allowed. (-25308) Switch to `docker login ... --password-stdin` which respects DOCKER_CONFIG and writes credentials to the per-run config.json we created in the isolate step. Applied to both GHCR and Fly registry logins in both publish workflows. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 08:45:20 -07:00
Hongming Wang	f93ec926cb	fix(ci): replace heredoc JSON with printf in publish workflows The heredoc block writing Docker config.json had unindented `{` at column 1, which GitHub Actions' YAML parser interpreted as a flow mapping start — causing every publish-platform-image and publish-canvas-image run to fail with 0 jobs (startup_failure). Replace `cat <<'JSON' ... JSON` with a single `printf` call that produces identical config.json content without confusing the parser. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 08:20:43 -07:00
Hongming Wang	d7161f5877	feat(ci): add Fly deploy step to publish-platform-image workflow After pushing the tenant image to registry.fly.io, the workflow now lists all running/stopped molecule-tenant machines and updates each to the newly pushed image tag. Gracefully skips if no machines exist (control plane provisions on demand). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 07:29:42 -07:00
Hongming Wang	77d6f5e7a0	fix(ci): heredoc indentation in publish workflows + add dev-start.sh Two fixes: 1. publish-canvas-image.yml + publish-platform-image.yml: the JSON heredoc for config.json had leading whitespace from YAML indentation, producing invalid JSON. Docker fell back to osxkeychain → -25308. Fixed by removing indentation inside the heredoc body. 2. Added scripts/dev-start.sh — one-command local dev environment. Starts infra (docker-compose), platform (Go), and canvas (Next.js) with proper health checks and cleanup on Ctrl-C.	2026-04-16 05:56:25 -07:00
Hongming Wang	104ae6ca68	fix(ci): remove molecli build step — CLI moved to standalone repo	2026-04-16 05:28:10 -07:00
Hongming Wang	61d97e9a34	Merge pull request #468 from Molecule-AI/fix/issue-458-e2e-cancel-protection ci: extract e2e-api into dedicated workflow with run-level cancel protection (#458)	2026-04-16 05:16:45 -07:00
DevOps Engineer	8ba6e18c0a	ci: extract e2e-api into dedicated workflow with run-level cancel protection (#458 ) Job-level `concurrency.cancel-in-progress: false` only prevents sibling jobs from killing each other — it does not protect the parent workflow run from being cancelled when a new push arrives. Every PR push was cancelling the in-progress E2E run, forcing manual `gh run rerun` across 7+ active PRs. Fix: move e2e-api into `.github/workflows/e2e-api.yml` with a workflow-level concurrency group (`e2e-api-${{ github.ref }}`, cancel-in-progress: false). New pushes now queue behind the running E2E job instead of cancelling it. Fast jobs (platform-build, canvas-build, shellcheck, python-lint) stay in ci.yml and retain normal run-level cancellation for quick iteration feedback. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-16 11:15:13 +00:00
Hongming Wang	d424bd947f	chore: remove extracted directories, add manifest-driven Docker builds Remove plugins/, workspace-configs-templates/, org-templates/ dirs (now in standalone repos). Add manifest.json listing all 33 repos and scripts/clone-manifest.sh to clone them. Both Dockerfiles now use the manifest script instead of 33 hardcoded git-clone lines. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 04:13:29 -07:00
Hongming Wang	586cd87ab6	Merge pull request #415 from Molecule-AI/fix/issue-399-canvas-image-publish feat(ci): auto-publish canvas Docker image to GHCR on canvas/** merges	2026-04-16 03:08:27 -07:00
Canvas Agent	c928b4cbe8	feat(ci): auto-publish canvas Docker image to GHCR on canvas/ merges Closes #399. ## Root cause `publish-platform-image.yml` existed for the Go platform image but there was no equivalent for the canvas. After every canvas PR merged, CI ran `npm run build` and passed — but the live container at :3000 was never updated. The `canvas-deploy-reminder` job only posted a comment asking operators to manually rebuild, which was consistently missed. ## What this adds - `.github/workflows/publish-canvas-image.yml`: triggers on `canvas/` changes to main (and `workflow_dispatch`). Mirrors the platform workflow: macOS Keychain isolation, QEMU for linux/amd64, Buildx, GHCR push with `:latest` + `:sha-<7>` tags. - `NEXT_PUBLIC_PLATFORM_URL` / `NEXT_PUBLIC_WS_URL` resolve from `workflow_dispatch` inputs → `CANVAS_PLATFORM_URL` / `CANVAS_WS_URL` repo secrets → `localhost:8080` defaults (safe for self-hosted dev). - Inputs are passed via env vars (not direct `${{ }}` interpolation) to prevent shell injection from string inputs. - `docker-compose.yml`: adds `image: ghcr.io/molecule-ai/canvas:latest` to the canvas service so `docker compose pull canvas && docker compose up -d canvas` applies the new image. `build:` is retained for local development. Adds a comment clarifying that `NEXT_PUBLIC_*` runtime env vars are ignored by the standalone bundle (build-time only). - `ci.yml`: updates `canvas-deploy-reminder` commit comment to reference `docker compose pull` as the fast path, with `docker compose build` as the local-source fallback. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-16 09:23:26 +00:00
Hongming Wang	111c59da68	fix(ops): bake workspace-configs-templates into platform Docker image Tenant machines were booting with no templates because the Dockerfile only shipped the Go binary + migrations. The canvas showed "0 templates" with an empty picker. Changes: - platform/Dockerfile: build context changed from ./platform to repo root so COPY can reach workspace-configs-templates/ alongside the Go source. COPY paths updated for platform/{go.mod,go.sum,*.go} and platform/migrations/. - .github/workflows/publish-platform-image.yml: context: . (was ./platform), paths trigger now includes workspace-configs-templates/ so template changes rebuild the image. Phase A of the template-registry plan. Phase B adds a DB registry + on-demand fetch for community templates (user pastes GitHub URL at workspace creation time). The baked defaults always ship in the image for zero-config tenant boot. Verified: `docker build -f platform/Dockerfile -t test .` succeeds, `docker run --rm test ls /workspace-configs-templates/` shows all 8 templates (autogen, claude-code-default, crewai, deepagents, gemini-cli, hermes, langgraph, openclaw). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 01:54:47 -07:00
Hongming Wang	aa2a283835	fix(ci): explicitly disable osxkeychain credsStore for self-hosted runner #273 tried to fix the macOS Keychain -25308 error by pointing DOCKER_CONFIG at a per-run temp dir with `{"auths": {}}`. That was necessary but not sufficient: Docker on macOS inherits `osxkeychain` as the default credsStore even when config.json doesn't declare one (comes from Docker Desktop's bundled binding), so the login-action still tried to call /usr/local/bin/docker-credential-osxkeychain which fails with -25308 from the non-interactive launchd session. Evidence: after #273, publish-platform-image still failed on every main merge with: error saving credentials: error storing credentials - err: exit status 1, out: `User interaction is not allowed. (-25308)` Fix: write a config.json that explicitly sets `credsStore: ""` and clears `credHelpers`, forcing Docker to store creds in the inline `auths` map of this disposable config.json instead of reaching for the keychain. Also print config.json at diagnostic time so a future regression surfaces in the log instead of at login. No runtime / test impact — this only changes what the runner writes to the workflow's temp DOCKER_CONFIG directory. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 21:20:06 -07:00
Hongming Wang	3d0e093b11	chore(ci): serialize e2e-api across runs to prevent docker collision Now that the Molecule-AI org has two self-hosted Apple-silicon runners (`hongming-m1-mini` + `hongming-m1-mini-2`) servicing the same label set, two CI runs could execute the e2e-api job concurrently. Each run starts fixed-name docker containers (`molecule-ci-postgres`, `molecule-ci-redis`) bound to host ports 15432/16379 — a collision means the second run fails with "container name already in use" or "port already in use". Adds a workflow-level `concurrency: e2e-api` group to the job so GitHub Actions serializes e2e-api executions globally regardless of which runner picks them up. `cancel-in-progress: false` ensures later runs queue rather than cancelling the in-flight one (we want every PR's e2e check to actually execute, not get skipped by a newer push). Tradeoff: e2e-api is now effectively single-threaded across the whole org. Measured duration is ~1-2 min per run, so the added serialization latency is small relative to total CI wall time. All other jobs still parallelize across both runners.	2026-04-15 17:06:41 -07:00
Hongming Wang	63934ab487	fix(ci): publish-platform-image keychain + path diagnostics Every publish-platform-image run since the `3ff40c4` self-hosted runner migration has been failing with two runner-level issues that the workflow now works around (keychain) or surfaces clearly (path): 1. "error storing credentials - err: exit status 1, out: 'User interaction is not allowed. (-25308)'" docker/login-action tries to persist the GHCR + Fly tokens in the macOS Keychain, but the Mac mini runner runs as a non-interactive launchd service without an unlocked desktop session — keychain access raises -25308. Fix: set DOCKER_CONFIG to a per-run temp dir containing a plain config.json before the login step so credentials land in a file, not the keychain. This is the same trick the GitHub-hosted macos runners use in docker action examples. 2. "Unexpected error attempting to determine if executable file exists '/usr/local/bin/docker': Error: EACCES: permission denied, stat '/usr/local/bin/docker'" Not a workflow bug — the runner literally can't read the Docker binary path. Adds a diagnostic step before QEMU/buildx setup that prints: PATH, `command -v docker`, `docker --version`, and `ls -la` on both /usr/local/bin/docker and /opt/homebrew/bin/docker. Surfacing these in the log means the next failure (if any) shows the actual problem instead of hiding behind a cryptic buildx error. Does NOT fix the root cause of #2 — that needs the user to SSH into the Mac mini runner and reinstall / re-permission Docker Desktop (or switch to Colima/OrbStack). The diagnostic output will tell us exactly which path is broken. The 20+ queued CI runs from `ci.yml` are unrelated to this PR — they are stuck because the self-hosted runner has severely degraded queue throughput (runs wait 2+ hours before being picked up). That's a separate runner-health issue tracked as a user action in the triage report. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 16:06:28 -07:00
Hongming Wang	dde97433ed	fix(ci): apply user's bypass-setup-python to main (missed in #186 squash-merge) #186's squash-merge commit (`3ff40c4b`) took 15e15a21 (AGENT_TOOLSDIRECTORY override) but missed a6cfc5f (bypass setup-python entirely) which was pushed to the PR branch after the merge was initiated. The merge commit still has the old setup-python@v5 job config. Applies a6cfc5f's ci.yml verbatim via git checkout. Restores the Homebrew-python3.11 bypass path that the user prototyped. No other changes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 10:58:22 -07:00
Hongming Wang	3ff40c4b68	chore(ci): migrate all jobs to self-hosted macOS arm64 runner * chore(ci): migrate all jobs to self-hosted macOS arm64 runner Switches every job in `ci.yml` and `publish-platform-image.yml` from `ubuntu-latest` to `[self-hosted, macos, arm64]` to avoid GitHub-hosted minute rate limits. All jobs run on a single Apple-silicon self-hosted runner registered at the Molecule-AI org level. Notable non-trivial adaptations (macOS runners can't use `services:` and some GHA marketplace actions are Linux-only): - e2e-api: `services: postgres/redis` replaced with inline `docker run` steps. Ports remapped to 15432/16379 to avoid collision with anything the host may already expose on the standard ports. Containers are named (`molecule-ci-postgres` / `molecule-ci-redis`) and torn down in an `if: always()` step. Postgres readiness is still gated on pg_isready via `docker exec`. - shellcheck: `ludeeus/action-shellcheck` is a Docker action, Linux-only. Replaced with a direct `shellcheck` invocation (pre-installed on the runner) that scans `tests/e2e/.sh` with `--severity=warning`. - publish-platform-image: added `docker/setup-qemu-action@v3` and an explicit `platforms: linux/amd64` on both `docker/build-push-action` invocations. The runner is arm64 but Fly tenant machines pull amd64, so QEMU-emulated cross-arch builds are required. GHA cache-from/cache-to behavior is unchanged. Runner prereqs (one-time host setup): - Docker Desktop installed and running (for e2e-api + image publish) - `shellcheck` on PATH - `docker` on PATH - Go / Node / gh / Python are installed via setup- actions per job * fix(ci): set AGENT_TOOLSDIRECTORY for python-lint on self-hosted runner setup-python@v5 defaults to /Users/runner/hostedtoolcache which doesn't exist on the hongming-claw self-hosted runner. AGENT_TOOLSDIRECTORY tells the action to use a writable path under the runner user's home directory. Fixes the only failing job in CI run 24469156329 on PR #186. --------- Co-authored-by: Hongming Wang <HongmingWang-Rabbit@users.noreply.github.com>	2026-04-15 10:48:27 -07:00
Hongming Wang	6f785f0b5a	fix(ci): revert Fly registry username to 'x' — 'molecule-ai' gets 401 Post-mortem on the failed publish-platform-image run on main (PR #82): Fly's Docker registry requires username EXACTLY equal to "x". My code-review "readability fix" changing it to "molecule-ai" caused every push to return 401 Unauthorized. Verified locally: echo $FLY_API_TOKEN \| docker login registry.fly.io -u x --password-stdin → Login Succeeded echo $FLY_API_TOKEN \| docker login registry.fly.io -u molecule-ai --password-stdin → 401 Unauthorized Lesson: don't second-guess docs that specify a literal value. Comment now says "MUST be literal 'x'" with a 2026-04-15 verification note to prevent future regressions. Code-review process improvement: when reviewing a change against a vendor API, prefer "preserve exact doc-specified values" over readability suggestions. Logged as a cron-learning. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 17:21:53 -07:00
Hongming Wang	855d423f6c	review: split push steps, runbook for secret rotation, username clarity Addresses PR #82 code review: 🟡×3 + 🔵×5. - Fly registry login username: 'x' → 'molecule-ai' + explanatory comment. - Build & push split into two steps (GHCR / Fly registry) so a single- registry outage can't fail the other. Second step uses 'if: always()' to ensure Fly mirror runs even if GHCR push flakes. - docs/runbooks/saas-secrets.md: full secret map + rotation procedures for every SaaS credential, with danger-case callouts. Documents the coupled FLY_API_TOKEN (lives in GHA secret AND fly secrets — must be rotated in both). - CLAUDE.md: new 'SaaS ops' section linking to the runbook.	2026-04-14 17:09:11 -07:00
Hongming Wang	b811b47334	feat(ci): mirror platform image to registry.fly.io/molecule-tenant Keeps ghcr.io/molecule-ai/platform private (per CEO direction — open- source when full SaaS ships) while still letting the private control plane's Fly provisioner boot tenant machines: Fly auto-authenticates same-org machines against registry.fly.io, no per-tenant pull credentials to wire. Workflow now logs into both GHCR (using built-in GITHUB_TOKEN) and Fly registry (using FLY_API_TOKEN secret) and pushes the same image to four tags total: - ghcr.io/molecule-ai/platform:latest - ghcr.io/molecule-ai/platform:sha-<short> - registry.fly.io/molecule-tenant:latest - registry.fly.io/molecule-tenant:sha-<short> Secret added via `gh secret set FLY_API_TOKEN` on the public repo. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 17:05:36 -07:00

1 2

57 Commits