molecule-core

Author	SHA1	Message	Date
Hongming Wang	c77a88c247	chore(security): pin Actions to SHAs + enable Dependabot auto-bumps Supply-chain hardening for the CI pipeline. 23 workflow files modified, 59 mutable-tag refs replaced with commit SHAs. The risk Every `uses:` reference in .github/workflows/*.yml was pinned to a mutable tag (e.g., `actions/checkout@v4`). A maintainer of an action — or a compromised maintainer account — can repoint that tag to malicious code, and our pipelines silently pull it on the next run. The tj-actions/changed-files compromise of March 2025 is the canonical example: maintainer credential leak, attacker repointed several `@v<N>` tags to a payload that exfiltrated repository secrets. Repos that pinned to SHAs were unaffected. The fix Replace each `@v<N>` with `@<commit-sha> # v<N>`. The trailing comment preserves human readability ("ah, this is v4"); the SHA makes the reference immutable. Actions covered (10 distinct): actions/{checkout,setup-go,setup-python,setup-node,upload-artifact,github-script} docker/{login-action,setup-buildx-action,build-push-action} github/codeql-action/{init,autobuild,analyze} dorny/paths-filter imjasonh/setup-crane pnpm/action-setup (already pinned in molecule-app, listed here for completeness) Excluded: Molecule-AI/molecule-ci/.github/workflows/disable-auto-merge-on-push.yml@main — internal org reusable workflow; we control its repo, threat model is different from third-party actions. Conventional to pin to @main rather than SHA for internal reusables. The maintenance cost SHA pinning means upstream fixes require manual SHA bumps. Without automation, pinned SHAs go stale. So this PR also enables Dependabot across four ecosystems: - github-actions (workflows) - gomod (workspace-server) - npm (canvas) - pip (workspace runtime requirements) Weekly cadence — the supply-chain attack window is "minutes between repoint and pull"; weekly auto-bumps don't help with zero-days regardless. The point is to pull in non-zero-day fixes without operator effort. Aligns with user-stated principle: "long-term, robust, fully- automated, eliminate human error." Companion PR: Molecule-AI/molecule-controlplane#308 (same pattern, smaller surface). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 15:37:06 -07:00
Hongming Wang	24bfced630	ci(publish-image): also tag :staging-latest so CP auto-picks up new builds Root cause of the 2026-04-24 all-day E2E failure chain: Railway staging CP had TENANT_IMAGE pinned to :staging-a14cf86 — a static SHA that had silently drifted 10+ days stale. Every new tenant (including every E2E run's fresh tenant) was spawned with that stale image, which predated applyRuntimeModelEnv. Without applyRuntimeModelEnv, HERMES_DEFAULT_MODEL never reached the workspace EC2 user-data, so install.sh fell back to nousresearch/hermes-4-70b → openrouter → 401 "Missing Authentication header" in every A2A reply. Four correct fixes shipped today all got shadowed by this single stale pin: • template-hermes#19 (provider priority for openai/) • template-hermes#20 (decouple prefix-strip from bridge guard) • molecule-controlplane#247 (force fresh /opt/adapter clone) • molecule-core#1987 (E2E pins HERMES_CUSTOM_ as workaround) Fix: publish each main build under both :staging-<sha> AND :staging-latest. Change Railway staging CP's TENANT_IMAGE env to :staging-latest (done via `railway variables --set` as part of this incident). Future main builds then auto-propagate to new tenant provisions without any human in the loop. Safety: :staging-latest is the "most recent main build" — NOT a canary-verified promotion. That distinction is preserved: • Prod tenants still pull :latest (canary-verified, retagged by canary-verify.yml only after the canary fleet green-lights a digest) • Staging tenants now pull :staging-latest (every main build, pre-canary) So staging becomes the canary: if a :staging-latest build regresses, the staging canary fleet catches it before it can be promoted to :latest for prod. This is what the canary design intended; the missing :staging-latest tag was the hole. Zero impact on image size / build time: Docker tags point at the same digest, no duplicate push. Follow-up: filed an issue tracking the need for CP's TENANT_IMAGE to NEVER be pinned to a SHA in any environment — it must always float on a named tag (:staging-latest for staging, :latest for prod). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 00:29:55 -07:00
Hongming Wang	e298393df5	perf(ci): move all public-repo workflows to ubuntu-latest molecule-core is a public repo — GHA-hosted minutes are free. The self-hosted Mac mini was only in play to dodge GHA rate limits (memory feedback_selfhosted_runner), but for these specific workflows it came with real costs: - Docker-push workflows emulated linux/amd64 from arm64 via QEMU — every canvas + platform image build ran ~2-3x slower than native. - Six PRs worth of keychain-avoidance hacks in publish-* because `docker login` on macOS writes to osxkeychain unconditionally, and the Mac mini's launchd user-agent keychain is locked. - Homebrew pin-down environment variables (HOMEBREW_NO_) sprinkled everywhere to work around the shared /opt/homebrew symlink mess on the runner. - Setup-python@v5 couldn't write to /Users/runner, so ci.yml python-lint resorted to a hand-rolled Homebrew python3.11 dance. - Single runner → fan-out contention; CodeQL's 45-min analysis fought the canvas publish for the one slot. Changes across the 7 workflows: - runs-on: [self-hosted, macos, arm64] → ubuntu-latest (every job) - publish-canvas-image + publish-workspace-server-image: drop the hand-rolled auths-map step + QEMU setup + buildx v4 → docker/login-action@v3 + setup-buildx@v3. Linux + amd64 target = native build. - canary-verify + promote-latest: replace `brew install crane` + HOMEBREW_NO_ incantations with imjasonh/setup-crane@v0.4. - codeql.yml: drop `brew install jq` — jq is preinstalled on ubuntu-latest. - ci.yml shellcheck: drop the self-hosted existence check — shellcheck is preinstalled via apt. - ci.yml python-lint: replace the Homebrew python3.11 path dance with actions/setup-python@v5 (which works fine on GHA-hosted), add requirements.txt caching while we're there. - Remove stale comments referencing "the self-hosted runner", "Mac mini", keychain, osxkeychain etc. The self-hosted Mac mini remains in service for private-repo workflows only. Memory feedback_selfhosted_runner updated to reflect the public-repo scope carve-out. Net -96 lines across the 7 files. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 12:56:49 -07:00
Hongming Wang	52235aeb27	feat(router): /cp/* reverse-proxy to CP + same-origin canvas fetches Canvas's browser bundle issues fetches to both CP endpoints (/cp/auth/me, /cp/orgs, ...) AND tenant-platform endpoints (/canvas/viewport, /approvals/pending, /org/templates). They share ONE build-time base URL. Baking api.moleculesai.app broke tenant calls with 404; baking the tenant subdomain broke auth. Tried both today and saw exactly one failure mode per attempt. Real fix: same-origin fetches + tenant-side split. Adds: internal/router/cp_proxy.go # /cp/* → CP_UPSTREAM_URL mounted before NoRoute(canvasProxy). Now a tenant serves: /cp/* → reverse-proxy to api.moleculesai.app /canvas/viewport, /approvals/pending, /workspaces/:id/*, /ws, /registry, → tenant platform (existing handlers) /metrics everything else → canvas UI (existing reverse-proxy) Canvas middleware reverts to `connect-src 'self' wss:` for the same-origin path (keeping explicit PLATFORM_URL whitelist as a self-hosted escape hatch when the build-arg is non-empty). CI build-arg flips to NEXT_PUBLIC_PLATFORM_URL="" so the bundle issues relative fetches. Security of cp_proxy: - Cookie + Authorization PRESERVED across the hop (opposite of canvas proxy) — they carry the WorkOS session, which is the whole point. - Host rewritten to upstream so CORS + cookie-domain on the CP side see their own hostname. - Upstream URL validated at construction: must parse, must be http(s), must have a host — misconfig fails closed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 13:01:40 -07:00
Hongming Wang	b9e1f1e88e	fix(ci): bake api.moleculesai.app into tenant canvas bundle Canvas's browser-side code (auth.ts, api.ts, billing.ts) all call fetch(PLATFORM_URL + /cp/). PLATFORM_URL comes from NEXT_PUBLIC_PLATFORM_URL at build time; with the build arg unset, it falls back to http://localhost:8080 in the compiled bundle. That means on a tenant like hongmingwang.moleculesai.app, the user's browser actually tried to fetch http://localhost:8080/cp/ auth/me — which resolves to the USER'S OWN machine, not the tenant. Login redirect loops 404. Every tenant canvas has been unable to complete a fresh login on this path; existing sessions only worked because the cookie was already set domain-wide. Fix: pass NEXT_PUBLIC_PLATFORM_URL=https://api.moleculesai.app as a build arg in the tenant-image workflow. CP already allows CORS from .moleculesai.app + credentials, and the session cookie is scoped to .moleculesai.app so tenant subdomains inherit it. Verified in prod by rebuilding canvas locally with the flag and hot-patching the hongmingwang instance via SSM. Baked chunks now contain api.moleculesai.app; browser auth redirects resolve cleanly to the CP. Self-hosted users override by rebuilding with their own URL — same pattern molecule-app uses with NEXT_PUBLIC_CP_ORIGIN. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 12:51:22 -07:00
Hongming Wang	ac85ee2a0d	fix(ci): clone sibling plugin repo so publish-workspace-server-image builds Publish has been failing since the 2026-04-18 open-source restructure (#964's merge) because workspace-server/Dockerfile still COPYs ./molecule-ai-plugin-github-app-auth/ but the restructure moved that code out to its own repo. Every main merge since has produced a "failed to compute cache key: /molecule-ai-plugin-github-app-auth: not found" error — prod images haven't moved. Fix: add an actions/checkout step that fetches the plugin repo into the build context before docker build runs. Private-repo safe: uses PLUGIN_REPO_PAT secret (fine-grained PAT with Contents:Read on Molecule-AI/molecule-ai-plugin-github-app-auth). Falls back to the default GITHUB_TOKEN if the plugin repo is public. Ops: set repo secret PLUGIN_REPO_PAT before the next main merge, or publish will fail with a 404 on the checkout step. Also gitignores the cloned dir so local dev builds don't accidentally commit it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 05:19:31 -07:00
Hongming Wang	7e3a6fd756	feat(canary): gate :latest tag promotion on canary verify green (Phase 3) Completes the canary release train. Before this, publish-workspace- server-image.yml pushed both :staging-<sha> and :latest on every main merge — meaning the prod tenant fleet auto-pulled every image immediately, before any post-deploy smoke test. A broken image (think: this morning's E2E current_task drift, but shipped at 3am instead of caught in CI) would have fanned out to every running tenant within 5 min. Now: - publish workflow pushes :staging-<sha> ONLY - canary tenants are configured to track :staging-<sha>; they pick up the new image on their next auto-update cycle - canary-verify.yml runs the smoke suite (Phase 2) after the sleep - on green: a new promote-to-latest job uses crane to remotely retag :staging-<sha> → :latest for both platform and tenant images - prod tenants auto-update to the newly-retagged :latest within their usual 5-min window - on red: :latest stays frozen on prior good digest; prod is untouched crane is pulled onto the runner (~4 MB, GitHub release) rather than docker-daemon retag so the workflow doesn't need a privileged runner. Rollback: if canary passed but something surfaces post-promotion, operator runs "crane tag ghcr.io/molecule-ai/platform:<prior-good-sha> latest" manually. A follow-up can wrap that in a Phase 4 admin endpoint / script. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 03:33:04 -07:00
Hongming Wang	64796838e0	ci: update GitHub Actions to current stable versions (closes #780 ) - golangci/golangci-lint-action@v4 → v9 - docker/setup-qemu-action@v3 → v4 - docker/setup-buildx-action@v3 → v4 - docker/build-push-action@v5 → v6 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-18 12:04:10 -07:00
Hongming Wang	a62ad0bd66	chore: update publish workflow name + document staging-first flow Default branch is now staging for both molecule-core and molecule-controlplane. PRs target staging, CEO merges staging → main to promote to production. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-18 07:02:02 -07:00
Hongming Wang	eafc413a43	chore: rename publish-platform-image → publish-workspace-server-image Aligns CI workflow filename with the platform/ → workspace-server/ rename. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-18 01:05:09 -07:00

10 Commits