The publish workflow was pushing platform/Dockerfile (Go-only) to the
Fly registry, but tenant machines run the combined image (Go + Canvas
reverse proxy). This caused "canvas unavailable" after machine update.
Changes:
- Fly registry build: platform/Dockerfile → platform/Dockerfile.tenant
- GHCR: keeps Go-only image (for self-hosted/dev use)
- Path triggers: add canvas/** and manifest.json (tenant image includes both)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Six prior PRs (#273, #319, #322, #341, #484, #486) all kept calling
`docker login` and tried to coerce credsStore via increasingly elaborate
config tricks. None worked. The latest publish-canvas-image and
publish-platform-image runs on main are still failing with:
error storing credentials - err: exit status 1,
out: `User interaction is not allowed. (-25308)`
Verified locally on the runner host (2026-04-16): `docker login` on
macOS unconditionally writes credentials to osxkeychain after a
successful login, regardless of the config presented to it.
# I wrote this:
{ "auths": {}, "credsStore": "", "credHelpers": {} }
# After `docker login --config <dir> ghcr.io ...` succeeded:
{
"auths": { "ghcr.io": {} }, # empty — auth is in Keychain
"credsStore": "osxkeychain" # Docker rewrote it back
}
So `--config` flag, DOCKER_CONFIG env var, credsStore="" etc. all share
the same fate: Docker re-enables osxkeychain after every successful
login. The Mac mini runner is a launchd user agent with a locked
Keychain, so storage fails with -25308.
This PR replaces the `docker login` invocation entirely. We write
`base64(user:pat)` directly into the disposable DOCKER_CONFIG's `auths`
map. `docker/build-push-action@v5` and the daemon honor the auths map
for push without ever calling `docker login`, so the Keychain is never
involved.
Same shape in both workflows:
- publish-canvas-image.yml — single registry (ghcr.io)
- publish-platform-image.yml — two registries (ghcr.io + registry.fly.io)
Fly username remains literal "x".
Security:
- Token env vars never echoed. Heredoc writes the auth blob via
`umask 077` (file mode 600). The temp config dir lives under
RUNNER_TEMP and is reaped at job end.
- Diagnostics preserved (docker version + binary ls + registry keys
only, no values) so future runner permission regressions remain
visible without leaking secrets.
Equivalent to closed PR #464 — re-opening because main is still
broken (verified by inspecting the most recent failure). The closing
comment on #464 stated the issue was already addressed by #341, but
it isn't.
docker/login-action@v3 ignores DOCKER_CONFIG and still tries the
macOS system keychain on the self-hosted runner, producing:
error storing credentials: User interaction is not allowed. (-25308)
Switch to `docker login ... --password-stdin` which respects
DOCKER_CONFIG and writes credentials to the per-run config.json
we created in the isolate step. Applied to both GHCR and Fly
registry logins in both publish workflows.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The heredoc block writing Docker config.json had unindented `{` at
column 1, which GitHub Actions' YAML parser interpreted as a flow
mapping start — causing every publish-platform-image and
publish-canvas-image run to fail with 0 jobs (startup_failure).
Replace `cat <<'JSON' ... JSON` with a single `printf` call that
produces identical config.json content without confusing the parser.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
After pushing the tenant image to registry.fly.io, the workflow now
lists all running/stopped molecule-tenant machines and updates each
to the newly pushed image tag. Gracefully skips if no machines exist
(control plane provisions on demand).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two fixes:
1. publish-canvas-image.yml + publish-platform-image.yml: the JSON
heredoc for config.json had leading whitespace from YAML indentation,
producing invalid JSON. Docker fell back to osxkeychain → -25308.
Fixed by removing indentation inside the heredoc body.
2. Added scripts/dev-start.sh — one-command local dev environment.
Starts infra (docker-compose), platform (Go), and canvas (Next.js)
with proper health checks and cleanup on Ctrl-C.
Remove plugins/, workspace-configs-templates/, org-templates/ dirs (now
in standalone repos). Add manifest.json listing all 33 repos and
scripts/clone-manifest.sh to clone them. Both Dockerfiles now use the
manifest script instead of 33 hardcoded git-clone lines.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tenant machines were booting with no templates because the Dockerfile
only shipped the Go binary + migrations. The canvas showed "0 templates"
with an empty picker.
Changes:
- platform/Dockerfile: build context changed from ./platform to repo
root so COPY can reach workspace-configs-templates/ alongside the
Go source. COPY paths updated for platform/{go.mod,go.sum,*.go} and
platform/migrations/.
- .github/workflows/publish-platform-image.yml: context: . (was
./platform), paths trigger now includes workspace-configs-templates/
so template changes rebuild the image.
Phase A of the template-registry plan. Phase B adds a DB registry +
on-demand fetch for community templates (user pastes GitHub URL at
workspace creation time). The baked defaults always ship in the image
for zero-config tenant boot.
Verified: `docker build -f platform/Dockerfile -t test .` succeeds,
`docker run --rm test ls /workspace-configs-templates/` shows all 8
templates (autogen, claude-code-default, crewai, deepagents, gemini-cli,
hermes, langgraph, openclaw).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
#273 tried to fix the macOS Keychain -25308 error by pointing
DOCKER_CONFIG at a per-run temp dir with `{"auths": {}}`. That was
necessary but not sufficient: Docker on macOS inherits `osxkeychain` as
the default credsStore even when config.json doesn't declare one
(comes from Docker Desktop's bundled binding), so the login-action
still tried to call /usr/local/bin/docker-credential-osxkeychain which
fails with -25308 from the non-interactive launchd session.
Evidence: after #273, publish-platform-image still failed on every
main merge with:
error saving credentials: error storing credentials - err: exit
status 1, out: `User interaction is not allowed. (-25308)`
Fix: write a config.json that explicitly sets `credsStore: ""` and
clears `credHelpers`, forcing Docker to store creds in the inline
`auths` map of this disposable config.json instead of reaching for
the keychain. Also print config.json at diagnostic time so a future
regression surfaces in the log instead of at login.
No runtime / test impact — this only changes what the runner writes
to the workflow's temp DOCKER_CONFIG directory.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Every publish-platform-image run since the aa41947 self-hosted runner
migration has been failing with two runner-level issues that the
workflow now works around (keychain) or surfaces clearly (path):
1. "error storing credentials - err: exit status 1, out:
'User interaction is not allowed. (-25308)'"
docker/login-action tries to persist the GHCR + Fly tokens in the
macOS Keychain, but the Mac mini runner runs as a non-interactive
launchd service without an unlocked desktop session — keychain
access raises -25308. Fix: set DOCKER_CONFIG to a per-run temp dir
containing a plain config.json before the login step so credentials
land in a file, not the keychain. This is the same trick the
GitHub-hosted macos runners use in docker action examples.
2. "Unexpected error attempting to determine if executable file
exists '/usr/local/bin/docker': Error: EACCES: permission denied,
stat '/usr/local/bin/docker'"
Not a workflow bug — the runner literally can't read the Docker
binary path. Adds a diagnostic step before QEMU/buildx setup that
prints: PATH, `command -v docker`, `docker --version`, and
`ls -la` on both /usr/local/bin/docker and /opt/homebrew/bin/docker.
Surfacing these in the log means the next failure (if any) shows
the actual problem instead of hiding behind a cryptic buildx error.
Does NOT fix the root cause of #2 — that needs the user to SSH into
the Mac mini runner and reinstall / re-permission Docker Desktop
(or switch to Colima/OrbStack). The diagnostic output will tell us
exactly which path is broken.
The 20+ queued CI runs from `ci.yml` are unrelated to this PR — they
are stuck because the self-hosted runner has severely degraded queue
throughput (runs wait 2+ hours before being picked up). That's a
separate runner-health issue tracked as a user action in the triage
report.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore(ci): migrate all jobs to self-hosted macOS arm64 runner
Switches every job in `ci.yml` and `publish-platform-image.yml` from
`ubuntu-latest` to `[self-hosted, macos, arm64]` to avoid GitHub-hosted
minute rate limits. All jobs run on a single Apple-silicon self-hosted
runner registered at the Molecule-AI org level.
Notable non-trivial adaptations (macOS runners can't use `services:` and
some GHA marketplace actions are Linux-only):
- e2e-api: `services: postgres/redis` replaced with inline `docker run`
steps. Ports remapped to 15432/16379 to avoid collision with anything
the host may already expose on the standard ports. Containers are named
(`molecule-ci-postgres` / `molecule-ci-redis`) and torn down in an
`if: always()` step. Postgres readiness is still gated on pg_isready
via `docker exec`.
- shellcheck: `ludeeus/action-shellcheck` is a Docker action, Linux-only.
Replaced with a direct `shellcheck` invocation (pre-installed on the
runner) that scans `tests/e2e/*.sh` with `--severity=warning`.
- publish-platform-image: added `docker/setup-qemu-action@v3` and an
explicit `platforms: linux/amd64` on both `docker/build-push-action`
invocations. The runner is arm64 but Fly tenant machines pull amd64,
so QEMU-emulated cross-arch builds are required. GHA cache-from/cache-to
behavior is unchanged.
Runner prereqs (one-time host setup):
- Docker Desktop installed and running (for e2e-api + image publish)
- `shellcheck` on PATH
- `docker` on PATH
- Go / Node / gh / Python are installed via setup-* actions per job
* fix(ci): set AGENT_TOOLSDIRECTORY for python-lint on self-hosted runner
setup-python@v5 defaults to /Users/runner/hostedtoolcache which doesn't
exist on the hongming-claw self-hosted runner. AGENT_TOOLSDIRECTORY tells
the action to use a writable path under the runner user's home directory.
Fixes the only failing job in CI run 24469156329 on PR #186.
---------
Co-authored-by: Hongming Wang <HongmingWang-Rabbit@users.noreply.github.com>
Post-mortem on the failed publish-platform-image run on main (PR #82):
Fly's Docker registry requires username EXACTLY equal to "x". My
code-review "readability fix" changing it to "molecule-ai" caused
every push to return 401 Unauthorized. Verified locally:
echo $FLY_API_TOKEN | docker login registry.fly.io -u x --password-stdin
→ Login Succeeded
echo $FLY_API_TOKEN | docker login registry.fly.io -u molecule-ai --password-stdin
→ 401 Unauthorized
Lesson: don't second-guess docs that specify a literal value. Comment
now says "MUST be literal 'x'" with a 2026-04-15 verification note to
prevent future regressions.
Code-review process improvement: when reviewing a change against a
vendor API, prefer "preserve exact doc-specified values" over readability
suggestions. Logged as a cron-learning.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Addresses PR #82 code review: 🟡×3 + 🔵×5.
- Fly registry login username: 'x' → 'molecule-ai' + explanatory comment.
- Build & push split into two steps (GHCR / Fly registry) so a single-
registry outage can't fail the other. Second step uses 'if: always()'
to ensure Fly mirror runs even if GHCR push flakes.
- docs/runbooks/saas-secrets.md: full secret map + rotation procedures
for every SaaS credential, with danger-case callouts. Documents the
coupled FLY_API_TOKEN (lives in GHA secret AND fly secrets — must be
rotated in both).
- CLAUDE.md: new 'SaaS ops' section linking to the runbook.
Keeps ghcr.io/molecule-ai/platform private (per CEO direction — open-
source when full SaaS ships) while still letting the private control
plane's Fly provisioner boot tenant machines: Fly auto-authenticates
same-org machines against registry.fly.io, no per-tenant pull
credentials to wire.
Workflow now logs into both GHCR (using built-in GITHUB_TOKEN) and
Fly registry (using FLY_API_TOKEN secret) and pushes the same image to
four tags total:
- ghcr.io/molecule-ai/platform:latest
- ghcr.io/molecule-ai/platform:sha-<short>
- registry.fly.io/molecule-tenant:latest
- registry.fly.io/molecule-tenant:sha-<short>
Secret added via `gh secret set FLY_API_TOKEN` on the public repo.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Phase B.2 companion to the private molecule-controlplane provisioner PR.
On every push to main that touches platform/**, builds platform/Dockerfile
and pushes to GHCR with two tags:
- :latest (floating, always main's tip)
- :sha-<short-commit> (immutable, pin-friendly)
Cache via GitHub Actions cache (cache-from: type=gha). Workflow_dispatch
trigger so we can re-publish after a docs-only merge if needed.
The private molecule-controlplane sets TENANT_IMAGE=ghcr.io/molecule-ai/platform:<tag>
and the provisioner creates each tenant Fly Machine from this image. Staying
on the same base image across tenants keeps upgrades atomic.
CLAUDE.md updated to document the new workflow in the CI pipeline section.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>