Extracted from the now-closed PR #1664 (Molecule-AI/molecule-core).
- New scripts/molecule-gh-token-refresh.sh background daemon — every
45 min (TOKEN_REFRESH_INTERVAL_SEC) calls the credential helper's
_refresh_gh action to keep both gh CLI auth and the on-disk cache
fresh through the GitHub App installation token's ~60 min TTL.
- scripts/molecule-git-token-helper.sh rewritten with a ~50 min
on-disk cache (${CACHE_DIR}/gh_installation_token + _expiry
companion file), a cache > API > env-var fallback chain, a new
_refresh_gh action (invoked by the daemon above), a _invalidate_cache
action, and path references flipped from /workspace/scripts/... to
/app/scripts/... to match the runtime image layout.
- Dockerfile copies the new refresh daemon and extends mkdir to
create /home/agent/.molecule-token-cache at build time.
- entrypoint.sh configures the git credential helper for github.com
while still root (so the global gitconfig is written before the
gosu handoff), creates + chowns the token cache dir, then as agent
starts the refresh daemon in the background and does an initial
gh auth login from GITHUB_TOKEN/GH_TOKEN so gh works before the
first refresh fires.
Dropped from PR #1664: cosmetic em-dash -> ASCII hyphen rewrites
(charset-normalizer noise) that would conflict with the repo's
existing em-dash convention used elsewhere in workspace/.
Root cause of the hermes 401 "Invalid API key" on SaaS workspaces:
1. CreateWorkspaceDialog never sent `model` in the /workspaces POST
2. Tenant/CP plumbed through a valid (provider, API key) but empty MODEL
3. Workspace install.sh ran with HERMES_DEFAULT_MODEL unset
4. derive-provider.sh saw no slug → PROVIDER="auto"
5. Hermes fell back to its compiled-in default (Anthropic via
OpenAI-compat adapter)
6. User's MINIMAX_API_KEY was present but irrelevant — hermes tried
Anthropic with it → 401
Fix:
- Extend HERMES_PROVIDERS with `defaultModel` + `models` (suggestion
list). Each provider ships with a known-good default so the trap
is physically impossible to hit with the new form.
- Add a required Model input to the Hermes panel, auto-populated
from the provider's defaultModel when the provider changes (only
if the user hasn't typed their own slug yet).
- Datalist surfaces additional model suggestions per provider so
users can pick a different size (e.g. M2.7-highspeed) without
typing the whole slug.
- handleCreate validates hermesModel is non-empty, sends as `model`
in the POST body alongside the secrets block.
- useEffect guard avoids clobbering a user-typed custom slug when
they toggle providers back and forth.
Existing 19 a11y tests still pass (non-SaaS path unchanged, four-tier
picker still renders, arrow-key nav still wraps).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Symptom (prod, hongmingwang tenant, 2026-04-22):
PUT /workspaces/:id/files/config.yaml → 500
{"error":"failed to write file: docker not available"}
Root cause: WriteFile + ReplaceFiles always reached for the tenant's
Docker client, but SaaS workspaces run as EC2 VMs (no Docker on the
tenant to cp into). There was no SaaS code path, so Save/Save&Restart
in the Config tab silently 500'd for every SaaS user.
Fix: add writeFileViaEIC — same ephemeral-keypair + EIC-tunnel dance
that the Terminal tab already uses (terminal.go). Flow:
1. ssh-keygen ephemeral ed25519 pair
2. aws ec2-instance-connect send-ssh-public-key (60s validity)
3. aws ec2-instance-connect open-tunnel (TLS → :22)
4. ssh ... "install -D -m 0644 /dev/stdin <abs path>"
install -D creates missing parent dirs atomically
5. Kill tunnel + wipe keydir
Runtime → base-path map (new table workspaceFilePathPrefix):
hermes → /home/ubuntu/.hermes
langgraph → /opt/configs
external → /opt/configs
unknown → /opt/configs
Both WriteFile (single file) and ReplaceFiles (bulk) detect
`workspaces.instance_id != ''` and route to EIC instead of Docker.
Local/self-hosted Docker path is unchanged.
Security: the only variable piece in the remote ssh command is the
absolute path, which is built via map lookup + filepath.Clean so
traversal is blocked. shellQuote() wraps it as defence-in-depth.
validateRelPath rejects absolute paths and surviving `..` segments
up-front; tests assert traversal rejection.
Follow-ups tracked separately:
- Reload hook after save (hermes gateway restart via SSH)
- Per-tunnel batching for ReplaceFiles with many files
- Runtime-specific base paths should be declared in the runtime
manifest, not hardcoded in the handler
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Following feedback that T4 — not T3 — is the full-access tier:
- Non-SaaS picker now shows all four tiers: T1 Sandboxed, T2 Standard,
T3 Privileged, T4 Full Access. Four-column grid.
- SaaS picker stays single-option but now locks to T4 (was T3). Every
SaaS workspace gets a dedicated EC2 VM, which is unambiguously the
"full host" case — T3 (privileged container) was a category mismatch.
- Default tier on SaaS is 4 (was 3). CP provisioner already supports
tier 4 (t3.large / 80 GB). TIER_CONFIG already has T4's amber color.
Tests updated for the four-tier picker: wrap tests now go T4 ↔ T1, and
the selection/tabIndex tests cover the fourth button.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
These files have been in public monorepo docs/ since the open-source
restructure on 2026-04-18, but are operational (outreach targets,
analytics tracking IDs, staged unpublished social copy) or strategic
(launch plans, SEO briefs, keyword targets, competitive research).
Per the internal documentation policy (2026-04-22), they belong in
the private internal repo. Pair PR: internal#27 receives the files.
Removed:
- docs/marketing/campaigns/* — 6 campaign packs with outreach + analytics
- docs/marketing/plans/phase-30-launch-plan.md — draft launch plan
- docs/marketing/briefs/* — 2 SEO content briefs
- docs/marketing/seo/keywords.md — keyword strategy
- docs/research/cognee-*.md — 2 architecture + isolation evals
What stays public:
- docs/marketing/blog/ — published blog posts
- docs/marketing/devrel/demos/ — dev-facing demo scripts + video
- docs/marketing/discord-adapter-day2/ — already-posted community copy
No external references to update — cross-references among these files
are now intact inside the internal repo; no public CLAUDE.md / README /
PLAN / docs/README referenced the moved paths.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Symptom (prod tenant hongmingwang):
GET /org/tokens → 500
orgtoken list: orgtoken: list: pq: invalid input syntax for type uuid: ""
Postgres rejects COALESCE(uuid_col, '') because it can't cast the
empty string to UUID. Cast to ::text first so the COALESCE operates
on matching types. OrgID on the Go side is already string, so no
scan changes needed.
sqlmock doesn't exercise pq type coercion — it accepts any AddRow
value for any column — which is why the existing tests pass while
prod 500s. Real-Postgres integration coverage is the systemic fix
(tracked separately), but this PR unblocks the Settings → Org Tokens
page today.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Symptom (prod tenant hongmingwang, 2026-04-22):
cp provisioner: console: unexpected 401
GET /workspaces/:id/console → 502 (View Logs broken)
Root cause: the tenant's CPProvisioner.authHeaders sent the provision-
gate shared secret as the Authorization bearer for every outbound CP
call, including /cp/admin/workspaces/:id/console. But CP gates
/cp/admin/* with CP_ADMIN_API_TOKEN — a distinct secret so a
compromised tenant's provision credentials can't read other tenants'
serial console output. Bearer mismatch → 401.
Fix: split authHeaders into two methods —
- provisionAuthHeaders(): Authorization: Bearer <MOLECULE_CP_SHARED_SECRET>
for /cp/workspaces/* (Start, Stop, IsRunning)
- adminAuthHeaders(): Authorization: Bearer <CP_ADMIN_API_TOKEN>
for /cp/admin/* (GetConsoleOutput and future admin reads)
Both still send X-Molecule-Admin-Token for per-tenant identity. When
CP_ADMIN_API_TOKEN is unset (dev / self-hosted single-secret setups),
cpAdminAPIKey falls back to sharedSecret so nothing regresses.
Rollout requirement: the tenant EC2 needs CP_ADMIN_API_TOKEN in its
env — this PR wires up the code, but CP's tenant-provision path must
inject the value. Filed as follow-up; until then, operators can set
it manually on existing tenants.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
On SaaS every workspace gets its own EC2 VM — the Docker-sandbox
distinction between T1 (sandboxed), T2 (standard Docker), and T3
(full host access) doesn't apply. A SaaS workspace is always a
dedicated VM, which is "full access" by construction. Showing T1/T2
in that UI is a category error: users pick a sandbox level that has
no effect on the actual EC2 machine they get.
Changes:
- tenant.ts: export isSaaSTenant() — returns true when canvas is
served at <slug>.moleculesai.app (SSR-safe: false on server)
- CreateWorkspaceDialog: when isSaaSTenant(), render only the T3
option, default tier=3, grid collapses to a single column. Label
gets a " — dedicated VM" hint so the user knows what they're
getting. On self-hosted the full T1/T2/T3 picker is unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Workspaces on SaaS register with their VPC-private IP (172.31.x.x on AWS
default VPCs). The SSRF guard in ssrf.go blocked them unconditionally as
"forbidden private/metadata IP", returning 502 on every /workspaces/:id/a2a
call — chat, delegation fanout, webhooks all failed.
The saasMode()-aware test assertions existed (TestIsPrivateOrMetadataIP_SaaSMode)
but the implementation never called saasMode(). Wire it up. In SaaS:
- RFC-1918 (10/8, 172.16/12, 192.168/16) and IPv6 ULA fd00::/8 are allowed
- 169.254/16 metadata, TEST-NET, 100.64/10 CGNAT, loopback, link-local
stay blocked in every mode
Also hardens IPv6: link-local multicast and interface-local multicast
are now rejected; DNS-resolved v6 addrs are checked too.
Symptom log (prod tenant hongmingwang):
ProxyA2A: unsafe URL for workspace a8af9d79-...: forbidden private/metadata
IP: 172.31.47.119
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Add-Key form used to open with a required Service dropdown
(GitHub / Anthropic / OpenRouter / Other) that gated everything
else. The dropdown did no persistent work — the secret store only
cares about (key_name, value); the Service label was never saved
anywhere. It also suffered registry drift: today we support ~22
hermes-dispatched providers (MiniMax, Gemini, DeepSeek, Kimi, Qwen,
NVIDIA, etc.); only 3 had entries. Everyone else landed in "Other"
with no downside beyond the mandatory click.
Replaces it with:
1. Key-name <datalist> autocomplete sourced from new
KEY_NAME_SUGGESTIONS in lib/services.ts — 26 entries covering
common infra keys + every hermes-supported provider.
2. inferGroup(keyName) derives classification at render time,
matching what the store already does in getGrouped(). No
behaviour change for list grouping.
3. Provider docs link renders inline only when inferGroup
recognises the name. For 'custom' keys we stay quiet — no
false-structure prompt.
4. Test-connection button still available when the inferred group
supports it AND the value is format-valid. Same providers as
before.
SERVICES registry preserved for LIST rendering + test routing.
Result: two fields instead of three. One fewer decision. Provider-
agnostic by design — new providers work the moment someone types
their canonical env var name; no UI code change per provider.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
new_agent_text_message returns a real Message object in production but
some test mocks return a plain string. Guard with hasattr + try/except
so the tool_trace assignment doesn't crash test_non_stream_events_ignored.
BLOCKERS fixed:
- instructions.go: Drop team-scope queries (teams/team_members tables don't
exist in any migration). Schema column kept for future. Restored Resolve
to /workspaces/:id/instructions/resolve under wsAuth — closes auth gap
that allowed cross-workspace enumeration of operator policy.
- migration 040: Add CHECK constraints on title (<=200) and content (<=8192)
to prevent token-budget DoS via oversized instructions.
- a2a_executor.py: Pair on_tool_start/on_tool_end via run_id instead of
list-position so parallel tool calls don't drop or clobber outputs. Cap
tool_trace at 200 entries to prevent runaway loops bloating JSONB.
HIGH fixes:
- instructions.go: Add length validation in Create + Update handlers.
Removed dead rows_ shadow variable. Replaced string concatenation in
Resolve with strings.Builder.
- prompt.py: Drop httpx timeout 10s -> 3s (boot hot path). Switch print
to logger.warning. Add Authorization bearer header from
MOLECULE_WORKSPACE_TOKEN env var.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tenant's workspace provisioner now forwards payload.Model (set by
canvas Config tab when a user picks a model) through to the
workspace's runtime env as HERMES_DEFAULT_MODEL, so install.sh /
start.sh in the template can seed the right ~/.hermes/config.yaml
without any post-provision manual step.
Helper applyRuntimeModelEnv() is runtime-switched so each template
owns its own env contract — hermes uses HERMES_DEFAULT_MODEL, future
runtimes with different config schemas register their own cases.
Runtimes that read model from /configs/config.yaml instead (langgraph,
claude-code, deepagents) are unaffected: the switch has no case for
them, so this is a no-op in those paths.
Applied in both the Docker provisioner path (provisionWorkspaceOpts)
and the SaaS/CP path (provisionWorkspaceCP) so local dev and
production behave identically.
Combined with:
- molecule-controlplane#231 (/opt/adapter/install.sh hook)
- molecule-ai-workspace-template-hermes#8 (install.sh for bare-host)
- molecule-ai-workspace-template-hermes#9 (derive-provider.sh)
this completes the MVP flow: customer creates a hermes workspace
in canvas with model = minimax/MiniMax-M2.7-highspeed + secret
MINIMAX_API_KEY = sk-cp-…, clicks Save, workspace provisions with
the MiniMax Token Plan hermes-agent gateway up and ready for the
first chat — no ops touch.
Foundation this builds on:
- env injection works for every runtime
- secret passthrough is generic (already via workspace_secrets)
- per-runtime env-var contract encoded once (applyRuntimeModelEnv)
- canvas Save button for later-edit remains a Files-API-over-EIC
concern (tracked separately)
See internal/product/designs/workspace-backends.md for the broader
architectural direction this fits into.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rm received /configs and filePath as two separate arguments, deleting
the entire /configs dir on every call. Concatenate to target only the
intended file. validateRelPath already prevents traversal, so this is
a logic bug not a security vulnerability.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds a configurable instruction injection system that prepends rules to
every agent's system prompt. Instructions are stored in the DB and fetched
at workspace startup, supporting three scopes:
- Global: applies to all agents (e.g., "verify with tools before reporting")
- Team: applies to agents in a specific team
- Workspace: applies to a single agent (role-specific rules)
Components:
- Migration 040: platform_instructions table with scope hierarchy
- Go API: CRUD endpoints + resolve endpoint that merges scopes
- Python runtime: fetches instructions at startup via /instructions/resolve
and prepends them to the system prompt as highest-priority context
Initial global instructions seeded:
1. Verify Before Acting (check issues/PRs/docs first)
2. Verify Output Before Reporting (second signal before reporting done)
3. Tool Usage Requirements (claims must include tool output)
4. No Hallucinated Emergencies (CRITICAL needs proof)
5. Staging-First Workflow (never push to main directly)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Every A2A response now includes a tool_trace — the list of tools/commands
the agent actually invoked during execution. This enables verifying agent
claims against what they actually did, catches hallucinated "I checked X"
responses, and provides an audit trail for the CEO to control hundreds of
agents by checking the top-level PM's trace.
Changes:
- Python runtime: collect tool name/input/output_preview on every
on_tool_start/on_tool_end event, embed in Message.metadata.tool_trace
- Go platform: extract tool_trace from A2A response metadata, store in
new activity_logs.tool_trace JSONB column with GIN index
- Activity API: expose tool_trace in List and broadcast endpoints
- Migration 039: adds tool_trace column + GIN index
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two related workflow hygiene changes:
## (1) canary-verify: graceful-skip when canary secrets absent
Before: canary-verify hit `scripts/canary-smoke.sh` which exited
non-zero when CANARY_TENANT_URLS was empty. Every main publish
ran → canary-verify failed → red check on main CI signal (7/7 in
past 24h). Noise, no value.
After: smoke step detects the missing-secrets case, writes a
warning to the step summary, sets an output `smoke_ran=false`,
and exits 0. The workflow completes green without pretending to
have tested anything.
Gated downstream: `promote-to-latest` now requires BOTH
`needs.canary-smoke.result == success` AND
`needs.canary-smoke.outputs.smoke_ran == true`. A skip does NOT
auto-promote — manual `promote-latest.yml` remains the release
gate while Phase 2 canary is absent (see
molecule-controlplane/docs/canary-tenants.md for the fleet
stand-up plan + decision framework).
When the canary fleet is stood up and secrets populated: delete
the early-exit branch + the smoke_ran gate. The workflow goes back
to its original "smoke gates promotion" semantics.
## (2) auto-promote-staging.yml — draft
New workflow that fires after CI / E2E Staging Canvas / E2E API /
CodeQL complete on the staging branch, checks that ALL four are
green on the same SHA, and fast-forwards `main` to that SHA.
Shipped disabled: the promote step is gated behind repo variable
`AUTO_PROMOTE_ENABLED=true`. Until that's set, the workflow
dry-runs and logs what it would have done. Toggle via Settings →
Variables when staging CI has been reliably green for a few days.
Safety:
- workflow_run events only fire on push to staging (PRs into
staging don't promote).
- Every required gate must be `completed/success` on the same
head_sha. Pending / failed / skipped / cancelled → abort.
- `--ff-only` push. Refuses to advance main if it has diverged
from staging history (someone landed a direct-to-main commit
that's not on staging). Human resolves the fork.
- `workflow_dispatch` with `force=true` lets us test the flow
end-to-end before flipping the variable on.
Motivation: molecule-core#1496 has been open with 1172 commits
divergence between staging and main. Today that trapped PR #1526
(dynamic canvas runtime dropdown) on staging while prod users
hit the hardcoded-dropdown bug. Auto-promote retires the bulk
staging→main PR pattern once the staging CI it depends on is
reliable.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #1526 shipped the /templates registry + canvas dynamic Runtime /
Model / Required-Env fields on 2026-04-22 — but merged into the
staging branch, not main. The staging→main promotion PR #1496 has
been open unmerged for a while with 1172 commits divergence, so
prod (which builds from main) still carries the old hardcoded
dropdown.
Symptom seen on hongmingwang.moleculesai.app today:
- New Hermes Agent workspace (template declares runtime: hermes) loads
Config tab → Runtime dropdown shows "LangGraph (default)" because
there's no <option value="hermes"> in the hardcoded list; it falls
back to empty-value silently.
- Model field is a plain TextInput with static placeholder
"e.g. anthropic:claude-sonnet-4-6" — should be a combobox populated
from the selected runtime's models[].
- Required Env Vars is a TagList with static placeholder
"e.g. CLAUDE_CODE_OAUTH_TOKEN" — should auto-populate from the
selected model's required_env.
- Net effect: "Save & Deploy" sends empty model + empty env to the
provisioner → workspace instant-fails.
This PR cherry-picks the exact three files from PR #1526 (#359dc61
on staging) forward to main, without pulling the other 1171
commits:
- canvas/src/components/tabs/ConfigTab.tsx
- RuntimeOption interface + FALLBACK_RUNTIME_OPTIONS (hermes,
gemini-cli included)
- useEffect fetches /templates and populates runtimeOptions
dynamically
- dropdown renders from runtimeOptions (no hardcoded list)
- Model becomes a combobox with datalist of available models
per selected runtime
- Required Env Vars auto-populates from the selected model's
required_env on model change
- workspace-server/internal/handlers/templates.go
- /templates endpoint returns [{id, name, runtime, models}] with
per-template models registry (id, name, required_env)
- workspace-server/internal/handlers/templates_test.go
- Tests for runtime+models parsing and legacy top-level model
fallback
The canvas Runtime dropdown now resolves "hermes" correctly;
Model dropdown shows the models[] from the hermes template; Env
auto-populates with HERMES_API_KEY (or whichever model selected).
Verified locally:
- workspace-server builds clean
- Template handler tests pass: TestTemplatesList_RuntimeAndModelsRegistry,
TestTemplatesList_LegacyTopLevelModel, TestTemplatesList_NonexistentDir
Follow-up: the staging→main promotion gap (#1496) is the
underlying process issue. Either merge that PR or adopt a policy
of landing fixes directly on main (as several PRs have today).
Files here were chosen minimally to avoid pulling unrelated staging
changes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three changes to stop ferrying sensitive content through our public
monorepo. All content already imported to Molecule-AI/internal (private)
— see linked PRs below.
## docs/incidents/INCIDENT_LOG.md — replaced with stub
Contained full security audit cycle records with CWE references,
file:line pointers to historical vulnerabilities, and severity
ratings. None of that belongs in a public repo.
→ Moved to Molecule-AI/internal/security/incident-log.md (PR #20).
Monorepo file becomes a 17-line stub pointing at the internal
location. Future incidents land in the internal file only.
## docs/architecture/canary-release.md — redacted identifiers
Had AWS account ID `004947743811` and IAM role name
`MoleculeStagingProvisioner` embedded. Even though the fleet
described isn't actually running (see state note), these
identifiers are account-specific and don't belong in public git.
→ Removed both values, replaced with generic references + a pointer
to Molecule-AI/internal/runbooks/canary-fleet.md (PR #21) where
the actual identifiers live. Any future rotation touches the
internal file, no public-git-history rewrite needed.
## docs/infra/workspace-terminal.md — reduced to public summary
Contained the full ops runbook: bootstrap script output, per-tenant
SG backfill loop with live SG IDs, customer slug names
(hongmingwang). Useful content but too specific for a public repo.
→ Moved to Molecule-AI/internal/runbooks/workspace-terminal.md
(PR #22). Monorepo file becomes a 30-line public summary of what
the feature does + pointers to code, so external readers /
self-hosters still get the design story.
## What's NOT in this PR (follow-up)
Marketing briefs, SEO plans, campaign copy, research dossiers, and
internal product designs (hermes-adapter-plan, medo-integration,
cognee-*) are the next batches. See docs policy doc coming next to
set team expectations.
Net removal: ~820 lines from public git going forward.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The canary-release.md doc describes the pipeline as if the fleet is
running — referring to AWS account 004947743811 and a configured
MoleculeStagingProvisioner role. Reality as of 2026-04-22: no canary
tenants are provisioned, the 3 GH Actions secrets are empty, and
canary-verify.yml has failed 7/7 times in a row.
Added a top-of-doc ⚠️ state note that:
1. Clarifies this is intended design, not deployed reality.
2. Notes the AWS account ID is historical / unverified.
3. Explains that merges currently rely on manual promote-latest.
4. Cross-links to molecule-controlplane/docs/canary-tenants.md for
the Phase 1 work that's shipped, the Phase 2 stand-up plan, and
the "should we even do this now?" decision framework.
5. Asks whoever lands Phase 2 to reconcile the two docs.
No behaviour change — doc-only.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two latent bugs the self-hosted Mac mini had been hiding. Both caught
by the newer toolchain on ubuntu-latest runners after PR #1626.
1. workspace-server/internal/handlers/terminal.go:442
`fmt.Sprintf("%s:%d", host, port)` flagged by go vet as unsafe
for IPv6 (it omits the required [::] brackets). Replaced with
`net.JoinHostPort(host, strconv.Itoa(port))` which handles both
IPv4 and IPv6 correctly. No runtime behaviour change — the only
call site passes "127.0.0.1", so the bug would never trigger in
practice, but vet is right to flag it as a latent correctness
issue.
2. workspace/tests/test_a2a_executor.py::test_set_current_task_updates_heartbeat
`MagicMock()` auto-creates attributes on first access, so
`getattr(heartbeat, "active_tasks", 0)` in shared_runtime.py
returned a MagicMock rather than the default 0. Adding 1 to a
MagicMock returns another MagicMock, so the assertion
`heartbeat.active_tasks == 1` never held. Seeding
`heartbeat.active_tasks = 0` before the first call makes
getattr() return a real int, matching how the real HeartbeatLoop
class initialises itself.
Both pre-existed on main and were hidden by the older Python / Go
toolchains on the Mac mini runner. Verified locally (venv pytest
pass, `go vet ./...` + `go build ./...` clean on workspace-server).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
molecule-core is a public repo — GHA-hosted minutes are free. The
self-hosted Mac mini was only in play to dodge GHA rate limits
(memory feedback_selfhosted_runner), but for these specific
workflows it came with real costs:
- Docker-push workflows emulated linux/amd64 from arm64 via QEMU —
every canvas + platform image build ran ~2-3x slower than native.
- Six PRs worth of keychain-avoidance hacks in publish-* because
`docker login` on macOS writes to osxkeychain unconditionally,
and the Mac mini's launchd user-agent keychain is locked.
- Homebrew pin-down environment variables (HOMEBREW_NO_*) sprinkled
everywhere to work around the shared /opt/homebrew symlink mess
on the runner.
- Setup-python@v5 couldn't write to /Users/runner, so ci.yml
python-lint resorted to a hand-rolled Homebrew python3.11 dance.
- Single runner → fan-out contention; CodeQL's 45-min analysis
fought the canvas publish for the one slot.
Changes across the 7 workflows:
- runs-on: [self-hosted, macos, arm64] → ubuntu-latest (every job)
- publish-canvas-image + publish-workspace-server-image:
drop the hand-rolled auths-map step + QEMU setup + buildx v4
→ docker/login-action@v3 + setup-buildx@v3. Linux + amd64
target = native build.
- canary-verify + promote-latest: replace `brew install crane` +
HOMEBREW_NO_* incantations with imjasonh/setup-crane@v0.4.
- codeql.yml: drop `brew install jq` — jq is preinstalled on
ubuntu-latest.
- ci.yml shellcheck: drop the self-hosted existence check —
shellcheck is preinstalled via apt.
- ci.yml python-lint: replace the Homebrew python3.11 path dance
with actions/setup-python@v5 (which works fine on GHA-hosted),
add requirements.txt caching while we're there.
- Remove stale comments referencing "the self-hosted runner",
"Mac mini", keychain, osxkeychain etc.
The self-hosted Mac mini remains in service for private-repo
workflows only. Memory feedback_selfhosted_runner updated to
reflect the public-repo scope carve-out.
Net -96 lines across the 7 files.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Every standalone workspace-template repo now publishes to
ghcr.io/molecule-ai/workspace-template-<runtime>:latest via the
reusable publish-template-image workflow in molecule-ci (landed
today — one caller per template repo). This PR makes the
provisioner actually use those images:
- RuntimeImages map + DefaultImage switched from bare local tags
(workspace-template:<runtime>) to their GHCR equivalents.
- New ensureImageLocal step before ContainerCreate: if the image
isn't present locally, attempt `docker pull` and drain the
progress stream to completion. Best-effort — if the pull fails
(network, auth, rate limit) the subsequent ContainerCreate still
surfaces the actionable "No such image" error, now with a
GHCR-appropriate hint instead of the defunct
`bash workspace/build-all.sh <runtime>` advice.
- runtimeTagFromImage now handles both forms: legacy
`workspace-template:<runtime>` (local dev via build-all.sh /
rebuild-runtime-images.sh) and the current GHCR shape. Keeps
error hints sensible in both worlds.
- Tests cover the GHCR path for tag extraction and the new error
message shape. Legacy local tags still recognised.
Local dev path unchanged — scripts/build-images.sh and
workspace/rebuild-runtime-images.sh still produce locally-tagged
`workspace-template:<runtime>` images, and Docker's image
resolver matches them before any pull is attempted. So
contributors can keep iterating on a template repo without
round-tripping through GHCR.
Follow-on impact:
- hongmingwang.moleculesai.app (and any other tenant EC2) will
auto-pull `ghcr.io/molecule-ai/workspace-template-hermes:latest`
on the next hermes workspace provision — picking up the real
Nous hermes-agent behind the A2A bridge (template-hermes v2.1.0)
without any tenant-side rebuild step.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
publish-canvas-image has been failing on every main push since 2026-04-21
at `addgroup -g 1000 canvas` because node:20-alpine already ships a `node`
user/group at uid/gid 1000. Same collision workspace-server/Dockerfile.tenant
already fixes with `deluser --remove-home node` before `addgroup`.
Copying that pattern here so the workflow goes green again and canvas images
publish to ghcr. No runtime behaviour change — canvas still runs as non-root
uid 1000.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>