CI run #946 (post-#43) confirmed `driver: docker` doesn't fix the ECR
push 401 either: buildx CLI inside the runner container talks to the
operator-host docker daemon (mounted socket), but the daemon doesn't
see the runner's ECR auth state, and the runner's buildx CLI doesn't
attach the auth header in a way the daemon accepts.
Drop buildx + build-push-action entirely. Plain `docker build` +
`docker push` from the runner container works because both use the
SAME docker socket + the SAME runner-container config.json (populated
by `aws ecr get-login-password | docker login` from amazon-ecr-login).
Trade-off: lose multi-arch support. We only ship linux/amd64 tenant
images today, so this is fine. If multi-arch becomes a requirement
later, we can revisit (likely with `docker buildx create
--driver=remote` pointing at an external buildkit, but that's
substantial infra work; not worth it for a single-arch shop).
Closes#173 (fourth piece — and hopefully last; this matches the
operator-host manual approach exactly).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #38 + #41 fixed the Dockerfile-side clone issue. CI run #893 then
revealed two Gitea-Actions-specific issues with the unchanged buildx
config:
1. `failed to push: 401 Unauthorized` to ECR. Root cause: default
buildx driver `docker-container` spawns a buildkit container that
doesn't share the host's `~/.docker/config.json`, so the ECR auth
set up by amazon-ecr-login doesn't reach the push. Fix: pin
`driver: docker` so buildx delegates to the host daemon, which
already has the ECR creds.
2. `dial tcp ...:41939: i/o timeout` on `_apis/artifactcache/cache`.
Root cause: `cache-from/cache-to: type=gha` is GitHub-specific;
Gitea Actions has no compatible artifact-cache backend, so every
cache lookup fails after a 30s timeout. Fix: remove the cache-*
options. Cold-build cost is <10min for 37-repo clone + Go/Node
compile, acceptable. Could revisit with type=registry inline cache
later if rebuilds get painful.
With this + #38/#41, the workflow should run end-to-end on Gitea
Actions: pre-clone -> docker build (host daemon) -> ECR push.
Closes#173 (third and final piece).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The first PR (#38) only patched Dockerfile.tenant — but the workflow
also builds the platform image from workspace-server/Dockerfile, which
had the SAME in-image `git clone` stage. Build run #794 caught this:
"process clone-manifest.sh ... exit code 128" on the platform image.
Apply the same pre-clone shape to the platform Dockerfile: drop the
`templates` stage, COPY from .tenant-bundle-deps/ instead. The
workflow's existing "Pre-clone manifest deps" step (added in #38)
already populates .tenant-bundle-deps/ before either build runs, so no
workflow change needed.
Self-review note: the missed-platform-Dockerfile is a Phase 1 quality
miss — I read both files but only registered the tenant one as
in-scope. Saved memory `feedback_orchestrator_must_verify_before_declaring_fixed`
applies: should have grepped the whole workspace-server/ for "templates"
stages before claiming Task #173 done. CI run #794 caught it within
~6 minutes; net cost: one followup commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TestPooledWithEICTunnel_PreservesFnErr (and any sqlmock-using neighbour
test) was at risk of inheriting stale INSERT calls from a previous
test's coalesceRestart goroutine that survived its t.Cleanup boundary.
The production callsite shape is `go h.RestartByID(...)` from
a2a_proxy.go, a2a_proxy_helpers.go and main.go. When that goroutine's
runRestartCycle panics, coalesceRestart's deferred recover swallows it
to keep the platform process alive — but in tests, nothing waits for
the goroutine to fully exit. If it's still draining LogActivity-shaped
work after the test returns, those INSERTs land in the next test's
sqlmock connection as kind=DELEGATION_FAILED /
kind=WORKSPACE_PROVISION_FAILED, surfacing as "INSERT-not-expected".
Fix: introduce drainCoalesceGoroutine(t, wsID, cycle) test helper that
spawns coalesceRestart on a goroutine (matching production) and
registers a t.Cleanup with sync.WaitGroup.Wait so the test can't
declare itself done while a goroutine is still alive.
Convert TestCoalesceRestart_PanicInCycleClearsState to use the helper
(previously it called coalesceRestart synchronously, which never
exercised the production goroutine-survival contract).
Add TestCoalesceRestart_DrainHelperWaitsForGoroutineExit as the
regression guard: cycle blocks 150ms then panics; the test asserts
t.Run elapsed >= 150ms (proving the Wait barrier engaged) AND the
deferred close ran (proving the panic-recovery defer chain executed)
AND state.running was cleared. Verified the assertion is real by
mutation-testing: removing t.Cleanup(wg.Wait) makes this test FAIL
deterministically with elapsed <300µs.
Per saved memory feedback_assert_exact_not_substring: the regression
test asserts an exact-shape contract (elapsed >= blockFor) rather than
a substring-in-output, so it discriminates between "drain works" and
"drain skipped".
Per Phase 3: 10/10 race-detector runs pass for all TestCoalesceRestart_*
tests. Full ./internal/handlers/... suite green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TestPooledWithEICTunnel_PreservesFnErr (and any sqlmock-using neighbour
test) was at risk of inheriting stale INSERT calls from a previous
test's coalesceRestart goroutine that survived its t.Cleanup boundary.
The production callsite shape is `go h.RestartByID(...)` from
a2a_proxy.go, a2a_proxy_helpers.go and main.go. When that goroutine's
runRestartCycle panics, coalesceRestart's deferred recover swallows it
to keep the platform process alive — but in tests, nothing waits for
the goroutine to fully exit. If it's still draining LogActivity-shaped
work after the test returns, those INSERTs land in the next test's
sqlmock connection as kind=DELEGATION_FAILED /
kind=WORKSPACE_PROVISION_FAILED, surfacing as "INSERT-not-expected".
Fix: introduce drainCoalesceGoroutine(t, wsID, cycle) test helper that
spawns coalesceRestart on a goroutine (matching production) and
registers a t.Cleanup with sync.WaitGroup.Wait so the test can't
declare itself done while a goroutine is still alive.
Convert TestCoalesceRestart_PanicInCycleClearsState to use the helper
(previously it called coalesceRestart synchronously, which never
exercised the production goroutine-survival contract).
Add TestCoalesceRestart_DrainHelperWaitsForGoroutineExit as the
regression guard: cycle blocks 150ms then panics; the test asserts
t.Run elapsed >= 150ms (proving the Wait barrier engaged) AND the
deferred close ran (proving the panic-recovery defer chain executed)
AND state.running was cleared. Verified the assertion is real by
mutation-testing: removing t.Cleanup(wg.Wait) makes this test FAIL
deterministically with elapsed <300µs.
Per saved memory feedback_assert_exact_not_substring: the regression
test asserts an exact-shape contract (elapsed >= blockFor) rather than
a substring-in-output, so it discriminates between "drain works" and
"drain skipped".
Per Phase 3: 10/10 race-detector runs pass for all TestCoalesceRestart_*
tests. Full ./internal/handlers/... suite green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
publish-workspace-server-image.yml could not run on Gitea Actions because
Dockerfile.tenant's stage 3 ran `git clone` against private Gitea repos
from inside the Docker build context, where no auth path exists. Every
workspace-server rebuild required a manual operator-host push.
Move cloning to the trusted CI context (where AUTO_SYNC_TOKEN — the
devops-engineer persona PAT — is naturally available). Dockerfile.tenant
now COPYs from .tenant-bundle-deps/, populated by the workflow's new
"Pre-clone manifest deps" step. The Gitea token never enters the image.
- scripts/clone-manifest.sh: optional MOLECULE_GITEA_TOKEN env embeds
basic-auth in the clone URL; redacted in log output. Anonymous fallback
preserved for future public-repo path.
- .github/workflows/publish-workspace-server-image.yml: new pre-clone
step before docker build; injects AUTO_SYNC_TOKEN. Fail-fast if the
secret is empty.
- workspace-server/Dockerfile.tenant: drop stage 3 (templates), COPY
from .tenant-bundle-deps/ instead. Header documents the prereq.
- .gitignore: ignore /.tenant-bundle-deps/ so a local build can't
accidentally commit cloned repos.
Verified locally: clone-manifest.sh with the devops-engineer persona
token cloned all 37 repos (9 ws + 7 org + 21 plugins, 4.9MB after
.git strip).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same SSOT-divergence shape as #10 / fixed in #12, but on the a2a-proxy
code path. The plugin handler was routed through `provisioner.RunningContainerName`;
a2a-proxy was forwarding optimistically and only catching missing containers
REACTIVELY via `maybeMarkContainerDead` after the network call timed out.
Result on tenants whose agent containers had been recycled (e.g. post-EC2
replace from molecule-controlplane#20): canvas waits 2-30s for the network
forward to fail before getting a 503, and the workspace-server logs only
"ProxyA2A forward error" without the "container is dead" signal.
This PR adds a proactive `Provisioner.IsRunning` check in `proxyA2ARequest`
between `resolveAgentURL` and `dispatchA2A`, gated on the conditions where
we know we're talking to a sibling Docker container we own (`h.provisioner
!= nil` AND `platformInDocker` AND the URL was rewritten to Docker-DNS form).
Three outcomes via the SSOT helper:
(true, nil) → forward as today
(false, nil) → fast-503 with `error="workspace container not running —
restart triggered"`, `restarting=true`, `preflight=true`,
plus the same offline-flip + WORKSPACE_OFFLINE broadcast +
async restart that `maybeMarkContainerDead` produces
(true, err) → fall through to optimistic forward (matches IsRunning's
"fail-soft as alive" contract — flaky daemon must not
trigger a restart cascade)
The `preflight=true` flag in the response distinguishes the proactive
short-circuit from the reactive `maybeMarkContainerDead` path so canvas
or downstream callers can render distinct messages later.
* `internal/handlers/a2a_proxy.go` — preflight call site between
resolveAgentURL and dispatchA2A; gated on `h.provisioner != nil &&
platformInDocker && url == http://<ContainerName(id)>:port`.
* `internal/handlers/a2a_proxy_helpers.go` — `preflightContainerHealth`
helper. Routes through `h.provisioner.IsRunning` (which itself wraps
`RunningContainerName`). Identical offline-flip side-effects as
`maybeMarkContainerDead` for the dead-container case.
* `internal/handlers/a2a_proxy_preflight_test.go` — 4 tests: running →
nil; not-running → structured 503 + sqlmock expectations on the
offline-flip + structure_events insert; transient error → nil
(fail-soft); AST gate pinning the SSOT routing (mirror of #12's gate).
Mutation-tested: removing the `if running { return nil }` guard makes
the production code fail to compile (unused var). A subtler mutation
(replacing the !running branch with `return nil`) would make
TestPreflight_ContainerNotRunning_StructuredFastFail fail at runtime
with sqlmock's "expected DB call did not occur."
Refs: molecule-core#36. Companion to #12 (issue #10).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a 'mock' runtime: virtual workspaces with no container, no EC2,
no LLM. Every A2A reply is synthesised from a small canned-variant
pool ('On it!', 'Got it, on it now.', etc.) deterministically seeded
by (workspace_id, request_id).
Built for funding-demo "200-workspace mock org" — renders an
enterprise-scale org chart on the canvas (CEO/VPs/Managers/ICs)
without burning real LLM credits or provisioning 200 EC2 instances.
Surfaces:
- workspace-server/internal/handlers/mock_runtime.go: A2A proxy
short-circuit, canned-reply pool, deterministic variant pick.
- workspace-server/internal/handlers/a2a_proxy.go: gate the
short-circuit before resolveAgentURL (mock has no URL).
- workspace-server/internal/handlers/org_import.go: skip Docker
provisioning for mock workspaces, set status='online' directly,
drop the per-sibling 2s pacing for mock children (collapses
a 200-workspace import from ~7min → ~1s).
- workspace-server/internal/handlers/runtime_registry.go: register
'mock' in the runtime allowlist (manifest + fallback set).
- workspace-server/internal/registry/healthsweep.go +
orphan_sweeper.go: skip mock workspaces in container-health and
stale-token sweeps (no container by design).
- workspace-server/internal/handlers/workspace_restart.go: mirror
the 'external' Restart no-op for mock.
- manifest.json: register the new
Molecule-AI/molecule-ai-org-template-mock-bigorg repo.
Tests: 5 new in mock_runtime_test.go covering happy-path, non-mock
regression guard, determinism, IsMockRuntime trim/case, JSON-RPC
id echo. All existing handler + registry tests still pass.
Local-verified: imported the 200-workspace template against a fresh
postgres+redis, confirmed all 200 land in 'online' and stay there
through the 30s health-sweep window, exercised A2A on CEO + VPs +
Managers + ICs and saw the variant pool rotate.
Org template lives at
Molecule-AI/molecule-ai-org-template-mock-bigorg (created today)
and is imported via the existing /org/import flow on the canvas
Template Palette.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Funding-demo Mock #1: when the canvas loads with `?purchase_success=1`,
show a centred success modal in the warm-paper theme. Auto-dismisses
after 5s; Close button + Esc + backdrop click also dismiss; URL params
are stripped on first paint so a refresh after dismiss does not
re-trigger.
Mounted in `app/layout.tsx` (not `app/page.tsx`) so the modal persists
across the canvas page-state transitions (loading → hydrated → error)
without unmounting and losing its open-state.
No real billing logic — the marketplace "Purchase" button on the
landing page redirects here with the flag; this modal is the only
thing the user sees of the "transaction".
Local-verified end-to-end via playwright (5/5 tests pass): redirect
URL shape, modal visibility, URL cleanup, close button, refresh-after-
dismiss behaviour, 5s auto-dismiss.
Pairs with the Purchase button added to landingpage Marketplace
section.
scripts/clone-manifest.sh runs inside the platform Dockerfile build,
so a change to that script needs to retrigger publish. Without it,
the prior fix (clone via Gitea + lowercase org) didn't trigger this
workflow because scripts/ wasn't in the path filter.
Also serves as the file change to satisfy the path filter for THIS
push, retriggering publish-workspace-server-image now.
Post-2026-05-06 GitHub-org suspension: scripts/clone-manifest.sh
was still pointing at https://github.com/${repo}.git, so the
Docker build for workspace-server'\''s platform image fails at:
fatal: could not read Username for 'https://github.com':
No such device or address
with no credentials available in the build container.
Fix: clone from https://git.moleculesai.app/${repo}.git instead.
manifest.json'\''s repo paths still read 'Molecule-AI/...' (the
historic GitHub slug, mixed-case); Gitea lowercases the org
component to 'molecule-ai/...'. Lowercase the org segment on
the fly with awk so we don'\''t need to rewrite every manifest
entry.
Local verify: bash -n passes, lowercase transform produces correct
Gitea paths, anonymous git clone of one of the manifest plugins
over HTTPS to git.moleculesai.app succeeds.
Class G in the prod-ship CI sweep — same shape as the github.com
ref Harness Replays hits, this is the second instance found.
Two coupled cleanups for the post-2026-05-06 stack:
============================================
The plugin injected GITHUB_TOKEN/GH_TOKEN via the App's
installation-access flow (~hourly rotation). Per-agent Gitea
identities replaced this approach after the 2026-05-06 suspension —
workspaces now provision with a per-persona Gitea PAT from .env
instead of an App-rotated token. The plugin code itself lived on
github.com/Molecule-AI/molecule-ai-plugin-github-app-auth which is
also unreachable post-suspension; checking it out at CI build time
was already failing.
Removed:
- workspace-server/cmd/server/main.go: githubappauth import + the
`if os.Getenv("GITHUB_APP_ID") != ""` block that called
BuildRegistry. gh-identity remains as the active mutator.
- workspace-server/Dockerfile + Dockerfile.tenant: COPY of the
sibling repo + the `replace github.com/Molecule-AI/molecule-ai-
plugin-github-app-auth => /plugin` directive injection.
- workspace-server/go.mod + go.sum: github-app-auth dep entry
(cleaned up by `go mod tidy`).
- 3 workflows: actions/checkout steps for the sibling plugin repo:
- .github/workflows/codeql.yml (Go matrix path)
- .github/workflows/harness-replays.yml
- .github/workflows/publish-workspace-server-image.yml
Verified `go build ./cmd/server` + `go vet ./...` pass post-removal.
=======================================================
Same workflow used to push to ghcr.io/molecule-ai/platform +
platform-tenant. ghcr.io/molecule-ai is gone post-suspension. The
operator's ECR org (153263036946.dkr.ecr.us-east-2.amazonaws.com/
molecule-ai/) already hosts platform-tenant + workspace-template-*
+ runner-base images and is the post-suspension SSOT for container
images. This PR aligns publish-workspace-server-image with that
stack.
- env.IMAGE_NAME + env.TENANT_IMAGE_NAME repointed to ECR URL.
- docker/login-action swapped for aws-actions/configure-aws-
credentials@v4 + aws-actions/amazon-ecr-login@v2 chain (the
standard ECR auth pattern; uses AWS_ACCESS_KEY_ID/SECRET secrets
bound to the molecule-cp IAM user).
The :staging-<sha> + :staging-latest tag policy is unchanged —
staging-CP's TENANT_IMAGE pin still points at :staging-latest, just
with the new registry prefix.
Refs molecule-core#157, #161; parallel to org-wide CI-green sweep.
Gitea is case-sensitive on owner slugs; canonical is lowercase
`molecule-ai/...`. Mixed-case `Molecule-AI/...` refs fail-at-0s
when the runner tries to resolve the cross-repo workflow / checkout.
Same fix as molecule-controlplane#12. Mechanical case-correction;
no behavior change beyond making CI resolve again.
Refs: internal#46
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two coupled fixes for molecule-core#10 (plugin install 503 vs
status=online split-state):
1. SSOT for "is this workspace's container running" — `findRunningContainer`
in plugins.go used to carry its own copy of `cli.ContainerInspect`, which
collapsed transient daemon errors into the same `""` return as a
genuinely-stopped container. Healthsweep's `Provisioner.IsRunning`
handled the same input correctly (defensive). Promote the inspect logic
to `provisioner.RunningContainerName`, route both consumers through it.
Transient errors get a distinct log line on the plugins side so triage
doesn't confuse a flaky daemon with a stopped container.
2. Runtime-aware Install/Uninstall — `runtime='external'` workspaces have
no local container; push-install via docker exec is meaningless. They
pull plugins via the download endpoint instead (Phase 30.3). Without a
guard they fell through to `findRunningContainer` and 503'd with a
misleading "container not running." Add an early 422 with a hint
pointing at the download endpoint.
The two fixes are independent: (1) preserves correctness when the SSOT
helper is later modified; (2) eliminates the persistent split-state on
the 5 external persona-agent workspaces in this DB (and on tenant
deployments hitting the same shape).
* `internal/provisioner/provisioner.go` — new `RunningContainerName(ctx,
cli, id) (string, error)` with three documented outcomes (running /
stopped / transient). `Provisioner.IsRunning` now wraps it; behavior
preserved.
* `internal/handlers/plugins.go` — `findRunningContainer` shimmed onto
`RunningContainerName`; new `isExternalRuntime(id)` predicate.
* `internal/handlers/plugins_install.go` — Install + Uninstall reject
external runtimes with 422 + hint, before the source-fetch step.
* `internal/handlers/plugins_install_external_test.go` — 5 cases:
external→422, uninstall-external→422, container-backed-falls-through,
no-runtime-lookup-fails-open, lookup-error-fails-open.
* `internal/handlers/plugins_findrunning_ssot_test.go` — two AST gates
pin the SSOT routing so future PRs can't silently re-introduce the
parallel impl. Mutation-tested: reverting either consumer to a direct
`ContainerInspect` makes the gate fail.
Refs: molecule-core#10
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
In dev mode (`MOLECULE_ENV=dev|development`, `ADMIN_TOKEN` unset) the
AdminAuth chain fails open by design so canvas at :3000 can call
workspace-server at :8080 without a bearer token. Combined with the
existing wildcard bind on `:8080`, that exposed unauthenticated
`POST /workspaces` to any same-LAN peer (S-8 in the audit RFC v1).
Couple the bind narrowness to the same signal that drives the auth
fail-open: when `middleware.IsDevModeFailOpen()` returns true, default
the listener to `127.0.0.1`. Production (`ADMIN_TOKEN` set) keeps
binding to all interfaces — its auth chain is doing the work. Operators
who need LAN exposure set `BIND_ADDR=<host>` explicitly.
* `cmd/server/main.go` — `resolveBindHost()` precedence: BIND_ADDR
explicit > IsDevModeFailOpen() loopback > "" (all interfaces).
Startup log line now includes the resolved bind + dev-mode-fail-open
state for post-deploy auditing.
* `cmd/server/bind_test.go` — 8 t.Setenv table cases covering
precedence, explicit overrides, dev/prod env words. Mutation-tested:
removing the `IsDevModeFailOpen()` branch makes the dev-mode cases
fail with "" vs "127.0.0.1".
Refs: molecule-core#7
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The five `mock.ExpectQuery(\`SELECT id FROM workspaces\`)` sites used a
loose substring regex that silent-passed three regression shapes #2872
called out:
1. `WHERE parent_id = $2` (drops `IS NOT DISTINCT FROM` — breaks
NULL-parent root matching)
2. `WHERE name = $1` only (drops parent_id check entirely — hijacks
siblings of the same name across different parents)
3. Drops `AND status != 'removed'` (blocks re-import after Collapse)
Extracts a `lookupChildSQLRE` const that anchors all four load-bearing
tokens (the SELECT/FROM, the name predicate, the IS NOT DISTINCT FROM
predicate, and the status filter). All five ExpectQuery sites now use
the same const so a future schema/predicate change fails one place.
Mutation-tested per memory feedback_assert_exact_not_substring.md:
- Replacing `IS NOT DISTINCT FROM` with `=` fails
TestLookupExistingChild_NilParent_MatchesRoot.
- Dropping `AND status != 'removed'` fails
TestLookupExistingChild_Found_ReturnsIDAndTrue.
Note: #2872 PR-A (AST gate strengthening) is already addressed inline —
findWorkspacesInsertSQL + TestCreateWorkspaceTree_InsertUsesOnConflictDoNothing
pin the ON CONFLICT DO NOTHING shape, which is a strictly stronger
gate than the original lookup-before-insert ordering check.
Allows MOLECULE_IMAGE_REGISTRY env override on the tenant workspace-server. Used to flip from ghcr.io/molecule-ai → private ECR mirror after the GitHub org suspension on 2026-05-06. Default unchanged for OSS users.
Closes#6.
Add MOLECULE_IMAGE_REGISTRY env var to override the registry prefix used
by all workspace-template image references. Defaults to ghcr.io/molecule-ai
(unchanged for OSS users); set to an ECR URI in production tenants when
mirroring to AWS.
Why this matters: GitHub suspended the Molecule-AI org on 2026-05-06 with
no warning. Production tenants kept running because they had images cached
locally, but any tenant restart (AWS health event, redeploy, OS reboot)
would have failed at `docker pull ghcr.io/molecule-ai/...` because GHCR
returned 401. This change introduces the seam needed to point new pulls at
a registry we control (AWS ECR) by flipping a single env var on Railway.
Design (RFC: molecule-ai/internal#6):
- New `RegistryPrefix()` function in `provisioner/registry.go` reads
MOLECULE_IMAGE_REGISTRY, falls back to "ghcr.io/molecule-ai".
- New `RuntimeImage(runtime)` returns the canonical ref using the prefix.
- `RuntimeImages` map computed at init via `computeRuntimeImages()` so
existing callers that range over it still work.
- `DefaultImage` likewise computed via `RuntimeImage(defaultRuntime)`.
- `handlers.TemplateImageRef()` switched from hardcoded format string to
`provisioner.RegistryPrefix()`.
- `runtime_image_pin.go::resolveRuntimeImage()` automatically inherits
the prefix change because it reads from `provisioner.RuntimeImages[]`
and only re-formats the tag suffix to a digest pin.
Alternatives rejected (see RFC):
- Multi-registry fallback chain (try ECR, fall back to GHCR): GHCR is
locked from outbound for our org, so the fallback never works for us.
Adds code complexity for no benefit.
- Hardcoded ECR-only switch: couples production code to a specific
deployment environment. OSS users self-hosting Molecule would need
the upstream GHCR.
- Self-hosted Harbor / registry-on-Hetzner: adds a component to operate.
Not justified at 3-tenant scale; AWS ECR is mature and IAM-integrated.
Auth — deliberately NOT changed in this commit:
- For GHCR, the existing `ghcrAuthHeader()` reads GHCR_USER/GHCR_TOKEN.
- For ECR, EC2 user-data installs `amazon-ecr-credential-helper` and adds
a `credHelpers` entry in `~/.docker/config.json` so the daemon resolves
ECR credentials via the EC2 instance role on every pull. The Go code
needs no auth change. This keeps the diff minimal.
Backwards compatibility:
- Additive: env unset → identical behavior to today (GHCR).
- Existing tests reference literal `ghcr.io/molecule-ai/...` strings;
they continue to pass under the default prefix.
- `RuntimeImages` map preserved for callers that iterate it.
- No interface, schema, API, or migration version bump needed.
Security review:
- No untrusted input: MOLECULE_IMAGE_REGISTRY is set at deploy time
(Railway env, EC2 user-data), not by users.
- No expanded data collection or logging changes.
- No new permissions: ECR pull permission is a future user-data + IAM
role change, separate from this code change.
- Worst-case: an attacker who already compromises Railway can swap the
registry prefix to a malicious URI — same blast radius as compromising
Railway today, no expansion.
Tests:
- 9 new unit tests in `registry_test.go` covering: default fallback,
env override, empty env, all 9 known runtimes, unknown runtime,
override-applies-to-all, computeRuntimeImages map population, env
reflection, alphabetical ordering pin.
- All existing provisioner + handlers tests continue to pass.
- Mutation-tested mentally: deleting `if v := os.Getenv(...)` makes
TestRegistryPrefix_RespectsEnv fail. Deleting `for _, r := range
knownRuntimes` makes TestRuntimeImage_AllKnownRuntimes fail. The test
suite would catch a regression of the original failure mode.
Rollout plan: this PR is safe to merge with no env change. Production
cutover happens by setting MOLECULE_IMAGE_REGISTRY on Railway after
the AWS ECR mirror is populated (separate ops change, tracked in
issue #6 phases 3b–3f).
Tracking:
- RFC: molecule-ai/internal#6
- Tasks: #97 (ECR setup), #98 (CP fallback)
- Tech debt: runbooks/hetzner-rollout-tech-debt-2026-05-06.md item 7
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes#3026. Final piece of RFC #2945.
## What's new
New package internal/messagestore/ holds:
- MessageStore interface — single read-side contract operators
implement to plug in alternative chat-history backends.
- ChatMessage / ChatAttachment / ListOptions types — canonical data
shapes returned by any impl, mirrors canvas's TS ChatMessage.
- PostgresMessageStore — platform-default impl wrapping the
activity_logs query + A2A-envelope parser ported in PR-C.
Behavior is byte-identical to the pre-PR-D handler.
## What moves
The activity_logs query, the parser (activityRowToChatMessages,
extractRequestText, extractChatResponseText, extractFilesFromTask,
etc.), and the internal-self-message predicate all migrate from
internal/handlers/chat_history.go into the new package. handlers/
chat_history.go becomes a thin HTTP-shape adapter:
parse query params → store.List(ctx, workspaceID, opts) → emit JSON
Compile-time interface assertion in postgres_store.go catches future
drift if the interface evolves and the impl falls behind.
## Why this PR
OSS operators wanting to:
- Tier hot/warm/cold storage (recent in Postgres, archival in S3)
- Use a vector store with hybrid search (Pinecone, Weaviate)
- Run an in-memory store for ephemeral test environments
- Federate history across regions
…had no extension point — they'd have to fork the handler. This PR
makes that a constructor swap at router.go.
## Tests
Parser-level (22 tests, MOVED to internal/messagestore/postgres_
store_test.go): every TS test case in
canvas/src/components/tabs/chat/__tests__/historyHydration.test.ts
has a Go counterpart. Timestamp preservation, user/agent extraction,
internal-self filter, role decision (status=error vs agent-error
prefix), v0/v1 file shapes, malformed JSON resilience.
Handler-level (9 NEW tests in internal/handlers/chat_history_test.go):
thin adapter coverage using a fake MessageStore. UUID validation,
before_ts RFC3339 validation, default limit, max-limit clamp,
invalid-limit fallback, before_ts passthrough, empty-array (not
null) JSON shape, attachment shape preservation, store-error → 502
mapping.
Compile-time interface conformance: PostgresMessageStore satisfies
MessageStore, fakeStore (test fake) satisfies MessageStore.
Mutation-tested. Removed UUID validation in the handler; confirmed
TestChatHistoryHandler_RejectsNonUUIDWorkspaceID fires red (status
200 instead of 400, non-UUID reaches the store). Restored, all
green.
Full handlers + messagestore + router test runs green; full repo
go test ./... green.
## SSOT decision
ChatMessage / ChatAttachment / parser / DB query all live in
internal/messagestore/ ONLY. handlers/chat_history.go imports the
package and uses the types via messagestore.ChatMessage etc. — no
re-declaration anywhere.
## Three weakest spots (hostile-reviewer self-pass)
1. The internal-self prefix list (Delegation results are ready...) is
a package var in messagestore/postgres_store.go. A future impl
that wants to override the predicate must reach into the package
to use IsInternalSelfMessage or define its own. Acceptable: the
predicate is part of the contract; if an impl wants different
semantics it owns that decision explicitly.
2. ListOptions has Limit + BeforeTS + HasBefore; future paging needs
(after_ts, peer_id filter, role filter) require additive struct
field additions, which is a soft API break for any impl that
handles ListOptions positionally. Mitigated by Go's struct-literal
convention (named fields by default); also flagged in the
interface comment for impl authors.
3. The handler does NOT log when a store returns an error — it just
maps to 502. An impl that wants to surface its error class up the
stack can't, today. If/when an impl needs that, the interface can
add a typed-error contract in a follow-up. Today's coverage is
sufficient: most ops issues land in the store impl's own logs.
## Security review
- Untrusted input? Same as PR-C — agent-emitted JSON parsed
defensively. New fakeStore in tests can't reach production.
- Trust boundary? Same. Interface lives BEHIND wsAuth; impls only
see workspace IDs already authenticated.
- Auth/authz? Inherited from handler; the interface doesn't
authenticate.
- PII / secrets in logs? Documented in the interface contract:
impls MUST NOT log full message bodies / attachment URIs. The
Postgres impl logs nothing on the happy path.
- Output sanitization? Same plain-text + opaque-URI surface as
PR-C. Canvas validates attachment-URI schemes.
No security-relevant changes beyond what /chat-history already
exposes via PR-C. Considered, not skipped.
## Versioning / backwards compat
- New internal package. Zero public API change.
- Single caller site in router.go updated (one-line constructor
change). NewChatHistoryHandler() → NewChatHistoryHandler(store).
- No schema change, no migration.
- Existing /chat-history endpoint unchanged on the wire — clients
don't notice the refactor.
## Phasing
This is the final RFC #2945 piece. Follow-ups parked:
- PR-C-2 (canvas migration): swap canvas loadMessagesFromDB to call
/chat-history instead of /activity. Independent of this PR;
blocked only by canvas team's calendar.
- Sample alternative impls (S3, in-memory) for OSS docs: separate
PR when the first OSS consumer materializes; demonstration code
untested against a real workload is anti-pattern.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Closes the SSOT gap for chat-history hydration: today every consumer
(canvas TS) re-implements an A2A-envelope walk to map activity_logs
rows into rendered ChatMessage objects. This PR moves that walk into
the server.
## What's added
GET /workspaces/:id/chat-history?limit=N&before_ts=T
Returns:
{
"messages": [
{"id": "<uuid>", "role": "user"|"agent"|"system",
"content": "...", "attachments": [...], "timestamp": "<RFC3339>"}
],
"reached_end": false
}
Auth chain: same wsAuth as /workspaces/:id/activity (tenant ADMIN_TOKEN
+ X-Molecule-Org-Id). No new trust boundary.
Filter: a2a_receive rows with source_id IS NULL — same canvas-source
filter the canvas applies via /activity?type=a2a_receive&source=canvas,
centralized so future API consumers don't need to know it.
## What's mirrored from canvas TS
Direct port of canvas/src/components/tabs/chat/historyHydration.ts
+ message-parser.ts:
- extractRequestText / extractFilesFromUserMessage — user-side parts
walk through request_body.params.message.parts[]
- extractChatResponseText — agent-side response_body collector across
the four shapes (string, A2A JSON-RPC parts, older nested
parts.root.text, task artifacts) joined with "\n" (matches canvas
multi-source collector — claude-code emits multiple text parts;
hermes emits summary+artifacts)
- extractFilesFromResponse / extractFilesFromTask — file walk across
parts[] + artifacts[].parts[] + status.message.parts[] +
message.parts[]
- v0 hot path ({kind:"file", file:{...}}) AND v1 protobuf flat shape
({url, filename, mediaType}) both supported
- Role decision: status='error' OR text starts with "agent error"
(case-insensitive) → "system", else "agent"
- isInternalSelfMessage prefix filter (Delegation results are
ready...)
- Timestamp pinned to row.created_at (regression cover for
2026-04-25 bubble-collapse bug)
## Tests
22 unit tests in chat_history_test.go, every TS test case in
historyHydration.test.ts has a Go counterpart:
Timestamp preservation (3): user/agent pin to created_at, two-rows
produce two distinct timestamps.
User-message extraction (5): text-only, internal-self skip,
null body, attachments hydrated, attachments-only-when-text-empty,
internal-self suppresses even with attachments.
Agent-message extraction (4): result-string, status=error→system,
agent-error-prefix→system, response_body.parts attachments,
null body, no-text-no-files-no-bubble.
End-to-end (1): paired user+agent same timestamp.
Go-specific (5): malformed JSON returns empty (no panic), v1
protobuf flat shape extraction, task-artifacts extraction, older
nested root.text shape, basename helper edge cases.
isInternalSelfMessage predicate (1): prefix match, non-prefix non-
match, empty-text non-match.
Mutation-tested. Removed the role-promotion branch (status=error +
agent-error prefix → system); confirmed both
TestChatHistory_RoleSystemWhenStatusError and
TestChatHistory_RoleSystemWhenAgentErrorPrefix fire red. Restored.
Both green.
Full handlers test suite (4.3s) green; full repo `go test ./...` green.
## SSOT decision
Parsing logic lives in workspace-server/internal/handlers/chat_history.go
ONLY. Canvas keeps historyHydration.ts + message-parser.ts during the
transition because:
- PR-C-2 (follow-up): canvas loadMessagesFromDB swaps to new
endpoint. Today's canvas still calls /activity for backward
compatibility.
- The TS parsers are still load-bearing for LIVE message handling
(WebSocket A2A_RESPONSE events) until RFC #2945 PR-B-2 mirrors
the typed event payloads to canvas consumers.
Canvas's TS path will be deleted in a separate PR after a one-week
observation window confirms no live-message consumers depend on it.
## Security review
- Untrusted input? YES — request_body and response_body come from
agents (potentially OSS / third-party). Defensive: any malformed
JSON returns empty content + no attachments, no panic. Tested
via TestChatHistory_MalformedJSONInRequestBodyReturnsEmpty.
- Trust boundary? Same as today: agent → workspace-server.
No new boundary; reuses existing wsAuth middleware.
- Auth/authz? Inherits wsAuth chain. Cross-workspace access blocked
by existing TenantGuard middleware.
- PII / secrets in logs? None. The handler logs nothing on the
happy path; errors log 502 without body content.
- Output sanitization? ChatMessage.content is plain text returned
as-is; canvas already sanitizes via ReactMarkdown. Attachment
URIs are agent-provided (workspace: / platform-pending: /
https:); canvas's existing scheme allow-list still applies.
## Versioning / backwards compatibility
- New endpoint /chat-history. /activity unchanged.
- Canvas historyHydration.ts + message-parser.ts intact during
transition (will be removed in PR-C-2 follow-up).
- No public API consumer of /activity is broken — added route is
additive.
- No semver bump (server is internal versioning).
## Three weakest spots (hostile-reviewer self-pass)
1. extractRequestText returns ONLY parts[0].text. If a user message
contains multiple text parts (uncommon — canvas only ever emits
one), we lose later parts. Matches canvas exactly today, but a
future change that emits multi-text user messages needs both
parsers updated. Documented in code; covered by test if/when
added.
2. activityRowToChatMessages rebuilds ChatMessage IDs every call (no
caching). Each chat reload mints fresh UUIDs. This is fine because
canvas dedupes by (role, content, timestamp window) not id, but a
future API consumer that DID rely on id stability would break.
Documented in the ChatMessage struct comment.
3. The handler scopes to source_id IS NULL only (canvas-source rows).
A future "show all messages, including agent-to-agent" mode would
need a new endpoint or a parameter. Out of scope for PR-C; canvas's
/activity?source=canvas already enforces the same filter.
Closes#3017. Unblocks RFC #2945 PR-D (MessageStore interface) which
returns []ChatMessage typed values.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes#2962.
## Why
Six per-package `truncate` helpers had drifted into independent
re-implementations of the same idea. Three of them (delegation.go,
memory/client/client.go, memory-backfill/verify.go) used
`s[:max] + "…"` byte-slice form, which on a multi-byte codepoint at
byte `max` produces invalid UTF-8 → Postgres `text`/`jsonb` rejects
the INSERT silently → `delegation` / `activity_logs` row never lands
→ audit gap.
Three other helpers (delegation_ledger.go #2962, agent_message_writer.go
#2959, scheduler.go #2026) had each been fixed in isolation with three
slightly different rune-safe shapes — confirming this is a class of
bug, not a single instance.
## What
New package `internal/textutil` with three rune-safe functions:
- `TruncateBytes(s, maxBytes)` — byte-cap, "…" marker. Used by 5
callers writing into byte-bounded columns / log lines.
- `TruncateBytesNoMarker(s, maxBytes)` — byte-cap, no marker. Used by
delegation_ledger.go where the storage already conveys "preview"
and an extra ellipsis would push the result over the column cap.
- `TruncateRunes(s, maxRunes)` — rune-cap, "…" marker. Used by
agent_message_writer.go where the cap is in display chars (UI
summary), not bytes.
All three guarantee `utf8.ValidString(out)` for any `utf8.ValidString(in)`.
Inputs already invalid go through `sanitizeUTF8` at the call site
boundary (scheduler.go preserved this defense-in-depth).
## Migration map
| Old | New | Behavior change |
|---|---|---|
| `delegation_ledger.truncatePreview` | `textutil.TruncateBytesNoMarker(s, 4096)` | none |
| `agent_message_writer.truncatePreviewRunes` | `textutil.TruncateRunes(s, n)` | none |
| `scheduler.truncate` | `textutil.TruncateBytes(s, n)` | "..." → "…" (3 bytes either way; single-glyph display) |
| `delegation.truncate` | `textutil.TruncateBytes(s, n)` | bug fix + ellipsis swap |
| `memory/client.truncate` | `textutil.TruncateBytes(s, n)` | bug fix |
| `memory-backfill.truncate` | `textutil.TruncateBytes(s, n)` | bug fix |
Five separate `truncate*` helpers + their per-package tests removed.
Net: 12 files / +427 / -255.
## Tests
- `internal/textutil/truncate_test.go` — 27 table-test cases + 145
fuzz-invariant cases asserting `utf8.ValidString` and byte-cap
invariants on every output.
- `delegation_ledger_test.go TestLedgerInsert_TruncatesOversizedPreview`
strengthened with `capValidUTF8Matcher` so the SQL-write argument
is asserted to be valid UTF-8 + within cap (not just `AnyArg()`).
Mutation-tested: replacing the SSOT call with byte-slice form makes
this test fail loud.
## Compatibility
- All callers internal; no external API surface change.
- Ellipsis swap "..." → "…": same byte budget (3 bytes), single-glyph
display. No alerting/grep on either marker in this codebase
(verified). Canvas renders both correctly.
- DB column widths unchanged (4096 / 80 / 200 / 256 / 300 — all
preserved in the migrations).
## Security
Fixes a silent INSERT-failure mode that hid `activity_logs` /
`delegations` rows containing peer-controlled text. The class of input
that triggered it (CJK, emoji, accented Latin) is normal user content,
not malicious — but the symptom (audit gap) makes incident
reconstruction harder. Helper is pure-function over `string`; no
secrets / PII / auth handling involved. Untrusted input is handled
identically to before, just rune-aligned now.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The migration-replay step globbed only *.up.sql, silently skipping
the older flat-naming migrations (001_workspaces.sql,
009_activity_logs.sql, etc.). Fine while no integration test
depended on those tables; broke when the #149 cross-table
atomicity test came in needing both workspaces (FK target for
activity_logs) and activity_logs themselves.
Switch to globbing *.sql + sorted lex-order, excluding *.down.sql
so up/down pairs don't undo themselves mid-run. Add a sanity check
for workspaces + activity_logs + pending_uploads alongside the
existing delegations gate so a future migration drift fails loud
instead of silently skipping the regressed test.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>