Closes the gap between the merged Memory v2 code (PR #2757 wired the
client into main.go) and operator activation. Without this PR an
operator wanting to flip MEMORY_V2_CUTOVER=true had to provision a
separate memory-plugin service and point MEMORY_PLUGIN_URL at it —
extra ops surface for what the design intends to be a built-in.
What ships:
* Both Dockerfile + Dockerfile.tenant build the
cmd/memory-plugin-postgres binary into /memory-plugin.
* Entrypoints spawn the plugin in the background on :9100 BEFORE
starting the main server; wait up to 30s for /v1/health to return
200; abort boot loud if it doesn't (better to crash-loop than to
silently route cutover traffic against a dead plugin).
* Default env: MEMORY_PLUGIN_DATABASE_URL=$DATABASE_URL (share the
existing tenant Postgres — plugin's `memory_namespaces` /
`memory_records` tables coexist with platform schema, no
conflicts), MEMORY_PLUGIN_LISTEN_ADDR=:9100.
* MEMORY_PLUGIN_DISABLE=1 escape hatch for operators running the
plugin externally on a separate host.
* Platform image: plugin runs as the `platform` user (not root) via
su-exec — matches the privilege boundary the main server already
drops to. Tenant image already starts as `canvas` so the plugin
inherits non-root automatically.
What stays operator-controlled:
* MEMORY_V2_CUTOVER is NOT auto-set. Behavior change for existing
deployments: zero. The wiring at workspace-server/internal/memory/
wiring/wiring.go skips building the plugin client until the
operator opts in, so the running sidecar is a no-op for traffic
until then.
* MEMORY_PLUGIN_URL is NOT auto-set either, for the same reason —
setting it implies cutover-active intent. Operators set both on
staging first, verify a live commit/recall round-trip (closes
pending task #292), then promote to production.
Operator activation steps after this PR ships:
1. Verify pgvector extension is available on the target Postgres
(the plugin's first migration runs CREATE EXTENSION IF NOT
EXISTS vector). Railway's managed Postgres ships pgvector
available; some self-hosted operators may need to enable it.
2. Redeploy the workspace-server with this image.
3. Set MEMORY_PLUGIN_URL=http://localhost:9100 + MEMORY_V2_CUTOVER=true
in the environment (staging first).
4. Watch boot logs for "memory-plugin: ✅ sidecar healthy" and the
wiring.go cutover messages; do a live commit_memory + recall_memory
round-trip via the canvas Memory tab to verify.
5. Promote to production once staging holds for a sweep window.
Refs RFC #2728. Closes the dormant-plugin gap noted in task #294.
Closes the gap that let issue #2395 ship: redeploy-fleet workflows reported
ssm_status=Success based on SSM RPC return code alone, while EC2 tenants
silently kept serving the previous :latest digest because docker compose up
without an explicit pull is a no-op when the local tag already exists.
Wire:
- new buildinfo package exposes GitSHA, set at link time via -ldflags from
the GIT_SHA build-arg (default "dev" so test runs without ldflags fail
closed against an unset deploy)
- router exposes GET /buildinfo returning {git_sha} — public, no auth,
cheap enough to curl from CI for every tenant
- both Dockerfiles thread GIT_SHA into the Go build
- publish-workspace-server-image.yml passes GIT_SHA=github.sha for both
images
- redeploy-tenants-on-main.yml + redeploy-tenants-on-staging.yml curl each
tenant's /buildinfo after the redeploy SSM RPC and fail the workflow on
digest mismatch; staging treats both :latest and :staging-latest as
moving tags; verification is skipped only when an operator pinned a
specific tag via workflow_dispatch
Tests:
- TestGitSHA_DefaultDevSentinel pins the dev default
- TestBuildInfoEndpoint_ReturnsGitSHA pins the wire shape that the
workflow's jq lookup depends on
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the last CP-provisioned-workspace gap: Terminal tab now works
for workspaces running on separate EC2 instances. Follow-up to
#1531 which added instance_id persistence.
How it works:
- HandleConnect checks workspaces.instance_id
- Empty → existing local Docker path (unchanged)
- Set → spawn `aws ec2-instance-connect ssh --connection-type eice
--instance-id X --os-user ec2-user -- docker exec -it ws-Y
/bin/bash` under creack/pty, bridge pty ↔ canvas WebSocket
Why subprocess AWS CLI instead of native AWS SDK:
- EIC Endpoint tunnel needs a signed WebSocket with specific framing
- aws-cli v2 implements it correctly; reimplementing in Go is ~500
lines of crypto + WS protocol work for zero user-visible benefit
- Tenant image picks up 1MB of aws-cli + openssh-client via apk
Handler design:
- sshCommandFactory is a var so tests can stub it (no real aws calls)
- Context cancellation propagates both ways (WS close → kill ssh;
ssh exit → close WS)
- User-visible error points at docs/infra/workspace-terminal.md when
EIC wiring is incomplete (common bootstrap failure)
Tests:
- TestHandleConnect_RoutesToRemote — instance_id in DB → CP branch
- TestHandleConnect_RoutesToLocal — empty instance_id → local branch
- TestSshCommandFactory_BuildsEICCommand — argv shape regression guard
Dockerfile.tenant: + openssh-client + aws-cli (Alpine main repo)
Refs: #1528, #1531
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
node:20-alpine ships with a `node` user at uid/gid 1000. The Dockerfile
tried `addgroup -g 1000 canvas` which fails with exit 1 because 1000
is already taken. Publish-workspace-server-image workflow has been
red for hours — tenant image :latest stuck on a digest that predates
the X-Molecule-Admin-Token CPProvisioner fix. Staging workspace
provisioning 401'd because the stale tenant binary never sent the
admin header.
Delete node user+group first (tolerant of future base-image changes
that might not ship it), then create canvas at 1000/1000 as before.
Mounted volumes continue to expect uid 1000.
Repro: publish-workspace-server-image workflow run 24731870797:
"process addgroup -g 1000 canvas && adduser... exit code: 1".
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes: #177 (CRITICAL — Dockerfile runs as root)
Dockerfiles changed:
- workspace-server/Dockerfile (platform-only): addgroup/adduser + USER platform
- workspace-server/Dockerfile.tenant (combined Go+Canvas): addgroup/adduser + USER canvas
+ chown canvas:canvas on canvas dir so non-root node process can read it
- canvas/Dockerfile (canvas standalone): addgroup/adduser + USER canvas
- workspace-server/entrypoint-tenant.sh: update header comment (no longer starts
as root; both processes now start non-root)
The entrypoint no longer needs a root→non-root handoff since both the Go
platform and Canvas node run as non-root by default. The 'canvas' user owns
/app and /platform, so volume mounts owned by the host's canvas user work
without needing a root init step.
Co-authored-by: Molecule AI CP-BE <cp-be@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>