Two scripts:
- nuke-and-rebuild.sh: docker down -v, clean orphans, rebuild, setup
- post-rebuild-setup.sh: insert global secrets (MiniMax + GH PAT),
import org template, wait for platform health
Global secrets ensure every provisioned container gets MiniMax API
config and GitHub PAT injected as env vars automatically — no manual
settings.json deployment needed.
Usage: bash scripts/nuke-and-rebuild.sh
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Closes the canary loop with the escape hatch and a single place to
read about the whole flow.
scripts/rollback-latest.sh <sha>
uses crane to retag :latest ← :staging-<sha> for BOTH the platform
and tenant images. Pre-checks the target tag exists and verifies
the :latest digest after the move so a bad ops typo doesn't
silently promote the wrong thing. Prod tenants auto-update to the
rolled-back digest within their 5-min cycle. Exit codes: 0 = both
retagged, 1 = registry/tag error, 2 = usage error.
docs/architecture/canary-release.md
The one-page map of the pipeline: how PR → main → staging-<sha> →
canary smoke → :latest promotion works end-to-end, how to add a
canary tenant, how to roll back, and what this gate explicitly does
NOT catch (prod-only data, config drift, cross-tenant bugs).
No code changes in the CP or workspace-server — this PR is shell
+ docs only, so it's safe to land independently of the other Phase
{1,1.5,2,3} PRs still in review.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Post-deploy verification for staging tenant images. Runs against the
canary fleet after each publish-workspace-server-image build — catches
auto-update breakage (a la today's E2E current_task drift) before it
propagates to the prod tenant fleet that auto-pulls :latest every 5 min.
scripts/canary-smoke.sh iterates a space-sep list of canary base URLs
(paired with their ADMIN_TOKENs) and checks:
- /admin/liveness reachable with admin bearer (tenant boot OK)
- /workspaces list responds (wsAuth + DB path OK)
- /memories/commit + /memories/search round-trip (encryption + scrubber)
- /events admin read (AdminAuth C4 path)
- /admin/liveness without bearer returns 401 (C4 fail-closed regression)
.github/workflows/canary-verify.yml runs after publish succeeds:
- 6-min sleep (tenant auto-updater pulls every 5 min)
- bash scripts/canary-smoke.sh with secrets pulled from repo settings
- on failure: writes a Step Summary flagging that :latest should be
rolled back to prior known-good digest
Phase 3 follow-up will split the publish workflow so only
:staging-<sha> ships initially, and canary-verify's green gate is
what promotes :staging-<sha> → :latest. This commit lays the test
gate alone so we have something running against tenants immediately.
Secrets to set in GitHub repo settings before this workflow can run:
- CANARY_TENANT_URLS (space-sep list)
- CANARY_ADMIN_TOKENS (same order as URLs)
- CANARY_CP_SHARED_SECRET (matches staging CP PROVISION_SHARED_SECRET)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
molecule_runtime's _deep_merge_hooks() uses unconditional list.extend()
when merging plugin settings-fragment.json files. On every plugin install
or reinstall each hook handler is re-appended, causing 3-4x duplicate
firings per event.
scripts/dedup_settings_hooks.py — idempotent live fix (reads via
/proc/*/root, no docker CLI required). Safe to re-run.
scripts/verify_settings_hooks.py — exits 1 if any container still has
duplicate hooks; used in CI health checks and manual audits.
Upstream fix needed in molecule_runtime._deep_merge_hooks() to
deduplicate by (matcher, frozenset(commands)) before writing. Track
separately.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two fixes:
1. publish-canvas-image.yml + publish-platform-image.yml: the JSON
heredoc for config.json had leading whitespace from YAML indentation,
producing invalid JSON. Docker fell back to osxkeychain → -25308.
Fixed by removing indentation inside the heredoc body.
2. Added scripts/dev-start.sh — one-command local dev environment.
Starts infra (docker-compose), platform (Go), and canvas (Next.js)
with proper health checks and cleanup on Ctrl-C.
Code review fixes:
- 🟡#1: Replace python3 with jq in Dockerfile template stages (~50MB → ~2MB)
- 🟡#2: Add clone count verification to scripts/clone-manifest.sh
(set -e + expected vs actual count check — fails build if any clone fails)
- 🟡#3: Drop 'unsafe-eval' from CSP (not needed for Next.js production
standalone builds, only dev mode). Updated test assertion.
- 🟡#4: Remove broken pyproject.toml from workspace-template/ (it claimed
to package as molecule-ai-workspace-runtime but the directory structure
didn't match — the real package ships from the standalone repo)
- 🔵#1: Add version-pinning TODO comment to manifest.json
- 🔵#3: Add full repo URLs + test counts for SDK/MCP/CLI/runtime in CLAUDE.md
Security (GitGuardian alert):
- Removed Telegram bot token (8633739353:AA...) from template-molecule-dev
pm/.env — replaced with ${TELEGRAM_BOT_TOKEN} placeholder
- Removed Claude OAuth token (sk-ant-oat01-...) from template-molecule-dev
root .env — replaced with ${CLAUDE_CODE_OAUTH_TOKEN} placeholder
- Both tokens need immediate rotation by the operator
Tests: Platform middleware tests updated + all pass.
Remove plugins/, workspace-configs-templates/, org-templates/ dirs (now
in standalone repos). Add manifest.json listing all 33 repos and
scripts/clone-manifest.sh to clone them. Both Dockerfiles now use the
manifest script instead of 33 hardcoded git-clone lines.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Resolves#17.
Part A: scripts/cleanup-rogue-workspaces.sh deletes workspaces whose id
or name starts with known test placeholder prefixes (aaaaaaaa-, etc.)
and force-removes the paired Docker container. Documented in
tests/README.md.
Part B: add a pre-flight check in provisionWorkspace() — when neither a
template path nor in-memory configFiles supplies config.yaml, probe the
existing named volume via a throwaway alpine container. If the volume
lacks config.yaml, mark the workspace status='failed' with a clear
last_sample_error instead of handing it to Docker's unless-stopped
restart policy (which otherwise loops forever on FileNotFoundError).
New pure helper provisioner.ValidateConfigSource + unit tests.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>