molecule-core

Author	SHA1	Message	Date
Hongming Wang	de01544d6b	fix(harness-runner): switch from non-existent /heartbeat-history to /activity The runner was speculatively calling `/workspaces/:id/heartbeat-history` — that endpoint doesn't exist on workspace-server. On local dev it 404'd; on tenant builds the platform's :8080 canvas-proxy fallback intercepted it and returned 28KB of Next.js HTML which then landed in the JSON event log. Neither outcome was useful trace data. `GET /workspaces/:id/activity` is the existing endpoint that reads activity_logs. That table already records the events the RFC §V1.0 step 6 'platform-side transition' check needs (a2a_send / a2a_receive / task_update / agent_log / error, plus duration_ms + status). Rename the runner's fetch + emitted event accordingly. Verified: GET /workspaces/<uuid>/activity?since_secs=60 returns 200 with `[]` against the local platform; no SaaS skip needed since the endpoint exists in both environments. Refs: molecule-core#2256 (V1.0 gate #1 measurement comment).	2026-04-28 23:12:51 -07:00
Hongming Wang	dd5c54dbaa	fix(harness-runner): WAIT_ONLINE_SECS round-up + SaaS heartbeat skip + UUID/slug validation Three review-driven fixes to the runner before #2261 merges: 1. `WAIT_ONLINE_SECS / 3` truncated; an operator passing 200 actually waited 198s. Round up so 200 → 67 polls × 3s = 201s ≥ requested. 2. The heartbeat-history endpoint isn't on tenant workspace-servers — the platform's :8080 fallback proxies unmatched paths to the canvas Next.js, so the SaaS run captured 28KB of HTML in the `heartbeat_trace` event log. Skip the fetch in MODE=saas; emit an explicit `<skipped: ...>` placeholder. Local mode behaviour unchanged. 3. ORG_ID and ORG_SLUG had no client-side format check, so a typo'd value got swallowed by TenantGuard's intentionally-opaque 404 (which doesn't tell the operator whether slug, UUID, or auth was wrong). Validate UUID and slug shape up front; matching errors are actionable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 22:29:29 -07:00
Hongming Wang	00e4766046	docs: registry pattern + harness scripts READMEs Two docs covering load-bearing patterns from today's work that weren't previously discoverable: 1. workspace/platform_tools/README.md — explains the ToolSpec single-source-of-truth pattern (#2240), the CLI-block alignment gap that hand-maintained generation can't close (#2258), the snapshot golden files + LF-pinning (#2260), and the add/rename/ remove playbook. The next reader who lands in workspace/platform_tools/ now has the design rationale + the safe-edit procedure colocated with the code. 2. scripts/README.md — disambiguates the three measure-coordinator- task-bounds.sh files that now exist across two repos: - scripts/measure-coordinator-task-bounds.sh (canonical OSS, this repo) - scripts/measure-coordinator-task-bounds-runner.sh (Hermes/MiniMax variant, this repo) - scripts/measure-coordinator-task-bounds.sh (production-shape, in molecule-controlplane) Cross-references reference_harness_pair_pattern (auto-memory) for the cross-repo design rationale. Documents the common safety pattern (cleanup trap, DRY_RUN, non-target guard, cleanup_*_failed events) and the heartbeat-trace caveat. Refs: #2240, #2254, #2257, #2258, #2259, #2260; molecule-controlplane#321. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 22:19:40 -07:00
Hongming Wang	592f47694b	feat(harness): SaaS routing + provider-agnostic config for RFC #2251 measurement The original measure-coordinator-task-bounds.sh was hardcoded for local-dev (workspace-server on :8080) with claude-code/langgraph templates and OPENROUTER_API_KEY. Running it against staging requires both auth-chain plumbing (per-tenant ADMIN_TOKEN + X-Molecule-Org-Id TenantGuard header + tenant subdomain routing) and template/secret flexibility (e.g. Hermes/MiniMax for Token Plan keys). This adds: * `measure-coordinator-task-bounds-runner.sh` — separate runner that wraps the same workspace-server API calls but takes everything as env-var inputs. Two MODE values: - `local` → direct workspace-server (no auth/tenant scoping) - `saas` → tenant subdomain + per-tenant ADMIN_TOKEN bearer + X-Molecule-Org-Id TenantGuard header. Auto-fetches tenant token via CP /cp/admin/orgs/<slug>/admin-token given ORG_SLUG + CP_ADMIN_API_TOKEN, OR accepts a pre-resolved TENANT_ADMIN_TOKEN. * Configurable PM_TEMPLATE / CHILD_TEMPLATE / MODEL / SECRET_NAME / SECRET_VALUE — defaults match the original (claude-code-default + langgraph + OpenRouter). Hermes/MiniMax example documented in the header. * Per-poll status_change events during wait_online, so a workspace that never reaches online surfaces its last status (provisioning, failed, etc.) instead of a bare timeout. * WAIT_ONLINE_SECS knob (default 180s; SaaS cold-start needs ~420s for first hermes-image pull on a freshly-provisioned EC2 tenant). * `${args[@]+...}` guard on the api() helper — avoids `set -u` exploding on an empty header array (the local-dev hot-path). The original script also gained a SECRET_VALUE block earlier in the session — that change (separately staged) makes the secret-name configurable without forcing every operator through the new runner. V1.0 gate #1 (RFC #2251, Issue 4 repro) measurement results posted as a separate comment on molecule-core#2256. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 22:06:18 -07:00
Hongming Wang	6e5b5c4142	fix(harness): cleanup_failed event + drop misleading exit_code capture Self-review follow-ups on #2257: - Drop `local exit_code=$?` from cleanup(). `trap`-handler return values are ignored, so capturing $? only misled a future reader into thinking exit-code preservation was happening. - Replace silenced `>/dev/null 2>&1` DELETE with `-w '%{http_code}'` capture. ADMIN_TOKEN expiring mid-run was the realistic failure mode here — previously we swallowed it under the silenced redirect, leaving workspaces leaked with no signal. Now a 401/403/5xx surfaces as a `cleanup_failed` JSON event with a remediation hint pointing at cleanup-rogue-workspaces.sh; 404 is treated as success (the post-condition — workspace absent — holds). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 21:00:38 -07:00
Hongming Wang	039a41cce3	fix(harness): cleanup trap + tenant scoping + dry-run for measure-coordinator-task-bounds Three follow-ups from #2254 code review before the harness is safe to run against staging: 1. Cleanup trap. Workspaces are now auto-deleted on EXIT/INT/TERM. A Ctrl-C mid-run no longer leaks the PM + Researcher pair against shared infra. KEEP_WORKSPACES=1 opts out for post-run inspection. 2. Tenant scoping + admin auth. Non-localhost PLATFORM values now require both ADMIN_TOKEN and TENANT_ID; the script refuses to run without them. The previous version sent unauthenticated POSTs that, on staging, would either 401 every request or — worse — provision into the wrong tenant. Memory `feedback_never_run_cluster_cleanup_ tests_on_live_platform` calls out the same hazard class. 3. DRY_RUN=1 mode. Prints platform target, tenant id, auth fingerprint, and the planned actions, then exits before any state mutation. The intended pre-flight before running against staging. Also tightened OR_KEY check (the chained default silently accepted an empty OPENROUTER_API_KEY) and added a heartbeat-trace caveat to the interpretation guide explaining what `<endpoint_unavailable>` means for the bound question. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 20:38:35 -07:00
hongmingwang-moleculeai	daea27641f	Merge pull request #2254 from Molecule-AI/docs/rfc-2251-issue-4-repro-harness docs(rfc-2251): add coordinator task-bounds measurement harness	2026-04-29 02:03:23 +00:00
Hongming Wang	acd7fe76a5	docs(rfc-2251): add coordinator task-bounds measurement harness Adds a reproduction harness for Issue 4 of the 2026-04-28 CP review, referenced in RFC molecule-core#2251. The RFC review (issue #2251 comment) flagged that Issue 4 was hypothesized but not reproduced before V1.0 implementation begins — this script closes that gap. What it does: - Provisions a coordinator (PM, claude-code-default) + 1 child (Researcher, langgraph) via the platform API. - Sends an A2A kickoff with a synthesis-heavy task that requires SYNTHESIS_DEPTH (default 3) sequential delegations followed by a 600-word post-delegation synthesis. - Times the coordinator's full A2A round-trip with millisecond precision and emits one JSON event per phase (machine-readable). - Pulls the coordinator's heartbeat trace post-run so the team can see whether any platform-side state transition fired during the long synthesis (the V1.0 RFC's MAX_TASK_EXECUTION_SECS would surface as such a transition; absence of one in this trace confirms the RFC's premise). Why a measurement harness, not a pass/fail test: Issue 4's claim is "absence of platform-side bound", which is hard to assert in a single CI run. Outputting structured measurement data lets the team interpret across multiple runs / staging vs prod / different SYNTHESIS_DEPTH values rather than relying on one reproduction snapshot. The script's header has the full interpretation guide: - ELAPSED < 60s → not informative (LLM was just fast) - 60–300s → within DELEGATION_TIMEOUT, ambiguous - >= 300s without trace transitions → BUG CONFIRMED - curl_failed → coordinator hung past A2A_TIMEOUT or genuinely slow (disambiguate by querying status separately) Doesn't run in CI by default — invoked manually against staging or a local platform with PLATFORM=... and OPENROUTER_API_KEY=... env vars. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 18:58:39 -07:00
Hongming Wang	f323def18f	chore(build): include platform_tools in runtime wheel SUBPACKAGES The PR-built wheel + import smoke gate refused the platform_tools package because it's a new subdirectory under workspace/ that wasn't in scripts/build_runtime_package.py:SUBPACKAGES. The drift gate (which exists for exactly this reason) caught it cleanly: error: SUBPACKAGES drifted from workspace/ subdirectories: in workspace/ but NOT in SUBPACKAGES (will ship un-rewritten or be excluded): ['platform_tools'] Adding platform_tools to SUBPACKAGES wires the package into the runtime wheel + applies the canonical from platform_tools.<x> -> from molecule_runtime.platform_tools.<x> import-rewrite step that every other subpackage uses. Verified locally: scripts/build_runtime_package.py succeeds, the rewritten a2a_mcp_server.py reads from molecule_runtime.platform_tools.registry import TOOLS which matches the package layout in the wheel.	2026-04-28 17:19:00 -07:00
Hongming Wang	f2c3594abc	feat(dev-start): true single-command spinup — infra + templates + auth posture Manual fresh-user clean-slate test surfaced three friction points in the existing dev-start.sh: 1. The script ran docker compose -f docker-compose.infra.yml directly, bypassing infra/scripts/setup.sh — so the workspace template registry was never populated and the canvas template palette came up empty (the "Template palette is empty" troubleshooting hit). 2. ADMIN_TOKEN was not handled at all. Without it, the AdminAuth fail-open gate worked initially but slammed shut the moment the first workspace registered a token — at which point the canvas could no longer call /workspaces or /templates. New users hit 401s with no obvious next step. 3. The script wasn't mentioned in docs/quickstart.md. New users followed the documented 4-step manual flow and never discovered the single command existed. Fixes: - dev-start.sh now calls infra/scripts/setup.sh, which brings up full infra (postgres + redis + langfuse + clickhouse + temporal) AND populates the template/plugin registry from manifest.json. - On first run, dev-start.sh writes MOLECULE_ENV=development to .env. This activates middleware.isDevModeFailOpen() which lets the canvas keep calling admin endpoints without a bearer (the intended local-dev escape hatch). The .env is preserved on re-runs and sourced before the platform launches. - The script intentionally does NOT auto-generate an ADMIN_TOKEN. A first attempt did, and broke the canvas because isDevModeFailOpen requires ADMIN_TOKEN empty AND MOLECULE_ENV=development together. Setting ADMIN_TOKEN in dev would close the hatch and the canvas has no way to read that token in a dev build (no NEXT_PUBLIC_ADMIN_TOKEN bake step here). The .env comment block explicitly warns future contributors not to add it. - Both processes' logs go to /tmp/molecule-{platform,canvas}.log instead of stdout-mixed so the readiness banner stays clean. - Health-poll loops cap at 30s with a clear timeout error pointing to the log file, instead of hanging forever. - The readiness banner now lists the log paths AND tells the user the next step is "open localhost:3000 → add API key in Config → Secrets & API Keys → Global", instead of just listing service URLs. Quickstart doc rewrite leads with: git clone ... cd molecule-monorepo ./scripts/dev-start.sh The 4-step manual flow is preserved as "Manual setup (advanced)" for contributors who want per-component logs. Verified end-to-end from clean Docker (no containers, no volumes, no .env) three times: total wall-clock ~12s for a re-run with cached npm/docker layers. Platform's HTTP 200 on /workspaces without a bearer confirms the dev-mode auth hatch is active.	2026-04-27 16:29:37 -07:00
Hongming Wang	6e732ab714	fix(build): ship lib/ subpackage + extend drift gate to SUBPACKAGES Two compounding bugs that bit hermes (and any other workspace that reaches main.py:142): 1. workspace/lib/ was in EXCLUDE_DIRS so the published wheel didn't contain the directory at all. main.py imports `from lib.pre_stop import read_snapshot` (and `build_snapshot`, `write_snapshot`) so every workspace startup that reaches the snapshot path crashed with `ModuleNotFoundError: No module named 'lib'`. 2. Even if lib/ had shipped, `lib` wasn't in SUBPACKAGES so the import-rewriter would have left the bare `from lib.pre_stop` unqualified — it would still fail because the package would only be reachable as `molecule_runtime.lib`. Fix: move `lib` from EXCLUDE_DIRS to SUBPACKAGES (one entry each). Drift gate extension: the existing gate I added in #2163 only asserted TOP_LEVEL_MODULES against workspace/*.py. This change adds the symmetric assertion for SUBPACKAGES against workspace/<dir>/ (filtered by EXCLUDE_DIRS + presence of __init__.py). Catches both: - Subpackage added to workspace/ but missed in SUBPACKAGES - Subpackage missing from workspace/ but lingering in SUBPACKAGES - Subpackage wrongly in EXCLUDE_DIRS while also referenced by rewritten imports (the lib case) Tested locally: build of 0.1.99 now ships lib/ and main.py contains `from molecule_runtime.lib.pre_stop import ...` correctly rewritten. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 06:03:46 -07:00
Hongming Wang	026f5e51d9	ops: add Railway SHA-pin drift audit script + regression test (#2001 ) #2000 fixed one symptom — TENANT_IMAGE pinned to `staging-a14cf86` (10 days stale) silently no-op'd four upstream fixes on 2026-04-24. This adds the audit pattern as a re-runnable script so the broader class is observable on demand without new CI infrastructure. Audit results today (2026-04-27): controlplane / production: 54 vars audited, 0 drift-prone pins controlplane / staging: 52 vars audited, 0 drift-prone pins So the immediate audit deliverable is clean — TENANT_IMAGE is the only known violation and #2000 already fixed it. The script makes the ongoing audit a 5-second command instead of a manual one. Detection regex catches: * branch-SHA suffixes (`staging\|main\|prod\|production-<6+ hex>`) — the exact 2026-04-24 incident shape * version pins after `:` or `=` (`:v1.2.3`, `=v0.1.16`) — same drift class, just rendered differently Anchoring on `:` or `=` keeps prose like "version 1.2.3 of the api" out of the false-positive set. UUIDs, ARNs, AMI IDs, secrets, and floating tags (`:staging-latest`, `:main`) pass through untouched. Regression test (tests/ops/test_audit_railway_sha_pins.sh) pins 20 representative cases — 9 should-flag (covering all four branch prefixes + semver variants + middle-of-value matches) and 11 should-pass (the false-positive guards). Same regex inlined in both files so a future tweak that weakens detection fails the test in lockstep with weakening the audit. Both files shellcheck clean. CI gate (acceptance criterion's "regression: add a CI check") is deliberately scoped out — querying Railway from CI requires plumbing RAILWAY_TOKEN as a repo secret, which is multi-step setup. The re-runnable script + test cover the same surface today; the CI workflow is a small follow-up once the token is provisioned. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 05:01:23 -07:00
Hongming Wang	c68dc1877f	fix(release): drift-gate TOP_LEVEL_MODULES + smoke-import main in publish Two compounding bugs surfaced when 0.1.16 hit production today: 1. scripts/build_runtime_package.py had a hand-curated TOP_LEVEL_MODULES set listing every workspace/.py that should get its bare imports rewritten to `molecule_runtime.X`. The set silently went stale: - Missing: transcript_auth (added since #87 phase 1c), runtime_wedge, watcher → unrewritten imports shipped, every workspace startup died with ModuleNotFoundError. - Stale: claude_sdk_executor, cli_executor (both removed in #87), hermes_executor (never existed) → harmless but misleading. 2. publish-runtime.yml's wheel-smoke step asserted on stable invariants (BaseAdapter, AdapterConfig, a2a_client error sentinel) but never imported main. So even though main.py held the broken bare `from transcript_auth import ...`, the smoke check passed. Fixes: - Build script now derives the on-disk module set from workspace/.py and asserts it matches TOP_LEVEL_MODULES exactly. Drift in either direction fails the build with a specific diff message instead of shipping a broken wheel. Closed-list typo guard preserved (we still edit the set explicitly when a module is added/removed) — the gate just makes drift impossible to ignore. - TOP_LEVEL_MODULES updated to current reality: drop the 3 stale, add the 3 missing. - publish-runtime.yml wheel-smoke now `import molecule_runtime.main` before the invariant asserts. main is the entry point and transitively imports every module — any bare-import bug surfaces as ModuleNotFoundError before PyPI accepts the upload. Tested locally: `python3 scripts/build_runtime_package.py --version 0.1.99 --out /tmp/build-test` succeeds, and /tmp/build-test/molecule_runtime/main.py contains the rewritten `from molecule_runtime.transcript_auth import ...`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 03:19:17 -07:00
Hongming Wang	44d0444aae	fix(scripts): nuke-and-rebuild self-bootstraps templates; add E2E test Two paper cuts the fix addresses: 1. nuke-and-rebuild.sh wipes the compose stack but never re-populates workspace-configs-templates/, org-templates/, or plugins/. Those dirs are .gitignored — the curated set lives in manifest.json as external repos cloned via clone-manifest.sh (idempotent). Without that step, a fresh checkout or a post-deletion run leaves the dirs empty, which silently hides the entire template palette in Canvas + falls back to bare default workspace provisioning. Symptom: "Deploy your first agent" shows zero templates. 2. The existing ws-* container reap was already in the script (good), but it only fires when this script runs. Folks running `docker compose down -v` directly leave orphan ws-* containers behind. Documented that explicitly in the script comment so future readers understand why those lines are critical. The fix is just `bash clone-manifest.sh` added to the script. clone- manifest.sh is idempotent — populated dirs short-circuit, so a re-nuke on a healthy machine pays only a few stat calls. scripts/test-nuke-and-rebuild.sh exercises the canonical workflow end- to-end: - plants a fake orphan ws-* container, then asserts it gets reaped - renames the manifest dirs to simulate a fresh checkout, then asserts they get repopulated - waits for /health and asserts the platform sees the same template count on disk as via /configs in the container (catches bind-mount drift) - asserts the image-auto-refresh watcher (PR #2114) starts, since that's load-bearing for the CD chain users now rely on The test pre-flights port 5432/6379/8080 and exits 0 with a SKIP message if a non-target compose project is holding them — common when parallel monorepo checkouts coexist on one Docker daemon. scripts/ is intentionally outside CI shellcheck per ci.yml comment, but both files pass `shellcheck --severity=warning` anyway. Defers but does not solve the runtime root-cause for orphan ws-* after plain `docker compose down -v`: the orphan-sweeper in the platform only reaps containers whose workspace row says status='removed', so a wiped DB → no row → sweeper ignores them. Proper fix needs container labels keyed to a per-platform-instance UUID so the sweeper can confidently reap "containers I provisioned that aren't in my DB anymore" without nuking a sibling platform's containers on a shared daemon. Tracked as task #109's follow-up; out of scope for this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 14:37:04 -07:00
Hongming Wang	0de67cd379	feat(platform/admin): /admin/workspace-images/refresh + Docker SDK + GHCR auth The production-side end of the runtime CD chain. Operators (or the post- publish CI workflow) hit this after a runtime release to pull the latest workspace-template-* images from GHCR and recreate any running ws-* containers so they adopt the new image. Without this, freshly-published runtime sat in the registry but containers kept the old image until naturally cycled. Implementation notes: - Uses Docker SDK ImagePull rather than shelling out to docker CLI — the alpine platform container has no docker CLI installed. - ghcrAuthHeader() reads GHCR_USER + GHCR_TOKEN env, builds the base64- encoded JSON payload Docker engine expects in PullOptions.RegistryAuth. Both empty → public/cached images only; both set → private GHCR pulls. - Container matching uses ContainerInspect (NOT ContainerList) because ContainerList returns the resolved digest in .Image, not the human tag. Inspect surfaces .Config.Image which is what we need. - Provisioner.DefaultImagePlatform() exported so admin handler picks the same Apple-Silicon-needs-amd64 platform as the provisioner — single source of truth for the multi-arch override. Local-dev companion: scripts/refresh-workspace-images.sh runs on the host and inherits the host's docker keychain auth — alternate path for when GHCR_USER/TOKEN aren't set in the platform env. 🤖 Generated with [Claude Code](https://claude.com/claude-code)	2026-04-26 10:17:21 -07:00
rabbitblood	6494e9192b	refactor(ops): apply simplify findings on #2027 PR Code-quality + efficiency review of PR #2079: - Hoist all_slugs = prod_slugs \| staging_slugs out of decide() into the caller (was rebuilt on every record — 1k records × ~50-slug union per call). decide() signature now (r, all_slugs, ec2_names). - Compile regexes at module scope (_WS_RE, _E2E_RE, _TENANT_RE) + hoist platform-core literal set (_PLATFORM_CORE_NAMES). Same change mirrored in the bash heredoc. - Drop decorative # Rule N: comments (numbering was out of order, 3 before 2 — actively confusing). - Move the "edits must mirror" reminder OUTSIDE the CANONICAL DECIDE block in the .sh file, eliminating the .replace() comment-skip hack in TestParityWithBashScript. - Drop per-line .strip() in _slice_canonical (would mask a real indentation bug; both blocks already at column 0). - subTest() in TestPlatformCore loops so a single failure no longer short-circuits the rest of the items. - merge_group + concurrency on test-ops-scripts.yml (parity with ci.yml gate behaviour). - Fix don't apostrophe in inline comment that closed the python heredoc's single-quote and broke bash -n. All 25 tests still pass. bash -n clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 00:28:15 -07:00
rabbitblood	ba78a5c00d	test(ops): unit tests for sweep-cf-orphans decide() (#2027 ) Closes #2027. The CF orphan sweep deletes DNS records — a misclassification could nuke a live workspace's tunnel. The decision function had MAX_DELETE_PCT percentage gating but no automated test of category → action mapping. Approach: extract the decide() function to scripts/ops/sweep_cf_decide.py as a verbatim copy bracketed by `# CANONICAL DECIDE BEGIN/END` markers. The shell script keeps its inline heredoc (so the operational path is untouched) but bracketed by the same markers. A parity test (TestParityWithBashScript) reads both files and asserts the bracketed blocks match line-for-line — drift fails CI loudly. Coverage (25 tests, 1 file, stdlib unittest only): - Rule 1 platform-core: apex, _vercel, _domainkey, www/api/app/doc/send/status/staging-api - Rule 3 ws-: live (matches EC2 prefix) on prod + staging; orphan on prod + staging - Rule 4 e2e-: live + orphan on staging; orphan on prod - Rule 2 generic tenant: live prod + staging; unknown subdomain kept-for-safety - Rule 5 fallthrough: external domain + unrelated apex - Rule priority: api.moleculesai.app stays platform-core (not tenant); _vercel stays verification - Safety gate: under/at/over default 50% threshold; zero-total no-divide; custom threshold - Empty live-sets: documents that decide() alone classifies as orphan, gate is the defense CI: new .github/workflows/test-ops-scripts.yml runs `python -m unittest discover` against scripts/ops/ on every PR/push that touches the directory. Lightweight — no requirements file, stdlib only. Local: `cd scripts/ops && python -m unittest test_sweep_cf_decide -v` → 25 tests, all OK. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 00:22:30 -07:00
Hongming Wang	817b8b0307	fix(scripts): make MAX_DELETE_PCT actually honor env override The script's own help text documents \`MAX_DELETE_PCT=62 ./sweep-cf-orphans.sh\` as the way to relax the safety gate, but the in-script assignment on line 35 was unconditional and overwrote any env value — so the override never worked. During today's staging tenant-provision recovery (CP #255 context), hit the 57%-delete threshold and needed the documented override to clear 64 orphan records. The one-char change to \`\${MAX_DELETE_PCT:-50}\` honors the env while keeping the 50% default when no caller overrides. Ran with MAX_DELETE_PCT=62 after the fix — deleted 64 records, CF zone 111→47. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 14:14:55 -07:00
Hongming Wang	0576e341b9	ops(#1976 ): add smart-sweep script for orphan Cloudflare DNS records (#1978 ) Replaces the "panic-button at >65 records" manual sweep that nukes every pattern-match unconditionally (would delete live workspaces along with orphans). This version: - Queries CP prod + staging /admin/orgs for live tenant slugs - Queries AWS EC2 describe-instances for live workspace Name tags - Only deletes CF records whose slug/ws-id has no live counterpart - Dry-run by default (--execute to actually delete) - Safety gate refuses to delete >50% of records (configurable via MAX_DELETE_PCT env var) — catches the "API returned zero orgs, every tenant looks orphan" failure mode before it nukes production - Per-category accounting: orphan-ws / orphan-e2e-tenant / etc. Usage: CF_API_TOKEN=... CF_ZONE_ID=... \ CP_PROD_ADMIN_TOKEN=... CP_STAGING_ADMIN_TOKEN=... \ bash scripts/ops/sweep-cf-orphans.sh # dry-run bash scripts/ops/sweep-cf-orphans.sh --execute # actually delete Ref: #1976 (root-cause: tenant.Delete + workspace.Delete don't clean their CF records — until that's fixed, this script is the maintenance path) Co-authored-by: Hongming Wang <hongmingwang.rabbit@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: molecule-ai[bot] <276602405+molecule-ai[bot]@users.noreply.github.com>	2026-04-24 04:19:49 +00:00
Hongming Wang	96cc4b0c42	fix(quickstart): wire up template/plugin registry via manifest.json The Canvas template palette was empty on a fresh clone because `workspace-configs-templates/`, `org-templates/`, and `plugins/` are gitignored and nothing populated them. The registry already exists — `manifest.json` at repo root lists every curated `workspace-template-`, `org-template-`, and `plugin-` repo, and `scripts/clone-manifest.sh` clones them — but the step was absent from the README and setup.sh, so new users never ran it. ### What this commit does 1. `setup.sh` runs `clone-manifest.sh` automatically* (once). After starting the Docker network but before booting infra, iterate `manifest.json` and clone any workspace_templates / org_templates / plugins that aren't already populated. Idempotent — subsequent runs skip dirs that have content. Requires `jq`; when jq is missing the step prints a clear install hint and skips (doesn't fail). 2. `clone-manifest.sh` is idempotent. Before running `git clone`, check whether the target directory already exists and is non-empty — skip if so. Lets `setup.sh` rerun safely without forcing the operator to delete already-cloned template repos. 3. `ListTemplates` logs the reason it skips a template. The handler previously swallowed `resolveYAMLIncludes` errors with `continue`, so a broken template showed up as an empty palette with no log trail. Now the include-expansion and yaml.Unmarshal failure paths both emit a descriptive `log.Printf` — the exact message that made the stale `org-templates/molecule-dev/` snapshot debuggable: ListTemplates: skipping molecule-dev — !include expansion failed: !include "core-platform.yaml" at line 25: open .../teams/ core-platform.yaml: no such file or directory 4. Remove the in-tree `org-templates/molecule-dev/` snapshot (170 files). Matches the explicit intent of prior commit `bfec9e53` — "remove org-templates/molecule-dev/ — standalone repo is source of truth". A later "full staging snapshot" re-added a partial copy that had `!include` references to 7 role files that never existed in the snapshot (`core-platform.yaml`, `controlplane.yaml`, `app-docs.yaml`, `infra.yaml`, `sdk.yaml`, `release-manager/workspace.yaml`, `integration-tester/workspace.yaml`). `clone-manifest.sh` repopulates it fresh from `Molecule-AI/molecule-ai-org-template-molecule-dev`. .gitignore exception for `molecule-dev/` is dropped accordingly — the whole `/org-templates/` tree is now gitignored, symmetric with `/plugins/` and `/workspace-configs-templates/`. 5. Doc updates* (README, README.zh-CN, CONTRIBUTING) mention `jq` as a prerequisite and describe what setup.sh now does. ### Verification On a fresh-nuked DB with the updated branch: 1. `bash infra/scripts/setup.sh` — cleanly clones 33/33 manifest repos (20 plugins, 8 workspace_templates, 5 org_templates), then boots infra. Second run skips all 33 (idempotent). 2. `go run ./cmd/server` — "Applied 41 migrations", :8080 healthy. 3. `curl http://localhost:8080/org/templates` returns 4 templates (was `[]`): - Free Beats All - MeDo Smoke Test - Molecule AI Worker Team (Gemini) - Reno Stars Agent Team 4. `bash tests/e2e/test_api.sh` — 61/61 pass. 5. `npx vitest run` in canvas — 902/902 pass. 6. `shellcheck infra/scripts/setup.sh` — clean. ### SaaS parity All changes are local-dev surface. `setup.sh`, `clone-manifest.sh`, and the local `org-templates/` directory aren't part of the CP provisioner path — SaaS tenant machines get their templates via Dockerfile layers or CP-side provisioning, not `clone-manifest.sh`. The `ListTemplates` log addition is harmless either way (replaces a silent `continue` with a `log.Printf + continue`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 14:55:34 -07:00
molecule-ai[bot]	4a03b89e91	fix(scripts): correct platform dir path + add ROOT isolation (shellcheck clean) - dev-start.sh: $ROOT/platform → $ROOT/workspace-server (Go server lives in workspace-server/, not platform/; any developer running this script would get "no such directory" immediately) - nuke-and-rebuild.sh: add ROOT variable and -f "$ROOT/docker-compose.yml" so docker compose works from any CWD; fix post-rebuild-setup.sh path - rollback-latest.sh: add 'local' to src_digest and new_digest vars inside roll() function to prevent global-scope leakage Co-authored-by: Molecule AI Core-DevOps <core-devops@agents.moleculesai.app> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 15:42:24 +00:00
rabbitblood	d513a0ced5	security: remove hardcoded API keys from post-rebuild-setup.sh GitGuardian detected exposed MiniMax API key and GitHub PAT in the script's default values. Replaced with env var reads from .env file (which is gitignored). Script now validates required secrets exist before proceeding. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-20 13:02:52 -07:00
rabbitblood	f787873698	feat: nuke-and-rebuild.sh — one-command fleet reset Two scripts: - nuke-and-rebuild.sh: docker down -v, clean orphans, rebuild, setup - post-rebuild-setup.sh: insert global secrets (MiniMax + GH PAT), import org template, wait for platform health Global secrets ensure every provisioned container gets MiniMax API config and GitHub PAT injected as env vars automatically — no manual settings.json deployment needed. Usage: bash scripts/nuke-and-rebuild.sh Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-20 12:53:30 -07:00
Hongming Wang	eecce56c13	feat(canary): rollback-latest script + release-pipeline doc (Phase 4) Closes the canary loop with the escape hatch and a single place to read about the whole flow. scripts/rollback-latest.sh <sha> uses crane to retag :latest ← :staging-<sha> for BOTH the platform and tenant images. Pre-checks the target tag exists and verifies the :latest digest after the move so a bad ops typo doesn't silently promote the wrong thing. Prod tenants auto-update to the rolled-back digest within their 5-min cycle. Exit codes: 0 = both retagged, 1 = registry/tag error, 2 = usage error. docs/architecture/canary-release.md The one-page map of the pipeline: how PR → main → staging-<sha> → canary smoke → :latest promotion works end-to-end, how to add a canary tenant, how to roll back, and what this gate explicitly does NOT catch (prod-only data, config drift, cross-tenant bugs). No code changes in the CP or workspace-server — this PR is shell + docs only, so it's safe to land independently of the other Phase {1,1.5,2,3} PRs still in review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 03:37:42 -07:00
Hongming Wang	9662590360	feat(canary): smoke harness + GHA verification workflow (Phase 2) Post-deploy verification for staging tenant images. Runs against the canary fleet after each publish-workspace-server-image build — catches auto-update breakage (a la today's E2E current_task drift) before it propagates to the prod tenant fleet that auto-pulls :latest every 5 min. scripts/canary-smoke.sh iterates a space-sep list of canary base URLs (paired with their ADMIN_TOKENs) and checks: - /admin/liveness reachable with admin bearer (tenant boot OK) - /workspaces list responds (wsAuth + DB path OK) - /memories/commit + /memories/search round-trip (encryption + scrubber) - /events admin read (AdminAuth C4 path) - /admin/liveness without bearer returns 401 (C4 fail-closed regression) .github/workflows/canary-verify.yml runs after publish succeeds: - 6-min sleep (tenant auto-updater pulls every 5 min) - bash scripts/canary-smoke.sh with secrets pulled from repo settings - on failure: writes a Step Summary flagging that :latest should be rolled back to prior known-good digest Phase 3 follow-up will split the publish workflow so only :staging-<sha> ships initially, and canary-verify's green gate is what promotes :staging-<sha> → :latest. This commit lays the test gate alone so we have something running against tenants immediately. Secrets to set in GitHub repo settings before this workflow can run: - CANARY_TENANT_URLS (space-sep list) - CANARY_ADMIN_TOKENS (same order as URLs) - CANARY_CP_SHARED_SECRET (matches staging CP PROVISION_SHARED_SECRET) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 03:30:19 -07:00
Hongming Wang	e1d65607cf	feat(security): Phase 35.1 — SG lockdown script for tenant EC2 instances Restricts tenant EC2 port 8080 ingress to Cloudflare IP ranges only, blocking direct-IP access. Supports two modes: 1. Lock to CF IPs (Worker deployment): 14 IPv4 CIDR rules 2. Close ingress entirely (Tunnel deployment): removes 0.0.0.0/0 only Usage: bash scripts/lockdown-tenant-sg.sh --sg-id sg-xxxxx bash scripts/lockdown-tenant-sg.sh --sg-id sg-xxxxx --close-ingress bash scripts/lockdown-tenant-sg.sh --sg-id sg-xxxxx --dry-run Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-18 12:01:41 -07:00
Hongming Wang	fccf15681b	chore: final cleanup — remove internal tooling, gitignore local config Removed: - docs/.vitepress/ + package.json — docs site config belongs in Molecule-AI/docs - scripts/bridge/ — internal Claude Code bridge server - scripts/claude-code-bridge.py — internal agent bridge - scripts/dedup_settings_hooks.py, verify_settings_hooks.py — internal maintenance Gitignored: - .mcp.json → .mcp.json.example (local MCP config, users create their own) - test-results/ — ephemeral build artifacts Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-18 00:52:30 -07:00
Molecule AI DevOps Engineer	b69e50d98c	fix(scripts): add dedup_settings_hooks + verify utilities molecule_runtime's _deep_merge_hooks() uses unconditional list.extend() when merging plugin settings-fragment.json files. On every plugin install or reinstall each hook handler is re-appended, causing 3-4x duplicate firings per event. scripts/dedup_settings_hooks.py — idempotent live fix (reads via /proc/*/root, no docker CLI required). Safe to re-run. scripts/verify_settings_hooks.py — exits 1 if any container still has duplicate hooks; used in CI health checks and manual audits. Upstream fix needed in molecule_runtime._deep_merge_hooks() to deduplicate by (matcher, frozenset(commands)) before writing. Track separately. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 00:12:07 +00:00
Hongming Wang	0071b66a59	fix(ci): heredoc indentation in publish workflows + add dev-start.sh Two fixes: 1. publish-canvas-image.yml + publish-platform-image.yml: the JSON heredoc for config.json had leading whitespace from YAML indentation, producing invalid JSON. Docker fell back to osxkeychain → -25308. Fixed by removing indentation inside the heredoc body. 2. Added scripts/dev-start.sh — one-command local dev environment. Starts infra (docker-compose), platform (Go), and canvas (Next.js) with proper health checks and cleanup on Ctrl-C.	2026-04-16 05:56:25 -07:00
Hongming Wang	fd719f4d36	fix: use /bin/sh not bash in clone-manifest (Alpine has no bash)	2026-04-16 05:42:49 -07:00
Hongming Wang	74e4f30216	fix: address all code review findings + remove exposed secrets Code review fixes: - 🟡 #1: Replace python3 with jq in Dockerfile template stages (~50MB → ~2MB) - 🟡 #2: Add clone count verification to scripts/clone-manifest.sh (set -e + expected vs actual count check — fails build if any clone fails) - 🟡 #3: Drop 'unsafe-eval' from CSP (not needed for Next.js production standalone builds, only dev mode). Updated test assertion. - 🟡 #4: Remove broken pyproject.toml from workspace-template/ (it claimed to package as molecule-ai-workspace-runtime but the directory structure didn't match — the real package ships from the standalone repo) - 🔵 #1: Add version-pinning TODO comment to manifest.json - 🔵 #3: Add full repo URLs + test counts for SDK/MCP/CLI/runtime in CLAUDE.md Security (GitGuardian alert): - Removed Telegram bot token (8633739353:AA...) from template-molecule-dev pm/.env — replaced with ${TELEGRAM_BOT_TOKEN} placeholder - Removed Claude OAuth token (sk-ant-oat01-...) from template-molecule-dev root .env — replaced with ${CLAUDE_CODE_OAUTH_TOKEN} placeholder - Both tokens need immediate rotation by the operator Tests: Platform middleware tests updated + all pass.	2026-04-16 05:05:49 -07:00
Hongming Wang	8e304e69e8	chore: remove extracted directories, add manifest-driven Docker builds Remove plugins/, workspace-configs-templates/, org-templates/ dirs (now in standalone repos). Add manifest.json listing all 33 repos and scripts/clone-manifest.sh to clone them. Both Dockerfiles now use the manifest script instead of 33 hardcoded git-clone lines. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 04:13:29 -07:00
Hongming Wang	f7683e3adf	fix(provisioner): stop rogue config-missing restart loop (#17 ) Resolves #17. Part A: scripts/cleanup-rogue-workspaces.sh deletes workspaces whose id or name starts with known test placeholder prefixes (aaaaaaaa-, etc.) and force-removes the paired Docker container. Documented in tests/README.md. Part B: add a pre-flight check in provisionWorkspace() — when neither a template path nor in-memory configFiles supplies config.yaml, probe the existing named volume via a throwaway alpine container. If the volume lacks config.yaml, mark the workspace status='failed' with a clear last_sample_error instead of handing it to Docker's unless-stopped restart policy (which otherwise loops forever on FileNotFoundError). New pure helper provisioner.ValidateConfigSource + unit tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 07:32:58 -07:00
Hongming Wang	24fec62d7f	initial commit — Molecule AI platform Forked clean from public hackathon repo (Starfire-AgentTeam, BSL 1.1) with full rebrand to Molecule AI under github.com/Molecule-AI/molecule-monorepo. Brand: Starfire → Molecule AI. Slug: starfire / agent-molecule → molecule. Env vars: STARFIRE_* → MOLECULE_*. Go module: github.com/agent-molecule/platform → github.com/Molecule-AI/molecule-monorepo/platform. Python packages: starfire_plugin → molecule_plugin, starfire_agent → molecule_agent. DB: agentmolecule → molecule. History truncated; see public repo for prior commits and contributor attribution. Verified green: go test -race ./... (platform), pytest (workspace-template 1129 + sdk 132), vitest (canvas 352), build (mcp). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 11:55:37 -07:00

1 2

84 Commits