Commit Graph

25 Commits

Author SHA1 Message Date
Hongming Wang
f2c3594abc feat(dev-start): true single-command spinup — infra + templates + auth posture
Manual fresh-user clean-slate test surfaced three friction points in
the existing dev-start.sh:

  1. The script ran docker compose -f docker-compose.infra.yml
     directly, bypassing infra/scripts/setup.sh — so the workspace
     template registry was never populated and the canvas template
     palette came up empty (the "Template palette is empty"
     troubleshooting hit).
  2. ADMIN_TOKEN was not handled at all. Without it, the AdminAuth
     fail-open gate worked initially but slammed shut the moment the
     first workspace registered a token — at which point the canvas
     could no longer call /workspaces or /templates. New users hit
     401s with no obvious next step.
  3. The script wasn't mentioned in docs/quickstart.md. New users
     followed the documented 4-step manual flow and never discovered
     the single command existed.

Fixes:

  - dev-start.sh now calls infra/scripts/setup.sh, which brings up
    full infra (postgres + redis + langfuse + clickhouse + temporal)
    AND populates the template/plugin registry from manifest.json.
  - On first run, dev-start.sh writes MOLECULE_ENV=development to
    .env. This activates middleware.isDevModeFailOpen() which lets
    the canvas keep calling admin endpoints without a bearer (the
    intended local-dev escape hatch). The .env is preserved on
    re-runs and sourced before the platform launches.
  - The script intentionally does NOT auto-generate an ADMIN_TOKEN.
    A first attempt did, and broke the canvas because isDevModeFailOpen
    requires ADMIN_TOKEN empty AND MOLECULE_ENV=development together.
    Setting ADMIN_TOKEN in dev would close the hatch and the canvas
    has no way to read that token in a dev build (no
    NEXT_PUBLIC_ADMIN_TOKEN bake step here). The .env comment block
    explicitly warns future contributors not to add it.
  - Both processes' logs go to /tmp/molecule-{platform,canvas}.log
    instead of stdout-mixed so the readiness banner stays clean.
  - Health-poll loops cap at 30s with a clear timeout error pointing
    to the log file, instead of hanging forever.
  - The readiness banner now lists the log paths AND tells the user
    the next step is "open localhost:3000 → add API key in Config →
    Secrets & API Keys → Global", instead of just listing service
    URLs.

Quickstart doc rewrite leads with:

    git clone ...
    cd molecule-monorepo
    ./scripts/dev-start.sh

The 4-step manual flow is preserved as "Manual setup (advanced)"
for contributors who want per-component logs.

Verified end-to-end from clean Docker (no containers, no volumes,
no .env) three times: total wall-clock ~12s for a re-run with
cached npm/docker layers. Platform's HTTP 200 on /workspaces
without a bearer confirms the dev-mode auth hatch is active.
2026-04-27 16:29:37 -07:00
Hongming Wang
6e732ab714 fix(build): ship lib/ subpackage + extend drift gate to SUBPACKAGES
Two compounding bugs that bit hermes (and any other workspace that
reaches main.py:142):

1. workspace/lib/ was in EXCLUDE_DIRS so the published wheel didn't
   contain the directory at all. main.py imports `from lib.pre_stop
   import read_snapshot` (and `build_snapshot`, `write_snapshot`) so
   every workspace startup that reaches the snapshot path crashed
   with `ModuleNotFoundError: No module named 'lib'`.

2. Even if lib/ had shipped, `lib` wasn't in SUBPACKAGES so the
   import-rewriter would have left the bare `from lib.pre_stop`
   unqualified — it would still fail because the package would only
   be reachable as `molecule_runtime.lib`.

Fix: move `lib` from EXCLUDE_DIRS to SUBPACKAGES (one entry each).

Drift gate extension: the existing gate I added in #2163 only
asserted TOP_LEVEL_MODULES against workspace/*.py. This change adds
the symmetric assertion for SUBPACKAGES against workspace/<dir>/
(filtered by EXCLUDE_DIRS + presence of __init__.py). Catches both:
- Subpackage added to workspace/ but missed in SUBPACKAGES
- Subpackage missing from workspace/ but lingering in SUBPACKAGES
- Subpackage wrongly in EXCLUDE_DIRS while also referenced by
  rewritten imports (the lib case)

Tested locally: build of 0.1.99 now ships lib/ and main.py contains
`from molecule_runtime.lib.pre_stop import ...` correctly rewritten.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 06:03:46 -07:00
Hongming Wang
026f5e51d9 ops: add Railway SHA-pin drift audit script + regression test (#2001)
#2000 fixed one symptom — TENANT_IMAGE pinned to `staging-a14cf86`
(10 days stale) silently no-op'd four upstream fixes on 2026-04-24.
This adds the audit pattern as a re-runnable script so the broader
class is observable on demand without new CI infrastructure.

Audit results today (2026-04-27):
  controlplane / production: 54 vars audited, 0 drift-prone pins
  controlplane / staging:    52 vars audited, 0 drift-prone pins

So the immediate audit deliverable is clean — TENANT_IMAGE is the only
known violation and #2000 already fixed it. The script makes the
ongoing audit a 5-second command instead of a manual one.

Detection regex catches:
  * branch-SHA suffixes (`staging|main|prod|production-<6+ hex>`)
    — the exact 2026-04-24 incident shape
  * version pins after `:` or `=`  (`:v1.2.3`, `=v0.1.16`)
    — same drift class, just rendered differently

Anchoring on `:` or `=` keeps prose like "version 1.2.3 of the api"
out of the false-positive set. UUIDs, ARNs, AMI IDs, secrets, and
floating tags (`:staging-latest`, `:main`) pass through untouched.

Regression test (tests/ops/test_audit_railway_sha_pins.sh) pins 20
representative cases — 9 should-flag (covering all four branch
prefixes + semver variants + middle-of-value matches) and 11
should-pass (the false-positive guards).  Same regex inlined in both
files so a future tweak that weakens detection fails the test in
lockstep with weakening the audit.

Both files shellcheck clean.

CI gate (acceptance criterion's "regression: add a CI check") is
deliberately scoped out — querying Railway from CI requires plumbing
RAILWAY_TOKEN as a repo secret, which is multi-step setup. The
re-runnable script + test cover the same surface today; the CI
workflow is a small follow-up once the token is provisioned.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 05:01:23 -07:00
Hongming Wang
c68dc1877f fix(release): drift-gate TOP_LEVEL_MODULES + smoke-import main in publish
Two compounding bugs surfaced when 0.1.16 hit production today:

1. scripts/build_runtime_package.py had a hand-curated TOP_LEVEL_MODULES
   set listing every workspace/*.py that should get its bare imports
   rewritten to `molecule_runtime.X`. The set silently went stale:
   - Missing: transcript_auth (added since #87 phase 1c), runtime_wedge,
     watcher → unrewritten imports shipped, every workspace startup
     died with ModuleNotFoundError.
   - Stale: claude_sdk_executor, cli_executor (both removed in #87),
     hermes_executor (never existed) → harmless but misleading.

2. publish-runtime.yml's wheel-smoke step asserted on stable invariants
   (BaseAdapter, AdapterConfig, a2a_client error sentinel) but never
   imported main. So even though main.py held the broken bare
   `from transcript_auth import ...`, the smoke check passed.

Fixes:

- Build script now derives the on-disk module set from workspace/*.py
  and asserts it matches TOP_LEVEL_MODULES exactly. Drift in either
  direction fails the build with a specific diff message instead of
  shipping a broken wheel. Closed-list typo guard preserved (we still
  edit the set explicitly when a module is added/removed) — the gate
  just makes drift impossible to ignore.

- TOP_LEVEL_MODULES updated to current reality: drop the 3 stale,
  add the 3 missing.

- publish-runtime.yml wheel-smoke now `import molecule_runtime.main`
  before the invariant asserts. main is the entry point and
  transitively imports every module — any bare-import bug surfaces
  as ModuleNotFoundError before PyPI accepts the upload.

Tested locally: `python3 scripts/build_runtime_package.py
--version 0.1.99 --out /tmp/build-test` succeeds, and
/tmp/build-test/molecule_runtime/main.py contains the rewritten
`from molecule_runtime.transcript_auth import ...`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 03:19:17 -07:00
Hongming Wang
44d0444aae fix(scripts): nuke-and-rebuild self-bootstraps templates; add E2E test
Two paper cuts the fix addresses:

1. nuke-and-rebuild.sh wipes the compose stack but never re-populates
   workspace-configs-templates/, org-templates/, or plugins/. Those dirs
   are .gitignored — the curated set lives in manifest.json as external
   repos cloned via clone-manifest.sh (idempotent). Without that step,
   a fresh checkout or a post-deletion run leaves the dirs empty, which
   silently hides the entire template palette in Canvas + falls back to
   bare default workspace provisioning. Symptom: "Deploy your first
   agent" shows zero templates.

2. The existing ws-* container reap was already in the script (good),
   but it only fires when this script runs. Folks running `docker compose
   down -v` directly leave orphan ws-* containers behind. Documented
   that explicitly in the script comment so future readers understand
   why those lines are critical.

The fix is just `bash clone-manifest.sh` added to the script. clone-
manifest.sh is idempotent — populated dirs short-circuit, so a re-nuke
on a healthy machine pays only a few stat calls.

scripts/test-nuke-and-rebuild.sh exercises the canonical workflow end-
to-end:
  - plants a fake orphan ws-* container, then asserts it gets reaped
  - renames the manifest dirs to simulate a fresh checkout, then
    asserts they get repopulated
  - waits for /health and asserts the platform sees the same template
    count on disk as via /configs in the container (catches bind-mount
    drift)
  - asserts the image-auto-refresh watcher (PR #2114) starts, since
    that's load-bearing for the CD chain users now rely on

The test pre-flights port 5432/6379/8080 and exits 0 with a SKIP
message if a non-target compose project is holding them — common when
parallel monorepo checkouts coexist on one Docker daemon.

scripts/ is intentionally outside CI shellcheck per ci.yml comment, but
both files pass `shellcheck --severity=warning` anyway.

Defers but does not solve the runtime root-cause for orphan ws-* after
plain `docker compose down -v`: the orphan-sweeper in the platform only
reaps containers whose workspace row says status='removed', so a wiped
DB → no row → sweeper ignores them. Proper fix needs container labels
keyed to a per-platform-instance UUID so the sweeper can confidently
reap "containers I provisioned that aren't in my DB anymore" without
nuking a sibling platform's containers on a shared daemon. Tracked as
task #109's follow-up; out of scope for this PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 14:37:04 -07:00
Hongming Wang
0de67cd379 feat(platform/admin): /admin/workspace-images/refresh + Docker SDK + GHCR auth
The production-side end of the runtime CD chain. Operators (or the post-
publish CI workflow) hit this after a runtime release to pull the latest
workspace-template-* images from GHCR and recreate any running ws-* containers
so they adopt the new image. Without this, freshly-published runtime sat in
the registry but containers kept the old image until naturally cycled.

Implementation notes:
- Uses Docker SDK ImagePull rather than shelling out to docker CLI — the
  alpine platform container has no docker CLI installed.
- ghcrAuthHeader() reads GHCR_USER + GHCR_TOKEN env, builds the base64-
  encoded JSON payload Docker engine expects in PullOptions.RegistryAuth.
  Both empty → public/cached images only; both set → private GHCR pulls.
- Container matching uses ContainerInspect (NOT ContainerList) because
  ContainerList returns the resolved digest in .Image, not the human tag.
  Inspect surfaces .Config.Image which is what we need.
- Provisioner.DefaultImagePlatform() exported so admin handler picks the
  same Apple-Silicon-needs-amd64 platform as the provisioner — single
  source of truth for the multi-arch override.

Local-dev companion: scripts/refresh-workspace-images.sh runs on the
host and inherits the host's docker keychain auth — alternate path for
when GHCR_USER/TOKEN aren't set in the platform env.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
2026-04-26 10:17:21 -07:00
rabbitblood
6494e9192b refactor(ops): apply simplify findings on #2027 PR
Code-quality + efficiency review of PR #2079:

- Hoist all_slugs = prod_slugs | staging_slugs out of decide() into the
  caller (was rebuilt on every record — 1k records × ~50-slug union per
  call). decide() signature now (r, all_slugs, ec2_names).
- Compile regexes at module scope (_WS_RE, _E2E_RE, _TENANT_RE) +
  hoist platform-core literal set (_PLATFORM_CORE_NAMES). Same change
  mirrored in the bash heredoc.
- Drop decorative # Rule N: comments (numbering was out of order, 3
  before 2 — actively confusing).
- Move the "edits must mirror" reminder OUTSIDE the CANONICAL DECIDE
  block in the .sh file, eliminating the .replace() comment-skip hack
  in TestParityWithBashScript.
- Drop per-line .strip() in _slice_canonical (would mask a real
  indentation bug; both blocks already at column 0).
- subTest() in TestPlatformCore loops so a single failure no longer
  short-circuits the rest of the items.
- merge_group + concurrency on test-ops-scripts.yml (parity with
  ci.yml gate behaviour).
- Fix don't apostrophe in inline comment that closed the python
  heredoc's single-quote and broke bash -n.

All 25 tests still pass. bash -n clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 00:28:15 -07:00
rabbitblood
ba78a5c00d test(ops): unit tests for sweep-cf-orphans decide() (#2027)
Closes #2027.

The CF orphan sweep deletes DNS records — a misclassification could nuke
a live workspace's tunnel. The decision function had MAX_DELETE_PCT
percentage gating but no automated test of category → action mapping.

Approach: extract the decide() function to scripts/ops/sweep_cf_decide.py
as a verbatim copy bracketed by `# CANONICAL DECIDE BEGIN/END` markers.
The shell script keeps its inline heredoc (so the operational path is
untouched) but bracketed by the same markers. A parity test
(TestParityWithBashScript) reads both files and asserts the bracketed
blocks match line-for-line — drift fails CI loudly.

Coverage (25 tests, 1 file, stdlib unittest only):
- Rule 1 platform-core: apex, _vercel, _domainkey, www/api/app/doc/send/status/staging-api
- Rule 3 ws-*: live (matches EC2 prefix) on prod + staging; orphan on prod + staging
- Rule 4 e2e-*: live + orphan on staging; orphan on prod
- Rule 2 generic tenant: live prod + staging; unknown subdomain kept-for-safety
- Rule 5 fallthrough: external domain + unrelated apex
- Rule priority: api.moleculesai.app stays platform-core (not tenant); _vercel stays verification
- Safety gate: under/at/over default 50% threshold; zero-total no-divide; custom threshold
- Empty live-sets: documents that decide() alone classifies as orphan, gate is the defense

CI: new .github/workflows/test-ops-scripts.yml runs `python -m unittest
discover` against scripts/ops/ on every PR/push that touches the
directory. Lightweight — no requirements file, stdlib only.

Local: `cd scripts/ops && python -m unittest test_sweep_cf_decide -v` →
25 tests, all OK.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 00:22:30 -07:00
Hongming Wang
817b8b0307 fix(scripts): make MAX_DELETE_PCT actually honor env override
The script's own help text documents \`MAX_DELETE_PCT=62 ./sweep-cf-orphans.sh\`
as the way to relax the safety gate, but the in-script assignment on line 35
was unconditional and overwrote any env value — so the override never worked.

During today's staging tenant-provision recovery (CP #255 context), hit the
57%-delete threshold and needed the documented override to clear 64 orphan
records. The one-char change to \`\${MAX_DELETE_PCT:-50}\` honors the env
while keeping the 50% default when no caller overrides.

Ran with MAX_DELETE_PCT=62 after the fix — deleted 64 records, CF zone 111→47.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 14:14:55 -07:00
Hongming Wang
0576e341b9
ops(#1976): add smart-sweep script for orphan Cloudflare DNS records (#1978)
Replaces the "panic-button at >65 records" manual sweep that nukes
every pattern-match unconditionally (would delete live workspaces
along with orphans).

This version:
- Queries CP prod + staging /admin/orgs for live tenant slugs
- Queries AWS EC2 describe-instances for live workspace Name tags
- Only deletes CF records whose slug/ws-id has no live counterpart
- Dry-run by default (--execute to actually delete)
- Safety gate refuses to delete >50% of records (configurable via
  MAX_DELETE_PCT env var) — catches the "API returned zero orgs, every
  tenant looks orphan" failure mode before it nukes production
- Per-category accounting: orphan-ws / orphan-e2e-tenant / etc.

Usage:
  CF_API_TOKEN=... CF_ZONE_ID=... \
    CP_PROD_ADMIN_TOKEN=... CP_STAGING_ADMIN_TOKEN=... \
    bash scripts/ops/sweep-cf-orphans.sh           # dry-run
  bash scripts/ops/sweep-cf-orphans.sh --execute   # actually delete

Ref: #1976 (root-cause: tenant.Delete + workspace.Delete don't clean
their CF records — until that's fixed, this script is the maintenance
path)

Co-authored-by: Hongming Wang <hongmingwang.rabbit@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: molecule-ai[bot] <276602405+molecule-ai[bot]@users.noreply.github.com>
2026-04-24 04:19:49 +00:00
Hongming Wang
96cc4b0c42 fix(quickstart): wire up template/plugin registry via manifest.json
The Canvas template palette was empty on a fresh clone because
`workspace-configs-templates/`, `org-templates/`, and `plugins/` are
gitignored and nothing populated them. The registry already exists —
`manifest.json` at repo root lists every curated
`workspace-template-*`, `org-template-*`, and `plugin-*` repo, and
`scripts/clone-manifest.sh` clones them — but the step was absent
from the README and setup.sh, so new users never ran it.

### What this commit does

**1. `setup.sh` runs `clone-manifest.sh` automatically** (once).
After starting the Docker network but before booting infra, iterate
`manifest.json` and clone any workspace_templates / org_templates /
plugins that aren't already populated. Idempotent — subsequent
runs skip dirs that have content. Requires `jq`; when jq is missing
the step prints a clear install hint and skips (doesn't fail).

**2. `clone-manifest.sh` is idempotent.** Before running `git clone`,
check whether the target directory already exists and is non-empty —
skip if so. Lets `setup.sh` rerun safely without forcing the operator
to delete already-cloned template repos.

**3. `ListTemplates` logs the reason it skips a template.** The
handler previously swallowed `resolveYAMLIncludes` errors with
`continue`, so a broken template showed up as an empty palette with
no log trail. Now the include-expansion and yaml.Unmarshal failure
paths both emit a descriptive `log.Printf` — the exact message that
made the stale `org-templates/molecule-dev/` snapshot debuggable:

    ListTemplates: skipping molecule-dev — !include expansion failed:
      !include "core-platform.yaml" at line 25: open .../teams/
      core-platform.yaml: no such file or directory

**4. Remove the in-tree `org-templates/molecule-dev/` snapshot** (170
files). Matches the explicit intent of prior commit
`bfec9e53` — "remove org-templates/molecule-dev/ — standalone repo
is source of truth". A later "full staging snapshot" re-added a
partial copy that had `!include` references to 7 role files that
never existed in the snapshot (`core-platform.yaml`,
`controlplane.yaml`, `app-docs.yaml`, `infra.yaml`, `sdk.yaml`,
`release-manager/workspace.yaml`, `integration-tester/workspace.yaml`).
`clone-manifest.sh` repopulates it fresh from
`Molecule-AI/molecule-ai-org-template-molecule-dev`.

.gitignore exception for `molecule-dev/` is dropped accordingly
— the whole `/org-templates/*` tree is now gitignored, symmetric
with `/plugins/` and `/workspace-configs-templates/`.

**5. Doc updates** (README, README.zh-CN, CONTRIBUTING) mention `jq`
as a prerequisite and describe what setup.sh now does.

### Verification

On a fresh-nuked DB with the updated branch:

1. `bash infra/scripts/setup.sh` — cleanly clones 33/33 manifest
   repos (20 plugins, 8 workspace_templates, 5 org_templates), then
   boots infra. Second run skips all 33 (idempotent).
2. `go run ./cmd/server` — "Applied 41 migrations", :8080 healthy.
3. `curl http://localhost:8080/org/templates` returns 4 templates
   (was `[]`):

       - Free Beats All
       - MeDo Smoke Test
       - Molecule AI Worker Team (Gemini)
       - Reno Stars Agent Team

4. `bash tests/e2e/test_api.sh` — 61/61 pass.
5. `npx vitest run` in canvas — 902/902 pass.
6. `shellcheck infra/scripts/setup.sh` — clean.

### SaaS parity

All changes are local-dev surface. `setup.sh`, `clone-manifest.sh`,
and the local `org-templates/` directory aren't part of the CP
provisioner path — SaaS tenant machines get their templates via
Dockerfile layers or CP-side provisioning, not `clone-manifest.sh`.
The `ListTemplates` log addition is harmless either way (replaces a
silent `continue` with a `log.Printf + continue`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 14:55:34 -07:00
molecule-ai[bot]
4a03b89e91
fix(scripts): correct platform dir path + add ROOT isolation (shellcheck clean)
- dev-start.sh: $ROOT/platform → $ROOT/workspace-server (Go server
  lives in workspace-server/, not platform/; any developer running
  this script would get "no such directory" immediately)
- nuke-and-rebuild.sh: add ROOT variable and -f "$ROOT/docker-compose.yml"
  so docker compose works from any CWD; fix post-rebuild-setup.sh path
- rollback-latest.sh: add 'local' to src_digest and new_digest vars
  inside roll() function to prevent global-scope leakage

Co-authored-by: Molecule AI Core-DevOps <core-devops@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-22 15:42:24 +00:00
rabbitblood
d513a0ced5 security: remove hardcoded API keys from post-rebuild-setup.sh
GitGuardian detected exposed MiniMax API key and GitHub PAT in the
script's default values. Replaced with env var reads from .env file
(which is gitignored). Script now validates required secrets exist
before proceeding.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-20 13:02:52 -07:00
rabbitblood
f787873698 feat: nuke-and-rebuild.sh — one-command fleet reset
Two scripts:
- nuke-and-rebuild.sh: docker down -v, clean orphans, rebuild, setup
- post-rebuild-setup.sh: insert global secrets (MiniMax + GH PAT),
  import org template, wait for platform health

Global secrets ensure every provisioned container gets MiniMax API
config and GitHub PAT injected as env vars automatically — no manual
settings.json deployment needed.

Usage: bash scripts/nuke-and-rebuild.sh

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-20 12:53:30 -07:00
Hongming Wang
eecce56c13 feat(canary): rollback-latest script + release-pipeline doc (Phase 4)
Closes the canary loop with the escape hatch and a single place to
read about the whole flow.

scripts/rollback-latest.sh <sha>
  uses crane to retag :latest ← :staging-<sha> for BOTH the platform
  and tenant images. Pre-checks the target tag exists and verifies
  the :latest digest after the move so a bad ops typo doesn't
  silently promote the wrong thing. Prod tenants auto-update to the
  rolled-back digest within their 5-min cycle. Exit codes: 0 = both
  retagged, 1 = registry/tag error, 2 = usage error.

docs/architecture/canary-release.md
  The one-page map of the pipeline: how PR → main → staging-<sha> →
  canary smoke → :latest promotion works end-to-end, how to add a
  canary tenant, how to roll back, and what this gate explicitly does
  NOT catch (prod-only data, config drift, cross-tenant bugs).

No code changes in the CP or workspace-server — this PR is shell
+ docs only, so it's safe to land independently of the other Phase
{1,1.5,2,3} PRs still in review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 03:37:42 -07:00
Hongming Wang
9662590360 feat(canary): smoke harness + GHA verification workflow (Phase 2)
Post-deploy verification for staging tenant images. Runs against the
canary fleet after each publish-workspace-server-image build — catches
auto-update breakage (a la today's E2E current_task drift) before it
propagates to the prod tenant fleet that auto-pulls :latest every 5 min.

scripts/canary-smoke.sh iterates a space-sep list of canary base URLs
(paired with their ADMIN_TOKENs) and checks:
- /admin/liveness reachable with admin bearer (tenant boot OK)
- /workspaces list responds (wsAuth + DB path OK)
- /memories/commit + /memories/search round-trip (encryption + scrubber)
- /events admin read (AdminAuth C4 path)
- /admin/liveness without bearer returns 401 (C4 fail-closed regression)

.github/workflows/canary-verify.yml runs after publish succeeds:
- 6-min sleep (tenant auto-updater pulls every 5 min)
- bash scripts/canary-smoke.sh with secrets pulled from repo settings
- on failure: writes a Step Summary flagging that :latest should be
  rolled back to prior known-good digest

Phase 3 follow-up will split the publish workflow so only
:staging-<sha> ships initially, and canary-verify's green gate is
what promotes :staging-<sha> → :latest. This commit lays the test
gate alone so we have something running against tenants immediately.

Secrets to set in GitHub repo settings before this workflow can run:
- CANARY_TENANT_URLS (space-sep list)
- CANARY_ADMIN_TOKENS (same order as URLs)
- CANARY_CP_SHARED_SECRET (matches staging CP PROVISION_SHARED_SECRET)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 03:30:19 -07:00
Hongming Wang
e1d65607cf feat(security): Phase 35.1 — SG lockdown script for tenant EC2 instances
Restricts tenant EC2 port 8080 ingress to Cloudflare IP ranges only,
blocking direct-IP access. Supports two modes:

1. Lock to CF IPs (Worker deployment): 14 IPv4 CIDR rules
2. Close ingress entirely (Tunnel deployment): removes 0.0.0.0/0 only

Usage:
  bash scripts/lockdown-tenant-sg.sh --sg-id sg-xxxxx
  bash scripts/lockdown-tenant-sg.sh --sg-id sg-xxxxx --close-ingress
  bash scripts/lockdown-tenant-sg.sh --sg-id sg-xxxxx --dry-run

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 12:01:41 -07:00
Hongming Wang
fccf15681b chore: final cleanup — remove internal tooling, gitignore local config
Removed:
- docs/.vitepress/ + package.json — docs site config belongs in Molecule-AI/docs
- scripts/bridge/ — internal Claude Code bridge server
- scripts/claude-code-bridge.py — internal agent bridge
- scripts/dedup_settings_hooks.py, verify_settings_hooks.py — internal maintenance

Gitignored:
- .mcp.json → .mcp.json.example (local MCP config, users create their own)
- test-results/ — ephemeral build artifacts

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 00:52:30 -07:00
b69e50d98c fix(scripts): add dedup_settings_hooks + verify utilities
molecule_runtime's _deep_merge_hooks() uses unconditional list.extend()
when merging plugin settings-fragment.json files. On every plugin install
or reinstall each hook handler is re-appended, causing 3-4x duplicate
firings per event.

scripts/dedup_settings_hooks.py — idempotent live fix (reads via
/proc/*/root, no docker CLI required). Safe to re-run.
scripts/verify_settings_hooks.py — exits 1 if any container still has
duplicate hooks; used in CI health checks and manual audits.

Upstream fix needed in molecule_runtime._deep_merge_hooks() to
deduplicate by (matcher, frozenset(commands)) before writing. Track
separately.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 00:12:07 +00:00
Hongming Wang
0071b66a59 fix(ci): heredoc indentation in publish workflows + add dev-start.sh
Two fixes:
1. publish-canvas-image.yml + publish-platform-image.yml: the JSON
   heredoc for config.json had leading whitespace from YAML indentation,
   producing invalid JSON. Docker fell back to osxkeychain → -25308.
   Fixed by removing indentation inside the heredoc body.

2. Added scripts/dev-start.sh — one-command local dev environment.
   Starts infra (docker-compose), platform (Go), and canvas (Next.js)
   with proper health checks and cleanup on Ctrl-C.
2026-04-16 05:56:25 -07:00
Hongming Wang
fd719f4d36 fix: use /bin/sh not bash in clone-manifest (Alpine has no bash) 2026-04-16 05:42:49 -07:00
Hongming Wang
74e4f30216 fix: address all code review findings + remove exposed secrets
Code review fixes:
- 🟡 #1: Replace python3 with jq in Dockerfile template stages (~50MB → ~2MB)
- 🟡 #2: Add clone count verification to scripts/clone-manifest.sh
  (set -e + expected vs actual count check — fails build if any clone fails)
- 🟡 #3: Drop 'unsafe-eval' from CSP (not needed for Next.js production
  standalone builds, only dev mode). Updated test assertion.
- 🟡 #4: Remove broken pyproject.toml from workspace-template/ (it claimed
  to package as molecule-ai-workspace-runtime but the directory structure
  didn't match — the real package ships from the standalone repo)
- 🔵 #1: Add version-pinning TODO comment to manifest.json
- 🔵 #3: Add full repo URLs + test counts for SDK/MCP/CLI/runtime in CLAUDE.md

Security (GitGuardian alert):
- Removed Telegram bot token (8633739353:AA...) from template-molecule-dev
  pm/.env — replaced with ${TELEGRAM_BOT_TOKEN} placeholder
- Removed Claude OAuth token (sk-ant-oat01-...) from template-molecule-dev
  root .env — replaced with ${CLAUDE_CODE_OAUTH_TOKEN} placeholder
- Both tokens need immediate rotation by the operator

Tests: Platform middleware tests updated + all pass.
2026-04-16 05:05:49 -07:00
Hongming Wang
8e304e69e8 chore: remove extracted directories, add manifest-driven Docker builds
Remove plugins/, workspace-configs-templates/, org-templates/ dirs (now
in standalone repos). Add manifest.json listing all 33 repos and
scripts/clone-manifest.sh to clone them. Both Dockerfiles now use the
manifest script instead of 33 hardcoded git-clone lines.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 04:13:29 -07:00
Hongming Wang
f7683e3adf fix(provisioner): stop rogue config-missing restart loop (#17)
Resolves #17.

Part A: scripts/cleanup-rogue-workspaces.sh deletes workspaces whose id
or name starts with known test placeholder prefixes (aaaaaaaa-, etc.)
and force-removes the paired Docker container. Documented in
tests/README.md.

Part B: add a pre-flight check in provisionWorkspace() — when neither a
template path nor in-memory configFiles supplies config.yaml, probe the
existing named volume via a throwaway alpine container. If the volume
lacks config.yaml, mark the workspace status='failed' with a clear
last_sample_error instead of handing it to Docker's unless-stopped
restart policy (which otherwise loops forever on FileNotFoundError).

New pure helper provisioner.ValidateConfigSource + unit tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 07:32:58 -07:00
Hongming Wang
24fec62d7f initial commit — Molecule AI platform
Forked clean from public hackathon repo (Starfire-AgentTeam, BSL 1.1)
with full rebrand to Molecule AI under github.com/Molecule-AI/molecule-monorepo.

Brand: Starfire → Molecule AI.
Slug: starfire / agent-molecule → molecule.
Env vars: STARFIRE_* → MOLECULE_*.
Go module: github.com/agent-molecule/platform → github.com/Molecule-AI/molecule-monorepo/platform.
Python packages: starfire_plugin → molecule_plugin, starfire_agent → molecule_agent.
DB: agentmolecule → molecule.

History truncated; see public repo for prior commits and contributor
attribution. Verified green: go test -race ./... (platform), pytest
(workspace-template 1129 + sdk 132), vitest (canvas 352), build (mcp).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 11:55:37 -07:00