Commit Graph

1009 Commits

Author SHA1 Message Date
Hongming Wang
de2a4cb50e
Merge pull request #986 from Molecule-AI/feat/tenant-cp-env-refresh
feat(ws-server): pull env from CP on startup
2026-04-19 03:27:14 -07:00
Hongming Wang
01e19e9243
Merge pull request #985 from Molecule-AI/docs/saas-migration-notes-prod
docs: 2026-04-19 SaaS prod migration notes
2026-04-19 03:27:12 -07:00
Hongming Wang
3e448c2569
Merge pull request #982 from Molecule-AI/fix/canvas-api-fetch-timeout
fix(canvas): add 15s fetch timeout on API calls
2026-04-19 03:27:09 -07:00
Hongming Wang
48ec5b2dc8 feat(ws-server): pull env from CP on startup
Paired with molecule-controlplane PR #55 (GET /cp/tenants/config). Lets
existing tenants heal themselves when we rotate or add a CP-side env
var (e.g. MOLECULE_CP_SHARED_SECRET landing earlier today) without any
ssh or re-provision.

Flow: main() calls refreshEnvFromCP() before any other os.Getenv read.
The helper reads MOLECULE_ORG_ID + ADMIN_TOKEN from the baked-in
user-data env, GETs {MOLECULE_CP_URL}/cp/tenants/config with those
credentials, and applies the returned string map via os.Setenv so
downstream code (CPProvisioner, etc.) sees the fresh values.

Best-effort semantics:
- self-hosted / no MOLECULE_ORG_ID → no-op (return nil)
- CP unreachable / non-200 → log + return error (main keeps booting)
- oversized values (>4 KiB each) rejected to avoid env pollution
- body read capped at 64 KiB

Once this image hits GHCR, the 5-minute tenant auto-updater picks it
up, the container restarts, refresh runs, and every tenant has
MOLECULE_CP_SHARED_SECRET within ~5 minutes — no operator toil.

Also fixes workspace-server/.gitignore so `server` no longer matches
the cmd/server package dir — it only ignored the compiled binary but
pattern was too broad. Anchored to `/server`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 02:41:15 -07:00
Hongming Wang
96535c30cc docs: 2026-04-19 SaaS prod migration notes
Captures the 10-PR staging→main cutover: what shipped, the three new
Railway prod env vars (PROVISION_SHARED_SECRET / EC2_VPC_ID /
CP_BASE_URL), and the sharp edge for existing tenants — their
containers pre-date PR #53 so they still need MOLECULE_CP_SHARED_SECRET
added manually (or a re-provision) before the new CPProvisioner's
outbound bearer works.

Also includes a post-deploy verification checklist and rollback plan.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 02:29:31 -07:00
Hongming Wang
7a41b0b243
Merge pull request #983 from Molecule-AI/staging
promote: staging → main (security hardening + Phase 35.1)
2026-04-19 02:28:05 -07:00
Hongming Wang
dcc4ec035d
Merge pull request #984 from Molecule-AI/fix/e2e-current-task-public-get
fix(e2e): stop asserting current_task on public workspace GET
2026-04-19 02:21:08 -07:00
Hongming Wang
0c1d56ebbf fix(e2e): stop asserting current_task on public workspace GET (#966)
PR #966 intentionally stripped current_task, last_sample_error, and
workspace_dir from the public GET /workspaces/:id response to avoid
leaking task bodies to anyone with a workspace bearer. The E2E smoke
test hadn't caught up — it was still asserting "current_task":"..."
on the single-workspace GET, which made every post-#966 CI run fail
with '60 passed, 2 failed'.

Swap the per-workspace asserts to check active_tasks (still exposed,
canonical busy signal) and keep the list-endpoint check that proves
admin-auth'd callers still see current_task end-to-end.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 02:19:15 -07:00
Hongming Wang
206856ad3a fix(canvas): add 15s fetch timeout on API calls
Pre-launch audit flagged api.ts as missing a timeout on every fetch.
A slow or hung CP response would leave the UI spinning indefinitely
with no way for the user to abort — effectively a client-side DoS.

15s is long enough for real CP queries (slowest observed is Stripe
portal redirect at ~3s) and short enough that a stalled backend
surfaces as a clear error with a retry affordance.

Uses AbortSignal.timeout (widely supported since 2023) so the
abort propagates through React Query / SWR consumers cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 02:12:47 -07:00
Hongming Wang
ea5cb88183
Merge pull request #981 from Molecule-AI/fix/security-tenant-cpprovisioner-bearer
fix(security): tenant CPProvisioner sends CP bearer on provision / stop / status
2026-04-19 01:55:20 -07:00
Hongming Wang
d8cbe51c82 fix(security): tenant CPProvisioner attaches CP bearer on all calls
Completes the C1 integration (PR #50 on molecule-controlplane). The CP
now requires Authorization: Bearer <PROVISION_SHARED_SECRET> on all
three /cp/workspaces/* endpoints; without this change the tenant-side
Start/Stop/IsRunning calls would all 401 (or 404 when the CP's routes
refused to mount) and every workspace provision from a SaaS tenant
would silently fail.

Reads MOLECULE_CP_SHARED_SECRET, falling back to PROVISION_SHARED_SECRET
so operators can use one env-var name on both sides of the wire. Empty
value is a no-op: self-hosted deployments with no CP or a CP that
doesn't gate /cp/workspaces/* keep working as before.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 01:53:12 -07:00
Hongming Wang
c062e653ad
Merge pull request #980 from Molecule-AI/fix/security-log-scrubbing
fix(security): scrub workspace-server token + upstream error logs
2026-04-19 01:39:39 -07:00
Hongming Wang
7318ead8a4 fix(security): scrub workspace-server token + upstream error logs
Two findings from the pre-launch log-scrub audit:

1. handlers/workspace_provision.go:548 logged `token[:8]` — the exact
   H1 pattern that panicked on short keys. Even with a length guard,
   leaking 8 chars of an auth token into centralized logs shortens the
   search space for anyone who gets log-read access. Now logs only
   `len(token)` as a liveness signal.

2. provisioner/cp_provisioner.go:101 fell back to logging the raw
   control-plane response body when the structured {"error":"..."}
   field was absent. If the CP ever echoed request headers (Authorization)
   or a portion of user-data back in an error path, the bearer token
   would end up in our tenant-instance logs. Now logs the byte count
   only; the structured error remains in place for the happy path.
   Also caps the read at 64 KiB via io.LimitReader to prevent
   log-flood DoS from a compromised upstream.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 01:33:47 -07:00
Hongming Wang
cb16e55447
Merge pull request #979 from Molecule-AI/fix/security-adminauth-c4
fix(security): C4 — close AdminAuth fail-open race on hosted-SaaS fresh install
2026-04-19 01:29:54 -07:00
Hongming Wang
13992478ec
Merge pull request #978 from Molecule-AI/fix/security-discord-config-limitreader
fix(security): cap Discord webhook + config PATCH bodies (H3/H4)
2026-04-19 01:28:46 -07:00
Hongming Wang
0e917ef6b8 fix(security): C4 — close AdminAuth fail-open race on hosted-SaaS fresh install
Pre-launch review blocker. AdminAuth's Tier-1 fail-open fired whenever
the workspace_auth_tokens table was empty — including the window between
a hosted tenant EC2 booting and the first workspace being created. In
that window, every admin-gated route (POST /org/import, POST /workspaces,
POST /bundles/import, etc.) was reachable without a bearer, letting an
attacker pre-empt the first real user by importing a hostile workspace
into a freshly provisioned instance.

Fix: fail-open is now ONLY applied when ADMIN_TOKEN is unset (self-
hosted dev with zero auth configured). Hosted SaaS always sets
ADMIN_TOKEN at provision time, so the branch never fires in prod and
requests with no bearer get 401 even before the first token is minted.

Tier-2 / Tier-3 paths unchanged.

The old TestAdminAuth_684_FailOpen_AdminTokenSet_NoGlobalTokens test
was codifying exactly this bug (asserting 200 on fresh install with
ADMIN_TOKEN set). Renamed and flipped to
TestAdminAuth_C4_AdminTokenSet_FreshInstall_FailsClosed asserting 401.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 01:28:13 -07:00
Hongming Wang
60c4801a13 fix(security): cap webhook + config PATCH bodies (H3/H4)
Two HIGH-severity DoS surfaces: both handlers read the entire HTTP
body with io.ReadAll(r.Body) and no upper bound, so a caller streaming
a multi-gigabyte request could exhaust memory on the tenant instance
before we even validated the JSON.

H3 (Discord webhook): wrap Body in io.LimitReader with a 1 MiB cap.
Discord Interactions payloads are well under 10 KiB in practice.

H4 (workspace config PATCH): wrap Body in http.MaxBytesReader with a
256 KiB cap. Real configs are <10 KiB; jsonb handles the cap
comfortably. Returns 413 Request Entity Too Large on overflow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 01:23:03 -07:00
Hongming Wang
b367f18e95
Merge pull request #977 from Molecule-AI/feat/workspace-snapshot-scrubber-823
feat(workspace): snapshot secret scrubber (closes #823)
2026-04-19 00:33:14 -07:00
Hongming Wang
e7b9b7df71 feat(workspace): snapshot secret scrubber (closes #823)
Sub-issue of #799, security condition C4. Standalone module in
workspace/lib/snapshot_scrub.py with three public functions:

- scrub_content(str) → str: regex-based redaction of secret patterns
- is_sandbox_content(str) → bool: detect run_code tool output markers
- scrub_snapshot(dict) → dict: walk memories, scrub each, drop sandbox entries

Patterns covered: sk-ant-/sk-proj-, ghp_/ghs_/github_pat_, AKIA,
cfut_, mol_pk_, ctx7_, Bearer, env-var assignments, base64 blobs ≥33 chars.

21 unit tests, 100% coverage on new code.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-19 00:32:42 -07:00
Hongming Wang
aec64a6a63
Merge pull request #972 from Molecule-AI/chore/ci-action-versions
ci: update GitHub Actions to current stable versions (closes #780)
2026-04-19 00:31:17 -07:00
Hongming Wang
04e10fb19d
Merge pull request #975 from Molecule-AI/fix/hibernate-409-guard-active-tasks
feat(platform): 409 guard on /hibernate when active_tasks > 0 (closes #822)
2026-04-19 00:30:24 -07:00
Hongming Wang
e2c270600c
Merge pull request #976 from Molecule-AI/feat/last-outbound-at-817
feat(platform): track last_outbound_at for silent detection (closes #817)
2026-04-19 00:30:01 -07:00
Hongming Wang
eef8949b65
Merge pull request #974 from Molecule-AI/fix/canvas-a11y-degraded-badge
fix(canvas): degraded badge WCAG AA contrast (closes #885 p1)
2026-04-19 00:28:39 -07:00
Hongming Wang
4c9d0d683f
Merge pull request #968 from Molecule-AI/fix/security-memory-delimiter-npm-pin
fix(security): GLOBAL memory delimiter spoofing + pin MCP version (closes #807, #805)
2026-04-19 00:28:08 -07:00
Hongming Wang
acb67c75b8
Merge pull request #964 from Molecule-AI/feat/schema-migrations-tracking
feat(db): schema_migrations tracking — run each migration only once
2026-04-19 00:27:27 -07:00
Hongming Wang
9b49024ce4
Merge pull request #967 from Molecule-AI/chore/shadcn-init
chore(canvas): initialize shadcn/ui CLI
2026-04-19 00:27:07 -07:00
Hongming Wang
ff4962e20f
Merge pull request #966 from Molecule-AI/fix/strip-current-task-public-get
fix(security): strip current_task from public GET response (closes #955)
2026-04-19 00:26:27 -07:00
Hongming Wang
0519327179
Merge pull request #973 from Molecule-AI/docs/rfc2119-opencode-must-not
docs(opencode): 'should not' → 'must not' for SAFE-T1201 (closes #861)
2026-04-19 00:26:05 -07:00
Hongming Wang
0111a882ab
Merge pull request #965 from Molecule-AI/fix/crlf-cron-prompts
fix(scheduler): strip CRLF from cron prompts (closes #958)
2026-04-19 00:25:14 -07:00
Hongming Wang
60ab365d81
Merge pull request #963 from Molecule-AI/chore/turbopack-dev
chore(canvas): enable Turbopack for dev server
2026-04-19 00:24:37 -07:00
Hongming Wang
beccd02519
Merge pull request #971 from Molecule-AI/chore/phase35-sg-lockdown-script
feat(security): Phase 35.1 — SG lockdown script for tenant EC2
2026-04-19 00:24:11 -07:00
Hongming Wang
a00d0dc602
Merge pull request #962 from Molecule-AI/chore/secret-scanner-mol-pk
chore: add mol_pk_ and cfut_ to pre-commit secret scanner
2026-04-19 00:22:44 -07:00
Hongming Wang
2f36bb9a7f feat(platform): track last_outbound_at for silent-workspace detection (closes #817)
Sub of #795 (phantom-busy post-mortem). Adds last_outbound_at TIMESTAMPTZ
column to workspaces. Bumped async on every successful outbound A2A call
from a real workspace (skip canvas + system callers). Exposed in
GET /workspaces/:id response as "last_outbound_at".

PM/Dev Lead orchestrators can now detect workspaces that have gone silent
despite being online (> 2h + active cron = phantom-busy warning).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 13:04:54 -07:00
Hongming Wang
a8897c5f17 feat(platform): 409 guard on /hibernate when active_tasks > 0 (closes #822)
Phase 35.1 / #799 security condition C3 — prevents operator from
accidentally killing a mid-task agent.

Behavior:
- active_tasks == 0 → proceed as before
- active_tasks > 0 && ?force=true → log [WARN] + proceed
- active_tasks > 0 && no force → 409 with {error, active_tasks}

2 new tests: TestHibernateHandler_ActiveTasks_Returns409,
TestHibernateHandler_ActiveTasks_ForceTrue_Returns200.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 12:09:52 -07:00
Hongming Wang
e74d41bbaa fix(canvas): degraded badge WCAG AA contrast — amber-400 → amber-300 (closes #885)
amber-400 on zinc-900 is 5.4:1 (AA pass). amber-300 is 6.9:1 (AA+AAA pass)
and matches the rest of the amber usage in WorkspaceNode (currentTask,
error detail, badge chip).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 12:05:38 -07:00
Hongming Wang
90236c4d23 docs(opencode): RFC 2119 — 'should not' → 'must not' for SAFE-T1201 warning (closes #861)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 12:04:49 -07:00
Hongming Wang
755c6952c9 ci: update GitHub Actions to current stable versions (closes #780)
- golangci/golangci-lint-action@v4 → v9
- docker/setup-qemu-action@v3 → v4
- docker/setup-buildx-action@v3 → v4
- docker/build-push-action@v5 → v6

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 12:04:10 -07:00
Hongming Wang
e1d65607cf feat(security): Phase 35.1 — SG lockdown script for tenant EC2 instances
Restricts tenant EC2 port 8080 ingress to Cloudflare IP ranges only,
blocking direct-IP access. Supports two modes:

1. Lock to CF IPs (Worker deployment): 14 IPv4 CIDR rules
2. Close ingress entirely (Tunnel deployment): removes 0.0.0.0/0 only

Usage:
  bash scripts/lockdown-tenant-sg.sh --sg-id sg-xxxxx
  bash scripts/lockdown-tenant-sg.sh --sg-id sg-xxxxx --close-ingress
  bash scripts/lockdown-tenant-sg.sh --sg-id sg-xxxxx --dry-run

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 12:01:41 -07:00
Hongming Wang
8da05e9f24 test: GLOBAL memory delimiter spoofing escape + LOCAL scope untouched
- TestCommitMemory_GlobalScope_DelimiterSpoofingEscaped: verifies [MEMORY prefix
  is escaped to [_MEMORY before DB insert (SAFE-T1201, #807)
- TestCommitMemory_LocalScope_NoDelimiterEscape: LOCAL scope stored verbatim

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 11:54:52 -07:00
Hongming Wang
a61a14d2fd test: verify current_task + last_sample_error + workspace_dir stripped from public GET
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 11:53:45 -07:00
Hongming Wang
64cf74bdb2 test: schema_migrations tracking — 4 cases (first boot, re-boot, mixed, down.sql filter)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 11:52:27 -07:00
Hongming Wang
a61dadde43 fix(security): GLOBAL memory delimiter spoofing + pin MCP npm version
SAFE-T1201 (#807): Escape [MEMORY prefix in GLOBAL memory content on
write to prevent delimiter-spoofing prompt injection. Content stored
as "[_MEMORY " so it renders as text, not structure, when wrapped with
the real delimiter on read.

SAFE-T1102 (#805): Pin @molecule-ai/mcp-server@1.0.0 in .mcp.json.example.
Prevents supply-chain attacks via unpinned npx -y.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 11:09:24 -07:00
Hongming Wang
1663c1bddb chore(canvas): initialize shadcn/ui — components.json + cn utility
Sets up shadcn/ui CLI so new components can be added with
`npx shadcn add <component>`. Uses new-york style, zinc base color,
no CSS variables (matches existing Tailwind-only approach).

Adds clsx + tailwind-merge for the cn() utility.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 07:57:17 -07:00
Hongming Wang
d5ab81dfd3 fix(security): strip current_task from public GET /workspaces/:id (closes #955)
current_task exposes live agent instructions to any caller with a
valid workspace UUID. Also strips last_sample_error and workspace_dir
from the public endpoint. These fields remain available through
authenticated workspace-specific endpoints.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 07:48:59 -07:00
Hongming Wang
1dcdd01378 fix(scheduler): strip CRLF from cron prompts on insert/update (closes #958)
Windows CRLF in org-template prompt text caused empty agent responses
and phantom-producing detection. Strips \r at the handler level before
DB persist, plus a one-time migration to clean existing rows.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 07:45:14 -07:00
Hongming Wang
55ceb39520 feat(db): schema_migrations tracking — migrations only run once
Adds a schema_migrations table that records which migration files
have been applied. On boot, only new migrations execute — previously
applied ones are skipped. This eliminates:

- Re-running all 33 migrations on every restart
- Risk of non-idempotent DDL failing on restart
- Unnecessary log noise from re-applying unchanged schema

First boot auto-populates the tracking table with all existing
migrations. Subsequent boots only apply new ones.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 07:39:20 -07:00
Hongming Wang
93568cbada chore(canvas): enable Turbopack for dev server — faster HMR
next dev --turbopack for significantly faster dev server startup
and hot module replacement. Build script unchanged (Turbopack for
next build is still experimental).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 07:39:03 -07:00
Hongming Wang
8869e3b5fa chore: add mol_pk_ and cfut_ to pre-commit secret scanner
Partner API keys (mol_pk_*) and Cloudflare tokens (cfut_*) now
caught by the pre-commit hook alongside sk-ant-, ghp_, AKIA.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 07:38:48 -07:00
Hongming Wang
0d538ab27a fix(ci): update working-directory for workspace-server/ and workspace/ renames
- platform-build: working-directory platform → workspace-server
- golangci-lint: working-directory platform → workspace-server
- python-lint: working-directory workspace-template → workspace
- e2e-api: working-directory platform → workspace-server
- canvas-deploy-reminder: fix duplicate if: key (merged into single condition)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 07:05:44 -07:00
Hongming Wang
5f452e377a chore: update publish workflow name + document staging-first flow
Default branch is now staging for both molecule-core and
molecule-controlplane. PRs target staging, CEO merges staging → main
to promote to production.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 07:02:02 -07:00