Commit Graph

26 Commits

Author SHA1 Message Date
16868c4ec1 fix(plugins): SaaS (EC2-per-workspace) install/uninstall via EIC SSH
Some checks failed
CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Successful in 5s
Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 15s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Successful in 5s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 17s
CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Successful in 6s
CI / Detect changes (pull_request) Successful in 19s
E2E API Smoke Test / detect-changes (pull_request) Successful in 15s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 16s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 17s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 13s
Harness Replays / detect-changes (pull_request) Successful in 18s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 17s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 15s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 10s
CI / Canvas (Next.js) (pull_request) Successful in 19s
CI / Python Lint & Test (pull_request) Successful in 12s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 17s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 15s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
Harness Replays / Harness Replays (pull_request) Failing after 2m4s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2m53s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 5m14s
CI / Platform (Go) (pull_request) Failing after 8m5s
Closes the 🔴 docker-only row in docs/architecture/backends.md. Plugin
install on every SaaS tenant currently 503s with "workspace container
not running" because the handler is hardcoded to Docker exec but SaaS
workspaces live on per-workspace EC2s. Caught on hongming.moleculesai.app
when canvas POST /workspaces/<id>/plugins surfaced the error.

Mirrors the Files API PR #1702 pattern: dispatch on workspaces.instance_id
in deliverToContainer (and Uninstall). When set, push the staged plugin
tarball to the EC2 over the existing withEICTunnel primitive
(template_files_eic.go) and unpack into the runtime's bind-mounted config
dir (/configs for claude-code, /home/ubuntu/.hermes for hermes — see
workspaceFilePathPrefix). chown 1000:1000 to match the docker path's
agent-uid contract; restart via the existing dispatcher.

Direct host write rather than docker-cp via SSH because the runtime's
config dir is already bind-mounted into the workspace container — the
runtime sees the files on next start with no additional plumbing.

Adds InstanceIDLookup (parallel to RuntimeLookup) so unit tests don't
need a DB; production wires it in router.go like templates.go does.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 15:42:51 -07:00
documentation-specialist
26afbbfdf4 docs(internal): bulk-sed molecule-core .md docs → Gitea (#37 final molecule-core sweep)
Some checks failed
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 5s
CI / Detect changes (pull_request) Successful in 5s
E2E API Smoke Test / detect-changes (pull_request) Successful in 6s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 6s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 6s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 5s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 5s
CI / Platform (Go) (pull_request) Successful in 3s
CI / Canvas (Next.js) (pull_request) Successful in 5s
CI / Python Lint & Test (pull_request) Successful in 3s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 4s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 8s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 4s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 4s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 3s
Ops Scripts Tests / Ops scripts (unittest) (pull_request) Failing after 12s
CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 51s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m20s
CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m20s
Mass-sed across 17 files / 38 active refs in molecule-core .md docs
(README + CONTRIBUTING + docs/architecture/ + docs/blog/ + docs/guides/
+ docs/integrations/ + docs/quickstart.md + scripts/README.md).

Driver: /tmp/sweep_core.py — same pattern set as the
internal-marketing bulk-sed (PR #50). 4 url-substitution patterns +
SKIP_PATTERN preserves /pull/<n> /issues/<n> /commit/<sha>
/releases/... historical refs.

Files NOT touched in this PR:
- docs/workspace-runtime-package.md — owned by molecule-core#15
  (workspace-runtime source-edit per #41). Reverted my bulk-sed of
  that file to avoid merge conflict.
- 2 Go-import-path refs in docs/memory-plugins/testing-your-plugin.md
  (github.com/Molecule-AI/molecule-monorepo/platform/internal/...) —
  Q5 cross-repo Go-module migration territory.
- 1 GitHub Gist link in docs/guides/external-workspace-quickstart.md
  (gist.github.com/molecule-ai/...) — no Gitea equivalent;
  consistent with the same handling in docs#1.

Manual fixes (2):
- docs/blog/2026-04-20-chrome-devtools-mcp-seo/index.md:306 —
  GitHub Discussions (no Gitea equivalent) → issue tracker link
- docs/guides/external-workspace-quickstart.md:218 — tracking-issue
  ?q= query-string url (regex didn't catch) → reformulated text +
  Gitea search-by-query approach

Pattern matches my docs#1 (public docs site) PR + internal#50
(internal/marketing bulk-sed). Standard substitutions:
- https://github.com/Molecule-AI/<repo> → https://git.moleculesai.app/molecule-ai/<repo>
- /blob/<branch>/ + /tree/<branch>/ → /src/branch/<branch>/

Refs: molecule-ai/internal#37, molecule-ai/internal#38
2026-05-07 01:27:50 -07:00
Hongming Wang
eec4ea2e7d chore: delete TeamHandler.Collapse + docs cleanup (closes #2864)
Multi-model retrospective review of #2856 (Phase 1 Expand removal)
flagged that TeamHandler.Collapse is unreachable from the canvas UI:
the "Collapse Team" button calls PATCH /workspaces/:id { collapsed }
(visual flag toggle on canvas_layouts), NOT POST /workspaces/:id/collapse.
The destructive POST route — which stops EC2s, marks children removed,
and deletes layouts — has zero UI callers (verified via grep across
canvas/, scripts/, and the MCP tool registry; only docs referenced it).

Two semantically different operations had been sharing the word
"Collapse":

- Visual collapse (canvas) → PATCH { collapsed: true }. Hides
  children visually. Reversible. UI-only.
- Destructive collapse (POST /collapse) → Stops + marks removed.
  Irreversible. No caller.

Deleting the destructive one + its supporting machinery:

- workspace-server/internal/handlers/team.go (entirely)
- workspace-server/internal/handlers/team_test.go (entirely)
- POST /collapse route + teamh init in router.go
- findTemplateDirByName helper (zero non-test callers after Expand
  was deleted in #2856; package-private so no out-of-package consumers)
- NewTeamHandler constructor (no callers after route removed)

Plus stale doc references (the most dangerous was the MCP wrapper
mapping in mcp-server-setup.md — anyone generating MCP tool wrappers
from that table was wiring a 404):

- docs/agent-runtime/team-expansion.md (deleted entirely — whole
  guide taught the deleted flow)
- docs/api-reference.md (dropped two team.go rows)
- docs/api-protocol/platform-api.md (dropped /expand + /collapse
  rows)
- docs/architecture/molecule-technical-doc.md (dropped /expand +
  /collapse rows)
- docs/guides/mcp-server-setup.md (dropped expand_team +
  collapse_team MCP wrapper mappings)
- docs/glossary.md (dropped "(org template expand_team)"
  parenthetical)
- docs/frontend/canvas.md (dropped broken link to deleted
  team-expansion.md)

Kept: docs/architecture/backends.md mention of "TeamHandler.Expand
(#2367) bypassed routing on Start" — correct historical context for
the AST gate's existence, no live route reference.

Visual-collapse path unaffected:
  canvas/src/components/ContextMenu.tsx:227 → api.patch — unchanged
  canvas/src/components/WorkspaceNode.tsx:128 → api.patch — unchanged

go vet ./... clean. go test ./internal/handlers/ -count 1 — all green
(4.3s, no regression).

Net: -388/+10 = ~378 lines removed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 11:59:43 -07:00
Hongming Wang
62fc25757c docs(backends): document Auto-dispatcher SoT pattern + source-level pins
Closes #10.

The 2026-05-05 hongming silent-drop incident shipped because the
backends.md parity matrix didn't enforce a "go through the dispatcher"
rule — three handlers (TeamHandler.Expand, OrgHandler.createWorkspaceTree,
workspace_crud.go's stopAndRemove) silently bypassed routing on
SaaS for ~6 months across two distinct verbs.

This doc pass:

- Adds a "How to dispatch" section that's the canonical answer to
  "where do I call Start / Stop / Has from?". Names the three
  dispatchers (provisionWorkspaceAuto, StopWorkspaceAuto,
  HasProvisioner), their fallbacks, and the allowed exceptions.
- Updates the matrix lifecycle rows so every dispatched operation
  points at the dispatcher source, not the per-backend bodies.
- Adds Org-import + Team-collapse rows so the bulk paths are visible
  to anyone scanning for parity gaps.
- Lists the source-level pins (4 of them) under Enforcement so
  future contributors see them as load-bearing tests, not noise.
- Adds a "When you add a NEW dispatch site" section so the next verb
  (Pause / Hibernate / Snapshot) lands as a dispatcher mirror, not
  as another bespoke handler that drifts from the existing two.
- Refreshes Last audit to 2026-05-05.

No code change; doc-only. The SoT abstractions described here landed
in PRs #2811 + #2824.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 20:25:10 -07:00
Hongming Wang
2f7beb9bce feat: drop shared_context — use memory v2 team namespace instead
Parent → child knowledge sharing previously lived behind a `shared_context`
list in config.yaml: at boot, every child workspace HTTP-fetched its parent's
listed files via GET /workspaces/:id/shared-context and prepended them as
a "## Parent Context" block. That paid the full transfer cost on every
boot regardless of whether the agent needed it, single-parent SPOF, no team
or org scope, and broken if the parent was unreachable.

Replace with memory v2's team:<id> namespace: agents call recall_memory
on demand. For large blob-shaped artefacts see RFC #2789 (platform-owned
shared file storage).

Removed:
- workspace/coordinator.py: get_parent_context()
- workspace/prompt.py: parent_context arg + injection block
- workspace/adapter_base.py: import + call + arg pass
- workspace/config.py: shared_context field + parser entry
- workspace-server/internal/handlers/templates.go: SharedContext handler
- workspace-server/internal/router/router.go: GET /shared-context route
- canvas/src/components/tabs/ConfigTab.tsx: Shared Context tag input
- canvas/src/components/tabs/config/form-inputs.tsx: schema field + default
- canvas/src/components/tabs/config/yaml-utils.ts: serializer entry
- 6 tests pinning the removed behavior; 5 doc references

Added regression gates so any reintroduction is loud:
- workspace/tests/test_prompt.py: build_system_prompt must NOT emit
  "## Parent Context"
- workspace/tests/test_config.py: legacy YAML key loads cleanly but
  shared_context attr must NOT exist on WorkspaceConfig
- tests/e2e/test_staging_full_saas.sh §9d: GET /shared-context must NOT
  return 200 against a live tenant

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 16:30:26 -07:00
Hongming Wang
b54968878a docs(internal): refresh runtime-package mirror policy + parity matrix + dead-link fix
- workspace-runtime-package.md: add explicit "Where to make changes"
  section documenting the mirror-only policy on
  Molecule-AI/molecule-ai-workspace-runtime — direct PRs are auto-rejected
  by mirror-guard CI; staging push regenerates both the mirror and the
  PyPI wheel via .github/workflows/publish-runtime.yml.
- infra/workspace-terminal.md: replace dead molecule-core#1528 reference
  (repo renamed to molecule-monorepo, no longer accepting issues at the
  old name) with a forward-pointer to monorepo + molecule-controlplane
  issue trackers.
- architecture/backends.md: bump audit date to 2026-05-02 and add rows
  for channel envelope enrichment (#2471), chat_history MCP tool
  (#2474), /activity before_ts paging (#2476), /activity peer_id filter
  (#2472), runtime_wedge smoke gate (#2473 + #2475), and the canvas-E2E
  state-file requirement (#2327).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 20:06:06 -07:00
rabbitblood
262a52a32c docs(security): document the KMS-rooted custody chain for SECRETS_ENCRYPTION_KEY
External architecture review flagged the SECRETS_ENCRYPTION_KEY env var
on the platform as encryption-at-rest theater. The reviewer read only
the platform repo and missed that the master key actually lives in AWS
KMS at the control plane layer, with envelope encryption wrapping each
tenant secret blob.

Adds docs/architecture/secrets-key-custody.md as the canonical source
of truth for the full chain:

- Two-mode envelope (KMS_KEY_ARN vs static-key fallback)
- Per-blob AES-256-GCM with KMS-wrapped DEKs
- Where each key actually lives (KMS, CP env, tenant env)
- Threat model per attacker capability
- Rotation story (annual KMS CMK rotation, manual DEK rotation on incident)
- Audit posture (SOC2 / ISO 27001 questionnaire bullets)

Patches three downstream docs that previously stopped at the env-var
level and link them to the new custody doc:

- development/constraints-and-rules.md (Rule 11)
- architecture/database-schema.md (workspace_secrets paragraph)
- architecture/molecule-technical-doc.md (env-vars table)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 11:29:16 -07:00
Hongming Wang
a56b765b2d
docs: testing strategy + PR hygiene + backend parity matrix + boot-event postmortem (#1824)
Bundles the documentation and lightweight tooling landed during the
2026-04-23 ops/triage session. Pure additions — no behavior changes.

## Added

### docs/architecture/backends.md
Parity matrix for Docker vs EC2 (SaaS) workspace backends. 18 features
tabulated with current status; 6 ranked drift risks; enforcement
hooks (parity-lint + contract tests). Living document — owners are
workspace-server + controlplane teams.

### docs/engineering/testing-strategy.md
Tiered test-coverage floors instead of a blanket 100% target. Seven
tiers by code class (auth/crypto → generated DTOs). Per-package
current-state snapshot + targets. Tracks the 3 biggest coverage gaps
(tokens.go 0%, workspace_provision.go 0%, wsauth ~48%) against their
tier-1/2 floors.

### docs/engineering/pr-hygiene.md
Captures the patterns that keep diffs reviewable. Motivated by the
2026-04-23 backlog audit where 8 of 23 open PRs had 70-380-file bloat
from stale branch drift. Covers: small-PR sizing, rebase-not-merge,
cherry-pick-onto-fresh-base for recovery, targeting staging first,
describing why-not-what.

### docs/engineering/postmortem-2026-04-23-boot-event-401.md
Postmortem for the /cp/tenants/boot-event 401 race. Root cause (DB
INSERT ordered AFTER readiness check), detection path (E2E + manual
log inspection), lessons (write-before-read pattern, integration
tests needed, E2E alerting gap, invariants-as-comments).

### tools/check-template-parity.sh
CI lint for template repos — diffs the `${VAR:+VAR=${VAR}}` provider-
key forwarders between install.sh (bare-host / EC2 path) and start.sh
(Docker path). Catches the #5 drift risk from backends.md before it
ships.

### workspace-server/internal/provisioner/backend_contract_test.go
Shared behavioral contract scaffold for Provisioner + CPProvisioner.
Compile-time assertions catch method-signature drift today; scenario-
level runs are t.Skip'd pending backend nil-hardening (drift risk #6,
see backends.md).

## Updated

### README.md
Links the new engineering docs + backends parity matrix into the
Documentation Map so agents and humans can actually find them.

## Related issues

- #1814 — unblock workspace_provision_test.go (broadcaster interface)
- #1813 — nil-client panic hardening (drift risk #6)
- #1815 — Canvas vitest coverage instrumentation
- #1816 — tokens.go 0% → 85%
- #1817 — 5 sqlmock column-drift failures
- #1818 — Python pytest-cov setup
- #1819 — wsauth middleware coverage gap
- #1821 — tiered coverage policy (meta)
- #1822 — backend parity drift tracker

Co-authored-by: Hongming Wang <hongmingwang.rabbit@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: molecule-ai[bot] <276602405+molecule-ai[bot]@users.noreply.github.com>
2026-04-23 19:59:38 +00:00
Hongming Wang
28bf11fb85 docs(security): move sensitive runbooks to private internal repo
Three changes to stop ferrying sensitive content through our public
monorepo. All content already imported to Molecule-AI/internal (private)
— see linked PRs below.

Contained full security audit cycle records with CWE references,
file:line pointers to historical vulnerabilities, and severity
ratings. None of that belongs in a public repo.

→ Moved to Molecule-AI/internal/security/incident-log.md (PR #20).
  Monorepo file becomes a 17-line stub pointing at the internal
  location. Future incidents land in the internal file only.

Had AWS account ID `004947743811` and IAM role name
`MoleculeStagingProvisioner` embedded. Even though the fleet
described isn't actually running (see state note), these
identifiers are account-specific and don't belong in public git.

→ Removed both values, replaced with generic references + a pointer
  to Molecule-AI/internal/runbooks/canary-fleet.md (PR #21) where
  the actual identifiers live. Any future rotation touches the
  internal file, no public-git-history rewrite needed.

Contained the full ops runbook: bootstrap script output, per-tenant
SG backfill loop with live SG IDs, customer slug names
(hongmingwang). Useful content but too specific for a public repo.

→ Moved to Molecule-AI/internal/runbooks/workspace-terminal.md
  (PR #22). Monorepo file becomes a 30-line public summary of what
  the feature does + pointers to code, so external readers /
  self-hosters still get the design story.

Marketing briefs, SEO plans, campaign copy, research dossiers, and
internal product designs (hermes-adapter-plan, medo-integration,
cognee-*) are the next batches. See docs policy doc coming next to
set team expectations.

Net removal: ~820 lines from public git going forward.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 22:39:23 +00:00
airenostars
7a89704b6e
fix(build): add missing fmt import + fix canvas Dockerfile GID (#1487)
* docs(canary-release): flag as aspirational; link to current state

The canary-release.md doc describes the pipeline as if the fleet is
running — referring to AWS account 004947743811 and a configured
MoleculeStagingProvisioner role. Reality as of 2026-04-22: no canary
tenants are provisioned, the 3 GH Actions secrets are empty, and
canary-verify.yml has failed 7/7 times in a row.

Added a top-of-doc ⚠️ state note that:

1. Clarifies this is intended design, not deployed reality.
2. Notes the AWS account ID is historical / unverified.
3. Explains that merges currently rely on manual promote-latest.
4. Cross-links to molecule-controlplane/docs/canary-tenants.md for
   the Phase 1 work that's shipped, the Phase 2 stand-up plan, and
   the "should we even do this now?" decision framework.
5. Asks whoever lands Phase 2 to reconcile the two docs.

No behaviour change — doc-only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(build): add missing fmt import in a2a_proxy.go, fix canvas Dockerfile GID

- a2a_proxy.go: missing "fmt" import caused build failure (8 undefined
  references at lines 743-775). Likely dropped during a recent merge.
- canvas/Dockerfile: GID 1000 already in use in node base image.
  Changed to dynamic group/user creation with fallback.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Hongming Wang <hongmingwang.rabbit@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Hongming Wang <hongmingwangrabbit@gmail.com>
2026-04-22 21:10:58 +00:00
Hongming Wang
a49e828588 docs: strip internal roadmap/followups from public org-api-keys docs
The monorepo docs/ tree is ecosystem + user-facing. Internal
roadmap ("what we'll build next", priorities, effort estimates)
doesn't belong there — customers reading our docs don't need our
backlog in their face, and we shouldn't signal "feature X is
coming" contractually when it's just a P2 item in internal
tracking.

Removes:
  - docs/architecture/org-api-keys-followups.md (the whole
    prioritized roadmap). Moved to the internal repo at
    runbooks/org-api-keys-followups.md where it belongs.
  - "Follow-up roadmap" section in docs/architecture/org-api-
    keys.md, replaced with a shorter "Known limitations" section
    that names the current constraints (full-admin only, no
    expiry, no user_id in session-minted audit) without
    speculating on when they change.
  - "What's coming" section in docs/guides/org-api-keys.md,
    replaced with "Current limits" that names the same
    constraints from the user's POV.

Public docs now describe the feature as it exists TODAY. Internal
tracking of what comes next lives in Molecule-AI/internal (private).
2026-04-20 14:31:46 -07:00
Hongming Wang
3982a5da52 feat(auth): org tokens reach /workspaces/:id/* subroutes + docs
Extends WorkspaceAuth to accept org API tokens as a valid
credential for any workspace sub-route in the org. Previously a
user minting an org token could hit admin-surface endpoints
(/workspaces, /org/import, etc.) but couldn't reach per-workspace
routes like /workspaces/:id/channels — those were gated by
WorkspaceAuth which only knew about workspace-scoped tokens.

Scope matches the explicit product spec: one org API key can
manipulate every workspace in the org. AI agents given a key can
read/write channels, tokens, schedules, secrets, tasks across all
workspaces.

## WorkspaceAuth tier order

  1. ADMIN_TOKEN exact match (break-glass / bootstrap)
  2. Org API token (Validate against org_api_tokens)           NEW
  3. Workspace-scoped token (ValidateToken with :id binding)
  4. Same-origin canvas referer

Org token tier sits above the per-workspace check so a presenter
of an org key doesn't hit the narrower ValidateToken failure path
first. Checked with isSameOriginCanvas path unchanged.

## End-to-end verified

Minted test token via ADMIN_TOKEN, then with that org token:
  - GET /workspaces             → 200 (list all)
  - GET /workspaces/<id>        → 200 (detail, admin-only route)
  - GET /workspaces/<id>/channels → 200 (workspace sub-route)
  - GET /workspaces/<id>/tokens   → 200 (workspace tokens list)
  - GET /workspaces/<bad-uuid>    → 404 workspace not found
                                    (routing still scoped correctly)

## Documentation

  - docs/architecture/org-api-keys.md — design, data model, threat
    model, security properties
  - docs/architecture/org-api-keys-followups.md — 10 tracked
    follow-ups prioritized (role scoping P1, per-workspace binding
    P1, expiry P2, usage metrics P2, WorkOS user_id capture P2,
    rotation webhooks P3, mint-rate limit P3, audit log P2, CLI
    P3, migrate ADMIN_TOKEN to the same table P4)
  - docs/guides/org-api-keys.md — end-user guide (mint via UI,
    use in curl/Python/TS/AI agents, session-vs-key comparison)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 14:11:45 -07:00
Hongming Wang
eecce56c13 feat(canary): rollback-latest script + release-pipeline doc (Phase 4)
Closes the canary loop with the escape hatch and a single place to
read about the whole flow.

scripts/rollback-latest.sh <sha>
  uses crane to retag :latest ← :staging-<sha> for BOTH the platform
  and tenant images. Pre-checks the target tag exists and verifies
  the :latest digest after the move so a bad ops typo doesn't
  silently promote the wrong thing. Prod tenants auto-update to the
  rolled-back digest within their 5-min cycle. Exit codes: 0 = both
  retagged, 1 = registry/tag error, 2 = usage error.

docs/architecture/canary-release.md
  The one-page map of the pipeline: how PR → main → staging-<sha> →
  canary smoke → :latest promotion works end-to-end, how to add a
  canary tenant, how to roll back, and what this gate explicitly does
  NOT catch (prod-only data, config drift, cross-tenant bugs).

No code changes in the CP or workspace-server — this PR is shell
+ docs only, so it's safe to land independently of the other Phase
{1,1.5,2,3} PRs still in review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 03:37:42 -07:00
Hongming Wang
96535c30cc docs: 2026-04-19 SaaS prod migration notes
Captures the 10-PR staging→main cutover: what shipped, the three new
Railway prod env vars (PROVISION_SHARED_SECRET / EC2_VPC_ID /
CP_BASE_URL), and the sharp edge for existing tenants — their
containers pre-date PR #53 so they still need MOLECULE_CP_SHARED_SECRET
added manually (or a re-provision) before the new CPProvisioner's
outbound bearer works.

Also includes a post-deploy verification checklist and rollback plan.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 02:29:31 -07:00
Hongming Wang
af2670cc53 fix(docs): update architecture + API reference paths for workspace-server rename
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 01:25:21 -07:00
Hongming Wang
a9036aec04 chore: gitignore CLAUDE.md, extract content to proper docs
CLAUDE.md was a 44KB catch-all mixing architecture docs (useful for
everyone) with agent operating instructions (internal). Split:

- docs/architecture/overview.md — system architecture, component
  descriptions, 13 key patterns (import cycles, health detection,
  communication rules, WebSocket flow, lifecycle, etc.)
- docs/api-reference.md — full REST API route table + database schema
- CLAUDE.md → gitignored (stays local for agent tooling)

All internal PR/issue references stripped from the new docs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 00:43:33 -07:00
Hongming Wang
92c60c313c chore: final open-source cleanup — binary, stale paths, private refs
- Remove compiled workspace-server/server binary from git
- Fix .gitignore, .gitattributes, .githooks/pre-commit for renamed dirs
- Fix CI workflow path filters (workspace-template → workspace)
- Replace real EC2 IP and personal slug in test_saas_tenant.sh
- Scrub molecule-controlplane references in docs
- Fix stale workspace-template/ paths in provisioner, handlers, tests
- Clean tracked Python cache files

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 00:38:55 -07:00
Hongming Wang
479a027e4b chore: open-source restructure — rename dirs, remove internal files, scrub secrets
Renames:
- platform/ → workspace-server/ (Go module path stays as "platform" for
  external dep compat — will update after plugin module republish)
- workspace-template/ → workspace/

Removed (moved to separate repos or deleted):
- PLAN.md — internal roadmap (move to private project board)
- HANDOFF.md, AGENTS.md — one-time internal session docs
- .claude/ — gitignored entirely (local agent config)
- infra/cloudflare-worker/ → Molecule-AI/molecule-tenant-proxy
- org-templates/molecule-dev/ → standalone template repo
- .mcp-eval/ → molecule-mcp-server repo
- test-results/ — ephemeral, gitignored

Security scrubbing:
- Cloudflare account/zone/KV IDs → placeholders
- Real EC2 IPs → <EC2_IP> in all docs
- CF token prefix, Neon project ID, Fly app names → redacted
- Langfuse dev credentials → parameterized
- Personal runner username/machine name → generic

Community files:
- CONTRIBUTING.md — build, test, branch conventions
- CODE_OF_CONDUCT.md — Contributor Covenant 2.1

All Dockerfiles, CI workflows, docker-compose, railway.toml, render.yaml,
README, CLAUDE.md updated for new directory names.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 00:24:44 -07:00
Hongming Wang
e906f49ec0 chore: open-source preparation — scrub secrets, add community files
Security:
- Replace hardcoded Cloudflare account/zone/KV IDs in wrangler.toml
  with placeholders; add wrangler.toml to .gitignore, ship .example
- Replace real EC2 IPs in docs with <EC2_IP> placeholders
- Redact partial CF API token prefix in retrospective
- Parameterize Langfuse dev credentials in docker-compose.infra.yml
- Replace Neon project ID in runbook with <neon-project-id>

Community:
- Add CONTRIBUTING.md (build, test, branch conventions, CI info)
- Add CODE_OF_CONDUCT.md (Contributor Covenant 2.1)

Cleanup:
- Replace personal runner username/machine name in CI + PLAN.md
- Replace personal tenant URL in MCP setup guide
- Replace personal author field in bundle-system doc
- Replace personal login in webhook test fixture
- Rewrite cryptominer incident reference as generic security remediation
- Remove private repo commit hashes from PLAN.md

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 00:10:56 -07:00
Hongming Wang
76d3b32ab9 fix: resolve PLAN.md merge conflict — keep both Phase 34 and Phase 36
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 21:41:32 -07:00
Hongming Wang
2dbb59cb35 docs: staging environment design + Phase 36 plan
Full staging environment that mirrors production. Every infra change
ships to staging first before promotion. Gates Phase 33 (Tunnel) and
Phase 35 (security hardening).

Components: Railway staging env, Neon branch, staging DNS, tagged
Docker images, promotion workflow, automated smoke tests.

Also marks Phase 33 as migrating from Worker to Cloudflare Tunnel
(issue #933), prerequisite: staging.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 20:37:11 -07:00
Hongming Wang
ecbcf02904 docs: Partner API Keys architecture + Phase 34 plan
Adds programmatic org management for partner platforms, CI/CD, and
automation. Partners authenticate with mol_pk_* API keys (SHA-256
hashed, scoped, rate-limited, revocable) alongside existing WorkOS
browser auth.

- Full architecture doc with schema, scopes, middleware integration,
  security considerations, and use cases
- Phase 34 in PLAN.md (4 sub-phases)
- CLAUDE.md cross-reference

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 14:07:50 -07:00
Hongming Wang
192f29e754 docs: tenant image upgrade strategies (Options A/B/C)
Documents three upgrade strategies for keeping tenant EC2 instances
current with platform-tenant:latest:
- Option A: Rolling restart via CP admin endpoint (coordinated)
- Option B: Sidecar auto-updater cron (implemented, 5 min interval)
- Option C: Blue-green via Worker (zero downtime, future)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 11:59:15 -07:00
Hongming Wang
49bd2e8f56 docs(wildcard-dns): address CEO review — KV cache, WebSocket, proxy trust
Addresses all 4 review points from PR #786:
1. Worker resilience: 3-tier cache (in-memory → KV → CP API) with stale
   fallback so CP outages are invisible to tenants
2. WebSocket proxying: documented upgradeHeader handling, fallback to
   keep Caddy for WS-only if Workers WS is unreliable
3. SG automation: note to auto-update Cloudflare IP ranges, don't hardcode
4. Trusted proxy: X-Forwarded-For / CF-Connecting-IP trust chain documented

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 10:17:43 -07:00
Hongming Wang
72285fb03e docs: wildcard DNS + Cloudflare Worker proxy architecture
Adds Phase 33 plan and architecture doc for replacing per-tenant DNS
records with a wildcard DNS + Cloudflare Worker proxy pattern.

Eliminates: DNS propagation delays, NXDOMAIN caching, per-instance
Let's Encrypt, Caddy on EC2. Same pattern used by Vercel, Railway,
Fly.io, WordPress, n8n.

4-phase migration: deploy Worker → stop creating DNS records →
remove Caddy from EC2 → cleanup.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 10:02:32 -07:00
Hongming Wang
24fec62d7f initial commit — Molecule AI platform
Forked clean from public hackathon repo (Starfire-AgentTeam, BSL 1.1)
with full rebrand to Molecule AI under github.com/Molecule-AI/molecule-monorepo.

Brand: Starfire → Molecule AI.
Slug: starfire / agent-molecule → molecule.
Env vars: STARFIRE_* → MOLECULE_*.
Go module: github.com/agent-molecule/platform → github.com/Molecule-AI/molecule-monorepo/platform.
Python packages: starfire_plugin → molecule_plugin, starfire_agent → molecule_agent.
DB: agentmolecule → molecule.

History truncated; see public repo for prior commits and contributor
attribution. Verified green: go test -race ./... (platform), pytest
(workspace-template 1129 + sdk 132), vitest (canvas 352), build (mcp).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 11:55:37 -07:00