Merge pull request #1661 from Molecule-AI/docs/move-sensitive-to-internal

docs(security): move sensitive runbooks to private internal repo
2026-04-22 14:17:36 -07:00 · 2026-04-22 14:17:36 -07:00 · 0db8445538
commit 0db8445538
parent 691de28064 bc82fa4e0e
3 changed files with 47 additions and 821 deletions
--- a/docs/architecture/canary-release.md
+++ b/docs/architecture/canary-release.md
@ -2,10 +2,12 @@

 How a workspace-server code change reaches the prod tenant fleet — and how to stop it if something's wrong.

-> **⚠️ State note (2026-04-22):** this doc describes the **intended design**. As of this write, the canary fleet described below is **not actually running** — no canary tenants are provisioned, `CANARY_TENANT_URLS` / `CANARY_ADMIN_TOKENS` / `CANARY_CP_SHARED_SECRET` are empty in repo secrets, and `canary-verify.yml` fails every run. The AWS account `004947743811` referenced in "Canary fleet" below predates this repo's history and hasn't been verified in-session.
+> **⚠️ State note (2026-04-22):** this doc describes the **intended design**. As of this write, the canary fleet described below is **not actually running** — no canary tenants are provisioned, `CANARY_TENANT_URLS` / `CANARY_ADMIN_TOKENS` / `CANARY_CP_SHARED_SECRET` are empty in repo secrets, and `canary-verify.yml` fails every run.
 >
 > Current merges gate on manual `promote-latest.yml` dispatches, not canary. See [molecule-controlplane/docs/canary-tenants.md](https://github.com/Molecule-AI/molecule-controlplane/blob/main/docs/canary-tenants.md) for the Phase 1 code work that's already shipped + the Phase 2 plan for actually standing up the fleet + a "should we even do this now?" decision framework.
 >
+> **Account-specific identifiers (AWS account ID, IAM role name) referenced below in the original design have been redacted from this public doc.** The actual values — if they exist — are in `Molecule-AI/internal/runbooks/canary-fleet.md`. If you're implementing Phase 2, start there.
+>
 > When Phase 2 lands, delete this note and reconcile the two docs.

 ## The loop
@ -34,7 +36,7 @@ canary-verify.yml waits 6 min, runs scripts/canary-smoke.sh

 ## Canary fleet

-Lives in a separate AWS account (`molecule-canary`, `004947743811`) via an assumed role (`MoleculeStagingProvisioner`). The CP's `is_canary` org flag routes provisioning there; every other org goes to the default staging account. See `docs/architecture/saas-prod-migration-2026-04-19.md` for the account bootstrap.
+Lives in a separate AWS account via an assumed role. The CP's `is_canary` org flag routes provisioning there; every other org goes to the default account. Specific account ID and role name are tracked in the internal runbook (`Molecule-AI/internal/runbooks/canary-fleet.md`) rather than here, so rotating them doesn't require rewriting public git history.

 Canary tenants are configured to pull `:staging-<sha>` (not `:latest`) via `TENANT_IMAGE` on their provisioner, so they ingest each new build before prod does.

@ -54,7 +56,7 @@ Expand by editing the script — each `check "name" "expected" "$response"` call

 1. `POST /cp/orgs` — create the org normally (is_canary defaults to false)
 2. `POST /cp/admin/orgs/<slug>/canary` with `{"is_canary": true}` — admin only, refuses to flip if already provisioned
-3. Re-trigger provision (or delete + recreate if the org was already provisioned into staging) — the fresh EC2 lands in account `004947743811`
+3. Re-trigger provision (or delete + recreate if the org was already provisioned into staging) — the fresh EC2 lands in the canary AWS account (see internal runbook for the specific ID)

 Then set repo secrets:
 - `CANARY_TENANT_URLS` — append the new tenant's URL
--- a/docs/incidents/INCIDENT_LOG.md
+++ b/docs/incidents/INCIDENT_LOG.md
@ -1,583 +1,18 @@
-# Incident Log — molecule-core
-
-> This file documents security incidents, outages, and degraded states.
-> Active incidents are listed first. Resolved incidents remain for historical record.
-
---
-
-*Last updated: 2026-04-21T07:45Z by Core Platform Lead — Incident log rebuilt after linter reset*
-
---
-
-## Security Audit Cycle 6 — ALL CLEAR (2026-04-21 ~07:15Z)
-
-**SHA range:** e69cb26 → 674384b on main (~5 commits + ~10 merged PRs)
-**Verdict:** ✅ No critical/high findings
-
-### Commits Reviewed — All CLEAN
-
-| Commit | Description |
-|--------|-------------|
-| `dc9c64e` / PR #1258 | F1097 org_id context — eliminates redundant 2nd SELECT in AdminAuth |
-| `33f1d1a` | Canvas cascade-delete UX — `pendingDelete.hasChildren`, warning dialog |
-| `0790d57` | Canvas metrics guard — null coalescing |
-| `781c217` | CI YAML fix |
-| `169120d` / PR #1310 | CWE-78/CWE-22 — exec form + path traversal guards |
-| `e431fc4` / PR #1302 | CWE-918 SSRF — `isSafeURL` in `a2a_proxy.go` |
-| `a66f889` / PR #1261 | CWE path-injection — `resolveInsideRoot` for template paths |
-
-Full audit saved to TEAM memory id `abc58b47`.
-
---
-
-## F1100 — workspace_restart.go Path Traversal (RESOLVED)
-
-**Severity:** Medium | **Finding ID:** F1100
-**Status:** Resolved — fix applied via `a66f889` (PR #1261) on both main and staging
-
-### Summary
-
-`workspace_restart.go:127-133` accepted `body.Template` (attacker-controlled) via raw `filepath.Join(h.configsDir, template)`, allowing path traversal (e.g. `../../../etc`) to escape `configsDir`. **Issue #1043 triage missed this — legitimate gap, not false positive.**
-
-Authenticated callers could pass a crafted `body.Template` value to escape the configs directory.
-
-### Fix Applied
-
-PR #1260 (intended) closed without merge. Fix landed via **PR #1261 (`a66f889`)** on both main and staging:
-
-```go
-// Fixed (a66f889):
-candidatePath, resolveErr := resolveInsideRoot(h.configsDir, template)
-if resolveErr != nil {
-    template = ""  // fallback fires safely
-}
-```
-
-### References
-
- PR #1260: closed without merge — superseded by PR #1261
- PR #1261 (`a66f889`): merged ✅
- Closes: #1043
-
---
-
-## F1088 Credential Exposure — CLOSED
-
-**All prior F1088 entries below remain valid. Summary of current state:**
-
- Credentials: MiniMax revoked (⚠️), GitHub PAT revoked (✅), Admin token — treat as potentially exposed
- BFG git-history scrub: NOT REQUIRED — incident management closure, 0 public forks confirmed
- Git history still contains values — admin token rotation recommended as precaution
- PR #1179 (`b89f3fd`) merged — active code is clean
- Branch `origin/fix/credential-history-cleanup-f1088` exists but is 38 commits behind main — superseded by incident management closure
-
-**Required remaining action:** Rotate `ADMIN_TOKEN` (`HlgeMb8...ShARE=`) as precaution. All other actions complete.
-
---
-
-### Summary
-
-Commit `d513a0ced549ef2be8903a7b4794256110ba1805` on staging (merged to main via PR #1098) contains three production credentials as hardcoded default values in `scripts/post-rebuild-setup.sh`. The credentials appeared in the git diff and were permanently visible in the public commit history.
-
-### Credentials Status
-
-| # | Credential | Value | Status |
-|---|------------|-------|--------|
-| 1 | ANTHROPIC_AUTH_TOKEN | `sk-cp-lHt...KVw` | ⚠️ Revoked or inactive (404 on API call) |
-| 2 | GITHUB_TOKEN | `github_pat_11...hsIJLIL` | ✅ Revoked (confirmed 401) |
-| 3 | ADMIN_TOKEN | `***REDACTED***` | Needs confirmation — treated as active until proven otherwise |
-
-### Resolution
-
-PR #1179 (`b89f3fd`: "ci: retry — trigger fresh runner allocation") closed this finding. The incident was closed at the finding-management level. Git history scrub via BFG was discussed but deemed not required by security team (no active public forks confirmed, credentials were already revoked/inactive).
-
-Active code is clean (`d513a0c` replaced hardcoded defaults with env-var reads).
-
-### Summary
-
-Commit `d513a0ced549ef2be8903a7b4794256110ba1805` on staging (merged to main via PR #1098) contains two production credentials as hardcoded default values in `scripts/post-rebuild-setup.sh`. The credentials appear in the git diff and are permanently visible in the public commit history.
-
-The commit itself fixed the problem by replacing hardcoded defaults with env-var reads (MINIMAX_API_KEY, GITHUB_PAT). However, git history still shows the original values.
-
-### Credentials Exposed
-
-| # | Credential | Value (redacted reference) | Service |
-|---|------------|------------------------------|---------|
-| 1 | ANTHROPIC_AUTH_TOKEN | `***REDACTED***` | MiniMax API (api.minimax.io/anthropic) |
-| 2 | GITHUB_TOKEN | `***REDACTED***` | GitHub (fine-grained PAT, scope unknown) |
-| 3 | ADMIN_TOKEN | `***REDACTED***` | Platform admin authentication |
-
-### Affected Files
-
- `scripts/post-rebuild-setup.sh` (commit d513a0c, PR #1098 → merged to staging → merged to main)
-
-### Timeline
-
- **~2026-04-20T13:02Z**: Commit `d513a0c` pushed by `rabbitblood`. GitGuardian flagged credentials in the diff. Fix committed in same commit.
- **~2026-04-20T**: Credentials removed from active code, but git history still contains them.
- **2026-04-20T22:32Z**: Incident discovered and escalated.
-
-### Actions Taken
-
-1. Dev Lead notified (delegation failed — Dev Lead unreachable)
-2. All child workspaces notified (delegation failed — all unreachable)
-3. Incident documented in this file
-4. Branch `origin/fix/credential-history-cleanup-f1088` exists but is 38 commits behind `origin/main`
-5. **Incident CLOSED** — PR #1179 merged, finding management closure, BFG scrub deemed not required (no active public forks confirmed)
-
-### Blast Radius (Confirmed by Core-Security)
-
-| Credential | Test Result | Status |
-|------------|-------------|--------|
-| MiniMax API key (`sk-cp-...KVw`) | `404 Not Found` on real API call | ⚠️ **REVOKED** (or endpoint inactive) |
-| GitHub PAT (`github_pat_...hsIJLIL`) | `401 Bad credentials` | ✅ **REVOKED** |
-| Admin token (`HlgeMb8...ShARE=`) | Base64 — cannot test directly | ⚠️ **Treated as active** — recommend rotation as precaution |
-
-**Public forks:** 0 confirmed (GH API `/forks` returns none) — low fork blast radius.
-
-**Git history scope:** Credentials exist in both `main` and `staging` in commits `f787873`..`d513a0c`. They were introduced in `f787873` ("feat: nuke-and-rebuild.sh") and removed from active code in `d513a0c`. Both branches require BFG cleanup.
-
-### Required Actions (RESOLVED)
-
- [x] Credentials revoked (MiniMax ⚠️, GitHub PAT ✅)
- [x] BFG git history cleanup **NOT REQUIRED** — incident management closure, no active public forks, credentials confirmed revoked/inactive
- [x] Team notification — documented in this log
- [ ] **Admin token rotation** — recommended as precaution (value still in git history, treat as potentially exposed)
-
-### BFG Repo-Cleaner Procedure
-
-**NOT REQUIRED** — F1088 closed without BFG scrub per security team decision. Retained for reference only.
-
-**Step 1 — Create credentials manifest (`creds.txt`) [NOT NEEDED]:**
-```
-***REDACTED***
-***REDACTED***
-***REDACTED***
-```
-
-**Step 2 — Clean origin/main:**
-```bash
-git clone --mirror https://github.com/Molecule-AI/molecule-core /tmp/molecule-main-mirror
-java -jar bfgr.jar --replace-text creds.txt --rewrite-not-committed-by-oss --no-blob-protection /tmp/molecule-main-mirror
-cd /tmp/molecule-main-mirror && git push --mirror
-```
-
-**Step 3 — Clean origin/staging:**
-```bash
-git clone --mirror https://github.com/Molecule-AI/molecule-core /tmp/molecule-staging-mirror
-java -jar bfgr.jar --replace-text creds.txt --rewrite-not-committed-by-oss --no-blob-protection /tmp/molecule-staging-mirror
-cd /tmp/molecule-staging-mirror && git push --mirror
-```
-
-**Step 4 — Notify team to re-clone both branches if cloned before ~13:02 UTC 2026-04-20.**
-
-### References
-
- Commit: `d513a0ced549ef2be8903a7b4794256110ba1805`
- PR: #1098 (staging → main merge)
- Cleanup branch: `origin/fix/credential-history-cleanup-f1088` (behind main by 38 commits)
- Scanners triggered: GitGuardian
- Security investigation: Core-Security (confirmed credentials revoked via API tests)
- GitHub issue: #1282 (filed by Core-OffSec)
- **Closed by:** PR #1179 (`b89f3fd`) — incident management closure, BFG scrub deemed not required
-
-### Known Issue — PR #1230 Incomplete (QA Round 16, 2026-04-21)
-
-PR #1230 / commit `524e3c6` ("fix(security): replace err.Error() leaks") failed to carry mcp.go fixes into main's tree. All 3 MCP error leaks remain on main:
- `mcp.go:259`: "parse error: " + err.Error()
- `mcp.go:347`: "invalid params: " + err.Error()
- `mcp.go:352`: err.Error()
- `org_plugin_allowlist.go:260`: "detail": err.Error()
-
-Fix is covered by PR #1226 (rebased, MERGEABLE). Gap should close after #1226 merges.
-
---
-
-## CWE-918 SSRF — Backport to Main (RESOLVED)
-
-**Severity:** High
-**Status:** Resolved — PR #1302 merged to main
-
-### Summary
-
-SSRF defence (`isSafeURL` in `a2a_proxy.go`) was backported to main to address CWE-918 (Server-Side Request Forgery). The fix prevents the A2A proxy from forwarding requests to internal network addresses (localhost, private ranges, etc.).
-
-### References
-
- Commit: `e431fc4` (fix(security): backport SSRF defence (CWE-918) to main — isSafeURL in a2a_proxy.go (#1292) (#1302))
-
---
-
-## CWE-22 + CWE-78 Security Fixes — Merged (RESOLVED)
-
-**Severity:** Critical
-**Status:** Resolved — proper fixes merged to staging and main
-
-### Summary
-
-The `fix/cwe78-delete-via-ephemeral-shell-injection` branch was the right diagnosis but wrong implementation (removed `safeName` from `copyFilesToContainer`). The correct fixes were merged separately:
-
-| Location | Commit | Fix |
-|----------|--------|-----|
-| staging | `ce2491e` | CWE-22: `copyFilesToContainer` safeName + `deleteViaEphemeral` validateRelPath + exec form |
-| main | `169120d` | CWE-78/CWE-22: block shell injection in `deleteViaEphemeral` |
-
-Both CWEs are fully resolved on both branches. The regression branch is superseded and must not be merged as-is.
-
-### Verification (staging `ce2491e`)
-
-`copyFilesToContainer` (container_files.go:73-99):
-```go
-clean := filepath.Clean(name)
-if filepath.IsAbs(clean) || strings.Contains(clean, "..") {
-    return fmt.Errorf("path traversal blocked: %s", name)
-}
-safeName := filepath.Join(destPath, clean)
-header := &tar.Header{Name: safeName, ...}  ✅
-```
-
-`deleteViaEphemeral` (container_files.go:152-168):
-```go
-validateRelPath(filePath)  ✅
-Cmd: []string{"rm", "-rf", "/configs", filePath}  ✅ exec form, no shell interpolation
-```
-
---
-
-
-
-**Severity:** High
-**Period:** ~2026-04-20T22:00Z – 2026-04-21T03:30Z
-**Finding IDs:** N/A (infra incident)
-**Status:** Resolved
-
-### Summary
-
-All self-hosted macOS arm64 runners saturated. 27 runs queued, 0 in-progress, 0 completed. Only cancellations processing. PRs #1053 and #1036 had zero CI runs.
-
-### Root Causes (multiple)
-
-1. `changes` job ran on `[self-hosted, macos, arm64]` despite having zero macOS dependencies (plain `git diff`) — wasted runner slots
-2. YAML corruption in `ci.yml` (JSON-escaped `\n` sequences from commits `12c52d4`/`5831b4e`) caused "workflow file issue" failures before any job could start
-3. `cancel-in-progress: false` at workflow level caused stale runs to queue instead of being cancelled
-4. Workflow-level concurrency not set — multiple in-flight runs queued on same ref
-
---
-
-## CI Stall — molecule-core/staging (RESOLVED 2026-04-21 ~07:05Z)
-
-**Severity:** High
-**Period:** ~2026-04-21T02:47Z – ~2026-04-21T07:00Z
-**Status:** Resolved — CI progressing normally, no config problems remain
-
-### Resolution
-
-All prior runner-saturation and YAML-corruption fixes were correct. The stall resolved naturally once stale queued runs drained. Current CI state (2026-04-21 ~07:07Z):
-
- Staging run #24708961892: **success** (SHA `5d32373`)
- Staging run #24708976467: **success** (changes job, SHA `72d825f`)
- Main run #24708984339: queued (normal — healthy queue, not stalled)
- Runner agent healthy — no dead slots
-
-### Root Causes (all resolved)
-
-1. `changes` job on `[self-hosted, macos, arm64]` — fixed by moving to `ubuntu-latest` (`9601545`)
-2. YAML corruption in `ci.yml` — fixed by PR #1264 / `b61692c` ✅
-3. `cancel-in-progress: false` at workflow level — reverted to `true` on staging ✅
-4. `cancel-in-progress: false` on main — correct for single-runner env, aligned via PR #1248 ✅
-
-### Staging CI Config (confirmed healthy)
-
- `ci.yml`: `cancel-in-progress: true`, `changes` job on `ubuntu-latest` ✅
- `codeql.yml`: `cancel-in-progress: false` ✅
- `e2e-api.yml`: `cancel-in-progress: false` ✅
-
-### Infra Recommendations (for long-term stability)
-
-1. Provision org-wide GitHub App installation token for CI automation (PATs rotate too frequently)
-2. Update remote URLs on controlplane and tenant-proxy repos
-3. Monitor runner agent health on mac mini — restart agent if future stalls recur
-
---
-
-## PR #1242 YAML Corruption — RESOLVED (PR never merged)
-
-**Severity:** Critical
-**Status:** Resolved — PR #1242 closed without merge, staging unaffected
-
-### Summary
-
-PR #1242 (`fix/ci-runner-queue-contention`) branch contained a YAML corruption in `ci.yml` — the `concurrency` block was replaced with a commit-SHA string literal:
-
-```yaml
-e4a62e1 (ci: add workflow-level concurrency to ci.yml and codeql.yml)
-```
-
-However, PR #1242 was **closed without merging**. Staging received `cancel-in-progress: true` via PR #1264 (commit `b61692c`) instead, which is the correct clean version.
-
-### Current State (updated 2026-04-21 ~04:30Z)
-
- **main:** `cancel-in-progress: false` ✅ (from PR #1248 / `2ffd11c` or similar clean commit)
- **staging:** `cancel-in-progress: true` (via `0b30465` tick restore after corruption)
- **PR #1248** (`2ffd11c`): open, sets staging `cancel-in-progress: false` — aligns staging with main ✅
- **Main has moved to `false`** — staging should follow to stay consistent
-
-### PR #1248 — URGENT MERGE
-
-PR #1248 (`fix/ci: restore corrupted ci.yml concurrency block`) by Dev Lead:
- Fixes the corruption pattern (same as prior incident)
- Sets `cancel-in-progress: false` — correct for single-runner environment
- Aligns staging CI config with main (which already has `false`)
- Must merge before any further CI runs on staging
-
-### References
-
- PR: #1242 (`fix/ci-runner-queue-contention`) — closed, not merged
- Staging corruption restored via: PR #1264 / `b61692c`
- PR #1248 (`2ffd11c`): open, Dev Lead fix, `cancel-in-progress: false`
- Main: `cancel-in-progress: false` ✅
-
---
-
-## PR #1036 QA Audit (STALE)
-
-**Severity:** Low
-**Date:** 2026-04-20 (QA audit performed)
-**Status:** Stale — CI infrastructure has been fixed since audit
-
-### Summary
-
-QA audit (2026-04-20) flagged CI as failing on PR #1036. However, CI was failing due to infrastructure issues (runner saturation, YAML corruption) that have since been resolved. The audit should be re-run now that staging CI is healthy.
-
---
-
-## PR #1246 / #1247 — Sed Regression Fix — RESOLVED (PR #1247 merged)
-
-**Severity:** Critical
-**Status:** Resolved — PR #1247 merged to main (2026-04-21 ~03:18Z)
-
-### Summary
-
-PR #1246 (`364712d`) was closed without merging. However, **PR #1247** (`04be218`) achieved the same fix cleanly and merged to main:
-
-```
-fix(go): replace $1 literal with resp.Body.Close() in 7 files (#1247)
-```
-
-Commit `04be218` (merged by molecule-ai[bot]) applied:
-```
-sed -i 's/defer func() { _ = \$1 }()/defer func() { _ = resp.Body.Close() }()/g'
-```
-
-### Affected Files (all fixed on main)
-
- `workspace-server/cmd/server/cp_config.go`
- `workspace-server/internal/handlers/a2a_proxy.go`
- `workspace-server/internal/handlers/github_token.go`
- `workspace-server/internal/handlers/traces.go`
- `workspace-server/internal/handlers/transcript.go`
- `workspace-server/internal/middleware/session_auth.go`
- `workspace-server/internal/provisioner/cp_provisioner.go` (3 occurrences)
-
-**Staging:** Fix present via prior commits. `cp_config.go` on staging has SHA `d1021c2` (correct form).
-
-**PR #1246:** Closed without merging — superseded by PR #1247. No further action needed.
-
---
-
-## CWE-78/CWE-22 Branch — RESOLVED (proper fixes merged separately)
-
-**Severity:** Critical
-**Status:** Resolved — proper fixes merged via `ce2491e` (staging) and `169120d` (main)
-
-### Summary
-
-The `fix/cwe78-delete-via-ephemeral-shell-injection` branch (commit `17419dd`) was **correct** for CWE-78 (`deleteViaEphemeral` exec form + `validateRelPath`) but **regressed** `copyFilesToContainer` by removing the `safeName` path-traversal guard.
-
-**Resolution — both branches merged to main and staging:**
-
-| Branch | Commit | Status |
-|--------|--------|--------|
-| staging | `ce2491e` — fix(security): CWE-22 in copyFilesToContainer and deleteViaEphemeral | ✅ merged |
-| main | `169120d` — fix(security): CWE-78/CWE-22 — block shell injection in deleteViaEphemeral | ✅ merged |
-
-### What was fixed (staging `ce2491e`)
-
- `copyFilesToContainer`: `filepath.Clean` + `IsAbs` + `strings.Contains("..")` validation, `safeName` in tar header ✅
- `deleteViaEphemeral`: `validateRelPath(filePath)` check before rm command ✅
- Both CWE-22 and CWE-78 addressed correctly
-
-### `fix/cwe78-delete-via-ephemeral-shell-injection` branch status
-
-**Do NOT merge** — it's now superseded by `ce2491e`/`169120d`. The regression it introduced (removing `safeName` from `copyFilesToContainer`) was never the right approach. If this branch is revived, it must be rebased on top of `ce2491e` to preserve existing CWE-22 protections while adding the CWE-78 exec-form fix.
-
---
-
-## F1085 Regression Branch (`fix/f1085-regression-1283`) — IS a Regression
-
-**Severity:** High
-**Status:** Active — branch removes the confirmed-good F1085 fix (confirmed 2026-04-21 ~07:10Z)
-
-### Summary
-
-Branch `origin/fix/f1085-regression-1283` (commit `3b244e6`) removes `redactSecrets(workspaceID, content)` from `seedInitialMemories` in `workspace_provision.go:249`:
-
-```diff
-`, workspaceID, redactSecrets(workspaceID, content), scope, awarenessNamespace); err != nil {
-+`, workspaceID, content, scope, awarenessNamespace); err != nil {
-```
-
-**Staging still has the correct fix** (`workspace_provision.go:253` on origin/staging confirms `redactSecrets` is present). This branch is behind staging and would regress it if merged.
-
-### Required Fix
-
-Close or revert this branch. `redactSecrets` must remain in `seedInitialMemories`. If there is a legitimate reason to change this (e.g., a different redaction strategy), document it clearly in the PR before merging.
-
---
-
-## F1097 — org_id Context Fix — RESOLVED
-
-**Severity:** Medium
-**Status:** Resolved — PR #1258 merged to main (`dc9c64e`)
-
-### Summary
-
-`orgToken.Validate` refactored to return `org_id` directly, eliminating the redundant 2nd SELECT in `AdminAuth`. All SQL parameterized correctly.
-
-### References
-
- PR #1258 (`dc9c64e`): fix(F1097): set org_id in Gin context for org-token callers
-
---
-
-## PR #1226 — err.Error() Leaks (STALE — closed without merge)
-
-**Severity:** Medium
-**Status:** Open — PR closed without merging, leaks still present on main
-
-### Summary
-
-PR #1226 (`fix(security): sanitize remaining err.Error() leaks + errcheck artifacts/client.go`) was **closed without merging**. The following leaks remain on main:
-
-| File | Line | Code | Fix |
-|------|------|------|-----|
-| `mcp.go` | 259 | `"parse error: " + err.Error()` | → `"parse error: invalid JSON request body"` |
-| `mcp.go` | 347 | `"invalid params: " + err.Error()` | → `"invalid params: malformed JSON"` |
-| `mcp.go` | 352 | `err.Error()` | → `"dispatch error"` |
-| `org_plugin_allowlist.go` | 260 | `"detail": err.Error()` | → `"detail": "plugin name validation failed"` |
-| `admin_memories.go` | 99 | `"invalid JSON: " + err.Error()` | → `"invalid JSON request body"` |
-
-**Already fixed:** `artifacts/client.go:175` — `defer func() { _ = resp.Body.Close() }()` confirmed correct (via PR #1247).
-
-### Action Required
-
-Reopen PR #1226 and fast-track merge. Alternatively, cherry-pick the 4 commits from that PR onto a fresh branch.
-
---
-
-## QA Round 18 — orgs-page Test Regression (FIXED on main, pending staging port)
-
-**Severity:** Medium
-**SHA tested:** `ce33da5` (PR #1257 branch merge with staging)
-**Status:** Regression identified in PR #1255, fixed on main, not yet on staging
-
-### Findings
-
-| Finding | Status |
-|---------|--------|
-| Canvas tests: 53 passed, **1 FAILED** | orgs-page.test.tsx line 133 — `vi.useRealTimers()` + raw `setTimeout(50)` without `act()` |
-| PR #1257 conflict | MERGEABLE, approved — closed without merge; fix is on main/staging via `a66f889` |
-| PR #1255 regression | Introduced orgs-page test flakiness — +18/-2 in orgs-page.test.tsx |
-
-### orgs-page Test Regression — Root Cause
-
-PR #1255 (`e885fa1`) regressed the timer fix from PR #1235. It replaced `waitFor()` with `vi.useRealTimers()` + raw `setTimeout(50)` without `act()` — causing microtask flush issues.
-
-### Resolution
-
-**Main:** Fixed in `674384b` (PR #1313) — wraps all 10 affected `vi.advanceTimersByTimeAsync(50)` calls in `act(async () => { ... })`. All 813 canvas tests pass on main.
-**Staging:** Regression NOT yet fixed — `origin/staging` is 13 commits behind main.
-
-### Action needed
-
-Cherry-pick or port the orgs-page test fix from `674384b` to staging.
-
---
-
-## Issue #1124 — Orchestrator GET /workspaces 404: Env Var Misconfiguration (OPEN)
-
-**Severity:** Medium
-**Status:** Active — root cause confirmed, fix pending, delegated to Core-BE
-
-### Summary
-
-Orchestrator (workspace agent, `workspace/` directory) GET /workspaces/{WORKSPACE_ID} returns 404 due to missing or empty `WORKSPACE_ID` env var. Confirmed via code review (2026-04-21 ~07:10Z).
-
-### Root Causes
-
-**Platform-side (provisioner.go:375-377) is CORRECT:**
-```go
-env := []string{
-    fmt.Sprintf("WORKSPACE_ID=%s", cfg.WorkspaceID),  // ✅ correctly injected
-    "WORKSPACE_CONFIG_PATH=/configs",
-    fmt.Sprintf("PLATFORM_URL=%s", cfg.PlatformURL),
-}
-```
-The platform injects `WORKSPACE_ID` at container provision time. **The bug is in the Python orchestrator modules** that default to empty string instead of validating the injected value.
-
-**Buggy Python module-level defaults (empty string → broken API calls):**
-| File | Line | Code |
-|------|------|------|
-| `workspace/a2a_cli.py` | 24 | `WORKSPACE_ID = os.environ.get("WORKSPACE_ID", "")` |
-| `workspace/a2a_client.py` | 17 | `WORKSPACE_ID = os.environ.get("WORKSPACE_ID", "")` |
-| `workspace/coordinator.py` | 26 | `WORKSPACE_ID = os.environ.get("WORKSPACE_ID", "")` |
-| `workspace/consolidation.py` | 22 | `WORKSPACE_ID = os.environ.get("WORKSPACE_ID", "")` |
-| `workspace/molecule_ai_status.py` | 25 | `WORKSPACE_ID = os.environ.get("WORKSPACE_ID", "")` |
-
-When `WORKSPACE_ID` is empty, API calls produce URLs like `/workspaces//heartbeat` or `/registry/discover/` — platform returns 404 or wrong routing.
-
-**Note — main.py is already correct:**
-```python
-workspace_id = os.environ.get("WORKSPACE_ID", "workspace-default")  # main.py:55 ✅
-```
-However, `main.py` uses a local variable — it doesn't export `WORKSPACE_ID` as a module constant. The other modules that import `WORKSPACE_ID` from `a2a_client` etc. still get the empty-string default.
-
-### Fix Required (Quick Win for Core-BE)
-
-**Option A — Fail fast at module import (recommended):**
-```python
-WORKSPACE_ID = os.environ.get("WORKSPACE_ID")
-if not WORKSPACE_ID:
-    raise RuntimeError("WORKSPACE_ID environment variable is required but not set")
-```
-Apply to all 5 affected modules. This surfaces the misconfiguration immediately instead of producing silent 404s downstream.
-
-**Option B — Align with main.py's approach (safer):**
-```python
-WORKSPACE_ID = os.environ.get("WORKSPACE_ID", "workspace-default")
-```
-But this masks real misconfigurations. Option A is better.
-
-### Modules Requiring Fix
-
- `workspace/a2a_cli.py` — line 24
- `workspace/a2a_client.py` — line 17
- `workspace/coordinator.py` — line 26
- `workspace/consolidation.py` — line 22
- `workspace/molecule_ai_status.py` — line 25
-
-### PLATFORM_URL Note
-
-All modules default to `http://platform:8080` (container mesh hostname). This is correct for in-container use but fails outside Docker. No action needed for in-container orchestrators — the platform injects `PLATFORM_URL` at provision time which overrides this default.
-
-### Owner
-
-Core-BE — delegated to Dev Lead (A2A failed). Core-BE sub-team: please pick up.
-
-### Fix PR
-
-[PR #1336](https://github.com/Molecule-AI/molecule-core/pull/1336) filed — `fix(orchestrator): fail-fast if WORKSPACE_ID env var is unset/empty`. Targets staging. Labels: bug, needs-work, area:backend-engineer, area:dev-lead.
-
---
-
-*Last updated: 2026-04-21T07:10Z by Core Platform Lead (post-restart session — all findings re-verified)*
+# Incident Log — moved
+
+> **This file moved to the internal repo on 2026-04-22.**
+>
+> Content now lives at **`Molecule-AI/internal/security/incident-log.md`**
+> (private — Molecule AI org members only).
+>
+> Why moved: incident records contain CWE references, file:line
+> pointers to historical vulnerabilities, and severity ratings. None
+> of that belongs in a public repo.
+>
+> **If you're adding a new incident:** write it in the internal repo,
+> not here. Don't recreate a public incident log.
+>
+> **If you need a historical entry:** check the internal repo first.
+> Everything up to 2026-04-22 was copied over. Git history for this
+> file in the public monorepo still contains the original content
+> (not rewritten — descriptive, no credentials).
--- a/docs/infra/workspace-terminal.md
+++ b/docs/infra/workspace-terminal.md
@ -1,242 +1,31 @@
-# Workspace Terminal over EIC + SSH
+# Workspace Terminal

-Tracking: [molecule-core#1528](https://github.com/Molecule-AI/molecule-core/issues/1528) (resolved 2026-04-22)
+> **Full runbook moved to the internal repo on 2026-04-22.**
+>
+> The implementation-level content (EIC bootstrap script output,
+> per-tenant SG backfill commands, tenant-specific identifiers) now
+> lives at **`Molecule-AI/internal/runbooks/workspace-terminal.md`**
+> (private — Molecule AI org members only).

-**Status: live in prod** on hongmingwang tenant as of 2026-04-22. Verified end-to-end against the Hermes workspace EC2.
+## What this feature is (public summary)

-## Problem
+The canvas Terminal tab opens an interactive shell on a workspace's
+compute — locally this is a `docker exec` into the container; in the
+SaaS tenant path it's an SSH session into the tenant EC2 (or the
+workspace container running on it) over an [EC2 Instance Connect
+Endpoint](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-connect-setup-ec2-instance-connect-endpoint.html).
+End users see a terminal; no direct public SSH ingress is required.

-Canvas's Terminal tab calls `workspace-server /workspaces/:id/terminal` which tries `docker.ContainerInspect` on the tenant's local Docker daemon. That works for locally-provisioned workspaces, but CP-provisioned (SaaS) workspaces run on **separate EC2 instances** — the tenant has no path to their Docker. Users see "Failed to connect — is the workspace container running?" while `STATUS: online` because A2A heartbeats come from the remote instance independently.
+Tracking: [molecule-core#1528](https://github.com/Molecule-AI/molecule-core/issues/1528) (resolved 2026-04-22).

-## Chosen approach: EC2 Instance Connect + SSH
+## Where things are

-`ec2-instance-connect:SendSSHPublicKey` pushes an ephemeral SSH public key (valid 60s) into the instance's metadata. A short-lived SSH connection uses the matching private key, runs `docker exec -it ws-<id> /bin/bash`, and bridges stdin/stdout to the canvas WebSocket.
+- **Go handler:** [`workspace-server/internal/handlers/terminal.go`](../../workspace-server/internal/handlers/terminal.go)
+- **CP provisioner (EIC endpoint, per-tenant SG):** `Molecule-AI/molecule-controlplane/internal/provisioner/ec2.go` — `EICEndpointSGID` field
+- **Bootstrap script:** `Molecule-AI/molecule-controlplane/scripts/bootstrap-eic-terminal.sh`
+- **Detailed ops runbook (internal):** `Molecule-AI/internal/runbooks/workspace-terminal.md`

-### Why not SSM Session Manager
-
-SSM would be the "right" answer in a mature infra but requires:
- An IAM instance profile with `AmazonSSMManagedInstanceCore` on every workspace EC2 (currently none have one — `aws ssm describe-instance-information` returns an empty list across the fleet)
- SSM agent on the AMI (already present on AL2023/Ubuntu, but unverified)
- Outbound to `ssm.*.amazonaws.com` (current VPC config unknown)
-
-EIC short-circuits all three. The existing `molecule-cp` IAM user picks up a small policy addition and we're done — no per-instance identity to bootstrap.
-
-### Comparison
-
-| Axis | EIC + SSH | SSM Session Manager |
-|---|---|---|
-| Uses existing `molecule-cp` creds | Yes | No — needs instance profile |
-| AMI changes | None (EIC in OS since AL2 2019+, Ubuntu 20.04+) | Verify agent present |
-| Infra changes | IAM policy + security group | IAM role + instance profile + maybe NAT/VPCe |
-| Audit | CloudTrail for `SendSSHPublicKey` | CloudTrail + SSM session logs (richer) |
-| Rotation | Every session (60s key lifetime) | Managed by AWS |
-| Compliance story | "SSH with per-session keys, CloudTrailed" | "SSM Session Manager with recording available" |
-
-Pick SSM later if compliance needs session recording. For now EIC is strictly less work.
-
-## Data flow
-
-```
-[Canvas]                [Tenant workspace-server]                    [Workspace EC2]
-   │                             │                                         │
-   │ WS /workspaces/:id/terminal │                                         │
-   ├────────────────────────────▶│                                         │
-   │                             │ SELECT instance_id                      │
-   │                             │ FROM workspaces WHERE id=:id            │
-   │                             │                                         │
-   │                             │ ec2:DescribeInstances(instance_id)      │
-   │                             │ → public_dns, availability_zone, az     │
-   │                             │                                         │
-   │                             │ ec2-instance-connect:SendSSHPublicKey   │
-   │                             │   target: instance_id                   │
-   │                             │   os_user: ec2-user|ubuntu              │
-   │                             │   public_key: ephemeral (ed25519)       │
-   │                             │                                         │
-   │                             │ ssh ec2-user@public_dns                 │
-   │                             │     -o StrictHostKeyChecking=no         │
-   │                             ├────────────────────────────────────────▶│
-   │                             │                                         │
-   │                             │ docker exec -it ws-<id> /bin/bash       │
-   │                             ├────────────────────────────────────────▶│
-   │                             │                                         │
-   │◀───── stdout bridge ────────┤◀──────────── stdout ────────────────────┤
-   │───── stdin bridge ─────────▶│───────────── stdin ─────────────────────▶│
-```
-
-`instance_id` is persisted on provision by migration `038_workspace_instance_id`. Terminal handler branches on `instance_id IS NOT NULL`.
-
-## Topology (verified from molecule-controlplane code)
-
- Workspaces launch in a **shared workspace VPC** (`p.VPCID`), not the tenant's VPC
- Each workspace gets its own SG created by `createPerTenantSG("workspace", <ws-short>, workspaceIngressRules())`
- Current `workspaceIngressRules()` opens only `8000/tcp` from `0.0.0.0/0` — no port 22
- CP already tags every workspace instance with `Role=workspace` (+ `WorkspaceID`, `Runtime`, `SGID`, `ManagedBy=molecule-cp`)
-
-Because tenant EC2 and workspace EC2 are in **different VPCs**, a direct SG CIDR rule for port 22 is awkward (would require VPC peering + tenant-CIDR bookkeeping). **EIC Endpoint** is the natural fit — it's a VPC resource that acts as a TLS tunnel to any instance in its VPC, keyed on IAM permissions rather than source CIDR.
-
-## IAM policy addition for `molecule-cp`
-
-```json
-{
-  "Version": "2012-10-17",
-  "Statement": [
-    {
-      "Sid": "DescribeInstancesForTerminalResolution",
-      "Effect": "Allow",
-      "Action": ["ec2:DescribeInstances"],
-      "Resource": "*"
-    },
-    {
-      "Sid": "PushEphemeralSSHKeyToWorkspaceInstances",
-      "Effect": "Allow",
-      "Action": [
-        "ec2-instance-connect:SendSSHPublicKey",
-        "ec2-instance-connect:OpenTunnel"
-      ],
-      "Resource": "arn:aws:ec2:*:*:instance/*",
-      "Condition": {
-        "StringEquals": {
-          "aws:ResourceTag/Role": "workspace"
-        }
-      }
-    }
-  ]
-}
-```
-
-Tag key is **`Role`** (capitalized) — CP already sets this at launch in `ec2.go:1126`. No CP change needed for the policy's scoping to work fleet-wide.
-
-## EIC Endpoint (one-time setup in the workspace VPC)
-
-```bash
-aws ec2 create-instance-connect-endpoint \
-  --subnet-id <any-subnet-in-workspace-vpc> \
-  --security-group-ids <sg-id-allowing-egress-only> \
-  --tag-specifications 'ResourceType=instance-connect-endpoint,Tags=[{Key=Name,Value=molecule-workspace-eic}]'
-```
-
-One endpoint per workspace VPC. Free for the resource (pay only for data transferred). Replaces both "open port 22 in every SG" and "establish VPC peering for tenant→workspace SSH" — no change to `workspaceIngressRules()` needed, no change to tenant VPC routing needed.
-
-## Alternative: direct SG rule (not recommended)
-
-If you really want direct SSH instead of EIC Endpoint:
-
-1. Add `22/tcp` to `workspaceIngressRules()` in `molecule-controlplane`, sourced from the tenant VPC's CIDR
-2. Establish VPC peering between tenant VPC and workspace VPC
-3. Update the route tables on both sides
-
-Three more failure modes + ongoing bookkeeping per tenant. Skip unless you have a specific reason EIC Endpoint doesn't fit.
-
-## Key lifetime
-
- ed25519 keypair generated per-session in the terminal handler
- Public half pushed via `SendSSHPublicKey` (valid 60s)
- Private half held in-memory only, discarded when the WS closes
- No keys on disk, no rotation cron, no secrets rotation debt
-
-## Failure modes + their user-visible messages
-
-| Condition | Message | Actionable? |
-|---|---|---|
-| `instance_id IS NULL` (local workspace) | Falls through to current local-Docker handler | n/a — existing behavior |
-| `instance_id` set, DescribeInstances returns nothing | "workspace instance no longer exists — recreate the workspace" | Yes |
-| `SendSSHPublicKey` 403 | "tenant lacks EIC permission — contact your admin" | Yes (requires IAM fix) |
-| SSH connect timeout | "tenant cannot reach workspace instance — check security group" | Yes (SG fix) |
-| `docker exec` fails (no container) | "workspace container is not running — try restart" | Yes (normal ops) |
-
-## Rollout (verified recipe)
-
-Each AWS account (staging + prod, etc.) needs this once. The CP repo
-ships `scripts/bootstrap-eic-terminal.sh` that automates everything
-below — what's here is what the script does, in case you want to run
-the steps by hand or audit it.
-
-### 1. Infra (one-shot)
-
-```bash
-# From molecule-controlplane checkout (needs IAM admin creds):
-./scripts/bootstrap-eic-terminal.sh <workspace-vpc-id> <region>
-```
-
-Creates (idempotent):
- EC2 Instance Connect **service-linked role** (`AWSServiceRoleForEC2InstanceConnect`)
- **Managed IAM policy** `MoleculeEICTerminal` (DescribeInstances + SendSSHPublicKey + OpenTunnel + CreateInstanceConnectEndpoint + DescribeInstanceConnectEndpoints)
- **IAM role + instance profile** `MoleculeTenantEICRole` / `MoleculeTenantEICProfile` (attach the managed policy) — this replaces env-var AWS creds on tenant EC2s
- **EIC Endpoint** in the workspace VPC (uses the default VPC SG for egress, which is all EIC Endpoint needs)
-
-Script prints the endpoint SG id + profile name to set on the CP:
-
-```
-EIC_ENDPOINT_SG_ID=sg-xxxxxx
-EC2_TENANT_IAM_PROFILE=MoleculeTenantEICProfile
-```
-
-### 2. CP config + redeploy
-
-Set those two env vars on the CP service (Railway dashboard or equivalent). On redeploy, [molecule-controlplane#227](https://github.com/Molecule-AI/molecule-controlplane/pull/227) ensures every **newly-provisioned** workspace + tenant SG auto-carries a `22/tcp` ingress rule sourced from the EIC Endpoint SG.
-
-### 3. Tenant env vars (every tenant EC2)
-
-The tenant workspace-server container needs these env vars to verify session cookies and reach the CP. Missing any of these produces a working-looking tenant whose canvas cold-loads with `401 admin auth required` on every call — which is what broke the hongmingwang tenant on 2026-04-22 before these were set.
-
-| Env var | Value | What breaks if missing |
-|---|---|---|
-| `CP_UPSTREAM_URL` | `https://api.moleculesai.app` (or your CP) | `/cp/*` paths fall through to Next.js 404 → canvas `AuthGate` infinite-redirects on login, hits browser's 431 header-limit |
-| `MOLECULE_ORG_SLUG` | tenant slug, e.g. `hongmingwang` | `verifiedCPSession` returns false — session cookie never validates, every API call 401s with "admin auth required" |
-| `MOLECULE_ORG_ID` | UUID of the tenant org | `tenant_guard` middleware 404s all non-`/cp/*` routes |
-| `AWS_REGION` | e.g. `us-east-2` | `aws ec2-instance-connect` subprocesses default to `us-east-1` and can't find instances |
-
-Tenants launched by CP should have `MOLECULE_ORG_ID` + `MOLECULE_ORG_SLUG` injected from the `organizations` row at provision time. If you find a tenant where these are missing, that's a CP provisioner bug, not operator error.
-
-AWS creds are NOT on this list because the instance profile (`MoleculeTenantEICProfile` from step 1) provides them via IMDSv2 — aws-cli inside the tenant container picks them up automatically. If you still see `AWS_ACCESS_KEY_ID` env vars on a tenant, strip them and rely on the profile.
-
-### 4. Backfill existing instances
-
-Pre-existing SGs need one-time ingress added. The bootstrap script's final output includes this loop with the real SG id substituted; shown here for visibility — **replace `<EIC_ENDPOINT_SG_ID>` with the `sg-…` value step 1 printed**:
-
-```bash
-EIC_SG=<EIC_ENDPOINT_SG_ID>  # from step 1 output
-
-for sg in $(aws ec2 describe-security-groups --region us-east-2 \
-    --filters 'Name=tag:ManagedBy,Values=molecule-cp' \
-    --query 'SecurityGroups[].GroupId' --output text | tr '\t' '\n'); do
-  aws ec2 authorize-security-group-ingress --region us-east-2 \
-    --group-id "$sg" --protocol tcp --port 22 --source-group "$EIC_SG" \
-    2>&1 | grep -v DuplicatePermission || true
-done
-```
-
-Note the `| tr '\t' '\n'` — aws-cli `--output text` tab-separates values within a row, which can concatenate all SG ids into a single word that breaks the for loop. Splitting to newlines is a no-op on well-behaved output and a fix on the concatenated case.
-
-### 5. Tenant code (this monorepo)
-
-Already merged:
- [#1531](https://github.com/Molecule-AI/molecule-core/pull/1531) — migration `038_workspace_instance_id` + persist on CP provision
- [#1533](https://github.com/Molecule-AI/molecule-core/pull/1533) — terminal handler remote branch (EIC open-tunnel + ssh + pty)
-
-Tenant image (`ghcr.io/molecule-ai/platform-tenant:latest`) ships with `aws-cli` + `openssh-client` as of 2026-04-22.
-
-### 6. Verification (how to confirm after deploy)
-
- Provision a fresh CP workspace → `SELECT instance_id FROM workspaces WHERE id = ?` is non-null
- Open canvas Terminal on that workspace → bash prompt (`ubuntu@ip-...`)
- Terminate the workspace EC2 manually → Terminal shows "EIC tunnel didn't come up"
- Temporarily remove `ec2-instance-connect:OpenTunnel` from `MoleculeEICTerminal` → Terminal shows "failed to push session key"
-
-### Existing-workspace backfill of `instance_id`
-
-Migrations run on tenant boot, but pre-existing workspace rows have NULL `instance_id`. The CP provisioner only writes `instance_id` on NEW provisions; old workspaces need:
-
-```sql
-- Inside the tenant DB
-UPDATE workspaces SET instance_id = '<i-xxx from DescribeInstances by tag WorkspaceID>', updated_at = now()
-WHERE id = '<workspace-uuid>';
-```
-
-For a whole fleet, join CP's workspace table with the DescribeInstances result by `WorkspaceID` tag and batch-UPDATE.
-
-## Future work (not in scope)
-
- Session recording for compliance → SSM migration with instance profile
- Multi-user concurrent terminals → connection pooling per workspace
- Terminal for workspaces behind a private NAT with no EIC route → fall back to SSM
+Why the split: the bootstrap-script output + per-tenant SG ingress
+backfill commands include AWS resource IDs and tenant slugs that
+don't belong in a public repo, but the high-level design is useful
+for external readers + self-hosters.