Commit Graph

3087 Commits

Author SHA1 Message Date
Hongming Wang
e02fedec99
Merge pull request #2120 from Molecule-AI/fix/secret-scan-merge-group
fix(ci): handle merge_group + shallow-clone BASE in secret-scan
2026-04-26 21:11:54 +00:00
228106db84
Merge pull request #2119 from Molecule-AI/refactor/provisioning-timeout-use-prune-helper
refactor(canvas): ProvisioningTimeout uses pruneStaleKeys helper (follow-up to #2110)
2026-04-26 21:09:53 +00:00
Hongming Wang
0ce537750c fix(ci): handle merge_group + shallow-clone BASE in secret-scan
[Molecule-Platform-Evolvement-Manager]

## What was breaking

Two distinct failure modes in `.github/workflows/secret-scan.yml`,
both visible after PR #2115 / #2117 hit the merge queue:

1. **`merge_group` events**: the script reads `github.event.before /
   after` to determine BASE/HEAD. Those properties only exist on
   `push` events. On `merge_group` events both came back empty, the
   script fell through to "no BASE → scan entire tree" mode, and
   false-positived on `canvas/src/lib/validation/__tests__/secret-formats.test.ts`
   which contains a `ghp_xxxx…` literal as a masking-function fixture.
   (Run 24966890424 — exit 1, "matched: ghp_[A-Za-z0-9]{36,}".)

2. **`push` events with shallow clone**: `fetch-depth: 2` doesn't
   always cover BASE across true merge commits. When BASE is in the
   payload but absent from the local object DB, `git diff` errors
   out with `fatal: bad object <sha>` and the job exits 128.
   (Run 24966796278 — push at 20:53Z merging #2115.)

## Fixes

- Add a dedicated fetch step for `merge_group.base_sha` (mirrors
  the existing pull_request base fetch) so the diff base is in the
  object DB before `git diff` runs.
- Move event-specific SHAs into a step `env:` block so the script
  uses a clean `case` over `${{ github.event_name }}` instead of
  a single `if pull_request / else push` that left merge_group on
  the empty branch.
- Add an on-demand fetch for the push-event BASE when it isn't in
  the shallow clone, plus a `git cat-file -e` guard before the
  diff so we fall through cleanly to the "scan entire tree" path
  if the fetch fails (correct, just slower) instead of exiting 128.

## Defense-in-depth

`secret-formats.test.ts` had two literal continuous-string fixtures
(`'ghp_xxxx…'`, `'github_pat_xxxx…'`). The ghp_ one matched the
secret-scan regex. Switched both to the `'prefix_' + 'x'.repeat(N)`
pattern already used elsewhere in the same file — runtime value is
the same, but the literal source text no longer matches the regex
even if the BASE detection ever falls back to tree-scan mode again.

## Test plan

- [x] No remaining regex matches in the secret-formats.test.ts source
- [x] YAML structure preserved
- [ ] CI passes on this PR's pull_request scan (was already passing)
- [ ] CI passes on this PR's merge_group scan (the new path)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 14:08:19 -07:00
rabbitblood
5d888abc41 refactor(canvas): ProvisioningTimeout uses pruneStaleKeys helper
Follow-up to #2110 (which generalised pruneStaleKeys to Map<string, T>).
Identified by the simplify reviewer on that PR as the only other
in-tree caller of the same shape: `for (const id of map.keys()) { if
(!liveIds.has(id)) map.delete(id); }`.

Net: -3 lines, one less hand-rolled GC loop. No behaviour change —
the helper does exactly what the inline block did.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 14:05:28 -07:00
Hongming Wang
84c3206e39
Merge pull request #2117 from Molecule-AI/fix/canvas-hydrate-delete-tombstones-2069
fix(canvas): tombstone deleted ids so in-flight hydrate can't resurrect them (#2069)
2026-04-26 20:57:51 +00:00
rabbitblood
8c69a98da2 chore(simplify): share FALLBACK_POLL_MS as the tombstone TTL + trim verbose comments
Simplify pass on top of #2069 fix:

- Export FALLBACK_POLL_MS from canvas/src/store/socket.ts and import
  it as TOMBSTONE_TTL_MS in deleteTombstones.ts. Single source of
  truth — tuning one without the other would silently re-open the
  hydrate-races-delete window. Required-fix per simplify reviewer.
- Compress deleteTombstones.ts docstring from 30 lines to 10 — keep
  the "what + why module-level"; drop the long-form problem
  description (issue #2069 carries it).
- Compress canvas.ts call-site comments at removeSubtree (4 lines →
  2) and hydrate (2 lines → 2 but tighter).
- Don't reassign the workspaces parameter inside hydrate — use a
  const `live` and thread it through the two downstream calls
  (computeAutoLayout, buildNodesAndEdges). Same effect, no lint
  smell.
- Trim the canvas.test.ts integration-test preamble.

No behaviour change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 13:52:49 -07:00
rabbitblood
7bb0bc39a2 fix(canvas): tombstone deleted ids so in-flight hydrate can't resurrect them (#2069)
Closes #2069. removeSubtree dropped a parent + descendants locally
after DELETE returned 200, but a GET /workspaces request that was
IN-FLIGHT before the DELETE completed could land AFTER and hydrate
the store with a stale snapshot — re-introducing the deleted nodes
on the canvas until the next 10s fallback poll corrected it.

New module canvas/src/store/deleteTombstones.ts holds a transient
process-lifetime Map<id, deletedAt>. removeSubtree calls
markDeleted(removedIds); hydrate calls wasRecentlyDeleted(id) to
filter the incoming workspaces. TTL is 10s — matches the WS-fallback
poll cadence so a single round-trip is covered, after which a
legitimately re-imported id flows through normally.

GC happens lazily at every read AND at write time so the map stays
bounded — no separate timer / interval / unmount plumbing.

Tests:
- canvas/src/store/__tests__/deleteTombstones.test.ts: 7 cases
  covering immediate flag, never-marked, TTL boundary (9999ms vs
  10001ms), GC-on-read, GC-on-write, re-mark resets timestamp,
  iterable input.
- canvas/src/store/__tests__/canvas.test.ts: end-to-end "hydrate
  cannot resurrect ids that removeSubtree just dropped (#2069)"
  exercises the full chain at the store level.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 13:48:15 -07:00
Hongming Wang
b007d8ac73
Merge pull request #2110 from Molecule-AI/fix/canvas-prune-stale-subtree-ids-2070
fix(canvas): prune lastFitSubtreeIdsRef on stale roots (#2070)
2026-04-26 20:46:24 +00:00
Hongming Wang
a25ed57613
Merge pull request #2115 from Molecule-AI/chore/codeowners-personal-review-routing
chore: add CODEOWNERS to auto-route agent PRs to your personal review account
2026-04-26 20:45:30 +00:00
Hongming Wang
1c38c78f5e
feat(compose): IMAGE_AUTO_REFRESH=true by default in local dev (#2116)
Picks up the GHCR digest watcher added in PR #2114 with no operator
action: just `docker compose up` and the platform self-heals to the
latest workspace-template image within 5 minutes of publish.

Default ON for local dev because that's where the runtime → workspace
iteration loop is tightest. .env.example documents the override knob
for the rare "running a long test that shouldn't be disturbed by a
publish" case.

Co-authored-by: Hongming Wang <hongmingwangalt@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 13:49:08 -07:00
Hongming Wang
dac55f3b42 chore: add CODEOWNERS to auto-route agent PRs to personal review account
After landing the 1-required-review gate on staging in cycle 24, every
agent-authored PR sits with `REVIEW_REQUIRED` until someone notices.
CODEOWNERS solves the routing half: every changed path matches `*`, so
GitHub auto-requests review from @hongmingwang-moleculeai (the
personal account, separate from the HongmingWang-Rabbit agent
identity). PRs land in the personal account's notification queue
automatically.

The `*  @hongmingwang-moleculeai` line is informational (route the
request) rather than enforced — branch protection's
require_code_owner_reviews flag is off, so any approving review still
satisfies the 1-review gate. Flip that on later if you want CODEOWNERS
approval to be the *required* review type.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 13:40:13 -07:00
Hongming Wang
263012249c
Merge pull request #2109 from Molecule-AI/feat/org-wide-secret-scan-workflow
feat(ci): add secret-scan workflow + reusable entry point for org-wide enrollment
2026-04-26 20:37:16 +00:00
Hongming Wang
9375e3d4ee
feat(workspace-server): GHCR digest watcher closes runtime CD chain (#2114)
Adds an opt-in goroutine that polls GHCR every 5 minutes for digest
changes on each workspace-template-*:latest tag and invokes the same
refresh logic /admin/workspace-images/refresh exposes. With this, the
chain from "merge runtime PR" to "containers running new code" is fully
hands-off — no operator step between auto-tag → publish-runtime →
cascade → template image rebuild → host pull + recreate.

Opt-in via IMAGE_AUTO_REFRESH=true. SaaS deploys whose pipeline already
pulls every release should leave it off (would be redundant work);
self-hosters get true zero-touch.

Why a refactor of admin_workspace_images.go is in this PR:
The HTTP handler held all the refresh logic inline. To share it with
the new watcher without HTTP loopback, extracted WorkspaceImageService
with a Refresh(ctx, runtimes, recreate) (RefreshResult, error) shape.
HTTP handler is now a thin wrapper; behavior is preserved (same JSON
response, same 500-on-list-failure, same per-runtime soft-fail).

Watcher design notes:
- Last-observed digest tracked in memory (not persisted). On boot the
  first observation per runtime is seed-only — no spurious refresh
  fires on every restart.
- On Refresh error, the seen digest rolls back so the next tick retries.
  Without this rollback a transient Docker glitch would convince the
  watcher the work was done.
- Per-runtime fetch errors don't block other runtimes (one template's
  brief 500 doesn't pause the others).
- digestFetcher injection seam in tick() lets unit tests cover all
  bookkeeping branches without standing up an httptest GHCR server.

Verified live: probed GHCR's /token + manifest HEAD against
workspace-template-claude-code; got HTTP 200 + a real
Docker-Content-Digest. Same calls the watcher makes.

Co-authored-by: Hongming Wang <hongmingwangalt@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 13:36:26 -07:00
Hongming Wang
168d6ec8d9
docs: point new-runtime-template flow at the GitHub template repo (#2111)
* docs: point new-runtime-template flow at the GitHub template repo

The 'Writing a new adapter' section was a 6-step manual checklist that
re-derived the canonical shape every time. Now that
Molecule-AI/molecule-ai-workspace-template-starter exists as a GitHub
template, the flow collapses to:

  gh repo create ... --template Molecule-AI/molecule-ai-workspace-template-starter

Plus a fill-in-the-TODO-markers table.

Why this matters: the starter ships with the
'repository_dispatch: [runtime-published]' cascade receiver pre-wired,
which means new templates pick up runtime PyPI publishes automatically
without the one-time setup PR each existing template needed (PRs #6-#22
across the 8 template repos that we just opened to retrofit). At
'hundreds of runtimes' scale this is the difference between linear PR-
toil and zero PR-toil per template addition.

Also adds: 'When the starter itself needs to evolve' — explicit pattern
for keeping the canonical shape in one place when it changes.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

* docs(workspace-runtime): drop PYPI_TOKEN refs — OIDC is the new auth

Reflects PR #2113 (PyPI Trusted Publisher / OIDC migration). No static
PyPI token exists in the repo anymore, so the docs shouldn't claim one
does. Replaces the PYPI_TOKEN row in the Required Secrets table with an
"Auth" section pointing at the OIDC config; TEMPLATE_DISPATCH_TOKEN is
still the only repo secret the cascade needs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Hongming Wang <hongmingwangalt@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 13:15:13 -07:00
Hongming Wang
f3a204347c
fix(publish-runtime): use PyPI Trusted Publisher (OIDC) instead of PYPI_TOKEN (#2113)
Drops the static PYPI_TOKEN secret in favor of OIDC trusted publishing.
PyPI now mints a short-lived upload credential after verifying the
workflow's OIDC claim against the trusted-publisher config registered
for molecule-ai-workspace-runtime (Molecule-AI/molecule-core,
publish-runtime.yml, environment pypi-publish).

Why:
- A leaked PYPI_TOKEN would let any holder publish arbitrary versions of
  molecule-ai-workspace-runtime to PyPI from anywhere — bypassing the
  monorepo's review and CI gates entirely. The 8 template repos pull
  this package; a malicious publish poisons all of them.
- Trusted Publisher (OIDC) makes that exfil path moot: no long-lived
  credential exists to leak. Only this exact workflow, on this repo,
  in the pypi-publish environment, can upload.

After this lands and the first OIDC publish succeeds, the PYPI_TOKEN
repo secret should be deleted (it becomes dead weight + a leak surface
with no purpose).

Belt-and-suspenders companion to PR #56 in molecule-ai-workspace-runtime
(sibling repo lockdown). Without OIDC, the sibling lockdown alone
doesn't prevent local `python -m build && twine upload` from a laptop
with a personal PyPI maintainer credential.

Co-authored-by: Hongming Wang <hongmingwangalt@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 13:14:47 -07:00
Hongming Wang
199630908d
fix(publish-runtime): smoke test asserts stable invariants, not feature flags (#2112)
The original smoke step had `assert a2a_client._A2A_QUEUED_PREFIX`
which is a feature-flag-style check — it fires false-positive every time
staging is mid-release of that specific feature. Caught when the dry-run
publish (run 24965411618) failed because _A2A_QUEUED_PREFIX hadn't
landed on staging yet (it lives in PR #2061's series, separate from the
PR #2103 chain that shipped this workflow).

Replaced with checks for stable invariants of the package contract:

  - a2a_client._A2A_ERROR_PREFIX exists (always has, since the
    [A2A_ERROR] sentinel is the foundational error-tagging primitive)
  - adapters.get_adapter is callable
  - BaseAdapter has the .name() static method (interface anchor)
  - AdapterConfig has __init__ (dataclass present)

These four cover the cases the smoke test actually needs to catch:
import-path rewrites broken by build_runtime_package.py, missing
modules, dataclass shape regressions. They don't fire when a specific
feature is mid-merge.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Hongming Wang <hongmingwangalt@gmail.com>
2026-04-26 13:14:15 -07:00
rabbitblood
570890dab6 chore(simplify): generalize prune helper + add value-identity test
Simplify pass on top of #2070 fix:

- Rename pruneStaleSubtreeIds → pruneStaleKeys, generalize to
  Map<string, T> so the same shape can absorb other keyed-by-node-id
  caches (ProvisioningTimeout.tsx tracking map is the obvious next
  caller — left as a follow-up to keep this PR scoped).
- Trim the helper docstring to remove implementation-detail rot
  (O(map_size), cadence claims). The ref-block comment carries the
  rationale where it actually matters (at the call site).
- Add identity-preservation test: survivors must keep their original
  Set reference. Guards against a future "rebuild instead of delete"
  regression that would silently invalidate downstream === checks.

No behaviour change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 12:31:35 -07:00
rabbitblood
69edc0bf92 fix(canvas): prune lastFitSubtreeIdsRef on stale roots (#2070)
Closes #2070. The Map<rootId, Set<nodeId>> in useCanvasViewport.ts
accumulated entries indefinitely — adds on every successful auto-fit,
never deletes when a root left state.nodes (cascade delete or manual
remove). Operationally invisible until thousands of imports, but the
fix is cheap.

Adds pruneStaleSubtreeIds(map, liveNodeIds) — a pure helper exported
alongside the existing shouldFitGrowing helper, called at the top of
runFit before any read or write to the map. Bounds the map to "roots
present right now" instead of "every root ever auto-fitted in this
session." O(map_size) per fit; runs only at user-driven cadence.

Tests in __tests__/useCanvasViewport.test.ts cover the four cases:
delete-some / no-op / clear-all / never-add.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 12:27:48 -07:00
rabbitblood
8edbd12980 feat(ci): add secret-scan workflow + reusable entry point for org-wide enrollment
Defense-in-depth for the #2090-class incident (2026-04-24): GitHub's
hosted Copilot Coding Agent leaked a ghs_* installation token into
tenant-proxy/package.json via npm init slurping the URL from a
token-embedded origin remote. We can't fix upstream's clone hygiene,
so we gate at the PR layer.

Single workflow, dual purpose:

1. PR / push / merge_group gate on this repo (molecule-monorepo).
   Refuses any change whose diff additions contain a credential-shaped
   string. Same shape as Block forbidden paths — error message tells
   the agent how to recover without echoing the secret value.

2. Reusable workflow entry point (workflow_call) for the rest of the
   org. Other Molecule-AI repos enroll with a 3-line workflow:

     jobs:
       secret-scan:
         uses: Molecule-AI/molecule-monorepo/.github/workflows/secret-scan.yml@main

   This makes molecule-monorepo the single source of truth for the
   regex set; consumer repos pick up new patterns without per-repo PRs.

Pattern set covers GitHub family (ghp_, ghs_, gho_, ghu_, ghr_,
github_pat_), Anthropic / OpenAI / Slack / AWS. Mirror of the
runtime's bundled pre-commit hook (molecule-ai-workspace-runtime:
molecule_runtime/scripts/pre-commit-checks.sh) — keep aligned when
either side adds a pattern.

Self-exclude on .github/workflows/secret-scan.yml so the file's own
regex literals don't block its merge.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 12:05:18 -07:00
Hongming Wang
b0a33d9ebf
Merge pull request #2106 from Molecule-AI/docs/secrets-key-custody
docs(security): document KMS-rooted custody chain for SECRETS_ENCRYPTION_KEY
2026-04-26 18:51:16 +00:00
Hongming Wang
cecb2600d7
Merge pull request #2107 from Molecule-AI/fix/canary-tls-timeout-diagnostics
fix(e2e): bump tenant TLS timeout to 15m + diagnostic burst on failure (#2090)
2026-04-26 18:51:14 +00:00
rabbitblood
b87befdabe chore(simplify): trim SHA-rot comments + harden TENANT_HOST scheme/port stripping
Simplify pass on top of the canary fix:

- Drop the three CP commit SHAs from comments — issue #2090 covers
  the audit trail, SHAs would rot.
- Pull the inline `900` into TLS_TIMEOUT_SEC=$((15 * 60)) so the
  bash mirrors the TS side (15 min) at a glance.
- TENANT_HOST extraction now strips http(s) AND any port suffix, so
  getent doesn't silently fail on a ws://host:443 style URL.
- sed-redact Authorization/Cookie out of the curl -v dump, defensive
  against future callers adding an auth header to this probe.

Pure cleanup; no behaviour change to the happy path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 11:44:54 -07:00
rabbitblood
af89d3fcbd fix(e2e): bump tenant TLS timeout to 15m + diagnostic burst on failure (#2090)
Canary #2090 has been red for 6 consecutive runs over 4+ hours, all
timing out at the TLS-readiness step exactly at the 10-min cap. Time
window correlates with three CP commits that landed today/yesterday
and changed EC2 boot behaviour:

- molecule-controlplane@a3eb8be — fix(ec2): force fresh clone of /opt/adapter
- molecule-controlplane@ed70405 — feat(sweep): wire up healthcheck loop
- molecule-controlplane@4ab339e — fix(provisioner): aggregate cleanup errors

Two changes here, both surgical:

1. Bump the bash-side TLS deadline from 600s to 900s, and the canvas TS
   mirror from 10m to 15m. Stays below the 20-min provision envelope
   (so a genuinely-stuck tenant still fails loud at the earlier
   provision step instead of masquerading as TLS).

2. On TLS-timeout, dump a diagnostic burst before exiting:
   - getent hosts $TENANT_HOST  (DNS resolution state)
   - curl -kv $TENANT_URL/health (TLS handshake + HTTP layer)
   The previous failure log was just "no 2xx in N min" with no signal
   for which layer was actually broken. After this, the next timeout
   tells us whether DNS, TLS handshake, or HTTP layer is the culprit
   so the CP root cause can be isolated without speculation.

This is the unblock; a separate molecule-controlplane issue tracks the
underlying regression suspicion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 11:39:28 -07:00
rabbitblood
262a52a32c docs(security): document the KMS-rooted custody chain for SECRETS_ENCRYPTION_KEY
External architecture review flagged the SECRETS_ENCRYPTION_KEY env var
on the platform as encryption-at-rest theater. The reviewer read only
the platform repo and missed that the master key actually lives in AWS
KMS at the control plane layer, with envelope encryption wrapping each
tenant secret blob.

Adds docs/architecture/secrets-key-custody.md as the canonical source
of truth for the full chain:

- Two-mode envelope (KMS_KEY_ARN vs static-key fallback)
- Per-blob AES-256-GCM with KMS-wrapped DEKs
- Where each key actually lives (KMS, CP env, tenant env)
- Threat model per attacker capability
- Rotation story (annual KMS CMK rotation, manual DEK rotation on incident)
- Audit posture (SOC2 / ISO 27001 questionnaire bullets)

Patches three downstream docs that previously stopped at the env-var
level and link them to the new custody doc:

- development/constraints-and-rules.md (Rule 11)
- architecture/database-schema.md (workspace_secrets paragraph)
- architecture/molecule-technical-doc.md (env-vars table)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 11:29:16 -07:00
Hongming Wang
9b26144386
Merge pull request #2105 from Molecule-AI/feat/wire-max-concurrent-from-template-1408
feat(workspaces): wire max_concurrent_tasks from template config.yaml (#1408)
2026-04-26 18:21:24 +00:00
rabbitblood
ca9a034bbe test(handlers): add 11th INSERT arg (max_concurrent_tasks) to remaining Create-handler mocks
CI on PR #2105 caught 7 Create-handler tests still mocking the
pre-#1408 10-arg INSERT signature. With the column now wired
unconditionally into the INSERT, every WithArgs that pinned
budget_limit as the 10th arg needed a 11th slot for the resolved
max_concurrent_tasks value.

Files:
- workspace_test.go: 6 tests (DBInsertError, DefaultsApplied,
  WithSecrets_Persists, TemplateDefaultsMissingRuntimeAndModel,
  TemplateDefaultsLegacyTopLevelModel, CallerModelOverridesTemplateDefault)
- workspace_budget_test.go: 1 test (Budget_Create_WithLimit)

All resolved values are the schema-default mirror, so the test
expectation reads as the same models.DefaultMaxConcurrentTasks
const that the handler writes. New imports added to both files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 11:14:02 -07:00
rabbitblood
4e6f6bf0f3 merge: sync staging into feat/wire-max-concurrent-from-template-1408 2026-04-26 11:11:30 -07:00
rabbitblood
4bcfc64e25 chore(simplify): drop verbose comments + introduce DefaultMaxConcurrentTasks const
Simplify pass on top of the wire-up commit:

- New const models.DefaultMaxConcurrentTasks = 1; handlers and tests
  reference the symbol so the schema-default mirror lives in one place.
- Strip 5 multi-line comments that narrated what the code does.
- Drop the duplicate field-rationale on OrgWorkspace; the one on
  CreateWorkspacePayload is canonical.
- Drop test-side positional comments that would silently lie if columns
  get reordered.

Pure cleanup; no behaviour change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 11:07:00 -07:00
Hongming Wang
a8a7aa54b6
Merge pull request #2061 from Molecule-AI/fix/canvas-multilevel-layout-ux
Canvas + platform UX hardening: env preflight, optimistic plugins, dotenv autoload, WS resilience
2026-04-26 18:03:10 +00:00
rabbitblood
ad5295cd8a feat(workspaces): wire max_concurrent_tasks from template config.yaml (#1408)
Phase 4 of #1408 (active_tasks counter). Runtime increment/decrement,
schema column (037), and scheduler enforcement (scheduler.go:312)
already shipped — but the write path from template config.yaml +
direct API was missing, so every workspace silently fell through to
the schema default of 1. Leaders that set max_concurrent_tasks: 3 in
their org template were getting 1 anyway, defeating the entire
feature for the use case it was built for (cron-vs-A2A contention on
PM/lead workspaces).

- OrgWorkspace gains MaxConcurrentTasks (yaml + json tags)
- CreateWorkspacePayload gains MaxConcurrentTasks (json tag)
- Both INSERTs now write the column unconditionally; 0/omitted
  payload value falls back to 1 (schema default mirror) so the wire
  stays single-shape — no forked column list / goto.
- Existing Create-handler test mocks updated to expect the 11th arg.
- New TestWorkspaceCreate_MaxConcurrentTasksOverride locks the
  payload→DB propagation for the leader case (value=3).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 11:03:01 -07:00
Hongming Wang
09bfd9bdce fix(tests): hoist _executor_mod alias so async wedge tests pass under --cov
The Copilot Auto-fix in 5a8f42b4 addressed the duplicate-import lint by
removing 'import claude_sdk_executor as _executor_mod' entirely, but the
async wedge tests (test_execute_marks_wedge_*, test_execute_clears_wedge_*)
still call _executor_mod._reset_sdk_wedge_for_test() etc. — so they failed
with NameError once that line was removed.

Restore the alias, but at the top of the file (alongside the other module-
level imports) rather than at line 1248. The late-file binding was the
proximate cause of the original CI failure: with --cov enabled (#1817),
sys.settrace + the @pytest.mark.asyncio wrapper combination caused the
late module-level binding to not be visible from inside the async test
bodies, even though the binding existed at module-load time. Hoisting
fixes that scope-resolution issue.

Verified locally with the exact CI config (--cov-fail-under=86):
  1280 passed, 2 xfailed — total coverage 90.25%

🤖 Generated with [Claude Code](https://claude.com/claude-code)
2026-04-26 10:57:21 -07:00
Hongming Wang
5a8f42b405
Potential fix for pull request finding 'Module is imported with 'import' and 'import from''
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
2026-04-26 10:45:37 -07:00
Hongming Wang
3b09bcc589
Merge branch 'staging' into fix/canvas-multilevel-layout-ux 2026-04-26 10:44:02 -07:00
Hongming Wang
d0f198b24f merge: resolve staging conflicts (a2a_proxy + workspace_crud)
Three files conflicted with staging changes that landed while this PR
sat open. Resolved each by combining both intents (not picking one side):

- a2a_proxy.go: keep the branch's idle-timeout signature
  (workspaceID parameter + comment) AND apply staging's #1483 SSRF
  defense-in-depth check at the top of dispatchA2A. Type-assert
  h.broadcaster (now an EventEmitter interface per staging) back to
  *Broadcaster for applyIdleTimeout's SubscribeSSE call; falls through
  to no-op when the assertion fails (test-mock case).

- a2a_proxy_test.go: keep both new test suites — branch's
  TestApplyIdleTimeout_* (3 cases for the idle-timeout helper) AND
  staging's TestDispatchA2A_RejectsUnsafeURL (#1483 regression). Updated
  the staging test's dispatchA2A call to pass the workspaceID arg
  introduced by the branch's signature change.

- workspace_crud.go: combine both Delete-cleanup intents:
  * Branch's cleanupCtx detachment (WithoutCancel + 30s) so canvas
    hang-up doesn't cancel mid-Docker-call (the container-leak fix)
  * Branch's stopAndRemove helper that skips RemoveVolume when Stop
    fails (orphan sweeper handles)
  * Staging's #1843 stopErrs aggregation so Stop failures bubble up
    as 500 to the client (the EC2 orphan-instance prevention)
  Both concerns satisfied: cleanup runs to completion past canvas
  hangup AND failed Stop calls surface to caller.

Build clean, all platform tests pass.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
2026-04-26 10:43:22 -07:00
Hongming Wang
5b346ab3e7
Merge pull request #2104 from Molecule-AI/test/ssrf-devmode-rfc1918-followup
test(ssrf): pin dev-mode RFC-1918 allow contract (follow-up to #2103)
2026-04-26 17:35:05 +00:00
Hongming Wang
762d3b8b2c test(ssrf): pin dev-mode RFC-1918 allow contract (follow-up to #2103)
PR #2103 widened the SSRF saasMode branch to also relax RFC-1918 + ULA
under MOLECULE_ENV=development (so the docker-compose dev pattern stops
rejecting workspace registrations on 172.18.x.x bridge IPs). The
existing TestIsSafeURL_DevMode_StillBlocksOtherRanges covered the
security floor (metadata / TEST-NET / CGNAT stay blocked), but no
test asserted the positive side — that 10.x / 172.x / 192.168.x / fd00::
ARE now allowed under dev mode.

Without this test, a future refactor that quietly drops the
`|| devModeAllowsLoopback()` from isPrivateOrMetadataIP wouldn't trip
any assertion, and the docker-compose dev loop would silently re-break.

Adds TestIsSafeURL_DevMode_AllowsRFC1918 — table of 4 URLs covering
the three RFC-1918 IPv4 ranges + IPv6 ULA fd00::/8. Sets
MOLECULE_DEPLOY_MODE=self-hosted explicitly so the test exercises the
devMode branch, not a SaaS-mode pass.

Closes the Optional finding I left on PR #2103.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 10:32:33 -07:00
Hongming Wang
61c16fe657
Merge pull request #2103 from Molecule-AI/runtime/cd-chain
feat: runtime CD chain + queued/drain classification + reload-safe agent messages
2026-04-26 17:21:54 +00:00
Hongming Wang
0de67cd379 feat(platform/admin): /admin/workspace-images/refresh + Docker SDK + GHCR auth
The production-side end of the runtime CD chain. Operators (or the post-
publish CI workflow) hit this after a runtime release to pull the latest
workspace-template-* images from GHCR and recreate any running ws-* containers
so they adopt the new image. Without this, freshly-published runtime sat in
the registry but containers kept the old image until naturally cycled.

Implementation notes:
- Uses Docker SDK ImagePull rather than shelling out to docker CLI — the
  alpine platform container has no docker CLI installed.
- ghcrAuthHeader() reads GHCR_USER + GHCR_TOKEN env, builds the base64-
  encoded JSON payload Docker engine expects in PullOptions.RegistryAuth.
  Both empty → public/cached images only; both set → private GHCR pulls.
- Container matching uses ContainerInspect (NOT ContainerList) because
  ContainerList returns the resolved digest in .Image, not the human tag.
  Inspect surfaces .Config.Image which is what we need.
- Provisioner.DefaultImagePlatform() exported so admin handler picks the
  same Apple-Silicon-needs-amd64 platform as the provisioner — single
  source of truth for the multi-arch override.

Local-dev companion: scripts/refresh-workspace-images.sh runs on the
host and inherits the host's docker keychain auth — alternate path for
when GHCR_USER/TOKEN aren't set in the platform env.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
2026-04-26 10:17:21 -07:00
Hongming Wang
50decfd326 chore(compose): wire MOLECULE_ENV, GHCR_USER/TOKEN, MOLECULE_IMAGE_PLATFORM
Three env vars the platform now reads:

- MOLECULE_ENV=development (default) — activates the WorkspaceAuth /
  AdminAuth dev fail-open path so the canvas's bearer-less requests pass
  through. Also unlocks RFC-1918 relaxation in the SSRF guard so docker-
  bridge IPs work. Override to 'production' for staged deploys.

- GHCR_USER + GHCR_TOKEN — feed POST /admin/workspace-images/refresh's
  ImagePull auth payload. Both empty → endpoint can pull cached/public
  images only. Set with a fine-grained PAT (read:packages on Molecule-AI
  org) to pull private GHCR images.

- MOLECULE_IMAGE_PLATFORM=linux/amd64 (default) — workspace-template-*
  images ship single-arch amd64. On Apple Silicon hosts, the daemon's
  native linux/arm64/v8 request misses the manifest and pulls fail.
  Forcing amd64 makes Docker Desktop run them under Rosetta — slower
  (~2-3×) but functional.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
2026-04-26 10:14:47 -07:00
Hongming Wang
09972486e8 fix(platform/notify): persist agent send_message_to_user pushes
Pre-fix, POST /workspaces/:id/notify (the side-channel agents use to push
interim updates and follow-up results) only broadcast via WebSocket — no
DB write. When the user refreshed the page, the chat-history loader
(which queries activity_logs) couldn't restore those messages and they
vanished from the chat.

Hits the most common path: when the platform's POST /a2a times out (idle),
the runtime keeps working and eventually pushes its reply via
send_message_to_user. The reply rendered live but disappeared on reload.

Fix: also INSERT an activity_logs row with shape the existing loader
already understands (type=a2a_receive, source_id=NULL, response_body=
{result: text}). Persistence is best-effort — a DB hiccup doesn't block
the WebSocket push (which the user is already seeing).

🤖 Generated with [Claude Code](https://claude.com/claude-code)
2026-04-26 10:14:47 -07:00
Hongming Wang
7ed50824b6 fix(platform/ssrf): allow RFC-1918 in MOLECULE_ENV=development
The docker-compose dev pattern puts platform and workspace containers on
the same docker bridge network (172.18.0.0/16, RFC-1918). The runtime
registers via its docker-internal hostname which DNS-resolves to a
172.18.x.x IP. The SSRF defence's isPrivateOrMetadataIP rejected those,
so every workspace POST through the platform proxy returned
'workspace URL is not publicly routable' — breaking the entire docker-
compose dev loop.

Fix: in isPrivateOrMetadataIP, treat MOLECULE_ENV=development the same
as SaaS mode for RFC-1918 relaxation. Both share the 'trusted intra-
network routing' property — SaaS is sibling EC2s in the same VPC, dev
is sibling containers on the same docker bridge. Always-blocked
categories (metadata link-local, TEST-NET, CGNAT) stay blocked.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
2026-04-26 10:14:47 -07:00
Hongming Wang
d97d7d4768 fix(platform/delegation): classify queued response + stitch drain result back
When proxyA2A returns 202+{queued:true} (target busy → enqueued for drain
on next heartbeat), executeDelegation previously treated it as a successful
completion and ran extractResponseText on the queued JSON. The result was
'Delegation completed (workspace agent busy — request queued, will dispatch...)'
landing in activity_logs.summary, which the LLM then echoed to the user
chat as garbage.

Two fixes:
1. delegation.go: detect queued shape via new isQueuedProxyResponse helper,
   write status='queued' with clean summary 'Delegation queued — target at
   capacity', store delegation_id in response_body so the drain can stitch
   back later. Also embed delegation_id in params.message.metadata + use it
   as messageId so the proxy's idempotency-key path keys off the same id.

2. a2a_queue.go: when DrainQueueForWorkspace successfully drains a queued
   item, extract delegation_id from the body's metadata and UPDATE the
   originating delegate_result row (queued → completed with real
   response_body). Broadcast DELEGATION_COMPLETE so the canvas chat feed
   flips the queued line to completed in real time.

Closes the loop so check_task_status reflects ground truth instead of
perpetual 'queued' even after the queued request eventually drained.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
2026-04-26 10:14:19 -07:00
Hongming Wang
4dd9e2b846
Merge pull request #2102 from Molecule-AI/test/e2e-invalid-api-key-pattern-1900
test(e2e): add 'Invalid API key' regression assertion to staging A2A check (#1900)
2026-04-26 17:06:03 +00:00
Hongming Wang
1ae051ec95 test(e2e): add 'Invalid API key' regression assertion to staging A2A check (#1900)
The staging E2E suite already grep's for 5 known regression patterns
in the A2A response (hermes-agent 401, model_not_found, Encrypted
content, Unknown provider, hermes-agent unreachable). The comment
block at lines 386-395 lists "Invalid API key" as the signal for the
CP #238 boot-event 401 race + stale OPENAI_API_KEY paths, but the
explicit grep was never added — meaning a regression in that class
would slip through the generic `error|exception` catch-all.

Closes the gap with one specific-pattern check that fails loud with
the relevant bug references in the message.

Verified `bash -n` clean; pre-existing shellcheck SC2015 at line 88
is unrelated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 10:03:46 -07:00
Hongming Wang
d949b5b323
Merge pull request #2101 from Molecule-AI/test/broadcaster-interface-1814
test(handlers): introduce events.EventEmitter interface (#1814 partial)
2026-04-26 16:08:25 +00:00
Hongming Wang
7d48f24fef test(handlers): introduce events.EventEmitter interface (#1814 partial)
The 3 skipped tests in workspace_provision_test.go (#1206 regression
tests) were blocked because captureBroadcaster's struct-embed wouldn't
type-check against WorkspaceHandler.broadcaster's concrete
*events.Broadcaster field. This PR fixes the interface blocker for
the 2 broadcaster-related tests; the 3rd (plugins.Registry resolver)
is a separate blocker tracked elsewhere.

Changes:

- internal/events/broadcaster.go: define `EventEmitter` interface with
  RecordAndBroadcast + BroadcastOnly. *Broadcaster satisfies it via
  its existing methods (compile-time assertion guards future drift).
  SubscribeSSE / Subscribe stay off the interface because only sse.go
  + cmd/server/main.go call them, and both still hold the concrete
  *Broadcaster.

- internal/handlers/workspace.go: WorkspaceHandler.broadcaster type
  changes from *events.Broadcaster to events.EventEmitter.
  NewWorkspaceHandler signature updated to match. Production callers
  unchanged — they pass *events.Broadcaster, which the interface
  accepts.

- internal/handlers/activity.go: LogActivity takes events.EventEmitter
  for the same reason — tests passing a stub no longer need to
  construct the full broadcaster.

- internal/handlers/workspace_provision_test.go: captureBroadcaster
  drops the struct embed (no more zero-value Broadcaster underlying
  the SSE+hub fields), implements RecordAndBroadcast directly, and
  adds a no-op BroadcastOnly to satisfy the interface. Skip messages
  on the 2 empty broadcaster-blocked tests updated to reflect the
  new "interface unblocked, test body still needed" state.

Verified `go build ./...`, `go test ./internal/handlers/`, and
`go vet ./...` all clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 09:05:52 -07:00
Hongming Wang
d86eabdd58
Merge pull request #2100 from Molecule-AI/fix/token-cache-toctou-1552
fix(git-token-helper): close TOCTOU window + stop swallowing chmod errors (closes #1552)
2026-04-26 15:24:34 +00:00
Hongming Wang
dafe08450b
Merge pull request #2099 from Molecule-AI/fix/staging-e2e-tls-timeout
fix(e2e): bump staging tenant TLS-readiness timeout 3min → 10min
2026-04-26 15:24:01 +00:00
Hongming Wang
fc2720c1fe fix(git-token-helper): close TOCTOU window + stop swallowing chmod errors (closes #1552)
The token-cache helper had three #1552 findings, all in the
mode-600-after-the-fact pattern:

1. _write_cache writes .tmp with default umask (typically 022 → 644
   on disk) and then chmod 600's after the mv. A concurrent reader
   in that microsecond-wide window sees the token at mode 644.
2. Each chmod was swallowed via `|| true` — if it ever fails, the
   tokens stay world-readable with no operator signal.
3. _refresh_gh's gh_token_file write has the same shape and same
   two issues.

Hardening:

- Wrap the .tmp creates in a `umask 077` block so the files are 600
  from creation. Restore the previous umask before return so callers
  aren't perturbed.
- Replace `chmod ... 2>/dev/null || true` with `if ! chmod ...; then
  echo WARN ...; fi`. A chmod failure is a real signal worth grep'ing.
- Apply the same pattern to the _refresh_gh gh_token_file path.
  `local` is illegal in a top-level case branch, so use a uniquely-
  named global (_gh_prev_umask) and unset it after.

Verified `bash -n` clean and `shellcheck` clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 08:22:29 -07:00
rabbitblood
f9b1b34956 fix(e2e): bump staging tenant TLS-readiness timeout 3min → 10min
Closes a 4+ cycle Canvas tabs E2E flake pattern that's been blocking
staging→main PRs since 2026-04-24+ (#2096, #2094, #2055, #2079, ...).

Root cause: TLS_TIMEOUT_MS=180s (3 min) is too tight for the layered
realities of staging tenant TLS readiness:

1. Cloudflare DNS propagation through the edge (1-2 min typical)
2. Tenant CF Tunnel registering the new hostname (1-2 min)
3. CF edge ACME cert provisioning + cache (1-3 min)

Each layer can add 1-3 min on its own under heavy staging load — the
realistic worst case is well past the 3-min cap.

Provision and workspace-online timeouts were already raised to 20 min
(staging-setup.ts:42-46 history). The TLS gate was the remaining
under-budgeted step. Bumping to 10 min keeps it inside the 20-min
PROVISION envelope so a genuinely-stuck tenant still fails loud at
the earlier provision step rather than masquerading as a TLS issue.

Both call sites raised together:
- canvas/e2e/staging-setup.ts: TLS_TIMEOUT_MS = 10 * 60 * 1000
- tests/e2e/test_staging_full_saas.sh: TLS_DEADLINE += 600

Each carries an inline rationale comment so the next reviewer sees
the layer-by-layer decomposition without re-reading the issue thread.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 08:21:18 -07:00