Commit Graph

3075 Commits

Author SHA1 Message Date
Hongming Wang
1c38c78f5e
feat(compose): IMAGE_AUTO_REFRESH=true by default in local dev (#2116)
Picks up the GHCR digest watcher added in PR #2114 with no operator
action: just `docker compose up` and the platform self-heals to the
latest workspace-template image within 5 minutes of publish.

Default ON for local dev because that's where the runtime → workspace
iteration loop is tightest. .env.example documents the override knob
for the rare "running a long test that shouldn't be disturbed by a
publish" case.

Co-authored-by: Hongming Wang <hongmingwangalt@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 13:49:08 -07:00
Hongming Wang
263012249c
Merge pull request #2109 from Molecule-AI/feat/org-wide-secret-scan-workflow
feat(ci): add secret-scan workflow + reusable entry point for org-wide enrollment
2026-04-26 20:37:16 +00:00
Hongming Wang
9375e3d4ee
feat(workspace-server): GHCR digest watcher closes runtime CD chain (#2114)
Adds an opt-in goroutine that polls GHCR every 5 minutes for digest
changes on each workspace-template-*:latest tag and invokes the same
refresh logic /admin/workspace-images/refresh exposes. With this, the
chain from "merge runtime PR" to "containers running new code" is fully
hands-off — no operator step between auto-tag → publish-runtime →
cascade → template image rebuild → host pull + recreate.

Opt-in via IMAGE_AUTO_REFRESH=true. SaaS deploys whose pipeline already
pulls every release should leave it off (would be redundant work);
self-hosters get true zero-touch.

Why a refactor of admin_workspace_images.go is in this PR:
The HTTP handler held all the refresh logic inline. To share it with
the new watcher without HTTP loopback, extracted WorkspaceImageService
with a Refresh(ctx, runtimes, recreate) (RefreshResult, error) shape.
HTTP handler is now a thin wrapper; behavior is preserved (same JSON
response, same 500-on-list-failure, same per-runtime soft-fail).

Watcher design notes:
- Last-observed digest tracked in memory (not persisted). On boot the
  first observation per runtime is seed-only — no spurious refresh
  fires on every restart.
- On Refresh error, the seen digest rolls back so the next tick retries.
  Without this rollback a transient Docker glitch would convince the
  watcher the work was done.
- Per-runtime fetch errors don't block other runtimes (one template's
  brief 500 doesn't pause the others).
- digestFetcher injection seam in tick() lets unit tests cover all
  bookkeeping branches without standing up an httptest GHCR server.

Verified live: probed GHCR's /token + manifest HEAD against
workspace-template-claude-code; got HTTP 200 + a real
Docker-Content-Digest. Same calls the watcher makes.

Co-authored-by: Hongming Wang <hongmingwangalt@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 13:36:26 -07:00
Hongming Wang
168d6ec8d9
docs: point new-runtime-template flow at the GitHub template repo (#2111)
* docs: point new-runtime-template flow at the GitHub template repo

The 'Writing a new adapter' section was a 6-step manual checklist that
re-derived the canonical shape every time. Now that
Molecule-AI/molecule-ai-workspace-template-starter exists as a GitHub
template, the flow collapses to:

  gh repo create ... --template Molecule-AI/molecule-ai-workspace-template-starter

Plus a fill-in-the-TODO-markers table.

Why this matters: the starter ships with the
'repository_dispatch: [runtime-published]' cascade receiver pre-wired,
which means new templates pick up runtime PyPI publishes automatically
without the one-time setup PR each existing template needed (PRs #6-#22
across the 8 template repos that we just opened to retrofit). At
'hundreds of runtimes' scale this is the difference between linear PR-
toil and zero PR-toil per template addition.

Also adds: 'When the starter itself needs to evolve' — explicit pattern
for keeping the canonical shape in one place when it changes.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

* docs(workspace-runtime): drop PYPI_TOKEN refs — OIDC is the new auth

Reflects PR #2113 (PyPI Trusted Publisher / OIDC migration). No static
PyPI token exists in the repo anymore, so the docs shouldn't claim one
does. Replaces the PYPI_TOKEN row in the Required Secrets table with an
"Auth" section pointing at the OIDC config; TEMPLATE_DISPATCH_TOKEN is
still the only repo secret the cascade needs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Hongming Wang <hongmingwangalt@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 13:15:13 -07:00
Hongming Wang
f3a204347c
fix(publish-runtime): use PyPI Trusted Publisher (OIDC) instead of PYPI_TOKEN (#2113)
Drops the static PYPI_TOKEN secret in favor of OIDC trusted publishing.
PyPI now mints a short-lived upload credential after verifying the
workflow's OIDC claim against the trusted-publisher config registered
for molecule-ai-workspace-runtime (Molecule-AI/molecule-core,
publish-runtime.yml, environment pypi-publish).

Why:
- A leaked PYPI_TOKEN would let any holder publish arbitrary versions of
  molecule-ai-workspace-runtime to PyPI from anywhere — bypassing the
  monorepo's review and CI gates entirely. The 8 template repos pull
  this package; a malicious publish poisons all of them.
- Trusted Publisher (OIDC) makes that exfil path moot: no long-lived
  credential exists to leak. Only this exact workflow, on this repo,
  in the pypi-publish environment, can upload.

After this lands and the first OIDC publish succeeds, the PYPI_TOKEN
repo secret should be deleted (it becomes dead weight + a leak surface
with no purpose).

Belt-and-suspenders companion to PR #56 in molecule-ai-workspace-runtime
(sibling repo lockdown). Without OIDC, the sibling lockdown alone
doesn't prevent local `python -m build && twine upload` from a laptop
with a personal PyPI maintainer credential.

Co-authored-by: Hongming Wang <hongmingwangalt@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 13:14:47 -07:00
Hongming Wang
199630908d
fix(publish-runtime): smoke test asserts stable invariants, not feature flags (#2112)
The original smoke step had `assert a2a_client._A2A_QUEUED_PREFIX`
which is a feature-flag-style check — it fires false-positive every time
staging is mid-release of that specific feature. Caught when the dry-run
publish (run 24965411618) failed because _A2A_QUEUED_PREFIX hadn't
landed on staging yet (it lives in PR #2061's series, separate from the
PR #2103 chain that shipped this workflow).

Replaced with checks for stable invariants of the package contract:

  - a2a_client._A2A_ERROR_PREFIX exists (always has, since the
    [A2A_ERROR] sentinel is the foundational error-tagging primitive)
  - adapters.get_adapter is callable
  - BaseAdapter has the .name() static method (interface anchor)
  - AdapterConfig has __init__ (dataclass present)

These four cover the cases the smoke test actually needs to catch:
import-path rewrites broken by build_runtime_package.py, missing
modules, dataclass shape regressions. They don't fire when a specific
feature is mid-merge.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Hongming Wang <hongmingwangalt@gmail.com>
2026-04-26 13:14:15 -07:00
rabbitblood
8edbd12980 feat(ci): add secret-scan workflow + reusable entry point for org-wide enrollment
Defense-in-depth for the #2090-class incident (2026-04-24): GitHub's
hosted Copilot Coding Agent leaked a ghs_* installation token into
tenant-proxy/package.json via npm init slurping the URL from a
token-embedded origin remote. We can't fix upstream's clone hygiene,
so we gate at the PR layer.

Single workflow, dual purpose:

1. PR / push / merge_group gate on this repo (molecule-monorepo).
   Refuses any change whose diff additions contain a credential-shaped
   string. Same shape as Block forbidden paths — error message tells
   the agent how to recover without echoing the secret value.

2. Reusable workflow entry point (workflow_call) for the rest of the
   org. Other Molecule-AI repos enroll with a 3-line workflow:

     jobs:
       secret-scan:
         uses: Molecule-AI/molecule-monorepo/.github/workflows/secret-scan.yml@main

   This makes molecule-monorepo the single source of truth for the
   regex set; consumer repos pick up new patterns without per-repo PRs.

Pattern set covers GitHub family (ghp_, ghs_, gho_, ghu_, ghr_,
github_pat_), Anthropic / OpenAI / Slack / AWS. Mirror of the
runtime's bundled pre-commit hook (molecule-ai-workspace-runtime:
molecule_runtime/scripts/pre-commit-checks.sh) — keep aligned when
either side adds a pattern.

Self-exclude on .github/workflows/secret-scan.yml so the file's own
regex literals don't block its merge.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 12:05:18 -07:00
Hongming Wang
b0a33d9ebf
Merge pull request #2106 from Molecule-AI/docs/secrets-key-custody
docs(security): document KMS-rooted custody chain for SECRETS_ENCRYPTION_KEY
2026-04-26 18:51:16 +00:00
Hongming Wang
cecb2600d7
Merge pull request #2107 from Molecule-AI/fix/canary-tls-timeout-diagnostics
fix(e2e): bump tenant TLS timeout to 15m + diagnostic burst on failure (#2090)
2026-04-26 18:51:14 +00:00
rabbitblood
b87befdabe chore(simplify): trim SHA-rot comments + harden TENANT_HOST scheme/port stripping
Simplify pass on top of the canary fix:

- Drop the three CP commit SHAs from comments — issue #2090 covers
  the audit trail, SHAs would rot.
- Pull the inline `900` into TLS_TIMEOUT_SEC=$((15 * 60)) so the
  bash mirrors the TS side (15 min) at a glance.
- TENANT_HOST extraction now strips http(s) AND any port suffix, so
  getent doesn't silently fail on a ws://host:443 style URL.
- sed-redact Authorization/Cookie out of the curl -v dump, defensive
  against future callers adding an auth header to this probe.

Pure cleanup; no behaviour change to the happy path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 11:44:54 -07:00
rabbitblood
af89d3fcbd fix(e2e): bump tenant TLS timeout to 15m + diagnostic burst on failure (#2090)
Canary #2090 has been red for 6 consecutive runs over 4+ hours, all
timing out at the TLS-readiness step exactly at the 10-min cap. Time
window correlates with three CP commits that landed today/yesterday
and changed EC2 boot behaviour:

- molecule-controlplane@a3eb8be — fix(ec2): force fresh clone of /opt/adapter
- molecule-controlplane@ed70405 — feat(sweep): wire up healthcheck loop
- molecule-controlplane@4ab339e — fix(provisioner): aggregate cleanup errors

Two changes here, both surgical:

1. Bump the bash-side TLS deadline from 600s to 900s, and the canvas TS
   mirror from 10m to 15m. Stays below the 20-min provision envelope
   (so a genuinely-stuck tenant still fails loud at the earlier
   provision step instead of masquerading as TLS).

2. On TLS-timeout, dump a diagnostic burst before exiting:
   - getent hosts $TENANT_HOST  (DNS resolution state)
   - curl -kv $TENANT_URL/health (TLS handshake + HTTP layer)
   The previous failure log was just "no 2xx in N min" with no signal
   for which layer was actually broken. After this, the next timeout
   tells us whether DNS, TLS handshake, or HTTP layer is the culprit
   so the CP root cause can be isolated without speculation.

This is the unblock; a separate molecule-controlplane issue tracks the
underlying regression suspicion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 11:39:28 -07:00
rabbitblood
262a52a32c docs(security): document the KMS-rooted custody chain for SECRETS_ENCRYPTION_KEY
External architecture review flagged the SECRETS_ENCRYPTION_KEY env var
on the platform as encryption-at-rest theater. The reviewer read only
the platform repo and missed that the master key actually lives in AWS
KMS at the control plane layer, with envelope encryption wrapping each
tenant secret blob.

Adds docs/architecture/secrets-key-custody.md as the canonical source
of truth for the full chain:

- Two-mode envelope (KMS_KEY_ARN vs static-key fallback)
- Per-blob AES-256-GCM with KMS-wrapped DEKs
- Where each key actually lives (KMS, CP env, tenant env)
- Threat model per attacker capability
- Rotation story (annual KMS CMK rotation, manual DEK rotation on incident)
- Audit posture (SOC2 / ISO 27001 questionnaire bullets)

Patches three downstream docs that previously stopped at the env-var
level and link them to the new custody doc:

- development/constraints-and-rules.md (Rule 11)
- architecture/database-schema.md (workspace_secrets paragraph)
- architecture/molecule-technical-doc.md (env-vars table)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 11:29:16 -07:00
Hongming Wang
9b26144386
Merge pull request #2105 from Molecule-AI/feat/wire-max-concurrent-from-template-1408
feat(workspaces): wire max_concurrent_tasks from template config.yaml (#1408)
2026-04-26 18:21:24 +00:00
rabbitblood
ca9a034bbe test(handlers): add 11th INSERT arg (max_concurrent_tasks) to remaining Create-handler mocks
CI on PR #2105 caught 7 Create-handler tests still mocking the
pre-#1408 10-arg INSERT signature. With the column now wired
unconditionally into the INSERT, every WithArgs that pinned
budget_limit as the 10th arg needed a 11th slot for the resolved
max_concurrent_tasks value.

Files:
- workspace_test.go: 6 tests (DBInsertError, DefaultsApplied,
  WithSecrets_Persists, TemplateDefaultsMissingRuntimeAndModel,
  TemplateDefaultsLegacyTopLevelModel, CallerModelOverridesTemplateDefault)
- workspace_budget_test.go: 1 test (Budget_Create_WithLimit)

All resolved values are the schema-default mirror, so the test
expectation reads as the same models.DefaultMaxConcurrentTasks
const that the handler writes. New imports added to both files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 11:14:02 -07:00
rabbitblood
4e6f6bf0f3 merge: sync staging into feat/wire-max-concurrent-from-template-1408 2026-04-26 11:11:30 -07:00
rabbitblood
4bcfc64e25 chore(simplify): drop verbose comments + introduce DefaultMaxConcurrentTasks const
Simplify pass on top of the wire-up commit:

- New const models.DefaultMaxConcurrentTasks = 1; handlers and tests
  reference the symbol so the schema-default mirror lives in one place.
- Strip 5 multi-line comments that narrated what the code does.
- Drop the duplicate field-rationale on OrgWorkspace; the one on
  CreateWorkspacePayload is canonical.
- Drop test-side positional comments that would silently lie if columns
  get reordered.

Pure cleanup; no behaviour change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 11:07:00 -07:00
Hongming Wang
a8a7aa54b6
Merge pull request #2061 from Molecule-AI/fix/canvas-multilevel-layout-ux
Canvas + platform UX hardening: env preflight, optimistic plugins, dotenv autoload, WS resilience
2026-04-26 18:03:10 +00:00
rabbitblood
ad5295cd8a feat(workspaces): wire max_concurrent_tasks from template config.yaml (#1408)
Phase 4 of #1408 (active_tasks counter). Runtime increment/decrement,
schema column (037), and scheduler enforcement (scheduler.go:312)
already shipped — but the write path from template config.yaml +
direct API was missing, so every workspace silently fell through to
the schema default of 1. Leaders that set max_concurrent_tasks: 3 in
their org template were getting 1 anyway, defeating the entire
feature for the use case it was built for (cron-vs-A2A contention on
PM/lead workspaces).

- OrgWorkspace gains MaxConcurrentTasks (yaml + json tags)
- CreateWorkspacePayload gains MaxConcurrentTasks (json tag)
- Both INSERTs now write the column unconditionally; 0/omitted
  payload value falls back to 1 (schema default mirror) so the wire
  stays single-shape — no forked column list / goto.
- Existing Create-handler test mocks updated to expect the 11th arg.
- New TestWorkspaceCreate_MaxConcurrentTasksOverride locks the
  payload→DB propagation for the leader case (value=3).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 11:03:01 -07:00
Hongming Wang
09bfd9bdce fix(tests): hoist _executor_mod alias so async wedge tests pass under --cov
The Copilot Auto-fix in 5a8f42b4 addressed the duplicate-import lint by
removing 'import claude_sdk_executor as _executor_mod' entirely, but the
async wedge tests (test_execute_marks_wedge_*, test_execute_clears_wedge_*)
still call _executor_mod._reset_sdk_wedge_for_test() etc. — so they failed
with NameError once that line was removed.

Restore the alias, but at the top of the file (alongside the other module-
level imports) rather than at line 1248. The late-file binding was the
proximate cause of the original CI failure: with --cov enabled (#1817),
sys.settrace + the @pytest.mark.asyncio wrapper combination caused the
late module-level binding to not be visible from inside the async test
bodies, even though the binding existed at module-load time. Hoisting
fixes that scope-resolution issue.

Verified locally with the exact CI config (--cov-fail-under=86):
  1280 passed, 2 xfailed — total coverage 90.25%

🤖 Generated with [Claude Code](https://claude.com/claude-code)
2026-04-26 10:57:21 -07:00
Hongming Wang
5a8f42b405
Potential fix for pull request finding 'Module is imported with 'import' and 'import from''
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
2026-04-26 10:45:37 -07:00
Hongming Wang
3b09bcc589
Merge branch 'staging' into fix/canvas-multilevel-layout-ux 2026-04-26 10:44:02 -07:00
Hongming Wang
d0f198b24f merge: resolve staging conflicts (a2a_proxy + workspace_crud)
Three files conflicted with staging changes that landed while this PR
sat open. Resolved each by combining both intents (not picking one side):

- a2a_proxy.go: keep the branch's idle-timeout signature
  (workspaceID parameter + comment) AND apply staging's #1483 SSRF
  defense-in-depth check at the top of dispatchA2A. Type-assert
  h.broadcaster (now an EventEmitter interface per staging) back to
  *Broadcaster for applyIdleTimeout's SubscribeSSE call; falls through
  to no-op when the assertion fails (test-mock case).

- a2a_proxy_test.go: keep both new test suites — branch's
  TestApplyIdleTimeout_* (3 cases for the idle-timeout helper) AND
  staging's TestDispatchA2A_RejectsUnsafeURL (#1483 regression). Updated
  the staging test's dispatchA2A call to pass the workspaceID arg
  introduced by the branch's signature change.

- workspace_crud.go: combine both Delete-cleanup intents:
  * Branch's cleanupCtx detachment (WithoutCancel + 30s) so canvas
    hang-up doesn't cancel mid-Docker-call (the container-leak fix)
  * Branch's stopAndRemove helper that skips RemoveVolume when Stop
    fails (orphan sweeper handles)
  * Staging's #1843 stopErrs aggregation so Stop failures bubble up
    as 500 to the client (the EC2 orphan-instance prevention)
  Both concerns satisfied: cleanup runs to completion past canvas
  hangup AND failed Stop calls surface to caller.

Build clean, all platform tests pass.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
2026-04-26 10:43:22 -07:00
Hongming Wang
5b346ab3e7
Merge pull request #2104 from Molecule-AI/test/ssrf-devmode-rfc1918-followup
test(ssrf): pin dev-mode RFC-1918 allow contract (follow-up to #2103)
2026-04-26 17:35:05 +00:00
Hongming Wang
762d3b8b2c test(ssrf): pin dev-mode RFC-1918 allow contract (follow-up to #2103)
PR #2103 widened the SSRF saasMode branch to also relax RFC-1918 + ULA
under MOLECULE_ENV=development (so the docker-compose dev pattern stops
rejecting workspace registrations on 172.18.x.x bridge IPs). The
existing TestIsSafeURL_DevMode_StillBlocksOtherRanges covered the
security floor (metadata / TEST-NET / CGNAT stay blocked), but no
test asserted the positive side — that 10.x / 172.x / 192.168.x / fd00::
ARE now allowed under dev mode.

Without this test, a future refactor that quietly drops the
`|| devModeAllowsLoopback()` from isPrivateOrMetadataIP wouldn't trip
any assertion, and the docker-compose dev loop would silently re-break.

Adds TestIsSafeURL_DevMode_AllowsRFC1918 — table of 4 URLs covering
the three RFC-1918 IPv4 ranges + IPv6 ULA fd00::/8. Sets
MOLECULE_DEPLOY_MODE=self-hosted explicitly so the test exercises the
devMode branch, not a SaaS-mode pass.

Closes the Optional finding I left on PR #2103.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 10:32:33 -07:00
Hongming Wang
61c16fe657
Merge pull request #2103 from Molecule-AI/runtime/cd-chain
feat: runtime CD chain + queued/drain classification + reload-safe agent messages
2026-04-26 17:21:54 +00:00
Hongming Wang
0de67cd379 feat(platform/admin): /admin/workspace-images/refresh + Docker SDK + GHCR auth
The production-side end of the runtime CD chain. Operators (or the post-
publish CI workflow) hit this after a runtime release to pull the latest
workspace-template-* images from GHCR and recreate any running ws-* containers
so they adopt the new image. Without this, freshly-published runtime sat in
the registry but containers kept the old image until naturally cycled.

Implementation notes:
- Uses Docker SDK ImagePull rather than shelling out to docker CLI — the
  alpine platform container has no docker CLI installed.
- ghcrAuthHeader() reads GHCR_USER + GHCR_TOKEN env, builds the base64-
  encoded JSON payload Docker engine expects in PullOptions.RegistryAuth.
  Both empty → public/cached images only; both set → private GHCR pulls.
- Container matching uses ContainerInspect (NOT ContainerList) because
  ContainerList returns the resolved digest in .Image, not the human tag.
  Inspect surfaces .Config.Image which is what we need.
- Provisioner.DefaultImagePlatform() exported so admin handler picks the
  same Apple-Silicon-needs-amd64 platform as the provisioner — single
  source of truth for the multi-arch override.

Local-dev companion: scripts/refresh-workspace-images.sh runs on the
host and inherits the host's docker keychain auth — alternate path for
when GHCR_USER/TOKEN aren't set in the platform env.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
2026-04-26 10:17:21 -07:00
Hongming Wang
50decfd326 chore(compose): wire MOLECULE_ENV, GHCR_USER/TOKEN, MOLECULE_IMAGE_PLATFORM
Three env vars the platform now reads:

- MOLECULE_ENV=development (default) — activates the WorkspaceAuth /
  AdminAuth dev fail-open path so the canvas's bearer-less requests pass
  through. Also unlocks RFC-1918 relaxation in the SSRF guard so docker-
  bridge IPs work. Override to 'production' for staged deploys.

- GHCR_USER + GHCR_TOKEN — feed POST /admin/workspace-images/refresh's
  ImagePull auth payload. Both empty → endpoint can pull cached/public
  images only. Set with a fine-grained PAT (read:packages on Molecule-AI
  org) to pull private GHCR images.

- MOLECULE_IMAGE_PLATFORM=linux/amd64 (default) — workspace-template-*
  images ship single-arch amd64. On Apple Silicon hosts, the daemon's
  native linux/arm64/v8 request misses the manifest and pulls fail.
  Forcing amd64 makes Docker Desktop run them under Rosetta — slower
  (~2-3×) but functional.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
2026-04-26 10:14:47 -07:00
Hongming Wang
09972486e8 fix(platform/notify): persist agent send_message_to_user pushes
Pre-fix, POST /workspaces/:id/notify (the side-channel agents use to push
interim updates and follow-up results) only broadcast via WebSocket — no
DB write. When the user refreshed the page, the chat-history loader
(which queries activity_logs) couldn't restore those messages and they
vanished from the chat.

Hits the most common path: when the platform's POST /a2a times out (idle),
the runtime keeps working and eventually pushes its reply via
send_message_to_user. The reply rendered live but disappeared on reload.

Fix: also INSERT an activity_logs row with shape the existing loader
already understands (type=a2a_receive, source_id=NULL, response_body=
{result: text}). Persistence is best-effort — a DB hiccup doesn't block
the WebSocket push (which the user is already seeing).

🤖 Generated with [Claude Code](https://claude.com/claude-code)
2026-04-26 10:14:47 -07:00
Hongming Wang
7ed50824b6 fix(platform/ssrf): allow RFC-1918 in MOLECULE_ENV=development
The docker-compose dev pattern puts platform and workspace containers on
the same docker bridge network (172.18.0.0/16, RFC-1918). The runtime
registers via its docker-internal hostname which DNS-resolves to a
172.18.x.x IP. The SSRF defence's isPrivateOrMetadataIP rejected those,
so every workspace POST through the platform proxy returned
'workspace URL is not publicly routable' — breaking the entire docker-
compose dev loop.

Fix: in isPrivateOrMetadataIP, treat MOLECULE_ENV=development the same
as SaaS mode for RFC-1918 relaxation. Both share the 'trusted intra-
network routing' property — SaaS is sibling EC2s in the same VPC, dev
is sibling containers on the same docker bridge. Always-blocked
categories (metadata link-local, TEST-NET, CGNAT) stay blocked.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
2026-04-26 10:14:47 -07:00
Hongming Wang
d97d7d4768 fix(platform/delegation): classify queued response + stitch drain result back
When proxyA2A returns 202+{queued:true} (target busy → enqueued for drain
on next heartbeat), executeDelegation previously treated it as a successful
completion and ran extractResponseText on the queued JSON. The result was
'Delegation completed (workspace agent busy — request queued, will dispatch...)'
landing in activity_logs.summary, which the LLM then echoed to the user
chat as garbage.

Two fixes:
1. delegation.go: detect queued shape via new isQueuedProxyResponse helper,
   write status='queued' with clean summary 'Delegation queued — target at
   capacity', store delegation_id in response_body so the drain can stitch
   back later. Also embed delegation_id in params.message.metadata + use it
   as messageId so the proxy's idempotency-key path keys off the same id.

2. a2a_queue.go: when DrainQueueForWorkspace successfully drains a queued
   item, extract delegation_id from the body's metadata and UPDATE the
   originating delegate_result row (queued → completed with real
   response_body). Broadcast DELEGATION_COMPLETE so the canvas chat feed
   flips the queued line to completed in real time.

Closes the loop so check_task_status reflects ground truth instead of
perpetual 'queued' even after the queued request eventually drained.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
2026-04-26 10:14:19 -07:00
Hongming Wang
4dd9e2b846
Merge pull request #2102 from Molecule-AI/test/e2e-invalid-api-key-pattern-1900
test(e2e): add 'Invalid API key' regression assertion to staging A2A check (#1900)
2026-04-26 17:06:03 +00:00
Hongming Wang
1ae051ec95 test(e2e): add 'Invalid API key' regression assertion to staging A2A check (#1900)
The staging E2E suite already grep's for 5 known regression patterns
in the A2A response (hermes-agent 401, model_not_found, Encrypted
content, Unknown provider, hermes-agent unreachable). The comment
block at lines 386-395 lists "Invalid API key" as the signal for the
CP #238 boot-event 401 race + stale OPENAI_API_KEY paths, but the
explicit grep was never added — meaning a regression in that class
would slip through the generic `error|exception` catch-all.

Closes the gap with one specific-pattern check that fails loud with
the relevant bug references in the message.

Verified `bash -n` clean; pre-existing shellcheck SC2015 at line 88
is unrelated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 10:03:46 -07:00
Hongming Wang
d949b5b323
Merge pull request #2101 from Molecule-AI/test/broadcaster-interface-1814
test(handlers): introduce events.EventEmitter interface (#1814 partial)
2026-04-26 16:08:25 +00:00
Hongming Wang
7d48f24fef test(handlers): introduce events.EventEmitter interface (#1814 partial)
The 3 skipped tests in workspace_provision_test.go (#1206 regression
tests) were blocked because captureBroadcaster's struct-embed wouldn't
type-check against WorkspaceHandler.broadcaster's concrete
*events.Broadcaster field. This PR fixes the interface blocker for
the 2 broadcaster-related tests; the 3rd (plugins.Registry resolver)
is a separate blocker tracked elsewhere.

Changes:

- internal/events/broadcaster.go: define `EventEmitter` interface with
  RecordAndBroadcast + BroadcastOnly. *Broadcaster satisfies it via
  its existing methods (compile-time assertion guards future drift).
  SubscribeSSE / Subscribe stay off the interface because only sse.go
  + cmd/server/main.go call them, and both still hold the concrete
  *Broadcaster.

- internal/handlers/workspace.go: WorkspaceHandler.broadcaster type
  changes from *events.Broadcaster to events.EventEmitter.
  NewWorkspaceHandler signature updated to match. Production callers
  unchanged — they pass *events.Broadcaster, which the interface
  accepts.

- internal/handlers/activity.go: LogActivity takes events.EventEmitter
  for the same reason — tests passing a stub no longer need to
  construct the full broadcaster.

- internal/handlers/workspace_provision_test.go: captureBroadcaster
  drops the struct embed (no more zero-value Broadcaster underlying
  the SSE+hub fields), implements RecordAndBroadcast directly, and
  adds a no-op BroadcastOnly to satisfy the interface. Skip messages
  on the 2 empty broadcaster-blocked tests updated to reflect the
  new "interface unblocked, test body still needed" state.

Verified `go build ./...`, `go test ./internal/handlers/`, and
`go vet ./...` all clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 09:05:52 -07:00
Hongming Wang
d86eabdd58
Merge pull request #2100 from Molecule-AI/fix/token-cache-toctou-1552
fix(git-token-helper): close TOCTOU window + stop swallowing chmod errors (closes #1552)
2026-04-26 15:24:34 +00:00
Hongming Wang
dafe08450b
Merge pull request #2099 from Molecule-AI/fix/staging-e2e-tls-timeout
fix(e2e): bump staging tenant TLS-readiness timeout 3min → 10min
2026-04-26 15:24:01 +00:00
Hongming Wang
fc2720c1fe fix(git-token-helper): close TOCTOU window + stop swallowing chmod errors (closes #1552)
The token-cache helper had three #1552 findings, all in the
mode-600-after-the-fact pattern:

1. _write_cache writes .tmp with default umask (typically 022 → 644
   on disk) and then chmod 600's after the mv. A concurrent reader
   in that microsecond-wide window sees the token at mode 644.
2. Each chmod was swallowed via `|| true` — if it ever fails, the
   tokens stay world-readable with no operator signal.
3. _refresh_gh's gh_token_file write has the same shape and same
   two issues.

Hardening:

- Wrap the .tmp creates in a `umask 077` block so the files are 600
  from creation. Restore the previous umask before return so callers
  aren't perturbed.
- Replace `chmod ... 2>/dev/null || true` with `if ! chmod ...; then
  echo WARN ...; fi`. A chmod failure is a real signal worth grep'ing.
- Apply the same pattern to the _refresh_gh gh_token_file path.
  `local` is illegal in a top-level case branch, so use a uniquely-
  named global (_gh_prev_umask) and unset it after.

Verified `bash -n` clean and `shellcheck` clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 08:22:29 -07:00
rabbitblood
f9b1b34956 fix(e2e): bump staging tenant TLS-readiness timeout 3min → 10min
Closes a 4+ cycle Canvas tabs E2E flake pattern that's been blocking
staging→main PRs since 2026-04-24+ (#2096, #2094, #2055, #2079, ...).

Root cause: TLS_TIMEOUT_MS=180s (3 min) is too tight for the layered
realities of staging tenant TLS readiness:

1. Cloudflare DNS propagation through the edge (1-2 min typical)
2. Tenant CF Tunnel registering the new hostname (1-2 min)
3. CF edge ACME cert provisioning + cache (1-3 min)

Each layer can add 1-3 min on its own under heavy staging load — the
realistic worst case is well past the 3-min cap.

Provision and workspace-online timeouts were already raised to 20 min
(staging-setup.ts:42-46 history). The TLS gate was the remaining
under-budgeted step. Bumping to 10 min keeps it inside the 20-min
PROVISION envelope so a genuinely-stuck tenant still fails loud at
the earlier provision step rather than masquerading as a TLS issue.

Both call sites raised together:
- canvas/e2e/staging-setup.ts: TLS_TIMEOUT_MS = 10 * 60 * 1000
- tests/e2e/test_staging_full_saas.sh: TLS_DEADLINE += 600

Each carries an inline rationale comment so the next reviewer sees
the layer-by-layer decomposition without re-reading the issue thread.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 08:21:18 -07:00
Hongming Wang
7c8be5cac2
Merge pull request #2098 from Molecule-AI/fix/sweep-cf-orphans-noise
fix(ci): stop sweep-cf-orphans noise — drop merge_group + soft-skip when secrets unset
2026-04-26 15:08:35 +00:00
Hongming Wang
f1792e1f7a fix(ci): stop sweep-cf-orphans noise — drop merge_group + soft-skip when secrets unset
The sweep-cf-orphans workflow shipped in #2088 was noisier than
intended in two ways. This PR fixes both — was filed under the
Optional finding I left on the original review and now matters because
the noise is observably hitting the merge queue.

1) `merge_group: types: [checks_requested]` was firing the entire
   sweep job on every PR through the merge queue. The original intent
   ("future required-check support without a workflow edit") never
   materialized, and meanwhile every recent merge-queue eval (#2091,
   #2092, #2093, #2094, #2095, #2097) generated a red `Sweep CF
   orphans (merge_group)` run.

   Drop the trigger. Comment in the workflow explains the re-add path
   if/when the workflow IS wired as a required check (re-add the
   trigger AND gate the actual sweep step with
   `if: github.event_name != 'merge_group'` so merge-queue evals are
   no-op success).

2) The `Verify required secrets present` step exits 2 when the 6
   secrets aren't configured yet (the PR body's post-merge step,
   still pending). That turns the hourly schedule into an hourly red
   CI run for as long as the secrets stay unset.

   Convert to a soft skip: emit a `:⚠️:` listing the missing
   secrets and set a `skip=true` step output, then gate the sweep
   step with `if: steps.verify.outputs.skip != 'true'`. Workflow
   reports green and ops still sees the warning when they review
   recent runs.

Net effect:
- merge-queue evals stop generating spurious red runs
- the schedule reports green-with-warning until secrets land
- once secrets land, behavior is identical to today's (real sweep
  runs, hard-fails if a secret is later removed)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 08:05:53 -07:00
Hongming Wang
0a2c8e25bf
Merge pull request #2097 from Molecule-AI/fix/ssrf-dispatch-a2a-1483
fix(a2a): isSafeURL guard inside dispatchA2A (closes #1483)
2026-04-26 14:21:26 +00:00
Hongming Wang
fd891a147e fix(a2a): isSafeURL guard inside dispatchA2A (closes #1483)
#1483 flagged that dispatchA2A() doesn't call isSafeURL internally —
the guard exists only at the caller level (resolveAgentURL at
a2a_proxy.go:424). The primary call path through proxyA2ARequest
is safe today, but if any future code path ever calls dispatchA2A
directly without going through resolveAgentURL, the SSRF check
would be silently bypassed.

This adds the one-line defense-in-depth guard the issue prescribed:

  if err := isSafeURL(agentURL); err != nil {
      return nil, nil, &proxyDispatchBuildError{err: err}
  }

Wrapping as *proxyDispatchBuildError preserves the existing caller
error-classification path — the same shape that maps to 500 elsewhere.

Adds TestDispatchA2A_RejectsUnsafeURL pinning the contract:
re-enables SSRF for the test (setupTestDB disables it for normal
unit tests), passes a metadata IP, asserts the build error returns
and cancel is nil so no resource is leaked.

The 4 existing dispatchA2A unit tests use setupTestDB → SSRF
disabled, so they continue passing unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 07:18:58 -07:00
Hongming Wang
a8c9644618
Merge pull request #2094 from Molecule-AI/feat/server-side-provision-timeout-2054-phase2
feat(workspace-server): surface provision_timeout_ms in workspace API (#2054 phase 2)
2026-04-26 13:53:18 +00:00
Hongming Wang
6c72b8ec68
Merge pull request #2095 from Molecule-AI/fix/ssrf-discoverhostpeer-1484
fix(discovery): isSafeURL guard on registered URLs (closes #1484)
2026-04-26 13:53:06 +00:00
Hongming Wang
2b76f7dfcb fix(discovery): isSafeURL guard on registered URLs (closes #1484)
#1484 flagged that discoverHostPeer() and writeExternalWorkspaceURL()
return URLs sourced from the workspaces table without an isSafeURL
check. Workspace runtimes register their own URLs via /registry/register
— a misbehaving / compromised runtime could register a metadata-IP URL.
Today both functions are gated by Phase 30.6 bearer-required Discover,
so exposure is theoretical. The fix makes them safe regardless of
upstream auth shape.

Changes:
- discoverHostPeer: isSafeURL on resolved URL before responding;
  503 + log on rejection.
- writeExternalWorkspaceURL: same guard applied to the post-rewrite
  outURL (so a host.docker.internal rewrite is checked AND a
  metadata-IP that survived the rewrite untouched is rejected).
- 3 new regression tests:
  * RejectsMetadataIPURL on host-peer path (169.254.169.254 → 503)
  * AcceptsPublicURL on host-peer path (8.8.8.8 → 200; positive
    counterpart so the rejection test can't pass via universal-fail)
  * RejectsMetadataIPURL on external-workspace path

setupTestDB already disables SSRF checks via setSSRFCheckForTest,
so the 16+ existing discovery tests remain untouched. Only the new
tests opt in to enabled SSRF.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 06:50:36 -07:00
rabbitblood
f1ad012024 refactor(handlers): apply simplify findings on PR #2094
- Extract walkTemplateConfigs(configsDir, fn) shared helper. Both
  templates.List and loadRuntimeProvisionTimeouts walked configsDir
  + parsed config.yaml — same boilerplate twice. Now centralised so
  a future template-discovery rule (subdir naming, README sentinel,
  etc.) lands in one place.
- templates.List uses the walker — net -10 lines.
- loadRuntimeProvisionTimeouts uses the walker — net -10 lines.
- Document runtimeProvisionTimeoutsCache as 'NOT SAFE for
  package-level reuse' so a future change doesn't accidentally
  promote it to a singleton (sync.Once can't be reset → tests
  would lock out other fixtures).

Skipped (review finding): atomic.Pointer[map[string]int] for
future hot-reload. The doc comment already documents the
limitation; YAGNI-promoting the primitive now would buy a
not-yet-built feature at the cost of more code today.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 06:40:15 -07:00
rabbitblood
27396d992c feat(workspace-server): surface provision_timeout_ms in workspace API (#2054 phase 2)
Phase 2 of #2054 — workspace-server reads runtime-level
provision_timeout_seconds from template config.yaml manifests and
includes provision_timeout_ms in the workspace List/Get response.
Phase 1 (canvas, #2092) already plumbs the field through socket →
node-data → ProvisioningTimeout's resolver, so the moment a
template declares the field the per-runtime banner threshold
adjusts without a canvas release.

Implementation:

- templates.go: parse runtime_config.provision_timeout_seconds in
  the templateSummary marshaller. The /templates API now surfaces
  the field too — useful for ops dashboards and future tooling.
- runtime_provision_timeouts.go (new): loadRuntimeProvisionTimeouts
  scans configsDir, parses every immediate subdir's config.yaml,
  returns runtime → seconds. Multiple templates with the same
  runtime: max wins (so a slow template's threshold doesn't get
  cut by a fast template's). Bad/empty inputs are silently
  skipped — workspace-server starts cleanly with no templates.
- runtimeProvisionTimeoutsCache: sync.Once-backed lazy cache.
  First workspace API request after process start pays the read
  cost (~few KB across ~50 templates); every subsequent request is
  a map lookup. Cache lifetime = process lifetime; invalidates on
  workspace-server restart, which is the normal template-change
  cadence.
- WorkspaceHandler gets a provisionTimeouts field (zero-value struct
  is valid — the cache lazy-inits on first get()).
- addProvisionTimeoutMs decorates the response map with
  provision_timeout_ms (seconds × 1000) when the runtime has a
  declared timeout. Absent = no key in the response, canvas falls
  through to its runtime-profile default. Wired into both List
  (per-row decoration in the loop) and Get.

Tests (5 new in runtime_provision_timeouts_test.go):
- happy path: hermes declares 720, claude-code doesn't, only
  hermes appears in the map
- max-on-duplicate: same runtime in two templates → max wins
- skip-bad-inputs: missing runtime, zero timeout, malformed yaml,
  loose top-level files all silently ignored
- missing-dir: returns empty map, no crash
- cache: lazy-init on first get; subsequent gets hit cache even
  after underlying file changes (sync.Once contract); unknown
  runtime returns zero

Phase 3 (separate template-repo PR): template-hermes config.yaml
declares provision_timeout_seconds: 720 under runtime_config.
canvas RUNTIME_PROFILES.hermes becomes redundant + removable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 06:37:45 -07:00
Hongming Wang
f4cbb50ddf
Merge pull request #2093 from Molecule-AI/test/python-coverage-floor-1817
test(workspace): centralize pytest-cov config + 92% floor (closes #1817)
2026-04-26 13:27:05 +00:00
Hongming Wang
5d294936b3 fixup: lower coverage floor 92→86 to match post-omit measurement
The 97% number from CI run 24956647701 was measured WITHOUT a
.coveragerc omit list. Once this PR's prescribed omit set is in
effect (`*/__init__.py`, `*/tests/*`, `plugins_registry/*` — files
that don't carry behavior), the actual measurement of behavior-bearing
code on the same staging snapshot is 91.11% (run 24957664272).

86% sits at the issue's prescribed `current − 5pp` margin and
unblocks CI without lowering the bar in real terms.
2026-04-26 06:24:36 -07:00
Hongming Wang
e8c87e9f72
Merge pull request #2092 from Molecule-AI/feat/per-node-provision-timeout-2054
feat(canvas): per-workspace provision_timeout_ms override (#2054 phase 1)
2026-04-26 13:22:48 +00:00