Commit Graph

517 Commits

Author SHA1 Message Date
Hongming Wang
be271aef8b fix(orphan-sweeper): exclude runtime='external' from stale-token revoke
The Docker-mode orphan sweeper was incorrectly targeting external runtime
workspaces, revoking their auth tokens ~6 minutes after creation (one
sweep cycle past the 5-min grace).

External workspaces have NO local container by design — their agent runs
off-host. The "no live container" predicate the sweep uses to detect
wiped-volume orphans matches every external workspace unconditionally,
which was killing the only auth credential the off-host agent has.

Reproducer: create runtime=external workspace, paste the auth token into
molecule-mcp / curl, wait 5 minutes. Next request returns
`HTTP 401 — token may be revoked`. Platform log shows
`Orphan sweeper: revoking stale tokens for workspace <id> (no live
container; volume likely wiped)`.

Fix: add `AND w.runtime != 'external'` to the sweep's SELECT. The
existing test regexes (third-pass query expectations + the shared
expectStaleTokenSweepNoOp helper) are tightened to require the new
predicate, so a regression that drops it fails CI immediately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 00:49:37 -07:00
Hongming Wang
384edb4af0
Merge branch 'staging' into perf/cache-platform-inbound-secret 2026-05-03 00:08:43 -07:00
Hongming Wang
b040171fa1 perf(wsauth): in-process cache for platform_inbound_secret reads
Heartbeats fire every 60s per workspace and were the dominant caller
of ReadPlatformInboundSecret — one DB SELECT each, purely to redeliver
the same value. For an N-workspace fleet that's N SELECTs/minute of
pure overhead, growing linearly with the fleet (#189).

This adds a sync.Map cache keyed by workspaceID with a 5-minute TTL:

- **Read-through**: cache miss → DB SELECT → populate → return.
- **Write-through**: every IssuePlatformInboundSecret call refreshes
  the cache with the new value before returning, so the lazy-heal mint
  path (readOrLazyHealInboundSecret) doesn't see a stale read of the
  value it just wrote.
- **TTL eviction**: 5 minutes — generous enough that the heartbeat
  hot path hits cache for ~5 reads in a row before re-validating, short
  enough that an out-of-band rotation (operator running
  `UPDATE workspaces SET platform_inbound_secret=...` directly)
  propagates within minutes without requiring a redeploy.
- **Absence not cached**: ErrNoInboundSecret skips the cache write so
  the lazy-heal recovery contract for the column-NULL case
  (readOrLazyHealInboundSecret in workspace_provision_shared.go) keeps
  working.

Memory footprint is bounded by the active workspace fleet (~200 bytes
per entry); deleted workspaces leave dead entries until process restart,
acceptable given workspace-deletion is operator-rare.

Why in-process instead of Redis: workspace-server runs as a single
Railway service today (per memory project_controlplane_ownership);
adding Redis for this single column read would be over-engineering.
The cache is a self-contained, Redis-free upgrade that keeps the same
semantic surface (read returns the latest secret) while collapsing
the heartbeat read storm. If the deployment ever fans out across
replicas, an operator-side rotation propagates per-replica TTL-bounded
without needing a shared write log.

Tests: 5 new cases covering cache hit within TTL, refresh after TTL
(simulating an operator rotation via SQL), write-through on Issue,
absence-not-cached, and Reset clearing all entries. The setupMock
helper in wsauth and setupTestDB helper in handlers both call
ResetInboundSecretCacheForTesting() at start + cleanup so write-through
state from one test doesn't shadow SELECT expectations in the next.
SetInboundSecretCacheNowForTesting() exposes a deterministic clock
override so the TTL test doesn't sleep.

Task: #189.
2026-05-03 00:04:38 -07:00
Hongming Wang
c4f64a11a8
Merge pull request #2546 from Molecule-AI/fix/provisioner-repull-moving-tags
fix(provisioner): force re-pull of moving image tags on workspace start
2026-05-03 06:59:36 +00:00
Hongming Wang
552602e462 fix(provisioner): force re-pull of moving image tags on workspace start
Previously Start() only pulled when the image was missing locally
(imgErr != nil). Once a tenant's Docker daemon had `:latest` cached,
it stuck on that snapshot forever even after publish-runtime pushed
a newer image with the same tag — the same image-cache class that
sibling task #232 closed on the controlplane redeploy path.

Now Start() additionally re-pulls when the tag is "moving"
(`:latest`, no tag, `:staging`, `:main`, `:dev`, `:edge`, `:nightly`,
`:rolling`). Pinned tags (semver, sha-prefixed, date-stamped, build-id)
and digest-pinned references (`@sha256:...`) skip the pull because
their contents are by definition immutable.

The classifier (imageTagIsMoving) is deliberately conservative on the
"moving" side — only the well-known moving tags trip it. Misclassifying
a pinned tag as moving wastes bandwidth on every provision; misclassifying
moving as pinned silently bricks the fleet on stale snapshots, which
is exactly the bug class this fix closes.

Edge cases handled:
- Registry hostname with port (`localhost:5000/foo`) — the `:5000` is
  not mistaken for a tag.
- Digest pinning (`image@sha256:...`) — never re-pulled even if a
  moving-looking tag is also present.
- Legacy local-build tags (`workspace-template:hermes`) — treated as
  pinned (no registry to move from).

Test coverage: 22 cases across all classifier shapes. No changes to
the pull-failure path (still best-effort, ContainerCreate still
surfaces the actionable "image not found" error if the pull failed
and the cache is also empty).

Task: #215. Companion to #232.
2026-05-02 23:56:32 -07:00
Hongming Wang
dfeefb0acc fix(workspace-server): vendor upstream derive-provider.sh + close 12-prefix drift
The drift gate's monorepoRoot walk-up looked for workspace-configs-templates/
which is gitignored locally and doesn't exist in this repo at all (the
canonical script lives in molecule-ai-workspace-template-hermes). Test
failed on CI from day one with "could not find monorepo root".

Two layered fixes in one PR:

1. Vendor upstream derive-provider.sh as testdata/ + drop monorepoRoot.
   The vendored copy has a header pointing operators at the upstream
   source and a one-line cp command for refresh. Test now reads two
   files (vendored shell + workspace_provision.go) via package-relative
   paths — Go test sets cwd to the package dir, so this is hermetic
   without any walk-up gymnastics.

2. Update the case-statement regex to match upstream's renamed variable
   (${_HERMES_MODEL} since v0.12.0, the resolved value of
   HERMES_INFERENCE_MODEL with a HERMES_DEFAULT_MODEL legacy fallback).
   Regex now accepts either spelling so a future rename fails loudly
   on the parser-sanity check rather than silently returning empty.

Vendoring upstream surfaced real drift the gate was designed to catch:
upstream v0.12.0 added 12 provider prefixes that deriveProviderFromModelSlug
didn't handle (xai/grok, bedrock/aws, tencent/tencent-tokenhub, gmi,
qwen-oauth, lmstudio/lm-studio, minimax-oauth, alibaba-coding-plan,
google-gemini-cli, openai-codex, copilot-acp, copilot). Without these,
Save+Restart on a workspace using one of those prefixes would persist
LLM_PROVIDER="" and the next boot would fall back to derive-provider.sh's
runtime *=auto branch — losing the user's explicit choice on every restart.

Added all 12 case clauses + 16 new table-driven test cases (covering
both canonical and aliased forms). Drift gate now passes; future
upstream additions will fail loudly with a "DRIFT: ..." message
pointing the engineer at the missing case.

Task: #242
2026-05-02 23:51:23 -07:00
Hongming Wang
284012a768 test(workspace-server): AST drift gate for derive-provider.sh ↔ Go port
PR #2535 added a Go port of derive-provider.sh
(deriveProviderFromModelSlug) so workspace-server can persist
LLM_PROVIDER into workspace_secrets at provision time. This created
two sources of truth — if a future PR adds a provider prefix to one
without the other, the platform's persisted LLM_PROVIDER silently
disagrees with what the container's derive-provider.sh produces at
boot, with no test going red.

This adds a hermetic drift gate that:

  1. Parses workspace-configs-templates/hermes/scripts/derive-provider.sh
     with regex (handling both single-line `pat/*) PROVIDER="x" ;;`
     clauses and multi-line conditional clauses) to build a
     map[prefix]provider.
  2. Walks workspace_provision.go's AST with go/ast, finds
     deriveProviderFromModelSlug, and extracts every case-clause
     prefix → return-string-literal pair.
  3. Cross-checks both directions and accepts only the two documented
     divergences (nousresearch/* and openai/* both → "openrouter" at
     provision time because derive-provider.sh's runtime-env checks
     aren't loaded yet) via a hardcoded acceptedDivergences map.
  4. Fails with an actionable message that names both files and
     suggests the exact fix (add the case OR add to divergence list
     with a comment).

Pattern: behavior-based AST gate from PR #2367 / memory feedback —
pin the invariant by what the function maps, not by what it's named.
Stdlib-only (go/ast, go/parser, go/token, regexp); no network, no DB,
no docker — reads two monorepo files in-process.

A second sanity-check test pins anchor prefixes the regex must find,
so a future shell-syntax change can't silently produce an empty map
and trivially pass the main gate.

Closes task #242.
2026-05-02 23:51:23 -07:00
Hongming Wang
586d567a48 fix(workspace-server): log silent yaml.Unmarshal + coexistence test (#256, #257)
Two follow-ups from PR #2543's multi-model code review (audit #253).

1. **Log silent yaml.Unmarshal errors (#256).** When a malformed
   config.yaml made `yaml.Unmarshal(data, &raw)` fail, the affected
   template silently disappeared from /templates with no trace —
   operator could not distinguish "excluded due to parse error" from
   "never existed." That widened a real foot-gun once PR #2543 added
   structured top-level `providers:` (a string-shaped top-level
   `providers:` decoded into `[]providerRegistryEntry` would fail and
   drop the whole entry). Now logs `templates list: skip <id>:
   yaml.Unmarshal: <err>` and continues with the rest.

2. **Coexistence test (#257 part 1).** PR #2543 covered the structured
   registry and slug list in isolation. claude-code-default in
   production ships BOTH: top-level `providers:` (structured registry,
   2 entries) AND `runtime_config.providers:` (slug list, 3 entries).
   New `TestTemplatesList_BothProviderShapesCoexist` mirrors that
   layout, asserts both shapes surface independently with no
   cross-talk (e.g. a slug-only entry like `anthropic-api` does NOT
   synthesize a stub in the structured registry), and pins the JSON
   wire-shape for both fields side-by-side.

3. **`base_url: null` decoding assertion (#257 part 3).** Adds an
   explicit `got[0].BaseURL == ""` check in the existing
   `TestTemplatesList_SurfacesProviderRegistry` test, locking in the
   `string` (not `*string`) type. A future change to `*string` would
   surface as JSON `null` and break canvas's "no base_url = use
   provider defaults" branch — caught loudly by this assertion.

Tests: 11 TestTemplatesList_* now green, including the new
MalformedYAMLLogsAndSkips and BothProviderShapesCoexist.

The remaining piece of #257 — renaming `Providers []string` JSON tag
to `provider_slugs` — requires coordinated canvas updates across 4
files and is intentionally deferred to a separate PR (no canvas
churn while user is mid-test).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 23:01:59 -07:00
Hongming Wang
992a0c6860 fix(workspace-server): surface structured provider registry on /templates (#235)
Closes the contract drift caught by audit #253. Task #235 ("Server:
enrich /templates payload with structured providers") was marked
completed, but `templates.go` only ever emitted the
`runtime_config.providers []string` slug list — the structured
ProviderEntry shape (auth_env, model_prefixes, model_aliases, base_url)
the description promised was never plumbed.

Templates ship the structured registry under a TOP-LEVEL `providers:`
block (claude-code carries 6+ entries today; hermes still uses the
slug list). Both shapes coexist and are independent — surface them as
two separate fields:

  - `providers`           → existing []string slug list (unchanged)
  - `provider_registry`   → new []providerRegistryEntry (structured)

The canvas's ProviderModelSelector comment block already anticipates
this ("Templates that ship explicit vendor metadata (future) should
override the heuristic."). With this field in place, the canvas can
optionally drop its prefix-inference fallback for templates that ship
an explicit registry — separate PR. Today's change is purely additive
on the server side; no canvas change required.

Tests:
- TestTemplatesList_SurfacesProviderRegistry: order preservation +
  field plumbing on a claude-code-shaped fixture (oauth + minimax)
  + JSON wire-shape gate to catch struct-tag renames.
- TestTemplatesList_OmitsProviderRegistryWhenAbsent: omitempty so
  legacy templates (hermes, langgraph) don't emit `null` and break
  Array.isArray on the canvas side.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 22:42:42 -07:00
Hongming Wang
8a86b66159 fix(workspace-server): set universal MODEL env on every templated provision
Bug B fix, server-side complement to molecule-runtime PR #2538.
The runtime PR taught `workspace/config.py` to honour
`MODEL_PROVIDER` over `runtime_config.model` from the template's
verbatim YAML. This PR is the upstream half: workspace-server's
`applyRuntimeModelEnv` now sets `MODEL=<picked>` for **every**
runtime, not just hermes (which got `HERMES_DEFAULT_MODEL` already).

Pre-fix: applyRuntimeModelEnv's per-runtime switch only emitted
HERMES_DEFAULT_MODEL for hermes; every other runtime got nothing,
so the adapter read its template's default model from
/configs/config.yaml. Surfaced 2026-05-02 — picking MiniMax-M2 in
canvas → workspace booted with model=sonnet (claude-code template
default) and demanded CLAUDE_CODE_OAUTH_TOKEN.

Post-fix: MODEL is set unconditionally before the per-runtime switch.
HERMES_DEFAULT_MODEL stays for backwards compat. Adapters opt in by
reading os.environ["MODEL"] in their executor (claude-code adapter
already does this since the same Bug B fix; see
workspace-configs-templates/claude-code-default/adapter.py).

Tests
=====
- `TestApplyRuntimeModelEnv_SetsUniversalMODELForAllRuntimes`:
  table-driven across claude-code/hermes/langgraph/crewai + empty-model
  fallback + MODEL_PROVIDER-secret-fallback path. Adding a new
  runtime = adding a row, not writing a new test.
- All 6 sub-cases pass + existing
  `TestWorkspaceCreate_FirstDeploy_UnknownModel_OnlyMintModelProvider`
  pin still green.

Why now
=======
This was authored alongside the runtime PR but stashed (not committed)
during a session-handoff cleanup. The molecule-runtime side shipped at
SHA 16ac895a and is live on PyPI as molecule-ai-workspace-runtime
0.1.84, but until the workspace-server side ships, the canvas-picked
MODEL env never reaches non-hermes adapters.

Caught by the systematic stash audit triggered by the user's
discovery that ProviderModelSelector had been similarly stashed.

Closes the workspace-server side of #246. Builds on merged #2538.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 22:10:51 -07:00
Hongming Wang
d95877c88d
Merge pull request #2535 from Molecule-AI/fix/hermes-first-deploy-model-provider-persistence
Persist canvas-selected model+provider on first deploy
2026-05-03 02:25:03 +00:00
Hongming Wang
1b75fddb8e
Merge pull request #2536 from Molecule-AI/chore/prune-manifest-to-4-runtimes
chore(manifest): prune to 4 actively-supported runtimes
2026-05-03 02:24:50 +00:00
Hongming Wang
f33e59ba8c chore(manifest): prune to 4 actively-supported runtimes
Deletes the 5 unsupported workspace_templates from manifest.json
(langgraph, crewai, autogen, deepagents, gemini-cli). The runtime
matrix is now claude-code / hermes / openclaw / codex — the four
templates with shipping images, working A2A integration, and active
CI publish-image cascades.

Mirrors the prune in:
  - workspace-server/internal/handlers/runtime_registry.go
    (fallbackRuntimes for dev/test contexts that boot without the
    manifest mounted)
  - workspace-server/internal/handlers/workspace_provision.go
    (sanitizeRuntime: empty/unknown → "claude-code", was "langgraph";
    removes the langgraph/deepagents-specific runtime_config skip
    branch — they're no longer supported, so the block is dead)
  - tests for both: rename TestEnsureDefaultConfig_LangGraph →
    _Hermes, TestEnsureDefaultConfig_EmptyRuntimeDefaultsToLangGraph
    → _ClaudeCode, drop TestEnsureDefaultConfig_DeepAgents,
    update TestSanitizeRuntime_Allowlist + the two
    TestResolveRestartTemplate_* cases that pinned langgraph-default
    as the safe-default name

Why this is safe: production reads manifest.json at boot and uses it
as the authoritative allowlist; the 5 removed runtimes have not
shipped working images for ≥1 release cycle. Any provision request
naming one will now coerce to claude-code (with a log line) instead
of returning a runtime that has no functioning template repo.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 19:21:47 -07:00
Hongming Wang
a1de71dd53 fix(workspace-server): persist canvas-selected model + provider on first deploy
When the canvas POSTs /workspaces with {model: "minimax/MiniMax-M2.7"},
the model slug was never written to workspace_secrets. The workspace
booted hermes once with HERMES_DEFAULT_MODEL set from payload.Model, but
on every subsequent restart applyRuntimeModelEnv's fallback chain found
nothing in envVars["MODEL_PROVIDER"] (because nothing wrote it) and
hermes silently fell through to the template default
(nousresearch/hermes-4-70b) — wrong provider keys → hermes gateway
401'd → /health poll failed → molecule-runtime never registered →
"container started but never called /registry/register".

Worse, LLM_PROVIDER was never written either (the canvas doesn't send
provider), so CP user-data wrote no provider: field to
/configs/config.yaml and derive-provider.sh fell through to PROVIDER=auto
on every custom-prefix slug.

Fix: after the workspace row commits, persist MODEL_PROVIDER (verbatim
slug) and LLM_PROVIDER (derived from slug prefix) to workspace_secrets.
LLM_PROVIDER is gating-only — derive-provider.sh remains the runtime
source of truth and can override at boot. Reuses extracted
setModelSecret / setProviderSecret helpers (refactored out of SetModel /
SetProvider gin handlers) so SQL stays in one place.

Symptom: failed-workspace 95ed3ff2 (2026-05-02).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 19:21:01 -07:00
dependabot[bot]
82d0655fe9
chore(deps)(deps): bump github.com/creack/pty in /workspace-server
Bumps [github.com/creack/pty](https://github.com/creack/pty) from 1.1.18 to 1.1.24.
- [Release notes](https://github.com/creack/pty/releases)
- [Commits](https://github.com/creack/pty/compare/v1.1.18...v1.1.24)

---
updated-dependencies:
- dependency-name: github.com/creack/pty
  dependency-version: 1.1.24
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-05-02 19:22:48 +00:00
Hongming Wang
f18ee8598a fix(restart): retry cpProv.Stop with backoff + flag exhaustion as LEAK-SUSPECT
Both restart paths (interactive Restart handler + auto-restart's
stopForRestart) used to log-and-continue on cpProv.Stop failure. After
PR #2500 made CPProvisioner.Stop surface CP non-2xx as an error, those
paths became the actual leak generator: every transient CP/AWS hiccup =
one orphan EC2 alongside the freshly provisioned one. The 13 zombie
workspace EC2s on demo-prep staging traced to this exact path.

Adds cpStopWithRetry helper with bounded exponential backoff (3 attempts,
1s/2s/4s). Different policy from workspace_crud.go's Delete handler:
Delete returns 500 to the client on Stop failure (loud-fail-and-block —
user asked to destroy, silent leak unacceptable), whereas Restart's
contract is "make the workspace alive again" — refusing to reprovision
strands the user with a dead workspace. So this helper retries to absorb
transient failures, then on exhaustion emits a structured `LEAK-SUSPECT`
log line for the (forthcoming) CP-side workspace orphan reconciler to
correlate. Caller proceeds to reprovision regardless.

ctx-cancel exits the retry early without sleeping the backoff (matters
during shutdown drain); the cancel path emits a distinct log line and
deliberately does NOT emit LEAK-SUSPECT — operator-cancel and
retry-exhaustion are different signals and conflating them would noise
up the orphan-reconciler queue with workspaces we never had a chance to
retry.

Tests: 5 behavior tests covering every branch (no-op, first-try success,
eventual success, exhaustion, ctx-cancel) + 1 AST gate that pins the
helper-only invariant (any future inline `h.cpProv.Stop(...)` in
workspace_restart.go fires the gate, mutation-tested).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 23:36:38 -07:00
Hongming Wang
5167e482d0 fix(cp-provisioner): surface CP non-2xx on Stop to plug EC2 leak
http.Client.Do only errors on transport failure — a CP 5xx (AWS
hiccup, missing IAM, transient outage) was silently treated as
success. Workspace row then flipped to status='removed' and the EC2
stayed alive forever with no DB pointer (the "orphan EC2 on a
0-customer account" scenario flagged in workspace_crud.go #1843).
Found while triaging 13 zombie workspace EC2s on demo-prep staging.

Adds a status-code check that returns an error tagged with the
workspace ID + status + bounded body excerpt, so the existing
loud-fail path in workspace_crud.go's Delete handler can populate
stop_failures and surface a 500. Body read is io.LimitReader-capped
at 512 bytes to keep error logs sane during a CP outage.

Tests: 4 new (5xx surfaces, 4xx surfaces, 2xx variants 200/202/204
all succeed, long body is truncated). Test-first verified — the
first three fail on the buggy code and all four pass on the fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 22:59:01 -07:00
Hongming Wang
0064f02c00 test(sweeper): integration coverage for manifest-override + accessor consolidation
Two follow-ups from PR #2494's review:

1. Two new sweep tests exercise the lookup path through
   sweepStuckProvisioning end-to-end:
     - ManifestOverrideSparesRow: claude-code 11min old, manifest=20min
       → no UPDATE, no broadcast (sparing works through the sweeper)
     - ManifestOverrideStillFlipsPastDeadline: claude-code 21min old,
       manifest=20min → flipped + payload.timeout_secs=1200
   Closes the gap that the unit-test on provisioningTimeoutFor alone
   left open: a future refactor could drop the lookup arg from the
   sweeper's call and only the unit test caught it. Verified by
   regression-injecting `lookup→nil` in sweepStuckProvisioning — both
   new tests fail, the old ones still pass.

2. addProvisionTimeoutMs now goes through ProvisionTimeoutSecondsForRuntime
   instead of calling provisionTimeouts.get directly. Single accessor
   path for the same data — the canvas response and the sweeper now
   resolve identically by construction.

No production behavior change; tests + accessor cleanup only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 22:00:36 -07:00
Hongming Wang
18edf88d59 fix(sweeper): honour template-manifest provision_timeout_seconds
Real wiring gap discovered while investigating issue #2486 cluster of
prod claude-code workspaces failed at exactly 10m. The
runtimeProvisionTimeoutsCache (#2054 phase 2) reads
runtime_config.provision_timeout_seconds from each template's
config.yaml so the **canvas** spinner respects per-template timeouts —
but the **sweeper** in registry/provisiontimeout.go hardcoded 10 min
(claude-code) / 30 min (hermes) and never consulted the manifest. So a
template that declared a longer window had a UI that waited correctly
but a sweeper that killed the row at the hardcoded floor anyway.

Resolution order pinned by new TestProvisioningTimeout_ManifestOverride:

  1. PROVISION_TIMEOUT_SECONDS env (ops-debug global override)
  2. Template manifest lookup (per-runtime, beats hermes default too)
  3. Hermes default (30 min — CP bootstrap-watcher 25 min + 5 min slack)
  4. DefaultProvisioningTimeout (10 min)

Wiring:
  - registry: new RuntimeTimeoutLookup function type, threaded through
    StartProvisioningTimeoutSweep + sweepStuckProvisioning + the
    pre-existing provisioningTimeoutFor.
  - handlers: ProvisionTimeoutSecondsForRuntime exposes the cache's
    lookup as a method so main.go can pass it without breaking the
    handlers→registry import direction.
  - cmd/server/main.go: wire wh.ProvisionTimeoutSecondsForRuntime into
    the sweep boot.

Verified:
  - go test -race ./... passes (every workspace-server package).
  - Regression-injected the lookup arm: 3 manifest-override subcases
    fail with the actual-vs-expected gap, confirming the new test is
    load-bearing.
  - The original two timeout tests (env-override, hermes default) keep
    passing — `lookup=nil` argument preserves their semantics.

Operator action enabled: a template wanting a 15-min window can now
just set `runtime_config.provision_timeout_seconds: 900` in its
config.yaml and the sweeper honours it on the next workspace-server
restart.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 21:44:42 -07:00
Hongming Wang
955755ce1e test(provision): tighten Assertion 4 message to name both failure modes
Per review nit on PR #2491: the previous message ("a goroutine reached
cpProv.Start but never broadcast its failure") could mislead an
operator if Assertion 2 and 4 both fire — Assertion 4 also catches
"goroutine exited via an earlier path before reaching Start." Spell
both modes out and cross-reference Assertion 2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 20:14:39 -07:00
Hongming Wang
82cc331517 test(provision): harden panic tests with re-raise guard + assert broadcast count
Post-merge follow-up to PR #2487 review feedback:

1. guardAgainstReraise(fn) helper around every panic-test exercise. The
   original RecoversAndMarksFailed had its own outer recover() to detect
   re-raise; NoOpWhenNoPanic and PersistFailureLogged didn't. If a future
   regression makes logProvisionPanic re-raise, those two would have
   crashed the test process (taking sibling tests down) instead of
   reporting a clean failure. Now all three use the shared guard.

2. Concurrent repro now asserts bcast.count == 7 — the new
   concurrentSafeBroadcaster's count field was added in the race fix
   but not actually consumed. Cross-checks the existing recorder-set
   assertion from a different angle: a goroutine could in principle
   reach cpProv.Start (recorder hits) but then lose its
   WORKSPACE_PROVISION_FAILED broadcast on the failure path. Pinning
   both rules out that silent-drop variant for the canvas-broadcast
   contract specifically.

3. Comment on captureLog noting log.SetOutput is process-global and
   incompatible with t.Parallel() — preempts a future footgun if
   someone parallelizes the panic suite.

Verified: all four tests pass under -race; full handlers + db packages
green under -race.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 20:11:11 -07:00
Hongming Wang
4f64c4366f test(provision): swap to concurrent-safe broadcaster in 7-burst harness
CI Platform (Go) ran with -race and the concurrent test tripped the
detector: captureBroadcaster (sequential-test stub) writes lastData
unguarded; 7 fan-out goroutines call markProvisionFailed → that stub
concurrently. Local non-race run had hidden it.

Introduce concurrentSafeBroadcaster (mutex-counted) for this single
fan-out test. Sequential tests keep using captureBroadcaster — the
fix is local to the test that creates the goroutines.

Verified ./internal/handlers passes with -race.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 20:03:11 -07:00
Hongming Wang
7a19724194 fix(provision): route panic recovery through markProvisionFailed + fix log capture
Three fixes addressing review of the issue #2486 observability PR:

1. CI failure: original inline UPDATE in logProvisionPanic used a hard-coded
   `status='failed'` literal, which trips workspace_status_enum_drift_test
   (the post-PR-#2396 gate that requires every status write to flow through
   models.Status* via parameterized $N). Refactor to call
   h.markProvisionFailed which uses StatusFailed parameterized.

2. Canvas-broadcast gap (review finding): inline UPDATE skipped
   RecordAndBroadcast, so panic recovery marked the row failed in DB but
   the canvas spinner stayed on "provisioning" until the next poll.
   markProvisionFailed fires WORKSPACE_PROVISION_FAILED, so canvas now
   flips to a failure card immediately.

3. Critical test bug (review finding): `defer log.SetOutput(log.Writer())`
   in three test sites evaluated log.Writer() at defer-fire time AFTER the
   SetOutput swap — restoring the buffer to itself, never restoring
   os.Stderr. Subsequent tests in the package were running with the panic
   tests' captured buffer as their writer. Extracted captureLog(t) helper
   that captures `prev` BEFORE the swap and uses t.Cleanup.

Plus: softened the "goroutine never started" comment in the concurrent
repro harness — the harness atomic-counts BEFORE the entry log fires, so
"never started" was misleading; the real failure mode is "entry log
renamed/removed or writer hijacked."

Verified: full handlers suite passes; drift gate passes (Platform Go CI
failure root-caused). Regression-injected the recover body again — both
panic tests still fail as expected, confirming the contract is gated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 19:56:34 -07:00
Hongming Wang
fe92194584 test(provision): concurrent 7-burst repro harness for #2486 silent-drop
Goal: a deterministic, in-process reproduction of the prod incident
where 7 simultaneous claude-code provisions on the hongming tenant
produced ZERO log lines from any of the four documented exit paths.

Approach: stub CPProvisioner that records every Start() call,
sqlmock for the prepare flow, fire 7 goroutines concurrently against
provisionWorkspaceCP, then assert:

  1. Entry log fired exactly 7 times (one per goroutine).
  2. Stub Start() recorded all 7 distinct workspace IDs.
  3. Each goroutine's entry log names its own workspace ID.

Result on staging head as of 2026-05-02: PASSES — meaning the
silent-drop class isn't reproducible against current head with stub
CP. Tenant hongming runs sha 76c604fb (725 commits behind staging),
so the bug is most likely already fixed upstream — hongming needs
a redeploy.

The test stays as a regression gate: any future refactor that
re-introduces silent goroutine swallow in the CP provision path
(rate-limit drop, channel-send-without-receiver, panic without
recover, etc.) trips it.

A safeWriter wraps the captured log buffer because raw
bytes.Buffer.Write isn't safe for concurrent goroutines — without
serialization the 7 entry-log lines interleave at byte boundaries
and the strings.Count assertion gets unreliable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 19:19:05 -07:00
Hongming Wang
46daae1ffb fix(provision): entry log + panic recovery on workspace provision goroutines
Issue #2486: 7 claude-code workspaces stuck in provisioning produced
NONE of the four documented exit-path log lines in
provisionWorkspaceCP — neither prepare-failed, nor start-failed, nor
persist-instance-id-failed, nor success. Operators couldn't tell
whether the goroutine ran at all.

Add an entry log at the top of provisionWorkspaceOpts +
provisionWorkspaceCP so a missing entry distinguishes "goroutine
never started" from "started but exited via an unlogged path."

Add logProvisionPanic at the same defer site so a panic inside
either provisioner doesn't (a) crash the whole workspace-server
process, taking every other tenant workspace with it, and (b)
silently leave the row in `provisioning` until the 10-min sweeper
fires. The recover persists status='failed' with a sanitized
panic-class message via a fresh 10s context (the goroutine's own
ctx may have been the one panicking).

Tests pin three contracts:
- no-op when no panic (otherwise every successful provision
  emits a spurious log line)
- recovers + persists failed status on panic, with stack trace
- defense-in-depth: if the persist itself fails, log it instead
  of leaving the operator with a recovered-panic log but no row

Regression-injected by neutering the recover() body — all three
tests fail until the recover + UPDATE path is restored.

This is observability + resilience only, not a root-cause fix
for #2486. The actual silent-drop class still needs reproduction
once the tenant is on a build that includes this entry log.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 19:14:20 -07:00
Hongming Wang
15e1ea36de feat(activity): add before_ts paging knob to /activity route
The wheel-side chat_history MCP tool advertises a `before_ts`
parameter for backward paging through long histories, and the docs
describe it as the canonical pagination knob — but the server
silently ignored it until now. Without this fix, an agent passing
before_ts to chat_history would always get the most-recent N rows
and pagination would be broken end-to-end.

Add `before_ts` query param parsed as RFC3339 at the trust boundary
and translated into a `created_at < $X` clause on the existing
builder. Mirrors the strict-inequality shape since_id uses for
forward paging (`created_at > cursorTime`) so paging across both
directions has consistent semantics.

Tests: 3 new branches (positive filter, composition with peer_id
into the canonical chat_history paging shape, RFC3339 rejection
across 4 malformed inputs including URL-encoded SQL injection).
Mutation-verified pre-commit; existing 9 activity tests still pass.

Reported by self-review on PR #2474.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 18:04:31 -07:00
Hongming Wang
c85fac4663 feat(activity): add peer_id filter to /workspaces/:id/activity
Surfaces the conversation history with one specific peer for the
wheel-side chat_history MCP tool. The filter joins
(source_id = $X OR target_id = $X) so both inbound (peer was sender)
and outbound (peer was recipient) turns appear in the same view,
ordered by created_at, and composes with existing type/source/
since_secs/since_id/limit filters.

Validates peer_id as a UUID at the trust boundary so a malformed
caller can't smuggle SQL fragments via the parameter — the args are
bound but the explicit rejection gives the wheel a cleaner 400
signal than an empty list, and defends against any future code path
that might interpolate the value into a URL or another query.

Tests: 3 new branches (positive filter, composition with
type+source, UUID-shape rejection across 5 malformed inputs).
Mutation-verified: reverting activity.go fails all peer_id tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 17:46:15 -07:00
Hongming Wang
517bd0efc5 feat(canvas+workspace-server): data-driven Provider dropdown (#199)
Option B PR-5. Canvas Config tab now exposes a Provider override input
that's adapter-driven from each runtime's template — no hardcoded
provider list in the canvas. PUT /workspaces/:id/provider on Save
when dirty; auto-restart suppression to avoid double-restart with
the model handler's own restart.

The dropdown's suggestion list comes from /templates →
runtime_config.providers (the field added in
molecule-ai-workspace-template-hermes PR #31). For templates that
haven't migrated to the explicit providers list yet, suggestions
derive from model[].id slug prefixes — still adapter-driven, just
inferred. This keeps existing templates working while platform team
migrates them one at a time.

workspace-server changes:
- Add Providers []string field to templateSummary JSON
- Parse runtime_config.providers in /templates handler
- 2 new tests pin the surfacing + omitempty behavior

canvas changes:
- Remove hardcoded PROVIDER_SUGGESTIONS constant
- Add provider/originalProvider state + PUT-on-save logic
- Add deriveProvidersFromModels() fallback helper
- Wire RuntimeOption.providers from /templates response
- 8 new tests pin the behavior end-to-end

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 11:19:17 -07:00
Hongming Wang
1a1285171c
Merge pull request #2453 from Molecule-AI/feat/workspace-server-provider-endpoint
feat(workspace-server): PUT /provider endpoint (#196 — Option B PR-2)
2026-05-01 05:37:15 +00:00
Hongming Wang
258c6bea44 feat(workspace-server): PUT /provider endpoint for explicit LLM provider (#196)
Mirror of PUT /model. Stores the provider slug as the LLM_PROVIDER
workspace secret so the canvas can update model + provider
independently — a user might keep the same model alias and switch
providers (route through a different gateway), or vice versa.
Forcing both into one endpoint imposes a single Save+Restart per
change; two endpoints let canvas update each as the user picks.

Plumbs through the existing chain: secret-load → envVars → CP
req.Env → user-data env exports → /configs/config.yaml (after
controlplane PR #364 lands the heredoc append).

Tests: 5 new cases mirroring SetModel/GetModel exactly — default
empty response, DB error, upsert with restart trigger, empty-clears,
invalid-UUID rejection.

Part of: Option B PR-2 (#196) — workspace-server plumbs LLM_PROVIDER
Stack:   PR-1 schema (#2441 merged)
         PR-2 (this)  ws-server endpoint
         PR-3 (#364 open) CP user-data persistence
         PR-4 (pending) hermes adapter consume
         PR-5 (pending) canvas Provider dropdown
2026-04-30 22:25:48 -07:00
Hongming Wang
364c70fc71 fix(workspace-server): emit null removed_at when timestamp fetch fails
#2429 review finding. The 410-Gone path issues a follow-up
`SELECT updated_at` after detecting status='removed'. If that query
fails (workspace row deleted between the two queries, transient DB
error, etc.), `removedAt` stays as Go's zero time and the JSON body
emits `"removed_at": "0001-01-01T00:00:00Z"` — a misleading timestamp
the client has to know to ignore.

Now we branch on `removedAt.IsZero()` and emit `null` for the failed
path. The actionable signal (the 410 + hint) is unchanged; only the
timestamp shape gets cleaner.

Pinned by `TestWorkspaceGet_RemovedReturns410WithNullRemovedAtOnTimestampFetchFailure`,
which simulates the row vanishing via `sqlmock`'s `WillReturnError(sql.ErrNoRows)`.
The original `_RemovedReturns410` test now also asserts that the
happy-path timestamp is a non-null value (was just checking the key
existed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 22:24:59 -07:00
Hongming Wang
72f0079c10 feat(workspace-server): GET /workspaces/:id returns 410 Gone when status='removed' (#2429)
Defense-in-depth at the endpoint level. Previously, GET /workspaces/:id
returned 200 OK with `status:"removed"` in the body for deleted
workspaces — silent-fail UX hit on the hongmingwang tenant 2026-04-30:
the channel bridge / molecule-mcp wheel had a dead workspace_id + token
in .env, get_workspace_info returned 200 → caller assumed everything
was fine, then every subsequent /registry/* call 401d because tokens
were revoked, and operators had no idea their workspace was gone.

#2425 fixed the steady-state heartbeat path (escalate to ERROR after
3 consecutive 401s). This change is the startup-time defense — fail
loud when the operator first probes the workspace instead of waiting
for the heartbeat to sour.

The 410 body includes:
  {error: "workspace removed", id, removed_at, hint: "Regenerate ..."}

Audit-trail consumers that need the body shape of a removed workspace
(admin views, "show me deleted workspaces" tooling) opt into the
legacy 200 + body via ?include_removed=true. Without this opt-in path
the audit trail becomes invisible at the API layer.

Two new tests pinned:
  - TestWorkspaceGet_RemovedReturns410
  - TestWorkspaceGet_RemovedWithIncludeQueryReturns200

Follow-ups in separate PRs:
  - Update workspace/a2a_client.py get_workspace_info to surface
    "removed" specifically rather than collapsing into "not found"
  - Update channel bridge getWorkspaceInfo (server.ts) to detect 410
    → log clear "workspace was deleted, re-onboard" error
  - Audit canvas/* + admin tooling consumers that may rely on the
    legacy 200 + status:"removed" shape; switch them to the
    ?include_removed=true opt-in if needed
  - Update docs (runtime-mcp.mdx Troubleshooting + external-agents.mdx
    lifecycle table)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 21:55:24 -07:00
Hongming Wang
b9311134cf fix(terminal-diagnose): KI-005 hierarchy check + race-free stderr capture
Two fixes from /code-review-and-quality on PR #2445:

1. **KI-005 hierarchy check parity with /terminal**

   HandleConnect runs the KI-005 cross-workspace guard before dispatch
   (terminal.go:85-106): when X-Workspace-ID is set and != :id, validate
   the bearer's workspace binding then call canCommunicateCheck. Without
   this, an org-level token holder in tenant Foo can probe any
   workspace's diagnostic state by guessing the UUID — same enumeration
   vector KI-005 closed for /terminal in #1609. Per-workspace bearer
   tokens are URL-bound by WorkspaceAuth, so the gap is org tokens
   within the same tenant.

   Fix: copy the same gate into HandleDiagnose, before the
   instance_id SELECT.

   Test: TestHandleDiagnose_KI005_RejectsCrossWorkspace stubs
   canCommunicateCheck=false and confirms 403 fires before the DB
   lookup (sqlmock's ExpectationsWereMet pins that we never reached
   the SELECT COALESCE). Mirrors the existing
   TestTerminalConnect_KI005_RejectsUnauthorizedCrossWorkspace.

2. **Race-free tunnel stderr capture (syncBuf)**

   strings.Builder isn't goroutine-safe. os/exec spawns a background
   goroutine that copies the subprocess's stderr fd to cmd.Stderr's
   Write, so reading the buffer's String() from the request goroutine
   on wait-for-port timeout while the tunnel may still be writing is
   a data race that `go test -race` flags. Worst-case impact in
   production is a garbled Detail string (not a crash), but the fix
   is small.

   Fix: wrap bytes.Buffer in a sync.Mutex (syncBuf type). Same
   io.Writer interface, no API changes elsewhere.

3. **Nit cleanup**

   - read-pubkey failure now reports as its own step name instead of
     a duplicated "ssh-keygen" entry — disambiguates two different
     failure modes that previously shared a name.
   - Replaced numToString hand-rolled int-to-string with strconv.Itoa
     in the test (no import savings reason existed).

Suite: 4 diagnose tests pass with -race; full handlers suite passes
in 3.95s. go vet clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 21:19:18 -07:00
Hongming Wang
d012a803e4 feat(terminal): add diagnose endpoint for SSH probe stages
GET /workspaces/:id/terminal/diagnose runs the same per-stage pipeline as
/terminal (ssh-keygen → EIC send-key → tunnel → ssh) but non-interactively
and returns JSON. Each stage reports {name, ok, duration_ms, error,
detail}, plus a top-level first_failure naming the broken stage.

Why: when the canvas terminal silently disconnects ("Session ended" with
no error frame — the user-reported failure mode on hongmingwang's hermes
workspace), there is no remote-readable signal of WHICH stage failed.
The ssh client's stderr lives only in the workspace-server's stdout on
the tenant CP EC2 — invisible without shell access. /terminal can't
expose stderr cleanly because it has already upgraded to WebSocket
binary frames by the time ssh runs. /terminal/diagnose stays pure
HTTP/JSON, so the same auth (WorkspaceAuth + ADMIN_TOKEN fallback) gives
operators a one-call probe that splits "IAM broke" (send-ssh-public-key
fails) from "tunnel/SG broke" (wait-for-port fails) from "sshd auth
broke" (ssh-probe gets Permission denied) from "shell broke" (probe
exits non-zero with stderr).

Stages mirrored from handleRemoteConnect in terminal.go:

  1. ssh-keygen          ephemeral session keypair
  2. send-ssh-public-key AWS EIC API push, IAM-gated
  3. pick-free-port      local port for the tunnel
  4. open-tunnel         aws ec2-instance-connect open-tunnel start
  5. wait-for-port       the tunnel actually listens (folds tunnel
                         stderr into Detail when it doesn't)
  6. ssh-probe           non-interactive `ssh ... 'echo MARKER'` that
                         confirms auth + bash + the marker round-trip
                         (CombinedOutput captures stderr verbatim —
                         this is the whole reason the endpoint exists)

Local Docker workspaces (no instance_id) get a smaller probe:
container-found + container-running. Same response shape so callers
don't need to branch.

Tests stub sendSSHPublicKey / openTunnelCmd / sshProbeCmd via the
existing package-level vars (same pattern as TestSSHCommandCmd_*) so
the test suite stays hermetic — no AWS, no network. The three new
tests pin: (a) routing to remote on instance_id present,
(b) routing to local on empty instance_id, (c) the operationally
critical case — full success through wait-for-port then a probe
failure surfaces ssh stderr in the ssh-probe step's Error/Detail
with first_failure="ssh-probe".

Auth: rides on existing WorkspaceAuth middleware. Operators with the
tenant ADMIN_TOKEN (fetched via /cp/admin/orgs/:slug/admin-token) can
probe any workspace without per-workspace token; same admin path as
the canvas dashboard reads workspace activity.

Response always returns HTTP 200 (success or step failure are both in
the JSON body) so callers don't need to branch on status code — the
endpoint either reports a first_failure or doesn't.

Resolves task #200, supports task #193 (workspace EC2 sshd
unresponsive — without this endpoint we couldn't pin the failure
stage from outside the tenant CP EC2).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 21:10:20 -07:00
Hongming Wang
cda93e3c52 test(terminal): update exact-argv snapshot to include ConnectTimeout
The pre-existing TestSSHCommandCmd_BuildsArgv asserts the literal argv
slice. Adding `-o ConnectTimeout=10` shifted the slice — this commit
tracks the snapshot to match. The new behavior-based
TestSSHCommandCmd_ConnectTimeoutPresent (added in the prior commit)
keeps the invariant pinned without depending on argv ordering, so
future tweaks land in only one place even if more options are added.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 20:23:48 -07:00
Hongming Wang
f30b3d4476 fix(terminal): cap ssh handshake at 10s so hung sshd surfaces fast
When the workspace EC2's sshd is unresponsive (mid-restart, SG drop,
AMI without ec2-instance-connect), the canvas's xterm shows the user's
typed bytes echoed back by the workspace-server's *local* PTY (cooked +
echo mode before ssh sets it raw post-handshake) and then closes
silently when Cloudflare's idle WebSocket timer fires (~100s) — with no
"Connection refused" or "Permission denied" output ever reaching the
user. This is what hongmingwang's hermes terminal looked like 2026-04-30
right after the heartbeat-fix redeploy: status="online" but the shell
appeared dead.

Caught reproducibly by holding a fresh /workspaces/<id>/terminal
WebSocket open for 60s — server sent zero frames except the local-PTY
echo of one keystroke typed at t=8s. ssh was hung at handshake; bash
never saw the byte.

Fix: add `-o ConnectTimeout=10` to ssh args. Now the failure surfaces
as a real ssh error message in the terminal within 10s, instead of
masquerading as a silently dead shell over the next ~100s. Doesn't
diagnose *why* sshd isn't responding (separate investigation), but
it does mean the user gets actionable feedback within seconds.

Behavior-based regression test asserts `-o ConnectTimeout=N` is in the
ssh argv — pins presence, not the literal value, so operators can tune
without breaking the gate. Verified to FAIL on pre-fix code (matched
the literal arg pair) and PASS on fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 20:16:41 -07:00
Hongming Wang
f6ddcf66ab Move /restart Stop into the async goroutine
Pre-fix Restart called provisioner.Stop / cpProv.Stop synchronously before
returning the HTTP response. CPProvisioner.Stop is DELETE /cp/workspaces/:id
→ CP → AWS EC2 terminate, which can exceed the canvas's 15s HTTP timeout,
especially right after a platform-wide redeploy when every tenant queues a
CP request at once. The user sees a misleading "signal timed out" red banner
on Save & Restart even though the async re-provision goroutine continues
and the workspace ends up online.

Caught 2026-04-30 on hongmingwang hermes workspace 32993ee7-…cb9d75d112a5
right after the heartbeat-fix platform redeploy at 02:11Z. The workspace
came back online correctly; only the canvas response timed out.

Fix moves Stop into the same goroutine as provisionWorkspaceCP /
provisionWorkspaceOpts. The handler now responds in <500ms (DB lookup +
status UPDATE only). Stop and provision keep their existing ordering
inside the goroutine. Uses context.Background() to detach from the request
lifecycle so an aborted client connection doesn't cancel the in-flight
Stop/provision pair.

Pinned by a behavior-based AST gate (workspace_restart_async_test.go):
the test parses workspace_restart.go and walks the Restart function body,
flagging any <recv>.{provisioner,cpProv}.Stop call that isn't nested in a
*ast.FuncLit. Same family as callsProvisionStart in
workspace_provision_shared_test.go. Verified the gate fails on the
pre-fix shape (flags lines 151 and 153 — the original sync Stop calls).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 19:35:29 -07:00
Hongming Wang
a5c5139e3a fix(workspace): deliver platform_inbound_secret on every heartbeat
Heartbeat now echoes the workspace's platform_inbound_secret on every
beat (mirroring /registry/register), and the molecule-mcp client
persists it to /configs/.platform_inbound_secret on receipt.

Symptom (2026-04-30, hongmingwang tenant): chat upload returned 503
"workspace will pick it up on its next heartbeat" and then 401 on
retry — permanent until workspace restart. The 503 message was a lie:
heartbeat used to discard the platform_inbound_secret entirely; only
register delivered it, and register fires once at startup.

Server (Go):
  - Heartbeat handler reuses readOrLazyHealInboundSecret (the same
    helper chat_files + register use), so heartbeat-time recovery
    covers the rotate / mid-life NULL-column case the existing
    register-time heal can't reach.
  - Failure is non-fatal: liveness contract trumps secret delivery,
    chat_files retries lazy-heal on its own next request.

Client (Python):
  - _persist_inbound_secret_from_heartbeat parses the heartbeat 200
    response and persists via platform_inbound_auth.save_inbound_secret.
  - All exceptions swallowed — heartbeat liveness > secret persistence;
    next tick (≤20s) retries.

Tests:
  - Server: pin secret-present, lazy-heal-mint-on-NULL, and heal-
    failure-omits-field branches.
  - Client: pin persist-on-200, skip-on-empty, skip-on-non-dict-body,
    skip-on-401, swallow-save-OSError.
2026-04-30 17:36:33 -07:00
Hongming Wang
876c0bfcd4 docs(canvas): update Universal MCP snippet — molecule-mcp now standalone
The canvas tab snippet for the Universal MCP path was written before
this PR added the built-in register + heartbeat thread. Earlier wording
described it as "outbound-only — pair with the Claude Code or Python SDK
tab for heartbeat + inbound messages" — that's stale. molecule-mcp now
handles register + heartbeat itself; the only thing it doesn't yet do is
inbound A2A delivery.

Updated:
- externalUniversalMcpTemplate header comment + body — describes
  standalone behavior, points operators at SDK/channel only when they
  need INBOUND (not heartbeat).
- Drops the now-redundant curl-register step from the snippet — the
  binary registers itself on startup.
- Canvas modal label likewise updated.

No runtime / behavior change; pure docs polish so a copy-pasting
operator's mental model matches what the binary actually does.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 15:52:15 -07:00
Hongming Wang
427300f3a4 feat: make molecule-mcp standalone (built-in register + heartbeat) + recover awaiting_agent on heartbeat
Two paired fixes that together let an external operator run a single
process (molecule-mcp) and see their workspace come up online in the
canvas — the bug surfaced live when status stuck at "awaiting_agent /
OFFLINE" despite an active MCP server.

Platform side (workspace-server/internal/handlers/registry.go):
  Heartbeat handler already auto-recovers offline → online and
  provisioning → online, but NOT awaiting_agent → online. Healthsweep
  flips stale-heartbeat external workspaces TO awaiting_agent, and
  with no recovery path the workspace stays "OFFLINE — Restart" in the
  canvas forever. Add the symmetric branch: if currentStatus ==
  "awaiting_agent" and a heartbeat arrives, flip to online + broadcast
  WORKSPACE_ONLINE. Mirrors the existing offline/provisioning patterns
  exactly. Test: TestHeartbeatHandler_AwaitingAgentToOnline asserts
  the SQL UPDATE fires with the awaiting_agent guard clause.

Wheel side (workspace/mcp_cli.py):
  molecule-mcp was outbound-only — operators had to run a separate
  SDK process to register + heartbeat. Now mcp_cli.main():
    1. Calls /registry/register at startup (idempotent upsert flips
       status awaiting_agent → online via the existing register path).
    2. Spawns a daemon thread that POSTs /registry/heartbeat every
       20s. 20s is comfortably under the healthsweep stale window so
       a single missed beat doesn't cause status churn.
    3. Runs the MCP stdio loop in the foreground.

  Both calls set Origin: ${PLATFORM_URL} so the SaaS edge WAF accepts
  them. Threaded heartbeat (not asyncio) chosen because it doesn't
  need to share an event loop with the MCP stdio server — daemon=True
  cleanly dies when the operator's runtime exits.

  MOLECULE_MCP_DISABLE_HEARTBEAT=1 escape hatch lets in-container
  callers (which have heartbeat.py running already) reuse the entry
  point without double-heartbeating. Default is enabled.

End-to-end verification (live, against
hongmingwang.moleculesai.app, workspace 8dad3e29-...):
  pre-fix:  status=awaiting_agent → canvas shows OFFLINE forever
  post-fix: ran `molecule-mcp` for 5s standalone → canvas state:
            status=online runtime=external agent=molecule-mcp-8dad3e29

Test coverage: 7 new mcp_cli tests (register-at-startup, heartbeat-
thread-spawned, disable-env-skips-both, env-and-file token resolution,
register payload shape, heartbeat endpoint + headers); 1 new platform
test (awaiting_agent → online recovery). Full workspace + handlers
suites green: 1355 Python, full Go handlers passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 15:42:44 -07:00
Hongming Wang
716589742c feat(canvas): add Universal MCP tab to external-agent connect modal
The "Connect your external agent" dialog already covered Claude Code,
Python SDK, curl, and raw fields. This adds a Universal MCP tab that
documents the new \`molecule-mcp\` console script — the runtime-
agnostic baseline shipped by PR #2413's workspace-runtime changes.

Surface area:
- New \`externalUniversalMcpTemplate\` constant in workspace-server.
  Three-step snippet: pip install runtime → one-shot register via curl
  → wire molecule-mcp into agent's MCP config (Claude Code example,
  notes that hermes/codex/etc. take the same env-var contract).
- Workspace create response now includes \`universal_mcp_snippet\`
  alongside the existing curl/python/channel snippets.
- Canvas modal renders the tab when \`universal_mcp_snippet\` is
  present; backward-compatible with older platform builds (tab hides
  when empty).

Origin/WAF coverage (the user explicitly asked for this):
- The runtime wheel handles Origin automatically (this PR's earlier
  commit on platform_auth.auth_headers).
- The curl tab now sets \`Origin: {{PLATFORM_URL}}\` preemptively
  with an explanatory comment; \`/registry/register\` is currently
  WAF-allowed without it but adding now keeps the snippet working
  if WAF rules expand. The comment also explains why
  \`/workspaces/*\` paths return empty 404 without Origin — the
  exact failure mode I hit while smoke-testing this PR live.
- The MCP snippet's footer notes that the wheel auto-handles
  Origin so operators don't think about it.

End-to-end verification (against live tenant
hongmingwang.moleculesai.app, freshly registered workspace):
- get_workspace_info → full JSON
- list_peers → "Claude Code Agent (ID: 97ac32e9..., status: online)"
- recall_memory → "No memories found."
all returned by the molecule-mcp binary speaking MCP stdio to
this Claude Code session.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 15:34:27 -07:00
Hongming Wang
36e263a07d fix(workspace-server): skip provision pipeline on Restart for runtime=external
POST /workspaces/:id/restart on a runtime=external workspace ran the full
re-provision pipeline (Stop → provisionWorkspace*), which calls
issueAndInjectToken → RevokeAllForWorkspace. For external workspaces
(operator-driven, no container/EC2) that silently destroyed the operator's
local bearer token on every "Restart" click in the canvas — the local
poller would then 401-spam against /activity until the operator manually
regenerated from the Tokens tab.

The auto-restart path (runRestartCycle, line 436) already short-circuits
runtime=external. This patch mirrors that for the manual handler so the
two paths agree, and surfaces a 200 OK with a clear message so the
canvas can tell the operator the fix is on their side rather than
silently no-op'ing.

Test coverage: TestRestartHandler_ExternalRuntimeNoOps asserts the
short-circuit fires *before* any DB write or provision call. sqlmock's
"unexpected query" failure mode would catch a regression that
re-introduced the token revoke or the status=provisioning UPDATE.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 15:08:48 -07:00
Hongming Wang
b5c7b349d8
Merge pull request #2408 from Molecule-AI/auto/fix-healthsweep-gate-saas
fix(boot): always start health-sweep goroutine — SaaS tenants need it for external-runtime liveness
2026-04-30 20:12:28 +00:00
Hongming Wang
8516a8f9c6 fix(tenant-guard): allowlist /buildinfo so redeploy verifier can reach it
The /buildinfo route added in #2398 to verify each tenant runs the
published SHA was 404'd by TenantGuard on every production tenant —
the allowlist had /health, /metrics, /registry/register,
/registry/heartbeat, but not /buildinfo. The redeploy workflows
curl /buildinfo from a CI runner with no X-Molecule-Org-Id header,
TenantGuard 404'd them, gin's NoRoute proxied to canvas, canvas
returned its HTML 404 page, jq read empty git_sha, and the verifier
silently soft-warned every tenant as "unreachable" — which the
workflow doesn't fail on.

Confirmed externally:
  curl https://hongmingwang.moleculesai.app/buildinfo
  → HTTP 404 + Content-Type: text/html (Next.js "404: This page
    could not be found.") even though /health on the same host
    returns {"status":"ok"} from gin.

The buildinfo package's own doc already declares /buildinfo public
by design ("Public is intentional: it's a build identifier, not
operational state. The same string is already published as
org.opencontainers.image.revision on the container image, so no new
info is exposed.") — the allowlist just missed it.

Pin the alignment in tenant_guard_test.go:
TestTenantGuard_AllowlistBypassesCheck now asserts /buildinfo
returns 200 without an org header alongside /health and /metrics,
so a future allowlist edit can't silently regress the verifier
again.

Closes the silent-success failure mode: stale tenants will now
show up as STALE (hard-fail) rather than UNREACHABLE (soft-warn).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 12:54:51 -07:00
Hongming Wang
235aca9908 fix(boot): always start health-sweep goroutine — SaaS tenants need it for external-runtime liveness
Pre-fix, cmd/server/main.go gated the entire health-sweep goroutine on
`prov != nil`. On SaaS tenants (`MOLECULE_ORG_ID` set) the local Docker
provisioner is never initialized — only `cpProv`. So the goroutine
never started, and `sweepStaleRemoteWorkspaces` (which transitions
runtime='external' workspaces from 'online' to 'awaiting_agent' when
their last_heartbeat_at goes stale) never ran.

Net effect on production: every external-runtime workspace on SaaS
that lost its agent stayed 'online' indefinitely instead of falling
back to 'awaiting_agent' (re-registrable). The drift gate (#2388)
caught the migration side and #2382 fixed the SQL writes, but this
orchestration-side gate slipped through both because there was no
SaaS-mode E2E coverage on the heartbeat-loss → awaiting_agent
transition.

Caught by #2392 (live staging external-runtime regression E2E)
failing at step 6 — 180s with no heartbeat, expected
status=awaiting_agent, got online.

Fix: drop the `if prov != nil` gate. `StartHealthSweep` already
handles nil checker correctly (healthsweep.go:50-71): the Docker
sweep is gated inside the loop, the remote sweep always runs. Test
coverage already exists at TestStartHealthSweep_NilCheckerRunsRemoteSweep.

After this lands and tenants redeploy, #2392 step 6 passes and the
regression coverage closes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 12:05:40 -07:00
Hongming Wang
c06e2fec5e
Merge pull request #2396 from Molecule-AI/auto/typed-workspace-status
refactor(workspace-status): typed constants + AST-based drift gate
2026-04-30 18:03:30 +00:00
Hongming Wang
998e13c4bd feat(deploy): verify each tenant /buildinfo matches published SHA after redeploy
Closes the gap that let issue #2395 ship: redeploy-fleet workflows reported
ssm_status=Success based on SSM RPC return code alone, while EC2 tenants
silently kept serving the previous :latest digest because docker compose up
without an explicit pull is a no-op when the local tag already exists.

Wire:
  - new buildinfo package exposes GitSHA, set at link time via -ldflags from
    the GIT_SHA build-arg (default "dev" so test runs without ldflags fail
    closed against an unset deploy)
  - router exposes GET /buildinfo returning {git_sha} — public, no auth,
    cheap enough to curl from CI for every tenant
  - both Dockerfiles thread GIT_SHA into the Go build
  - publish-workspace-server-image.yml passes GIT_SHA=github.sha for both
    images
  - redeploy-tenants-on-main.yml + redeploy-tenants-on-staging.yml curl each
    tenant's /buildinfo after the redeploy SSM RPC and fail the workflow on
    digest mismatch; staging treats both :latest and :staging-latest as
    moving tags; verification is skipped only when an operator pinned a
    specific tag via workflow_dispatch

Tests:
  - TestGitSHA_DefaultDevSentinel pins the dev default
  - TestBuildInfoEndpoint_ReturnsGitSHA pins the wire shape that the
    workflow's jq lookup depends on

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:55:08 -07:00
Hongming Wang
188db33794 refactor(workspace-status): catch missed literal in workspace_bootstrap.go + add literal-drift gate
Two related fixes after self-review of #2396:

1. workspace_bootstrap.go:62 — `SET status = 'failed'` was missed in the
   initial sweep. Now parameterized as $3 with models.StatusFailed.
   Test fixed with the additional WithArgs sentinel.

2. Drift gate now scans production .go AST for hard-coded
   `UPDATE workspaces … SET status = '<literal>'` and fails with
   file:line. This catches the kind of miss the first commit just
   fixed — the original migration-vs-codebase axis only verified
   AllWorkspaceStatuses ⊆ enum, not "no raw literals in writes."

Verified the gate fires: dropped a synthetic 'failed' literal into
internal/handlers/_drift_sanity.go and confirmed the gate flagged
"internal/handlers/_drift_sanity.go:6 → SET status = 'failed'".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:51:01 -07:00
Hongming Wang
fdf1b5d76a refactor(workspace-status): typed constants + AST-based drift gate
Eliminate raw 'awaiting_agent'/'hibernating'/'failed'/etc string literals
from production status writes. Adds models.WorkspaceStatus typed alias and
models.AllWorkspaceStatuses canonical slice; every UPDATE workspaces SET
status = ... now passes a parameterized $N typed value rather than a
hard-coded SQL literal.

Defense-in-depth follow-up to migration 046 (#2388): the Postgres enum
type was missing 'awaiting_agent' + 'hibernating' for ~5 days because
sqlmock regex matching cannot enforce live enum constraints. The drift
gate is now a proper Go AST + SQL parser (no regex), asserting the
codebase ⊆ migration enum and every const appears in the canonical
slice. With status as a parameterized typed value, future enum mismatches
fail at the SQL layer in tests, not silently in prod.

Test coverage: full suite passes with -race; drift gate green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:41:41 -07:00
Hongming Wang
e081c8335f refactor(handlers): widen WorkspaceHandler.provisioner to LocalProvisionerAPI interface (#2369)
Symmetric with the existing CPProvisionerAPI interface. Closes the
asymmetry where the SaaS provisioner field was an interface (mockable
in tests) but the Docker provisioner field was a concrete pointer
(not).

## Changes

- New ``provisioner.LocalProvisionerAPI`` interface — the 7 methods
  WorkspaceHandler / TeamHandler call on h.provisioner today: Start,
  Stop, IsRunning, ExecRead, RemoveVolume, VolumeHasFile,
  WriteAuthTokenToVolume. Compile-time assertion confirms *Provisioner
  satisfies it. Mirror of cp_provisioner.go's CPProvisionerAPI block.
- ``WorkspaceHandler.provisioner`` and ``TeamHandler.provisioner``
  re-typed from ``*provisioner.Provisioner`` to
  ``provisioner.LocalProvisionerAPI``. Constructor parameter type is
  unchanged — the assignment widens to the interface, so the 200+
  callers of ``NewWorkspaceHandler`` / ``NewTeamHandler`` are
  unaffected.
- Constructors gain a ``if p != nil`` guard before assigning to the
  interface field. Without this, ``NewWorkspaceHandler(..., nil, ...)``
  (the test fixture pattern across 200+ tests) yields a typed-nil
  interface value where ``h.provisioner != nil`` evaluates *true*,
  and the SaaS-vs-Docker fork incorrectly routes nil-fixture tests
  into the Docker code path. Documented inline with reference to
  the Go FAQ.
- Hardened the 5 Provisioner methods that lacked nil-receiver guards
  (Start, ExecRead, WriteAuthTokenToVolume, RemoveVolume,
  VolumeHasFile) — return ErrNoBackend on nil receiver instead of
  panicking on p.cli dereference. Symmetric with Stop/IsRunning
  (already hardened in #1813). Defensive cleanup so a future caller
  that bypasses the constructor's nil-elision still degrades
  cleanly.
- Extended TestZeroValuedBackends_NoPanic with 5 new sub-tests
  covering the newly-hardened nil-receiver paths. Defense-in-depth:
  a future refactor that drops one of the nil-checks fails red here
  before reaching production.

## Why now

- Provisioner orchestration has been touched in #2366 / #2368 — the
  interface symmetry is the natural follow-up captured in #2369.
- Future work (CP fleet redeploy endpoint, multi-backend
  provisioners) wants this in place. Memory note
  ``project_provisioner_abstraction.md`` calls out pluggable
  backends as a north-star.
- Memory note ``feedback_long_term_robust_automated.md`` —
  compile-time gates + ErrNoBackend symmetry > runtime panics.

## Verification

- ``go build ./...`` clean.
- ``go test ./...`` clean — 1300+ tests pass, including the
  previously-flaky Create-with-nil-provisioner paths that now
  exercise the constructor's nil-elision correctly.
- ``go test ./internal/provisioner/ -run TestZeroValuedBackends_NoPanic
  -v`` — all 11 nil-receiver subtests green (was 6, +5 for the
  newly-hardened methods).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 09:18:16 -07:00