Commit Graph

21 Commits

Author SHA1 Message Date
eaf7dbb7c4 fix(handlers): auto-restart workspace after file write/delete/replace
Some checks failed
sop-tier-check / tier-check (pull_request) Failing after 5s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 5s
PUT /workspaces/:id/files and DELETE /workspaces/:id/files updated the
config volume but never restarted the container, so the running agent
continued serving stale file content from its in-memory cache. The
SecretsHandler already had this pattern (issue #15); TemplatesHandler
was missing it.

Fix: after every successful write/delete in WriteFile, DeleteFile, and
ReplaceFiles, call h.wh.RestartByID(workspaceID) asynchronously, guarded
by h.wh != nil (nil-tolerant for callers that only use read-only
surfaces). The RestartByID coalescing gate prevents thundering-herd on
concurrent requests.

Fixes #151.
Fixes #87 (duplicate effort closed — core-be also filed #183).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-09 22:43:27 +00:00
Hongming Wang
f39b595a9c fix(workspace files API): EIC parity for ListFiles + DeleteFile (closes #2999 PR-A)
## User-visible bug

Canvas Files tab returns "0 files / No config files yet" for every
SaaS workspace, every root (/configs, /home, /workspace, /plugins).
Reported by user (canvas screenshot, hongming.moleculesai.app,
Hongming Personal Brand Agent — claude-code, T4, online).

## Root cause

`ListFiles` (templates.go) was missing the SSH-via-EIC branch that
ReadFile (PR #2785) and WriteFile (PR #1702) already have. On SaaS,
dockerCli is nil → findContainer returns "" → falls through to
host-side resolveTemplateDir which only matches baked-in template
names. For a user-named workspace it matches nothing, so the handler
silently returns []fileEntry{}.

DeleteFile had the same gap — right-click delete (introduced in PR-C
of this issue) would silently no-op once #1 was fixed.

## Fix

1. Extracted shared EIC plumbing into `withEICTunnel` (closure-based,
   single SSOT for keypair → key push → tunnel → port-wait → cleanup).
   Refactored writeFileViaEIC + readFileViaEIC to use it. Added
   listFilesViaEIC + deleteFileViaEIC on the same scaffold. The
   `LogLevel=ERROR` shim from PR #2822 now lives in one
   `eicSSHSession.sshArgs()` helper instead of being duplicated per
   helper — the next time we need to tweak ssh options, one place.

2. Factored remote shell strings into pure functions
   (buildInstallShell / buildCatShell / buildRmShell / buildFindShell
   + parseFindOutput) so the wire shape can be pinned without booting
   a real EIC tunnel.

3. Refactored `resolveWorkspaceFilePath(runtime, root, relPath)` to
   honor `?root=`. New rule: `/configs` (or empty / unrecognized) →
   runtime managed-config dir via workspaceFilePathPrefix (preserves
   the v1 ReadFile/WriteFile behaviour where canvas's Config tab
   GETs/PUTs config.yaml without specifying a root and lands in the
   right per-runtime dir); `/home`, `/workspace`, `/plugins` →
   literal absolute path on the EC2 host. List/Read/Write/Delete now
   agree on what file a tree row points to — pre-fix List would say
   "/home contents" but Read/Write would route to /configs.

4. ListFiles + DeleteFile dispatch on instance_id != "" → EIC helper.
   Errors from the EIC path produce 500 (not silent fall-through to
   local-Docker, which would mask the failure as "0 files" — the
   exact user-visible symptom).

5. Added ?root= validation gate to WriteFile + DeleteFile so an
   out-of-allowlist root is rejected before the resolver runs.

## Test coverage

- TestResolveWorkspaceFilePath_RuntimeIndirection — pins the
  /configs → runtime prefix translation per-runtime (hermes,
  claude-code, langgraph, external, unknown). Catches the regression
  where a future edit accidentally drops the runtime indirection.

- TestResolveWorkspaceFilePath_LiteralRoots — pins /home,
  /workspace, /plugins as literal pass-through regardless of
  runtime. Catches the symmetric regression where the literal roots
  start getting rewritten to the runtime prefix (which would mean
  the FilesTab "/home" selector silently routes to /configs on
  hermes).

- TestResolveWorkspaceRootPath — directory-only translation used
  by listFilesViaEIC, same indirection rules.

- TestSSHArgs_HardenedFlags — pins the centralised ssh option set
  (LogLevel=ERROR + hardening). Catches drift in the
  one-place-where-ssh-flags-live.

- TestEicSSHSessionSingleSourceForSSHFlags — behaviour-based AST
  gate (per memory). Counts s.sshArgs() callers (must be ≥4 —
  list/read/write/delete) and asserts LogLevel=ERROR appears
  exactly once in the source. Fires if anyone copy-pastes a raw
  ssh args slice instead of going through the helper.

- TestBuildInstallShell / TestBuildCatShell / TestBuildRmShell /
  TestBuildFindShell — pure-function tests pinning the remote
  command shape. Catches regression like "rm -f silently becomes
  rm -rf" or "find loses node_modules pruning" without needing a
  real EC2.

- TestBuildFindShell_DepthForwarding — catches a regression where
  the helper hard-codes a depth instead of using the caller's value.

- TestParseFindOutput / TestParseFindOutput_EmptyInput — pin the
  TYPE|SIZE|REL parser. Empty-input case explicitly returns []
  not nil so the JSON wire shape stays a list.

- TestListFiles_EICDispatch_Success / Error — sqlmock-driven
  handler test. Verifies instance_id != "" routes to listFilesViaEIC
  and surfaces errors as 500 (does NOT silently fall through to
  local-Docker, which is the exact regression-mode of the original
  bug).

- TestListFiles_EICBranch_NotTakenForSelfHosted — back-compat
  guard: instance_id == "" must NOT enter the EIC branch (would
  break self-hosted operators).

- TestDeleteFile_EICDispatch_Success / Error — same shape for
  DeleteFile.

- TestListFiles_RootValidation / TestDeleteFile_RootValidation —
  ?root=/etc must 400 before any DB query or EIC call.

## Verification

- `go build ./...` clean
- `go test ./...` clean (full workspace-server suite)
- Will be live-verified against staging on hongming.moleculesai.app
  after merge: open Files tab → expect populated /home + /configs +
  /workspace listings (not "0 files"); right-click delete on
  /configs/old.yaml → expect file removed on the EC2 host.

## Three weakest spots (hostile self-review)

1. The LogLevel=ERROR drift gate counts source occurrences. A
   future refactor that intentionally moves the literal somewhere
   else (e.g. into a constant) would trigger a false positive. The
   gate's failure message points to the load-bearing constraint
   (must appear in sshArgs); operator can adjust.

2. `eicFileWriteTimeout` constant kept as an alias for back-compat
   with prior tests. Documented as intentional + safe to remove on
   the next pass.

3. The resolver tests pin the runtime → prefix map values
   (`/home/ubuntu/.hermes`, `/configs`, etc.). A future runtime
   addition that ships a new prefix needs the test updated. This
   is intentional — silent prefix changes orphan saved files, so a
   test failure on map edit IS the right signal.

## Follow-up (RFC #2312 subtask 2)

Long-term the right fix is to drop EIC entirely and HTTP-forward to
the workspace's own URL (RFC #2312). That's a substantially larger
refactor across 5 surfaces (chat upload, files, templates, plugins,
terminal) and out of scope for this bug-fix PR. Tracked separately
under that RFC.

Refs #2999.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 20:18:05 -07:00
Hongming Wang
9f551319d2 feat(saas): close 4th default-tier site + lift org_import asymmetry + tests (#2910)
Multi-model retrospective review of #2901 found three Critical gaps:

1. (#2910 PR-B) template_import.go:79 wrote `tier: 3` hardcoded into
   generated config.yaml. On SaaS this defeated the T4 default at the
   create-handler layer — a config-less template import landed at T3
   regardless of POST /workspaces' computed default. The 4th
   default-tier site #2901 missed.

2. (#2910 PR-A) #2901 claimed `go test ... all green` but added zero
   new tests. Existing structural-pin tests caught dispatch-layer
   drift but said nothing about tier-default drift. A future refactor
   that flips DefaultTier() to always return 3 would ship green.

3. (#2910 PR-E) org_import.go fallback returned T2 on self-hosted
   while workspace.go returned T3. Internally consistent ("bulk vs
   interactive defaults") but undocumented same-name-different-value
   drift.

Fix:

- TemplatesHandler.NewTemplatesHandler now takes `wh *WorkspaceHandler`
  (nil-tolerant for read-only callers). Import + ReplaceFiles compute
  tier via h.wh.DefaultTier() and pass it to generateDefaultConfig.
  generateDefaultConfig gets a `tier int` parameter (bounds-checked,
  invalid input falls back to T3).

- org_import.go fallback lifts to h.workspace.DefaultTier() — single
  source of truth shared with Create + Templates so a future
  tier-default change sweeps every entry point at once.

- New saas_default_tier_test.go pinning:
    TestIsSaaS_TrueWhenCPProvWired
    TestIsSaaS_FalseWhenOnlyDocker
    TestDefaultTier_SaaS_IsT4
    TestDefaultTier_SelfHosted_IsT3
    TestGenerateDefaultConfig_RespectsTierParam
    TestGenerateDefaultConfig_SelfHostedTierT3
    TestGenerateDefaultConfig_OutOfRangeFallsBackToT3

- Existing template_import_test.go tests + chat_files_test.go +
  security_regression_test.go updated to thread the new tier param /
  wh constructor arg through their NewTemplatesHandler calls. Their
  pre-#2910 assertion of `tier: 3` is preserved (now passes because
  the test caller passes `3` explicitly), so no regression.

go vet ./... clean. go test ./internal/handlers/ -count 1 — all
green (4.2s).

Deferred to separate follow-ups (per #2910 plan):
- PR-C: MOLECULE_DEPLOYMENT_MODE explicit deployment-mode signal
  (closes the IsSaaS()=cpProv!=nil structural fragility)
- PR-D: Host iptables IMDS block + IMDSv2 hop-limit (paired with
  molecule-controlplane EC2-IAM-scope audit)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 11:38:22 -07:00
Hongming Wang
2f7beb9bce feat: drop shared_context — use memory v2 team namespace instead
Parent → child knowledge sharing previously lived behind a `shared_context`
list in config.yaml: at boot, every child workspace HTTP-fetched its parent's
listed files via GET /workspaces/:id/shared-context and prepended them as
a "## Parent Context" block. That paid the full transfer cost on every
boot regardless of whether the agent needed it, single-parent SPOF, no team
or org scope, and broken if the parent was unreachable.

Replace with memory v2's team:<id> namespace: agents call recall_memory
on demand. For large blob-shaped artefacts see RFC #2789 (platform-owned
shared file storage).

Removed:
- workspace/coordinator.py: get_parent_context()
- workspace/prompt.py: parent_context arg + injection block
- workspace/adapter_base.py: import + call + arg pass
- workspace/config.py: shared_context field + parser entry
- workspace-server/internal/handlers/templates.go: SharedContext handler
- workspace-server/internal/router/router.go: GET /shared-context route
- canvas/src/components/tabs/ConfigTab.tsx: Shared Context tag input
- canvas/src/components/tabs/config/form-inputs.tsx: schema field + default
- canvas/src/components/tabs/config/yaml-utils.ts: serializer entry
- 6 tests pinning the removed behavior; 5 doc references

Added regression gates so any reintroduction is loud:
- workspace/tests/test_prompt.py: build_system_prompt must NOT emit
  "## Parent Context"
- workspace/tests/test_config.py: legacy YAML key loads cleanly but
  shared_context attr must NOT exist on WorkspaceConfig
- tests/e2e/test_staging_full_saas.sh §9d: GET /shared-context must NOT
  return 200 against a live tenant

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 16:30:26 -07:00
Hongming Wang
9c7b34cb7f fix(workspace files API): GET ReadFile via SSH-EIC for SaaS workspaces
Pre-fix WriteFile (templates.go:436) had an `instance_id != ""` branch
that dispatched to writeFileViaEIC (SSH through EC2 Instance Connect),
but ReadFile (templates.go:362) skipped that branch entirely. ReadFile
always tried `findContainer` (which only works for local-Docker
workspaces, not SaaS EC2-per-workspace ones) and fell through to
`resolveTemplateDir` (which returns the seed template, not the
persisted workspace state).

Net effect on production: every Canvas Config tab open against a
SaaS workspace returned 404 "No config.yaml found" because GET
couldn't see what PUT had written. Visible to users after PR #2781
("show-misconfigured-state") surfaced the 404 as an error UX.

Caught by the synth-E2E 7c gate's GET-back assertion, but
misdiagnosed as a "test bug" and the GET assertion was dropped in
PR #2783 (rather than fixed at the source). This PR closes the loop:

1. New `readFileViaEIC` helper in template_files_eic.go that mirrors
   writeFileViaEIC's SSH-via-EIC dance and runs `sudo -n cat <path>`.
   Returns os.ErrNotExist on missing file (cat exits 1 with empty
   stdout under `2>/dev/null`) so the handler maps it cleanly to 404.

2. ReadFile dispatch now mirrors WriteFile's: when `instance_id` is
   non-empty, use readFileViaEIC; otherwise fall through to the
   local-Docker / template-dir path.

3. ReadFile's DB query expanded to also select instance_id + runtime
   (was just name). Three sqlmock-based tests updated to match the
   new column shape; the existing local-Docker fallback path stays
   green by passing instance_id="" in the mock rows.

Follow-up (separate PR): the synth-E2E 7c gate should restore the
GET-back marker assertion now that the read/write paths are unified.
That'll also catch any future Files API regression in the round-trip.
This PR doesn't touch the gate to keep the scope tight.

Verification:
- go build ./... clean
- full handlers test suite green (0.4s for ReadFile subset; 5.8s
  full)
- The 3 ReadFile sqlmock tests still cover the local-Docker fallback
  (instance_id=""); SaaS EIC dispatch is covered by the upcoming
  re-enabled synth-E2E 7c GET assertion (deferred to follow-up)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 16:02:26 -07:00
Hongming Wang
586d567a48 fix(workspace-server): log silent yaml.Unmarshal + coexistence test (#256, #257)
Two follow-ups from PR #2543's multi-model code review (audit #253).

1. **Log silent yaml.Unmarshal errors (#256).** When a malformed
   config.yaml made `yaml.Unmarshal(data, &raw)` fail, the affected
   template silently disappeared from /templates with no trace —
   operator could not distinguish "excluded due to parse error" from
   "never existed." That widened a real foot-gun once PR #2543 added
   structured top-level `providers:` (a string-shaped top-level
   `providers:` decoded into `[]providerRegistryEntry` would fail and
   drop the whole entry). Now logs `templates list: skip <id>:
   yaml.Unmarshal: <err>` and continues with the rest.

2. **Coexistence test (#257 part 1).** PR #2543 covered the structured
   registry and slug list in isolation. claude-code-default in
   production ships BOTH: top-level `providers:` (structured registry,
   2 entries) AND `runtime_config.providers:` (slug list, 3 entries).
   New `TestTemplatesList_BothProviderShapesCoexist` mirrors that
   layout, asserts both shapes surface independently with no
   cross-talk (e.g. a slug-only entry like `anthropic-api` does NOT
   synthesize a stub in the structured registry), and pins the JSON
   wire-shape for both fields side-by-side.

3. **`base_url: null` decoding assertion (#257 part 3).** Adds an
   explicit `got[0].BaseURL == ""` check in the existing
   `TestTemplatesList_SurfacesProviderRegistry` test, locking in the
   `string` (not `*string`) type. A future change to `*string` would
   surface as JSON `null` and break canvas's "no base_url = use
   provider defaults" branch — caught loudly by this assertion.

Tests: 11 TestTemplatesList_* now green, including the new
MalformedYAMLLogsAndSkips and BothProviderShapesCoexist.

The remaining piece of #257 — renaming `Providers []string` JSON tag
to `provider_slugs` — requires coordinated canvas updates across 4
files and is intentionally deferred to a separate PR (no canvas
churn while user is mid-test).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 23:01:59 -07:00
Hongming Wang
992a0c6860 fix(workspace-server): surface structured provider registry on /templates (#235)
Closes the contract drift caught by audit #253. Task #235 ("Server:
enrich /templates payload with structured providers") was marked
completed, but `templates.go` only ever emitted the
`runtime_config.providers []string` slug list — the structured
ProviderEntry shape (auth_env, model_prefixes, model_aliases, base_url)
the description promised was never plumbed.

Templates ship the structured registry under a TOP-LEVEL `providers:`
block (claude-code carries 6+ entries today; hermes still uses the
slug list). Both shapes coexist and are independent — surface them as
two separate fields:

  - `providers`           → existing []string slug list (unchanged)
  - `provider_registry`   → new []providerRegistryEntry (structured)

The canvas's ProviderModelSelector comment block already anticipates
this ("Templates that ship explicit vendor metadata (future) should
override the heuristic."). With this field in place, the canvas can
optionally drop its prefix-inference fallback for templates that ship
an explicit registry — separate PR. Today's change is purely additive
on the server side; no canvas change required.

Tests:
- TestTemplatesList_SurfacesProviderRegistry: order preservation +
  field plumbing on a claude-code-shaped fixture (oauth + minimax)
  + JSON wire-shape gate to catch struct-tag renames.
- TestTemplatesList_OmitsProviderRegistryWhenAbsent: omitempty so
  legacy templates (hermes, langgraph) don't emit `null` and break
  Array.isArray on the canvas side.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 22:42:42 -07:00
Hongming Wang
517bd0efc5 feat(canvas+workspace-server): data-driven Provider dropdown (#199)
Option B PR-5. Canvas Config tab now exposes a Provider override input
that's adapter-driven from each runtime's template — no hardcoded
provider list in the canvas. PUT /workspaces/:id/provider on Save
when dirty; auto-restart suppression to avoid double-restart with
the model handler's own restart.

The dropdown's suggestion list comes from /templates →
runtime_config.providers (the field added in
molecule-ai-workspace-template-hermes PR #31). For templates that
haven't migrated to the explicit providers list yet, suggestions
derive from model[].id slug prefixes — still adapter-driven, just
inferred. This keeps existing templates working while platform team
migrates them one at a time.

workspace-server changes:
- Add Providers []string field to templateSummary JSON
- Parse runtime_config.providers in /templates handler
- 2 new tests pin the surfacing + omitempty behavior

canvas changes:
- Remove hardcoded PROVIDER_SUGGESTIONS constant
- Add provider/originalProvider state + PUT-on-save logic
- Add deriveProvidersFromModels() fallback helper
- Wire RuntimeOption.providers from /templates response
- 8 new tests pin the behavior end-to-end

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 11:19:17 -07:00
rabbitblood
f1ad012024 refactor(handlers): apply simplify findings on PR #2094
- Extract walkTemplateConfigs(configsDir, fn) shared helper. Both
  templates.List and loadRuntimeProvisionTimeouts walked configsDir
  + parsed config.yaml — same boilerplate twice. Now centralised so
  a future template-discovery rule (subdir naming, README sentinel,
  etc.) lands in one place.
- templates.List uses the walker — net -10 lines.
- loadRuntimeProvisionTimeouts uses the walker — net -10 lines.
- Document runtimeProvisionTimeoutsCache as 'NOT SAFE for
  package-level reuse' so a future change doesn't accidentally
  promote it to a singleton (sync.Once can't be reset → tests
  would lock out other fixtures).

Skipped (review finding): atomic.Pointer[map[string]int] for
future hot-reload. The doc comment already documents the
limitation; YAGNI-promoting the primitive now would buy a
not-yet-built feature at the cost of more code today.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 06:40:15 -07:00
rabbitblood
27396d992c feat(workspace-server): surface provision_timeout_ms in workspace API (#2054 phase 2)
Phase 2 of #2054 — workspace-server reads runtime-level
provision_timeout_seconds from template config.yaml manifests and
includes provision_timeout_ms in the workspace List/Get response.
Phase 1 (canvas, #2092) already plumbs the field through socket →
node-data → ProvisioningTimeout's resolver, so the moment a
template declares the field the per-runtime banner threshold
adjusts without a canvas release.

Implementation:

- templates.go: parse runtime_config.provision_timeout_seconds in
  the templateSummary marshaller. The /templates API now surfaces
  the field too — useful for ops dashboards and future tooling.
- runtime_provision_timeouts.go (new): loadRuntimeProvisionTimeouts
  scans configsDir, parses every immediate subdir's config.yaml,
  returns runtime → seconds. Multiple templates with the same
  runtime: max wins (so a slow template's threshold doesn't get
  cut by a fast template's). Bad/empty inputs are silently
  skipped — workspace-server starts cleanly with no templates.
- runtimeProvisionTimeoutsCache: sync.Once-backed lazy cache.
  First workspace API request after process start pays the read
  cost (~few KB across ~50 templates); every subsequent request is
  a map lookup. Cache lifetime = process lifetime; invalidates on
  workspace-server restart, which is the normal template-change
  cadence.
- WorkspaceHandler gets a provisionTimeouts field (zero-value struct
  is valid — the cache lazy-inits on first get()).
- addProvisionTimeoutMs decorates the response map with
  provision_timeout_ms (seconds × 1000) when the runtime has a
  declared timeout. Absent = no key in the response, canvas falls
  through to its runtime-profile default. Wired into both List
  (per-row decoration in the loop) and Get.

Tests (5 new in runtime_provision_timeouts_test.go):
- happy path: hermes declares 720, claude-code doesn't, only
  hermes appears in the map
- max-on-duplicate: same runtime in two templates → max wins
- skip-bad-inputs: missing runtime, zero timeout, malformed yaml,
  loose top-level files all silently ignored
- missing-dir: returns empty map, no crash
- cache: lazy-init on first get; subsequent gets hit cache even
  after underlying file changes (sync.Once contract); unknown
  runtime returns zero

Phase 3 (separate template-repo PR): template-hermes config.yaml
declares provision_timeout_seconds: 720 under runtime_config.
canvas RUNTIME_PROFILES.hermes becomes redundant + removable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 06:37:45 -07:00
c5da3f1be9 fix(handlers): CWE-78 — reject absolute paths before strip in DeleteFile; drop null_byte test
- Add filepath.IsAbs guard in DeleteFile BEFORE the leading-slash strip so that
  absolute paths like "/etc/passwd" are rejected with 400 rather than silently
  accepted after the prefix is stripped.
- Remove the null_byte sub-case from TestCWE78_DeleteFile_TraversalVariants —
  httptest.NewRequest panics on \x00 in URLs (URL-layer concern, not handler).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 12:38:28 +00:00
Molecule AI Core Platform Lead
7d837dec74 fix(handlers): CWE-78 hardening for DeleteFile and SharedContext (#2011)
Replace string concatenation with safe exec-form path construction in
two remaining locations in templates.go:

1. DeleteFile (container-running path):
   - Before: `containerPath := "/configs/" + filePath` → `rm -rf containerPath`
   - After:  `rm -f filepath.Join("/configs", filePath)`
   - Also tightens rm flag from -rf to -f (no recursive delete on a file endpoint)

2. SharedContext (container-running path, per-file cat loop):
   - Before: `[]string{"cat", "/configs/" + relPath}`
   - After:  `[]string{"cat", "/configs", relPath}` (separate args, no shell join)

In both cases validateRelPath is already the primary guard (rejects traversal
inputs before reaching exec). filepath.Join / separate args is defence-in-depth
so that a bypass of validateRelPath cannot produce a dangerous concatenated path
in the exec argument list.

ReadFile was already fixed (PR #1885, merged to main at 12:08Z).

Regression tests added:
- TestCWE78_DeleteFile_TraversalVariants: 7 traversal patterns all → 400
- TestCWE78_SharedContext_SkipsTraversalPaths: traversal paths in
  shared_context config are silently skipped, only safe files returned

Fixes: #2011

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 12:29:57 +00:00
Hongming Wang
18ebb1d7bf fix(server): remove 60s A2A client timeout + correct file-read cat args
Two bugs surfaced while testing Claude Code + OAuth deploys:

1. A2A proxy: a2aClient had a 60s Client.Timeout "safety net" that
   defeated the per-request context deadlines the code otherwise sets
   (canvas = 5m, agent-to-agent = 30m). Claude Code's first-token cold
   start over OAuth takes 30-60s, so every first "hi" into a fresh
   claude-code workspace returned 503 at exactly the 1m mark. Removed
   the Client.Timeout — the context deadline now governs as documented
   in the adjacent comment.

2. Files tab: ReadFile ran `cat <rootPath> <filePath>` as two args to
   cat. `cat /home agent/turtle_draw.py` tries to read the rootPath
   directory (errors "Is a directory") and then resolves the filePath
   relative to the container cwd, which is not guaranteed to equal
   rootPath. Result: the file-content pane stayed blank even though
   the file listed fine. Join into a single path before exec.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 17:25:53 -07:00
Hongming Wang
dc50a1c775 refactor(canvas): data-drive provider picker from template config.yaml
The MissingKeysModal's provider list was hardcoded in deploy-preflight.ts
as RUNTIME_PROVIDERS — a per-runtime map that duplicated what each
template repo already declares in its config.yaml. That meant adding a
new provider required changes in two places, and the UI could drift out
of sync with the actual template (e.g. when a template adds a MiniMax or
Kimi model, the picker wouldn't know).

The single source of truth for "which env vars does this workspace need"
is each template's config.yaml:

  * `runtime_config.models[].required_env` — per-model key list
  * `runtime_config.required_env`          — runtime-level AND list

Go /templates already returned `models`. This change:

  * Adds `required_env` alongside `models` on templateSummary so the
    canvas receives the full picture.
  * Rewrites deploy-preflight.ts to derive ProviderChoice[] from a
    template object via `providersFromTemplate(template)`:
      - groups `models[]` by unique required_env tuple
      - falls back to runtime_config.required_env when models is empty
      - decorates labels with model counts (e.g. "OpenRouter (14 models)")
  * `checkDeploySecrets(template, workspaceId?)` now takes a template
    object instead of a runtime string. Any-provider satisfaction still
    short-circuits preflight to ok=true.
  * MissingKeysModal receives `providers` directly; no more lookups.
  * TemplatePalette threads `template.models` + `template.required_env`
    into the preflight.

Side effects:
  * Claude Code's dual-auth (OAuth token OR Anthropic API key) now
    surfaces as two picker options — its config.yaml already declared
    both, the UI just wasn't reading them.
  * Hermes picker now shows 8 provider options (Nous, OpenRouter,
    Anthropic, Gemini, DeepSeek, GLM, Kimi, Kilocode) instead of the
    hand-picked 3, matching its 35-model reality.

Removed the legacy RUNTIME_PROVIDERS / RUNTIME_REQUIRED_KEYS /
getRequiredKeys / findMissingKeys exports; MissingKeysModal.test.tsx
deleted (its coverage is subsumed by the new template-driven
deploy-preflight.test.ts). 58 modal-adjacent tests pass; full canvas
suite 919 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 17:07:15 -07:00
e12d8d12d3 fix(security): P0 — F1085/KI-005/CWE-78 security fixes rebased clean onto staging
Supersedes PRs #1882 + #1883 (both had merge conflicts / missing callerID decl).
Applied directly onto current staging HEAD (26c4565).

Changes:
- terminal.go: upgrade KI-005 guard ValidateAnyToken → ValidateToken (GH#756/#1609)
  Binds bearer token to claimed X-Workspace-ID; prevents cross-workspace terminal forge.
  Fixes missing `callerID` declaration that broke compilation in PR #1882.
- ssrf.go: add ssrfCheckEnabled flag + setSSRFCheckForTest helper for test isolation
- ssrf.go validateRelPath: harden to reject empty/"." paths; check both raw+cleaned for ..
- templates.go: ReadFile — exec form cat ["cat", rootPath, filePath] (was shell concat)
- orgtoken/tokens_test.go: fix regex (remove optional LIMIT $1 group)
- wsauth_middleware_test.go: add deprecated orgTokenOrgIDQuery const; update comments
- wsauth_middleware_org_id_test.go: use real org_id UUID in DBRowScanError test row

Security classification:
  F1085 (CWE-78) path traversal + exec form — P0 Fixed
  KI-005 terminal auth bypass (ValidateToken upgrade) — P0 Fixed
  CWE-22 SSRF test isolation — P0 Fixed

Co-Authored-By: Molecule AI Core-BE <core-be@agents.moleculesai.app>
Co-Authored-By: Core Platform Lead <core-platform@agents.moleculesai.app>
2026-04-23 20:52:49 +00:00
Hongming Wang
03741d1110 feat(files-api): SSH-backed write for SaaS workspaces (fixes 500 docker not available)
Symptom (prod, hongmingwang tenant, 2026-04-22):
  PUT /workspaces/:id/files/config.yaml → 500
  {"error":"failed to write file: docker not available"}

Root cause: WriteFile + ReplaceFiles always reached for the tenant's
Docker client, but SaaS workspaces run as EC2 VMs (no Docker on the
tenant to cp into). There was no SaaS code path, so Save/Save&Restart
in the Config tab silently 500'd for every SaaS user.

Fix: add writeFileViaEIC — same ephemeral-keypair + EIC-tunnel dance
that the Terminal tab already uses (terminal.go). Flow:

  1. ssh-keygen ephemeral ed25519 pair
  2. aws ec2-instance-connect send-ssh-public-key  (60s validity)
  3. aws ec2-instance-connect open-tunnel          (TLS → :22)
  4. ssh ... "install -D -m 0644 /dev/stdin <abs path>"
     install -D creates missing parent dirs atomically
  5. Kill tunnel + wipe keydir

Runtime → base-path map (new table workspaceFilePathPrefix):
  hermes     → /home/ubuntu/.hermes
  langgraph  → /opt/configs
  external   → /opt/configs
  unknown    → /opt/configs

Both WriteFile (single file) and ReplaceFiles (bulk) detect
`workspaces.instance_id != ''` and route to EIC instead of Docker.
Local/self-hosted Docker path is unchanged.

Security: the only variable piece in the remote ssh command is the
absolute path, which is built via map lookup + filepath.Clean so
traversal is blocked. shellQuote() wraps it as defence-in-depth.
validateRelPath rejects absolute paths and surviving `..` segments
up-front; tests assert traversal rejection.

Follow-ups tracked separately:
  - Reload hook after save (hermes gateway restart via SSH)
  - Per-tunnel batching for ReplaceFiles with many files
  - Runtime-specific base paths should be declared in the runtime
    manifest, not hardcoded in the handler

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 18:27:12 -07:00
Hongming Wang
f6e6a64ba9 fix(canvas): forward-port dynamic runtime dropdown from staging (PR #1526)
PR #1526 shipped the /templates registry + canvas dynamic Runtime /
Model / Required-Env fields on 2026-04-22 — but merged into the
staging branch, not main. The staging→main promotion PR #1496 has
been open unmerged for a while with 1172 commits divergence, so
prod (which builds from main) still carries the old hardcoded
dropdown.

Symptom seen on hongmingwang.moleculesai.app today:

- New Hermes Agent workspace (template declares runtime: hermes) loads
  Config tab → Runtime dropdown shows "LangGraph (default)" because
  there's no <option value="hermes"> in the hardcoded list; it falls
  back to empty-value silently.
- Model field is a plain TextInput with static placeholder
  "e.g. anthropic:claude-sonnet-4-6" — should be a combobox populated
  from the selected runtime's models[].
- Required Env Vars is a TagList with static placeholder
  "e.g. CLAUDE_CODE_OAUTH_TOKEN" — should auto-populate from the
  selected model's required_env.
- Net effect: "Save & Deploy" sends empty model + empty env to the
  provisioner → workspace instant-fails.

This PR cherry-picks the exact three files from PR #1526 (#359dc61
on staging) forward to main, without pulling the other 1171
commits:

- canvas/src/components/tabs/ConfigTab.tsx
  - RuntimeOption interface + FALLBACK_RUNTIME_OPTIONS (hermes,
    gemini-cli included)
  - useEffect fetches /templates and populates runtimeOptions
    dynamically
  - dropdown renders from runtimeOptions (no hardcoded list)
  - Model becomes a combobox with datalist of available models
    per selected runtime
  - Required Env Vars auto-populates from the selected model's
    required_env on model change

- workspace-server/internal/handlers/templates.go
  - /templates endpoint returns [{id, name, runtime, models}] with
    per-template models registry (id, name, required_env)

- workspace-server/internal/handlers/templates_test.go
  - Tests for runtime+models parsing and legacy top-level model
    fallback

The canvas Runtime dropdown now resolves "hermes" correctly;
Model dropdown shows the models[] from the hermes template; Env
auto-populates with HERMES_API_KEY (or whichever model selected).

Verified locally:
  - workspace-server builds clean
  - Template handler tests pass: TestTemplatesList_RuntimeAndModelsRegistry,
    TestTemplatesList_LegacyTopLevelModel, TestTemplatesList_NonexistentDir

Follow-up: the staging→main promotion gap (#1496) is the
underlying process issue. Either merge that PR or adopt a policy
of landing fixes directly on main (as several PRs have today).
Files here were chosen minimally to avoid pulling unrelated staging
changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 14:28:38 -07:00
Hongming Wang
73464a21dd
fix(restart): support SaaS control-plane provisioner (unblocks Platform Go build too) (#1512)
Squash-merge fix/restart (PR #1512): remove SSRF helpers from a2a_proxy_helpers.go since ssrf.go on main now owns these functions, resolving duplicate symbol build failures. Author: HongmingWang-Rabbit. Approved by molecule-ai. Mergeable, UNSTABLE (likely due to pending head branch changes).
2026-04-21 22:56:01 +00:00
molecule-ai[bot]
0bd2bf2b7f fix(security): CWE path-injection — resolveInsideRoot for Restart + ReadFile template paths (PR #1261)
workspace_restart.go:127-133 accepted body.Template (attacker-controlled)
via raw filepath.Join(h.configsDir, template), allowing path traversal
(e.g. "../../../etc") to escape configsDir.

Fix: replace raw filepath.Join with resolveInsideRoot, same pattern as
workspace.go:102 (already fixed) and workspace.go:249 (already fixed).
Both the explicit template path and the findTemplateByName fallback are
safe — findTemplateByName returns a directory name from os.ReadDir which
is inherently bounded and cannot contain "/".

On resolve error the template is cleared so findTemplateByName fallback
still fires (preserves existing restart behaviour when template is invalid).

Closes: #1043

Co-authored-by: Molecule AI Core-BE <core-be@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-21 03:38:39 +00:00
molecule-ai[bot]
35ccda1091 fix(security): replace err.Error() with generic messages in handler responses (#1193)
Replace all c.JSON(http.StatusBadRequest, gin.H{"error": err.Error()})
calls across 22 handler files with context-appropriate generic messages
to prevent internal error strings (DB details, validation messages,
file paths) leaking into API responses.

Pattern established:
- ShouldBindJSON failures → "invalid request body" (or "invalid delegation request")
- Validation failures → "invalid workspace ID", "invalid path", etc.
- Server-side errors still logged, only generic message returned to client

References: Security finding from Audit #125 (Stripe key leak via err.Error())

Co-authored-by: Molecule AI Fullstack (floater) <fullstack-floater@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-21 00:56:03 +00:00
Hongming Wang
d8026347e5 chore: open-source restructure — rename dirs, remove internal files, scrub secrets
Renames:
- platform/ → workspace-server/ (Go module path stays as "platform" for
  external dep compat — will update after plugin module republish)
- workspace-template/ → workspace/

Removed (moved to separate repos or deleted):
- PLAN.md — internal roadmap (move to private project board)
- HANDOFF.md, AGENTS.md — one-time internal session docs
- .claude/ — gitignored entirely (local agent config)
- infra/cloudflare-worker/ → Molecule-AI/molecule-tenant-proxy
- org-templates/molecule-dev/ → standalone template repo
- .mcp-eval/ → molecule-mcp-server repo
- test-results/ — ephemeral, gitignored

Security scrubbing:
- Cloudflare account/zone/KV IDs → placeholders
- Real EC2 IPs → <EC2_IP> in all docs
- CF token prefix, Neon project ID, Fly app names → redacted
- Langfuse dev credentials → parameterized
- Personal runner username/machine name → generic

Community files:
- CONTRIBUTING.md — build, test, branch conventions
- CODE_OF_CONDUCT.md — Contributor Covenant 2.1

All Dockerfiles, CI workflows, docker-compose, railway.toml, render.yaml,
README, CLAUDE.md updated for new directory names.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 00:24:44 -07:00