408e308ce5
26 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
258c6bea44 |
feat(workspace-server): PUT /provider endpoint for explicit LLM provider (#196)
Mirror of PUT /model. Stores the provider slug as the LLM_PROVIDER workspace secret so the canvas can update model + provider independently — a user might keep the same model alias and switch providers (route through a different gateway), or vice versa. Forcing both into one endpoint imposes a single Save+Restart per change; two endpoints let canvas update each as the user picks. Plumbs through the existing chain: secret-load → envVars → CP req.Env → user-data env exports → /configs/config.yaml (after controlplane PR #364 lands the heredoc append). Tests: 5 new cases mirroring SetModel/GetModel exactly — default empty response, DB error, upsert with restart trigger, empty-clears, invalid-UUID rejection. Part of: Option B PR-2 (#196) — workspace-server plumbs LLM_PROVIDER Stack: PR-1 schema (#2441 merged) PR-2 (this) ws-server endpoint PR-3 (#364 open) CP user-data persistence PR-4 (pending) hermes adapter consume PR-5 (pending) canvas Provider dropdown |
||
|
|
d012a803e4 |
feat(terminal): add diagnose endpoint for SSH probe stages
GET /workspaces/:id/terminal/diagnose runs the same per-stage pipeline as
/terminal (ssh-keygen → EIC send-key → tunnel → ssh) but non-interactively
and returns JSON. Each stage reports {name, ok, duration_ms, error,
detail}, plus a top-level first_failure naming the broken stage.
Why: when the canvas terminal silently disconnects ("Session ended" with
no error frame — the user-reported failure mode on hongmingwang's hermes
workspace), there is no remote-readable signal of WHICH stage failed.
The ssh client's stderr lives only in the workspace-server's stdout on
the tenant CP EC2 — invisible without shell access. /terminal can't
expose stderr cleanly because it has already upgraded to WebSocket
binary frames by the time ssh runs. /terminal/diagnose stays pure
HTTP/JSON, so the same auth (WorkspaceAuth + ADMIN_TOKEN fallback) gives
operators a one-call probe that splits "IAM broke" (send-ssh-public-key
fails) from "tunnel/SG broke" (wait-for-port fails) from "sshd auth
broke" (ssh-probe gets Permission denied) from "shell broke" (probe
exits non-zero with stderr).
Stages mirrored from handleRemoteConnect in terminal.go:
1. ssh-keygen ephemeral session keypair
2. send-ssh-public-key AWS EIC API push, IAM-gated
3. pick-free-port local port for the tunnel
4. open-tunnel aws ec2-instance-connect open-tunnel start
5. wait-for-port the tunnel actually listens (folds tunnel
stderr into Detail when it doesn't)
6. ssh-probe non-interactive `ssh ... 'echo MARKER'` that
confirms auth + bash + the marker round-trip
(CombinedOutput captures stderr verbatim —
this is the whole reason the endpoint exists)
Local Docker workspaces (no instance_id) get a smaller probe:
container-found + container-running. Same response shape so callers
don't need to branch.
Tests stub sendSSHPublicKey / openTunnelCmd / sshProbeCmd via the
existing package-level vars (same pattern as TestSSHCommandCmd_*) so
the test suite stays hermetic — no AWS, no network. The three new
tests pin: (a) routing to remote on instance_id present,
(b) routing to local on empty instance_id, (c) the operationally
critical case — full success through wait-for-port then a probe
failure surfaces ssh stderr in the ssh-probe step's Error/Detail
with first_failure="ssh-probe".
Auth: rides on existing WorkspaceAuth middleware. Operators with the
tenant ADMIN_TOKEN (fetched via /cp/admin/orgs/:slug/admin-token) can
probe any workspace without per-workspace token; same admin path as
the canvas dashboard reads workspace activity.
Response always returns HTTP 200 (success or step failure are both in
the JSON body) so callers don't need to branch on status code — the
endpoint either reports a first_failure or doesn't.
Resolves task #200, supports task #193 (workspace EC2 sshd
unresponsive — without this endpoint we couldn't pin the failure
stage from outside the tenant CP EC2).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
998e13c4bd |
feat(deploy): verify each tenant /buildinfo matches published SHA after redeploy
Closes the gap that let issue #2395 ship: redeploy-fleet workflows reported ssm_status=Success based on SSM RPC return code alone, while EC2 tenants silently kept serving the previous :latest digest because docker compose up without an explicit pull is a no-op when the local tag already exists. Wire: - new buildinfo package exposes GitSHA, set at link time via -ldflags from the GIT_SHA build-arg (default "dev" so test runs without ldflags fail closed against an unset deploy) - router exposes GET /buildinfo returning {git_sha} — public, no auth, cheap enough to curl from CI for every tenant - both Dockerfiles thread GIT_SHA into the Go build - publish-workspace-server-image.yml passes GIT_SHA=github.sha for both images - redeploy-tenants-on-main.yml + redeploy-tenants-on-staging.yml curl each tenant's /buildinfo after the redeploy SSM RPC and fail the workflow on digest mismatch; staging treats both :latest and :staging-latest as moving tags; verification is skipped only when an operator pinned a specific tag via workflow_dispatch Tests: - TestGitSHA_DefaultDevSentinel pins the dev default - TestBuildInfoEndpoint_ReturnsGitSHA pins the wire shape that the workflow's jq lookup depends on Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
bb52a1a365 |
fix(team): delegate Expand child-provisioning to shared mint pipeline (#2367)
Closes #2367. TeamHandler.Expand provisioned child workspaces by directly calling h.provisioner.Start, skipping mintWorkspaceSecrets and every other preflight (secrets load, env mutators, identity injection, missing-env, empty-config-volume auto-recover). Children shipped with NULL platform_inbound_secret + never-issued auth_token — same drift class as the SaaS bug just fixed in PR #2366, found while exercising a stronger gate against this package. Fix: - TeamHandler now holds *WorkspaceHandler. Expand delegates each child provision to wh.provisionWorkspace, picking up the shared prepare/mint/preflight pipeline automatically. Future provision-time steps go in ONE place and team-expand inherits them. - prepareProvisionContext gains PARENT_ID env injection sourced from payload.ParentID (which Expand now populates). This preserves the signal workspace/coordinator.py reads on startup, without threading env through provisioner.WorkspaceConfig manually. - NewTeamHandler signature gains *WorkspaceHandler; router passes it. Gate upgrade: - TestProvisionFunctions_AllCallMintWorkspaceSecrets is now behavior-based: it walks every FuncDecl in the package and flags any function that calls h.provisioner.Start or h.cpProv.Start without also calling mintWorkspaceSecrets. Drift-resistant by construction — a future provision function with any name still trips the gate. - Replaces the name-list version from PR #2366. The name list missed Expand precisely because Expand wasn't named provision*; the behavior-based detector caught it spontaneously when prototyped. Tests: full workspace-server module green; gate previously verified to fire red on Expand pre-fix and on deliberate mintWorkspaceSecrets removal. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
b6d223cd0a |
feat(a2a): per-queue-id status endpoint + per-message TTL (RFC #2331 Tier 1)
Closes the observability gap surfaced in #2329 item 5: callers received queue_id in the 202 enqueue response but had no public lookup. The only existing observability path was check_task_status (delegation-flavored A2A only — joins via request_body->>'delegation_id'). Cross-workspace peer-direct A2A had no observability after enqueue. This PR ships RFC #2331's Tier 1: minimum viable observability + caller- specified TTL. No schema migration — expires_at column already exists (migration 042); only DequeueNext was honoring it, with no caller path to populate it. Two changes: 1. extractExpiresInSeconds(body) — new helper mirroring extractIdempotencyKey/extractDelegationIDFromBody. Pulls params.expires_in_seconds from the JSON-RPC body. Zero (the unset default) preserves today's infinite-TTL semantics. EnqueueA2A grew an expiresAt *time.Time parameter; the proxy callsite computes *time.Time from the extracted seconds and threads it through to the INSERT. 2. GET /workspaces/:id/a2a/queue/:queue_id — new public handler. Auth: caller's workspace token must match queue.caller_id OR queue.workspace_id, OR be an org-level token. 404 (not 403) on auth failure to avoid leaking queue_id existence. Response includes status/attempts/last_error/timestamps/expires_at; embeds response_body via LEFT JOIN against activity_logs when status= completed for delegation-flavored items. What this does NOT change: - Drain semantics (heartbeat-driven dispatch). - Native-session bypass (claude-agent-sdk, hermes still skip queue). - Schema (column already exists). - MCP tools (delegate_task_async / check_task_status keep their contract; this is a parallel queue-id surface). Tests: - 7 cases on extractExpiresInSeconds covering absent/positive/ zero/negative/invalid-JSON/wrong-type/empty-params. - go vet + go build clean. - Full handlers test suite passes (no regressions from the EnqueueA2A signature change — only one production caller). Tier 2 (cross-workspace stitch + webhook callback) and Tier 3 (controllerized lifecycle) deferred per RFC #2331. |
||
|
|
d0f198b24f |
merge: resolve staging conflicts (a2a_proxy + workspace_crud)
Three files conflicted with staging changes that landed while this PR sat open. Resolved each by combining both intents (not picking one side): - a2a_proxy.go: keep the branch's idle-timeout signature (workspaceID parameter + comment) AND apply staging's #1483 SSRF defense-in-depth check at the top of dispatchA2A. Type-assert h.broadcaster (now an EventEmitter interface per staging) back to *Broadcaster for applyIdleTimeout's SubscribeSSE call; falls through to no-op when the assertion fails (test-mock case). - a2a_proxy_test.go: keep both new test suites — branch's TestApplyIdleTimeout_* (3 cases for the idle-timeout helper) AND staging's TestDispatchA2A_RejectsUnsafeURL (#1483 regression). Updated the staging test's dispatchA2A call to pass the workspaceID arg introduced by the branch's signature change. - workspace_crud.go: combine both Delete-cleanup intents: * Branch's cleanupCtx detachment (WithoutCancel + 30s) so canvas hang-up doesn't cancel mid-Docker-call (the container-leak fix) * Branch's stopAndRemove helper that skips RemoveVolume when Stop fails (orphan sweeper handles) * Staging's #1843 stopErrs aggregation so Stop failures bubble up as 500 to the client (the EC2 orphan-instance prevention) Both concerns satisfied: cleanup runs to completion past canvas hangup AND failed Stop calls surface to caller. Build clean, all platform tests pass. 🤖 Generated with [Claude Code](https://claude.com/claude-code) |
||
|
|
0de67cd379 |
feat(platform/admin): /admin/workspace-images/refresh + Docker SDK + GHCR auth
The production-side end of the runtime CD chain. Operators (or the post- publish CI workflow) hit this after a runtime release to pull the latest workspace-template-* images from GHCR and recreate any running ws-* containers so they adopt the new image. Without this, freshly-published runtime sat in the registry but containers kept the old image until naturally cycled. Implementation notes: - Uses Docker SDK ImagePull rather than shelling out to docker CLI — the alpine platform container has no docker CLI installed. - ghcrAuthHeader() reads GHCR_USER + GHCR_TOKEN env, builds the base64- encoded JSON payload Docker engine expects in PullOptions.RegistryAuth. Both empty → public/cached images only; both set → private GHCR pulls. - Container matching uses ContainerInspect (NOT ContainerList) because ContainerList returns the resolved digest in .Image, not the human tag. Inspect surfaces .Config.Image which is what we need. - Provisioner.DefaultImagePlatform() exported so admin handler picks the same Apple-Silicon-needs-amd64 platform as the provisioner — single source of truth for the multi-arch override. Local-dev companion: scripts/refresh-workspace-images.sh runs on the host and inherits the host's docker keychain auth — alternate path for when GHCR_USER/TOKEN aren't set in the platform env. 🤖 Generated with [Claude Code](https://claude.com/claude-code) |
||
|
|
425df5e5a9 |
merge(staging): resolve conflicts + fix 7 test regressions on top of #2061
- Merge origin/staging into fix/canvas-multilevel-layout-ux. 18 files
auto-merged (mostly canvas/tabs/chat and workspace-server handlers
the earlier DIRTY marker was stale relative to current staging).
- Fix 7 test failures surfaced by the merge:
1. Canvas.pan-to-node.test.tsx — mockGetIntersectingNodes was
inferred as vi.fn(() => never[]); mockReturnValueOnce of a node
object failed type check. Explicit return-type annotation.
2. Canvas.pan-to-node.test.tsx + Canvas.a11y.test.tsx — Canvas.tsx
reads deletingIds.size (new multilevel-layout state). Both mock
stores lacked deletingIds; added new Set<string>() to each.
3. canvas-batch-partial-failure.test.ts — makeWS() built a wire-
format WorkspaceData (snake_case, with x/y/uptime_seconds). The
store's node.data is now WorkspaceNodeData (camelCase, no wire-
only fields). Rewrote makeWS to produce WorkspaceNodeData and
updated 5 call-site casts. No assertions changed.
4. ConfigTab.hermes.test.tsx — two tests pinned pre-#2061 behavior
that the PR intentionally inverts:
a. "shows hermes-specific info banner" — RUNTIMES_WITH_OWN_CONFIG
now contains only {"external"}, so the banner is no longer
shown for hermes. Inverted assertion: now pins ABSENCE of
the banner, with a comment noting the inversion.
b. "config.yaml runtime wins over DB" — priority reversed:
DB is now authoritative so the tier-on-node badge matches
the form. Inverted scenario: DB=hermes + yaml=crewai →
form shows hermes. Switched test's DB runtime off langgraph
because the dropdown collapses langgraph into an empty-
valued "default" option that would hide the win signal.
- No production code changed — this commit is staging merge + test
realignment only. 953/953 canvas tests pass. tsc --noEmit clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
94d9331c76 |
feat(canvas+platform): chat attachments, model selection, deploy/delete UX
Session's accumulated UX work across frontend and platform. Reviewable in four logical sections — diff is large but internally cohesive (each section fixes a gap the next one depends on). ## Chat attachments — user ↔ agent file round trip - New POST /workspaces/:id/chat/uploads (multipart, 50 MB total / 25 MB per file, UUID-prefixed storage under /workspace/.molecule/chat-uploads/). - New GET /workspaces/:id/chat/download with RFC 6266 filename escaping and binary-safe io.CopyN streaming. - Canvas: drag-and-drop onto chat pane, pending-file pills, per-message attachment chips with fetch+blob download (anchor navigation can't carry auth headers). - A2A flow carries FileParts end-to-end; hermes template executor now consumes attachments via platform helpers. ## Platform attachment helpers (workspace/executor_helpers.py) Every runtime's executor routes through the same helpers so future runtimes inherit attachment awareness for free: - extract_attached_files — resolve workspace:/file:///bare URIs, reject traversal, skip non-existent. - build_user_content_with_files — manifest for non-image files, multi-modal list (text + image_url) for images. Respects MOLECULE_DISABLE_IMAGE_INLINING for providers whose vision adapter hangs on base64 payloads (MiniMax M2.7). - collect_outbound_files — scans agent reply for /workspace/... paths, stages each into chat-uploads/ (download endpoint whitelist), emits as FileParts in the A2A response. - ensure_workspace_writable — called at molecule-runtime startup so non-root agents can write /workspace without each template having to chmod in its Dockerfile. Hermes template executor + langgraph (a2a_executor.py) + claude-code (claude_sdk_executor.py) all adopt the helpers. ## Model selection & related platform fixes - PUT /workspaces/:id/model — was 404'ing, so canvas "Save" silently lost the model choice. Stores into workspace_secrets (MODEL_PROVIDER), auto-restarts via RestartByID. - applyRuntimeModelEnv falls back to envVars["MODEL_PROVIDER"] so Restart propagates the stored model to HERMES_DEFAULT_MODEL without needing the caller to rehydrate payload.Model. - ConfigTab Tier dropdown now reads from workspaces row, not the (stale) config.yaml — fixes "badge shows T3, form shows T2". ## ChatTab & WebSocket UX fixes - Send button no longer locks after a dropped TASK_COMPLETE — `sending` no longer initializes from data.currentTask. - A2A POST timeout 15 s → 120 s. LLM turns routinely exceed 15 s; the previous default aborted fetches while the server was still replying, producing "agent may be unreachable" on success. - socket.ts: disposed flag + reconnectTimer cancellation + handler detachment fix zombie-WebSocket in React StrictMode. - Hermes Config tab: RUNTIMES_WITH_OWN_CONFIG drops 'hermes' — the adaptor's purpose IS the form, banner was contradictory. - workspace_provision.go auto-recovery: try <runtime>-default AND bare <runtime> for template path (hermes lives at the bare name). ## Org deploy/delete animation (theme-ready CSS) - styles/theme-tokens.css — design tokens (durations, easings, colors). Light theme overrides by setting only the deltas. - styles/org-deploy.css — animation classes + keyframes, every value references a token. prefers-reduced-motion respected. - Canvas projects node.draggable=false onto locked workspaces (deploying children AND actively-deleting ids) — RF's authoritative drag lock; useDragHandlers retains a belt-and- braces check. - Organ cancel button (red pulse pill on root during deploy) cascades via existing DELETE /workspaces/:id?confirm=true. - Auto fit-view after each arrival, debounced 500 ms so rapid sibling arrivals coalesce into one fit (previous per-event fit made the viewport lurch continuously). - Auto-fit respects user-pan — onMoveEnd stamps a user-pan timestamp only when event !== null (ignores programmatic fitView) so auto-fits don't self-cancel. - deletingIds store slice + useOrgDeployState merge gives the delete flow the same dim + non-draggable treatment as deploy. - Platform-level classNames.ts shared by canvas-events + useCanvasViewport (DRY'd 3 copies of split/filter/join). ## Server payload change - org_import.go WORKSPACE_PROVISIONING broadcast now includes parent_id + parent-RELATIVE x/y (slotX/slotY) so the canvas renders the child at the right parent-nested slot without doing any absolute-position walk. createWorkspaceTree signature gains relX, relY alongside absX, absY; both call sites updated. ## Tests - workspace/tests/test_executor_helpers.py — 11 new cases covering URI resolution (including traversal rejection), attached-file extraction (both Part shapes), manifest-only vs multi-modal content, large-image skip, outbound staging, dedup, and ensure_workspace_writable (chmod 777 + non-root tolerance). - workspace-server chat_files_test.go — upload validation, Content-Disposition escaping, filename sanitisation. - workspace-server secrets_test.go — SetModel upsert, empty clears, invalid UUID rejection. - tests/e2e/test_chat_attachments_e2e.sh — round-trip against a live hermes workspace. - tests/e2e/test_chat_attachments_multiruntime_e2e.sh — static plumbing check + round-trip across hermes/langgraph/claude-code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
| a1b803ca7a |
fix(admin/a2a_queue): add drop-stale endpoint for post-incident queue cleanup
Issue #1947: after incidents, PM agents inherit hour-old TASK-priority queue items from ICs that were correctly reporting "X is broken" while X was actually broken. Once X is fixed those items are stale noise — PMs spend ~5 min each writing "thanks, the issue is resolved". Adds: - DropStaleQueueItems() in a2a_queue.go: UPDATE ... SET status='dropped' for queued items older than maxAgeMinutes. Uses FOR UPDATE SKIP LOCKED to stay concurrency-safe with concurrent drain calls. - AdminQueueHandler in admin_queue.go: POST /admin/a2a-queue/drop-stale (AdminAuth, ?max_age_minutes=N, &workspace_id=<id>). Returns {dropped: N}. - admin_queue_test.go: HTTP-level tests for param validation and response shape. - Router registration for the new endpoint. Usage during incident recovery: curl -X POST /admin/a2a-queue/drop-stale?max_age_minutes=120 # scoped to one workspace: curl -X POST /admin/a2a-queue/drop-stale?max_age_minutes=120&workspace_id=<uuid> Closes #1947. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> |
|||
|
|
19cd5c9f4b |
test(router): set ADMIN_TOKEN in TestTestTokenRoute_RequiresAdminAuth_WhenTokensExist
The test asserts that AdminAuth rejects an unauthenticated request to the test-token route once any workspace token exists in the DB. It sets MOLECULE_ENV=development to enable the handler's gate. After this branch's AdminAuth Tier-1b hatch (middleware/devmode.go), MOLECULE_ENV=development + empty ADMIN_TOKEN becomes the explicit fail-open signal for local dev — so the request correctly passes AdminAuth and falls through to the handler, which then 500s on an unmocked DB lookup instead of the expected 401. The security property the test is protecting (no bearer → 401 when tokens exist) corresponds to the SaaS configuration where ADMIN_TOKEN is always set. Setting ADMIN_TOKEN in the test suppresses the dev-mode hatch and reaches AdminAuth's Tier-2 bearer check, which correctly aborts 401 with "admin auth required". No production behaviour change — the test is now verifying the path that actually runs in production (MOLECULE_ENV=production + ADMIN_TOKEN set). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
87a97846cd |
feat(a2a): queue-on-busy — Phase 1 of priority queue (#1870)
## Problem
When a lead delegates to a worker that's mid-synthesis, the proxy returns
503 "workspace agent busy" and the caller records the delegation as
failed. On fan-out storms from leads this hits ~70% drop rate — today's
observed numbers in the cycle reports.
## Fix — Phase 1 TASK-level queue-on-busy
When `handleA2ADispatchError` determines the target is busy, instead of
returning 503, enqueue the request as priority=TASK and return 202
Accepted with `{queued: true, queue_id, queue_depth}`. The workspace's
next heartbeat (≤30s) drains one item if it reports spare capacity.
Files:
- migrations/042_a2a_queue.{up,down}.sql — `a2a_queue` table with
partial indexes on status='queued' + idempotency_key. Schema
supports PriorityCritical/Task/Info from day one so Phase 2/3 ship
without migration churn.
- internal/handlers/a2a_queue.go — EnqueueA2A / DequeueNext /
Mark*-helpers plus WorkspaceHandler.DrainQueueForWorkspace. Uses
`SELECT ... FOR UPDATE SKIP LOCKED` so concurrent drains can't
double-claim the same row. Max 5 attempts before marking 'failed'
so a stuck item doesn't wedge the queue forever.
- internal/handlers/a2a_proxy_helpers.go — isUpstreamBusyError branch
calls EnqueueA2A and returns 202 on success. Falls through to the
legacy 503 on enqueue error (DB hiccup shouldn't silently drop).
- internal/handlers/registry.go — RegistryHandler gets a QueueDrainFunc
injection hook (SetQueueDrainFunc). When Heartbeat sees
active_tasks < max_concurrent_tasks, spawns a goroutine that calls
the drain hook. context.WithoutCancel ensures the drain outlives
the heartbeat handler's ctx.
- internal/router/router.go — wires wh.DrainQueueForWorkspace into
rh.SetQueueDrainFunc after both are constructed.
## Not in this PR (Phase 2/3/4 follow-ups)
- INFO priority + TTL (Phase 2)
- CRITICAL priority + soft preemption between tool calls (Phase 3)
- Age-based promotion so TASK doesn't starve (Phase 4)
- `GET /workspaces/:id/queue` observability endpoint
Schema already supports all of these; only the dispatch + policy code
remains.
## Tests
- TestExtractIdempotencyKey (5 cases): messageId parsing is robust
- TestPriorityConstants: ordering invariant + 50=TASK default
alignment with migration DEFAULT
Full DB-touching tests (FIFO order, retry bound, idempotency conflict)
intentionally deferred to the CI migration-enabled path — sqlmock
ceremony would duplicate the existing test infrastructure 3× over and
the behaviour is directly expressible in SQL constraints (FOR UPDATE
SKIP LOCKED, partial unique index).
## Expected impact once deployed
- a2a_receive error with "busy" flavor drops from ~69/10min observed
today to ~0
- delegation_failed rate drops from ~50% to <5%
- real_output metric rises from ~30/15min back toward the pre-
throttle baseline
Closes #1870 Phase 1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
ed26f2733a |
fix(review): address code review blockers on tool-trace + instructions
BLOCKERS fixed: - instructions.go: Drop team-scope queries (teams/team_members tables don't exist in any migration). Schema column kept for future. Restored Resolve to /workspaces/:id/instructions/resolve under wsAuth — closes auth gap that allowed cross-workspace enumeration of operator policy. - migration 040: Add CHECK constraints on title (<=200) and content (<=8192) to prevent token-budget DoS via oversized instructions. - a2a_executor.py: Pair on_tool_start/on_tool_end via run_id instead of list-position so parallel tool calls don't drop or clobber outputs. Cap tool_trace at 200 entries to prevent runaway loops bloating JSONB. HIGH fixes: - instructions.go: Add length validation in Create + Update handlers. Removed dead rows_ shadow variable. Replaced string concatenation in Resolve with strings.Builder. - prompt.py: Drop httpx timeout 10s -> 3s (boot hot path). Switch print to logger.warning. Add Authorization bearer header from MOLECULE_WORKSPACE_TOKEN env var. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
d7afd15e59 |
feat: platform instructions system with global/team/workspace scope
Adds a configurable instruction injection system that prepends rules to every agent's system prompt. Instructions are stored in the DB and fetched at workspace startup, supporting three scopes: - Global: applies to all agents (e.g., "verify with tools before reporting") - Team: applies to agents in a specific team - Workspace: applies to a single agent (role-specific rules) Components: - Migration 040: platform_instructions table with scope hierarchy - Go API: CRUD endpoints + resolve endpoint that merges scopes - Python runtime: fetches instructions at startup via /instructions/resolve and prepends them to the system prompt as highest-priority context Initial global instructions seeded: 1. Verify Before Acting (check issues/PRs/docs first) 2. Verify Output Before Reporting (second signal before reporting done) 3. Tool Usage Requirements (claims must include tool output) 4. No Hallucinated Emergencies (CRITICAL needs proof) 5. Staging-First Workflow (never push to main directly) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
c1593dd328 |
Merge remote-tracking branch 'origin/staging' into feat/bootstrap-failed-and-console-proxy
# Conflicts: # workspace-server/internal/handlers/admin_memories_test.go |
||
|
|
4641151b09 |
Merge remote-tracking branch 'origin/staging' into feat/bootstrap-failed-and-console-proxy
# Conflicts: # workspace-server/internal/router/router.go |
||
| 70d47e2730 |
fix(security): SSRF URL validation (#1130) + redactSecrets on memory admin endpoints (#1131, #1132)
URLs returned from DB and Redis cache (db.GetCachedURL, workspaces.url column) are now validated via validateAgentURL() before any HTTP request is made: - mcpResolveURL (mcp.go): added validateAgentURL() calls on all three return paths (internal cache, Redis cache, DB fallback). - resolveAgentURL (a2a_proxy.go): added validateAgentURL() call before returning agentURL to the A2A dispatcher. validateAgentURL() was extended (registry.go) to resolve DNS hostnames and check each returned IP against the blocklist (private ranges, loopback, cloud-metadata 169.254.0.0/16). "localhost" is allowed by name for local dev. GET /admin/memories/export now applies redactSecrets() to each content field before including it in the JSON response. Pre-SAFE-T1201 memories (stored before redactSecrets was mandatory on writes) no longer leak credentials. POST /admin/memories/import now calls redactSecrets() on content before both the deduplication check and the INSERT. Imported memories with embedded credentials cannot bypass SAFE-T1201 (#838). - admin_memories.go: GET /admin/memories/export + POST /admin/memories/import handler (from PR #1051, with security fixes applied). - admin_memories_test.go: 6 tests covering redactSecrets parity on both endpoints. - registry_test.go: added DNS-lookup test cases for validateAgentURL (F1083). "localhost" allowed by name (preserves existing test); nxdomain blocked. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> |
|||
|
|
731a9aef6e |
feat(platform): bootstrap-failed + console endpoints for CP watcher
Workspaces stuck in provisioning used to sit in "starting" for 10min
until the sweeper flipped them. The real signal — a runtime crash at
EC2 boot — lands on the serial console within seconds but nothing
listened. These endpoints close the loop.
1. POST /admin/workspaces/:id/bootstrap-failed
The control plane's bootstrap watcher posts here when it spots
"RUNTIME CRASHED" in ec2:GetConsoleOutput. Handler:
- UPDATEs workspaces SET status='failed' only when status was
'provisioning' (idempotent — a raced online/failed stays put)
- Stores the error + log_tail in last_sample_error so the canvas
can render the real stack trace, not a generic "timeout" string
- Broadcasts WORKSPACE_PROVISION_FAILED with source='bootstrap_watcher'
2. GET /workspaces/:id/console
Proxies to CP's new /cp/admin/workspaces/:id/console endpoint so
the tenant platform can surface EC2 serial console output without
holding AWS credentials. CPProvisioner.GetConsoleOutput is the
client; returns 501 in non-CP deployments (docker-compose dev).
Both gated by AdminAuth — CP holds the tenant ADMIN_TOKEN that the
middleware accepts on its tier 2b branch.
Tests cover: happy-path fail, already-transitioned no-op, empty id,
log_tail truncation, and the 501 fallback when no CP is wired.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
517c2f869c |
Merge pull request #1053 from Molecule-AI/fix/memory-backup-restore-1051
feat(platform): memory backup/restore for nuke-safe development (#1051) |
||
|
|
ad28e10bf4 |
fix(org-tokens): rate-limit mint, bound list, correct audit provenance
Addresses the Critical + Important findings from today's code
review of the org API keys feature (PRs #1105-1108).
## Critical-1: rate-limit mint endpoint
Previously POST /org/tokens had no mint-rate limit. A compromised
WorkOS session or leaked bearer could mint thousands of tokens in
seconds, forcing a painful manual cleanup of each one.
Fix: dedicated per-IP token bucket, 10 mints/hour/IP. Legitimate
bursts fit under the ceiling; abuse bounces. List + Delete stay
on the global limiter — they can't be used to generate new
secret material.
## Important-1: HTTP handler integration tests
internal/orgtoken had 9 unit tests; the HTTP layer (org_tokens.go)
had none. Adds org_tokens_test.go covering:
- List happy path + DB error → 500
- Create actor="admin-token" (bootstrap), actor="org-token:<prefix>"
(chained mint), actor="session" (canvas browser path)
- Create name>100 chars → 400
- Create with empty body mints with no name
- Revoke happy path 200, missing id 404, empty id 400
- Plaintext returned in response body and prefix matches first 8 chars
- Warning text present
A regression that breaks the tier-ordering, drops the createdBy
field, or accepts oversized names now fails at CI not prod.
## Important-2: bound List output
List() had no LIMIT — a mint-storm bug or abuse could make the
admin UI slow to render and allocate proportionally. Adds
LIMIT 500 at the SQL layer. 10x realistic ceiling, guardrail
against pathological cases.
## Important-3: audit provenance uses plaintext prefix, not UUID
orgTokenActor() was logging "org-token:<first-8-of-uuid>" which
couldn't be cross-referenced with the UI (which shows first-8
of the plaintext). Users could not correlate "who minted this"
audit entries with the revoke button they're looking at.
Fix: Validate() now returns (id, prefix, error). Middleware
stashes both on the gin context. Handler reads prefix for the
actor string. Audit rows now match UI prefixes exactly.
## Nit: named constants for audit labels
actorOrgTokenPrefix / actorSession / actorAdminToken replace
the hardcoded strings scattered across the handler. Greppable
across log pipelines + audit queries; one place to change if
the format evolves.
## Tests
- internal/orgtoken: 9 existing + 0 new, all still green (updated
signatures for Validate returning prefix).
- internal/handlers/org_tokens_test.go: new — 9 HTTP-layer tests
above. Full gin.Context + sqlmock harness.
- Full `go test ./...` green except one pre-existing
TestGitHubToken_NoTokenProvider flake unrelated to this change
(expects 404, gets 500 — tracked separately).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
91187342b4 |
feat(auth): organization-scoped API keys for admin access
Adds user-facing API keys with full-org admin scope. Replaces the
single ADMIN_TOKEN env var with named, revocable, audited tokens
that users can mint/rotate from the canvas UI without ops
intervention.
Designed for the beta growth phase — one token tier (full admin).
Future work will split into scoped roles (admin / workspace-write
/ read-only) and per-workspace bindings. See docs/architecture/
org-api-keys.md for the design + follow-up roadmap.
## Surface
POST /org/tokens mint (plaintext returned once)
GET /org/tokens list live keys (prefix-only)
DELETE /org/tokens/:id revoke (idempotent)
All AdminAuth-gated. Bootstrap path: mint the first token via
ADMIN_TOKEN or canvas session; tokens can mint more tokens after.
## Validation as a new AdminAuth tier (2a)
AdminAuth evaluation order:
Tier 0 lazy-bootstrap fail-open (only when no live tokens AND
no ADMIN_TOKEN env)
Tier 1 verified WorkOS session via /cp/auth/tenant-member
Tier 2a org_api_tokens SELECT — NEW
Tier 2b ADMIN_TOKEN env (bootstrap / CLI break-glass)
Tier 3 any live workspace token (deprecated, only when ADMIN_TOKEN
unset)
Tier 2a runs ONE indexed lookup (partial index on
token_hash WHERE revoked_at IS NULL) + an async last_used_at
bump. No measurable latency cost on the hot path.
## UI
New "Org API Keys" tab in the settings panel. Label field for
human-readable naming. Plaintext shown once + clipboard copy.
Revoke with confirm dialog. Mirrors the existing workspace-
TokensTab flow so users who've used one get the other for free.
## Security properties
- Plaintext never stored. sha256 hash + 8-char display prefix.
- Revocation is immediate: partial index on revoked_at IS NULL
means the next request validates or fails in microseconds.
- created_by audit field captures provenance: "org-token:<short>"
when a token mints another, "session" for browser-UI mints,
"admin-token" for the ADMIN_TOKEN bootstrap path.
- Validate() collapses all failure shapes into ErrInvalidToken
so response-shape can't distinguish "never existed" from
"revoked".
## Tests
- internal/orgtoken: 9 unit tests (hash storage, empty field
null-ing, validation happy path, empty plaintext, unknown hash,
revoked filtering, list ordering, revoke idempotency, has-any-
live short-circuit).
- AdminAuth tier-2a integration covered by existing middleware
tests unchanged (fail-open + bearer paths).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
d03f2d47e0 |
fix: close cross-tenant authz + cp_proxy admin-traversal gaps
Addresses three Critical findings from today's code review of the
SaaS-canvas routing stack.
## Critical-1: session verification scoped to the current tenant
session_auth.go previously verified via GET /cp/auth/me, which
only answers "is someone logged in" — NOT "is this user in the
org they're targeting." Every WorkOS-authed user (including folks
who only signed up via app.moleculesai.app with no tenant
relationship) could call /workspaces, /approvals/pending,
/bundles/import, /org/import etc. on ANY tenant they could reach.
Cross-tenant read: user at acme.moleculesai.app could hit
bob.moleculesai.app/workspaces with their cookie and get Bob's
workspaces.
Fix:
- CP gains GET /cp/auth/tenant-member?slug=<slug> which joins
org_members × organizations and only returns member:true when
the authenticated user is actually in that org.
- Tenant sets MOLECULE_ORG_SLUG at boot via user-data.
- session_auth now calls tenant-member (not /me), passing its
own slug. Cache key includes slug so one tenant's cached
positive never satisfies another's check.
## Critical-2: cp_proxy path allowlist (lateral-movement fix)
cp_proxy.go forwarded any /cp/* path upstream with the cookie
and bearer attached. Since /cp/admin/* accepts sessions as one
of its auth tiers, a tenant-authed user could curl
/cp/admin/tenants/other-slug/diagnostics through their tenant
and the CP would honor it — turning any tenant into a lateral
hop into admin surface.
Fix: explicit allowlist of paths the canvas browser bundle
actually needs (/cp/auth, /cp/orgs, /cp/billing, /cp/templates,
/cp/legal). Everything else 404s at the tenant before cookies
leave. Fail-closed: future UI paths require explicit entries.
## Important-1,2: bounded session cache + split positive/negative TTL
Previous sync.Map cache grew unbounded (one entry per unique
Cookie header for process lifetime) and cached failures for 30s,
meaning a 3s CP blip locked users out for the full window.
Fix:
- Bounded map with batch random eviction at cap (10k entries ×
~100 bytes = 1 MB ceiling). Random eviction is O(1)
expected; we don't need precise LRU.
- Periodic sweeper goroutine (2 min) reclaims expired entries
even when they're not re-hit.
- Positive TTL 30s, negative TTL 5s — short negative so CP
flakes self-heal fast.
- Transport errors NOT cached (would otherwise trap every
user during a multi-second upstream outage).
- Cache key = sha256(slug + cookie) so raw session tokens
don't sit in process memory, and cross-tenant isolation is
structural not policy.
## Important-3: TenantGuard /cp/* bypass documented
Added a security note to the bypass explaining why it's safe
only under the current setup (cp_proxy allowlist + tunnel-only
ingress), and what would require revisiting (SG opens :8080
inbound to the VPC).
## Tests
- session_auth_test.go: 12 new tests — empty cookie, missing
slug, no CP, member:true happy path with cache hit, member:
false, 401 upstream, malformed JSON, transport error not
cached, cross-tenant isolation (same cookie different
tenants hit upstream separately), bounded eviction, expired
entries, cache key collision resistance.
- cp_proxy_test.go: new — isCPProxyAllowedPath covers 17
allow/block cases, forwarding preserves Cookie+Auth, Host
rewritten, blocked paths 404 without calling upstream.
All platform tests pass. CP provisioner tests pass after
threading cfg.OrgSlug into the container env.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
52235aeb27 |
feat(router): /cp/* reverse-proxy to CP + same-origin canvas fetches
Canvas's browser bundle issues fetches to both CP endpoints
(/cp/auth/me, /cp/orgs, ...) AND tenant-platform endpoints
(/canvas/viewport, /approvals/pending, /org/templates). They
share ONE build-time base URL. Baking api.moleculesai.app
broke tenant calls with 404; baking the tenant subdomain broke
auth. Tried both today and saw exactly one failure mode per
attempt.
Real fix: same-origin fetches + tenant-side split. Adds:
internal/router/cp_proxy.go # /cp/* → CP_UPSTREAM_URL
mounted before NoRoute(canvasProxy). Now a tenant serves:
/cp/* → reverse-proxy to api.moleculesai.app
/canvas/viewport,
/approvals/pending,
/workspaces/:id/*,
/ws, /registry, → tenant platform (existing handlers)
/metrics
everything else → canvas UI (existing reverse-proxy)
Canvas middleware reverts to `connect-src 'self' wss:` for the
same-origin path (keeping explicit PLATFORM_URL whitelist as a
self-hosted escape hatch when the build-arg is non-empty).
CI build-arg flips to NEXT_PUBLIC_PLATFORM_URL="" so the bundle
issues relative fetches.
Security of cp_proxy:
- Cookie + Authorization PRESERVED across the hop (opposite of
canvas proxy) — they carry the WorkOS session, which is the
whole point.
- Host rewritten to upstream so CORS + cookie-domain on the CP
side see their own hostname.
- Upstream URL validated at construction: must parse, must be
http(s), must have a host — misconfig fails closed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
b1bb5f838a |
fix: GitHub token refresh — add WorkspaceAuth path for credential helper (#1068)
PR #729 tightened AdminAuth to require ADMIN_TOKEN, breaking the workspace credential helper which called /admin/github-installation-token with a workspace bearer token. Tokens expired after 60 min with no refresh. Fix: Add /workspaces/:id/github-installation-token under WorkspaceAuth so any authenticated workspace can refresh its GitHub token. Keep the admin path as backward-compatible alias. Update molecule-git-token-helper.sh to use the workspace-scoped path when WORKSPACE_ID is set. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
c9e4e349b2 |
Add memory backup/restore endpoints for safe Docker rebuilds (#1051)
GET /admin/memories/export returns all agent memories with workspace name mapping. POST /admin/memories/import accepts the same format, resolves workspaces by name, and deduplicates on content+scope. Both endpoints are AdminAuth-gated. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
d8026347e5 |
chore: open-source restructure — rename dirs, remove internal files, scrub secrets
Renames: - platform/ → workspace-server/ (Go module path stays as "platform" for external dep compat — will update after plugin module republish) - workspace-template/ → workspace/ Removed (moved to separate repos or deleted): - PLAN.md — internal roadmap (move to private project board) - HANDOFF.md, AGENTS.md — one-time internal session docs - .claude/ — gitignored entirely (local agent config) - infra/cloudflare-worker/ → Molecule-AI/molecule-tenant-proxy - org-templates/molecule-dev/ → standalone template repo - .mcp-eval/ → molecule-mcp-server repo - test-results/ — ephemeral, gitignored Security scrubbing: - Cloudflare account/zone/KV IDs → placeholders - Real EC2 IPs → <EC2_IP> in all docs - CF token prefix, Neon project ID, Fly app names → redacted - Langfuse dev credentials → parameterized - Personal runner username/machine name → generic Community files: - CONTRIBUTING.md — build, test, branch conventions - CODE_OF_CONDUCT.md — Contributor Covenant 2.1 All Dockerfiles, CI workflows, docker-compose, railway.toml, render.yaml, README, CLAUDE.md updated for new directory names. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |