Adds pre-test DELETE + defensive index creation to
TestIntegration_BroadcastOrgRoot_NonRootSenderResolvesToRoot so a
prior crashed run (or stale shared DB) does not leave rows that
collide on workspaces_parent_name_uniq.
Does not touch production logic.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
ci-arm64-advisory / fast-checks (push) Compensated by status-reaper (push run was cancelled/superseded; Gitea 1.22.6 reports cancelled runs as failure statuses)
ci-arm64-advisory / fast-checks (push) Compensated by status-reaper (push run was cancelled/superseded; Gitea 1.22.6 reports cancelled runs as failure statuses)
Extracted clean from bundled #1985 (which mixed these tests with a tracker
rename + cancel-in-progress flips that are being handled separately). Two
test files only; reuse existing withMockDB/makeReq/wsUUID* harness from
tokens_sqlmock_test.go; no production code changed.
mc#774 reached its 14-day renewal cap (19 days old), failing
lint-continue-on-error-tracking on every workflow-touching PR. This
renames the tracker reference to the fresh renewal tracker mc#1982
(open, filed 2026-05-28) across all continue-on-error mask comments.
Comment-only; ZERO continue-on-error masks flipped, zero behavior
change. Pure unblock. A real per-mask triage (which of these can flip
to continue-on-error: false) is tracked separately for before the
2026-06-11 mc#1982 due date — this PR does not do that triage, only
renews so the workflow-PR batch can merge.
ci-arm64-advisory / fast-checks (push) Compensated by status-reaper (push run was cancelled/superseded; Gitea 1.22.6 reports cancelled runs as failure statuses)
CI / Canvas Deploy Reminder (push) Compensated by status-reaper (push run was cancelled/superseded; Gitea 1.22.6 reports cancelled runs as failure statuses)
ci-arm64-advisory / fast-checks (push) Compensated by status-reaper (push run was cancelled/superseded; Gitea 1.22.6 reports cancelled runs as failure statuses)
ci-arm64-advisory / fast-checks (push) Compensated by status-reaper (push run was cancelled/superseded; Gitea 1.22.6 reports cancelled runs as failure statuses)
The workspace_status enum (migrations 043/046) has no 'running' value;
valid alive state is 'online'. Seed INSERTs used 'running' -> pq rejects
it at setup, failing TestIntegration_BroadcastOrgRoot_NonRootSenderResolvesToRoot.
Masked until now because Handlers Postgres Integration kept failing at the
runner node/checkout step (ded docker-host:host). Status is irrelevant to the
org-root CTE (it walks parent_id); 'online' is the correct alive value.
Two handlers used %v for error values in fmt.Errorf, preventing
callers from using errors.Is/As. Switch to %w.
- ssrf.go: DNS resolution error
- org_plugin_allowlist.go: requireCallerOwnsOrg error
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
readWorkspaceDeriveInputs (llm_billing_mode.go) and scanAuditRows (audit.go)
both iterated rows.Next() without checking rows.Err() after the loop.
Add the check so iteration errors are not silently swallowed.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three handlers ignored errors from Result.RowsAffected():
- admin_schedules_health.go: ReapOrphans repointedN / disabledN
- org_import.go: migrateRuntimeSchedulesFromRemovedPredecessor
- llm_billing_mode.go: SetWorkspaceLLMBillingMode (clear + set paths)
All now log/return the error instead of silently discarding it.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The org-root recursive CTE in workspace_broadcast.go pinned `id AS root_id`
to the SENDER's own id at the anchor and carried it unchanged up the
parent_id chain. The final `SELECT root_id ... WHERE parent_id IS NULL`
therefore returned the sender's id, not the actual org root — so a
NON-root sender resolved ITSELF as the org root, scoping the broadcast to
the wrong subtree (the OFFSEC-015 org-isolation guarantee was correct for
root senders but wrong for any child workspace).
Fix: drop the bogus carried `root_id` column and select the id of the
row whose parent_id IS NULL (the true topmost ancestor). The walk
direction (JOIN org_chain c ON w.id = c.parent_id) was already correct.
Trace (leaf->mid->root): now resolves leaf and mid to root, root to
itself.
Adds a REAL Postgres integration test (build tag `integration`,
Handlers Postgres Integration CI) that seeds a 3-level chain and asserts
every node resolves to the true root — sqlmock cannot execute the CTE so
the existing unit tests could not catch this. Original staging reference:
closed PR #2090 (verified + applied cleanly, org-root hunk only).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Byte-syncs workspace-server/internal/providers/providers.yaml to the
controlplane canonical after cp#432 (kimi-coding base_url /v1 proxy-404
fix + google gemini OpenAI-compat base_url). Repins
canonicalProvidersYAMLSHA256. registry_gen unchanged (base_url is not in
the model-id projection).
Adds the authoritative OpenAPI 3.1 management contract (management.yaml) + README — the SSOT the management MCP/CLI/API-docs derive from (RFC#1706); closes the (c) OpenAPI gap in PLATFORM-MANAGEMENT-API.md §5. redocly-lint clean; source-grounded against router+handler. SOP merge ceremony complete: 7/7 sop-acks (engineers), qa+security APPROVE, 4 approvals; 3 BP-required CI contexts green (E2E no-op no-paths-change success).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ci-arm64-advisory / fast-checks (push) Compensated by status-reaper (push run was cancelled/superseded; Gitea 1.22.6 reports cancelled runs as failure statuses)
Root cause (live RCA): the Gitea Actions run-scheduler is throughput-
starved by workflow fan-out. A single PR-head commit triggers ~65 runs;
the `all-required` sentinel was a status-POLLING loop that held a
`ci-meta` executor slot (only 2 in the lane) for up to 40 min per PR;
and several cheap meta-lints fired as separate runs on every commit.
Two fixes, both branch-protection-preserving:
1. all-required: poll-gate → plain `needs:` aggregator (ci.yml).
Was: detect-changes + a 40-min `GET /commits/{sha}/statuses` poll
loop on the ci-meta lane (confirmed slot-squat in the RCA — two
concurrent JOB-all-required containers pinning the 2-slot lane).
Now: `needs: [changes, platform-build, canvas-build, shellcheck,
python-lint]` + a sub-second inline result-check (no API, no poll,
no checkout). Frees the slot immediately.
Safe because every aggregated job now gates real work PER-STEP
(`if: needs.changes.outputs.* != 'true'`), so it always reaches a
terminal SUCCESS and is never `skipped`. Plain `needs:` (WITHOUT
`if: always()`) works on Gitea 1.22.6 / act_runner v0.6.1 — only
`needs:` + `if: always()` is broken
(feedback_gitea_needs_works_only_ifalways_broken). canvas-deploy-
reminder is event-gated (`if: github.ref...`) so it is intentionally
excluded. The needs: set equals ci-required-drift.py's ci_job_names()
so F1 stays clean (verified + now unit-pinned).
The required context name `CI / all-required (<event>)` is UNCHANGED.
2. Cut fan-out:
- Consolidated lint-no-tenant-gitea-token.yml INTO
lint-forbidden-env-keys.yml as a second job (scan-tenant-token-
write). Two sub-second Go-source greps that fired as two separate
workflow runs per PR → one run, one checkout. Both still fire on
every PR (no paths filter; RFC#523 threat model preserved). The
moved job keeps its exact `name:` + `# bp-exempt:` directive
(Tier 2g); the old `Lint no tenant GITEA…` context is retired.
- Added a `paths:` filter to verify-providers-gen.yml (Go toolchain,
~8min) scoped to the codegen surface. SAFE: it is NOT a branch-
protection required context, so lint-required-no-paths permits it.
Branch-protection required contexts are unchanged (CI / all-required,
E2E API Smoke Test, Handlers Postgres Integration, sop-checklist /
all-items-acked). No paths filter was added to any required emitter.
Tests: updated test_ci_workflow_bookkeeping.py to pin the new needs:
aggregator shape + the no-if:always() hazard + the F1-lockstep
invariant (watched the old assertions fail, then pass on the new shape).
Full .gitea/scripts/tests suite (192) + affected tests/ lints green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Verified each against the authoritative handler source (molecule-core
workspace-server + molecule-controlplane) before editing:
1. tenantAdminToken: http/bearer -> apiKey header X-Molecule-Admin-Token.
authenticateTenant (controlplane workspace_provision.go) reads that
header, NOT Authorization, and derives org from the token
(SELECT org_id ... WHERE admin_token=$1). Removed orgRoutingHeaderId
from the DELETE /api/v1/workspaces/{workspace_id} security — no
X-Molecule-Org-Id is read on deprovision.
2. ProvisionStatus.stage: added `failed` (emitted by orgs.go on
failed/deprovisioning/deprovisioned). Existing launching/installing/
starting/configuring_https/ready all confirmed emitted by
orgs_progress.go + estimateBootProgress — none trimmed.
3. GET /workspaces/{id}: set security: [] — router.go registers it
outside every auth group (intentionally open for canvas-node self-
polling). Dropped the now-inapplicable 401.
4. Multi-period budget shape: added `budget_limits` (canonical) + legacy
`budget_limit` to PatchBudgetRequest, and `periods` (+ PeriodBudget)
to BudgetResponse, matching budget.go budgetResponse/PatchBudget.
5. GET tenant llm-billing-mode already modeled (handler serves GET+PUT) —
no change needed; verified.
6. Added prune=true destructive note (only literal "true" permanently
deletes, internal#734) and the CP-admin
/api/v1/admin/workspaces/{id}/llm-billing-mode GET+PUT pair
(cpAdminBearer, requires ?org_slug=).
redocly lint clean under both recommended and recommended-strict.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
CI / Canvas Deploy Reminder (push) Compensated by status-reaper (push run was cancelled/superseded; Gitea 1.22.6 reports cancelled runs as failure statuses)
Author workspace-server/docs/openapi/management.yaml — the hand-authored,
authoritative OpenAPI 3.1 contract for the Molecule platform MANAGEMENT
surface, spanning both services in one spec:
- CP (api.moleculesai.app, /api/v1/*): orgs create/get/list/delete/export/
provision-status, public instance lookup, billing (invoices/checkout/
portal/topup), admin (admin-create-org w/ dry_run, tenant delete +
scrub w/ confirm guard, diagnostics, redeploy + fleet, workspace env
w/ force guard, ListOrgWorkspaces, admin-token, thin-ami + runtime-image
pins), provisioning (provision w/ 422 RUNTIME_PIN_MISSING, deprovision,
status).
- Tenant workspace-server: /workspaces[/:id] CRUD + restart/pause/resume,
budget, llm-billing-mode, /workspaces/:id/secrets, /settings/secrets,
/org/import, /org/templates, /org/tokens (Org API Key mint/revoke),
/templates[/import], /bundles export/import.
Defines the five security tiers as securitySchemes (workosSession cookie,
cpAdminBearer, provisionSecret [+ tenantAdminToken on deprovision], orgApiKey
+ org routing header, workspaceToken) and applies the correct scheme(s)
per-route. Dry-run / confirm / force guards modelled per-operation.
Grounded in the router + handler sources (controlplane + workspace-server),
not just the synthesis doc — notably llm-billing-mode is modelled on the
real tenant route (/admin/workspaces/:id/llm-billing-mode, AdminAuth), with
the divergence from the synthesis doc noted in the README.
Adds README.md documenting the two-service split + the security-scheme→
surface tier matrix. This is the SSOT the management MCP + CLI + docs derive
from (PLATFORM-MANAGEMENT-API.md §5c / RFC #1706). Supersedes the swaggo
/schedules stub for the management surface; runtime surface stays out of scope.
Per dev-sop Phase 1-4 + Five-Axis self-review (in PR body).
Lints clean: npx @redocly/cli lint management.yaml (0 errors, 0 warnings).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
When a workspace NAMES a runtime but the config.yaml about to be seeded
declares a different top-level runtime, refuse to launch and surface
WORKSPACE_PROVISION_FAILED — the symmetric counterpart to selectImage's
ErrUnresolvableRuntime guard, on the config/template side.
Pre-fix: if a runtime's workspace template wasn't in the tenant cache at
provision time (or sanitizeRuntime coerced an unknown runtime), config
seeding silently fell back to claude-code-default. The image+env said
e.g. google-adk but the seeded config said claude-code, so the agent
booted mislabeled and personaless yet looked 'online' and returned canned
non-answers (hit the molecule-adk-demo hackathon org: 4 google-adk agents).
The guard is in prepareProvisionContext (shared by Docker + SaaS paths).
Empty requested runtime (org-template default path) and an indeterminate
seeded runtime (CP mode, no local config bytes) are both allowed — it only
fails on a concrete, contradictory signal.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Continuous synthetic E2E (staging) / Synthetic E2E against staging (push) Compensated by status-reaper (workflow has no push: trigger; Gitea 1.22.6 hardcoded-suffix bug — see .gitea/scripts/status-reaper.py)
molecule-core's synced copy of the provider registry was stale relative to
controlplane cp#423/#426, which split `openai`→`openai-subscription`
(auth_env CODEX_AUTH_JSON, IsPlatform false) / `openai-api` (OPENAI_API_KEY).
The stale copy derived codex→`openai` (and got band-aided to platform_managed),
producing "OpenAI requires OPENAI_API_KEY" + "codex adapter: no platform
provider" RuntimeError.
Sync to CP SSOT (CP HEAD fa44dc8), verbatim:
- providers.yaml, derive_provider.go, providers.go, and the
derive/providers/runtimes tests copied byte-exact from controlplane.
- regenerated gen/registry_gen.go via `go generate` (now carries the
openai-subscription entry: AuthEnv CODEX_AUTH_JSON, IsPlatform false).
- bumped canonicalProvidersYAMLSHA256 to the new synced-copy sha
(dedbb8cc…f76187) so the hermetic drift gate stays green.
Core-only manual edit (CP has no such map):
- secrets.go: add CODEX_AUTH_JSON to platformManagedDirectLLMBypassKeys so the
byok credential check counts the global CODEX_AUTH_JSON (codex byok now
provisions with the shared subscription token) and strips it under
platform-managed.
With the synced derive, codex+CODEX_AUTH_JSON → openai-subscription →
IsPlatform false → byok automatically via the existing billing resolver;
no derive logic was hand-edited and llm_billing_mode.go is untouched.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Continuous synthetic E2E (staging) / Synthetic E2E against staging (push) Compensated by status-reaper (workflow has no push: trigger; Gitea 1.22.6 hardcoded-suffix bug — see .gitea/scripts/status-reaper.py)
Multiple codex workspaces share ONE ChatGPT-Pro OAuth token (global_secrets
key CODEX_AUTH_JSON). OpenAI's refresh_token is single-use, so letting each
per-agent codex app-server refresh on its own 401 burned the shared seed within
seconds (a refresh storm → token_invalidated + "refresh token already used").
This adds a single platform-side owner of the refresh:
- internal/codexauth/refresher.go: one background goroutine, structurally
single-flight (one goroutine + package mutex). Reads the global
CODEX_AUTH_JSON, decodes the access_token JWT exp, and only within a safety
margin of expiry POSTs the refresh_token ONCE per due cycle, then re-encrypts
and writes the rotated blob back to global_secrets. Inert when the secret is
absent; on a permanent failure (invalid_grant / "already used") it logs once
and does NOT hot-loop. Billing-mode resolution + byok are untouched.
- cmd/server/main.go: wired under supervised.RunWithRecover like the other
background sweeps.
Pairs with the codex template's codex_auth_sync.sh (GET-only re-sync; per-agent
OAuth POST disabled) so workspaces only consume the current token and never
rotate it themselves.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Railway pin audit (drift detection) / Audit Railway env vars for drift-prone pins (push) Compensated by status-reaper (workflow has no push: trigger; Gitea 1.22.6 hardcoded-suffix bug — see .gitea/scripts/status-reaper.py)
Continuous synthetic E2E (staging) / Synthetic E2E against staging (push) Compensated by status-reaper (workflow has no push: trigger; Gitea 1.22.6 hardcoded-suffix bug — see .gitea/scripts/status-reaper.py)
Empty commit on the PR branch to get a clean CI run; the prior run's
tasks were orphaned by the 2026-05-31 08:30 gitea restart (task-not-found).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The ConfigTab runtime dropdown filtered GET /templates through a hardcoded
SUPPORTED_RUNTIME_VALUES allowlist (claude-code/codex/openclaw/hermes).
google-adk shipped in manifest.json + the workspace-server knownRuntimes
registry but was dropped by this frontend Set, so a google-adk workspace's
Config tab rendered the wrong runtime option and a Save would clobber the
runtime to the wrong value.
Make the picker trust the backend SSOT: /templates is already gated to the
manifest maintained set by loadRuntimesFromManifest. Remove the allowlist;
hide a runtime only when its template declares displayable:false (new
optional flag plumbed manifest config.yaml -> templateSummary -> /templates).
- canvas/ConfigTab.tsx: drop SUPPORTED_RUNTIME_VALUES; filter on
r.displayable===false; add google-adk to offline FALLBACK list.
- workspace-server templates.go: add Displayable *bool (yaml+json,
omitempty) so a template can opt out of the picker declaratively.
- tests: ConfigTab.googleAdk.test.tsx (google-adk selected + displayable
hidden) + TestTemplatesList_DisplayableFlag (nil/true/false + JSON contract).
Refs project_canvas_runtime_dropdown_ssot_fix.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Continuous synthetic E2E (staging) / Synthetic E2E against staging (push) Compensated by status-reaper (workflow has no push: trigger; Gitea 1.22.6 hardcoded-suffix bug — see .gitea/scripts/status-reaper.py)
The user-facing choice for the prune/persistence backend:
- ContainerConfigTab: a 'Saved data' selector (Auto / Always keep / Don't keep)
→ compute.data_persistence (omitted when Auto = unchanged wire/default).
- DetailsTab delete: an 'also erase saved data' checkbox → DELETE
?erase_data=true (default off keeps it for the orphan-sweeper grace).
- WorkspaceCompute.data_persistence type.
+test: erase checkbox sends erase_data=true; default delete unchanged. The 37
ContainerConfigTab+DetailsTab tests pass; my files typecheck clean.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The caller side of the recreate-safe prune (cp#415 Five-Axis F1): the prune
signal reaches CP ONLY on a permanent user-delete-with-erase, NEVER on
restart/recreate/reconcile.
- CPProvisionerAPI.StopAndPrune (CPProvisioner builds DELETE with &prune=true;
Stop never does — shared stopInternal).
- cpStopWithRetryErr(...prune): restart/hibernate pass false; delete passes the
user choice.
- stopWorkspaceForDelete(...erase) → CascadeDelete(...erase): HTTP Delete reads
?erase_data=true (opt-in; default keeps data for the orphan-sweeper grace);
org-import reconcile passes false.
Discriminating test: Stop sends NO prune=true (recreate-safety), StopAndPrune
sends it. All CPProvisionerAPI mocks gain StopAndPrune. Full handlers+provisioner
suite + vet + gofmt green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Threads the user's durable-data choice from the workspace Compute config
through to CP's provision request, so a user can pick persist vs ephemeral
per workspace (the caller side of cp#410's data_persistence support).
- models.WorkspaceCompute.DataPersistence (persisted in the compute JSONB)
- validateWorkspaceCompute: enum guard (persist|ephemeral|"") → clear 400
before the CP round-trip; CP re-validates at its edge (defense in depth)
- WorkspaceConfig.DataPersistence + workspace_provision build site
- cpProvisionRequest.data_persistence (omitempty → ""=auto omitted on wire)
Empty/auto = today's behavior; forward-compatible (inert until CP deploys
cp#410). +validator enum test. build/vet/test/gofmt green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Railway pin audit (drift detection) / Audit Railway env vars for drift-prone pins (push) Compensated by status-reaper (workflow has no push: trigger; Gitea 1.22.6 hardcoded-suffix bug — see .gitea/scripts/status-reaper.py)
First cold boot of a google-adk workspace pulls a large fresh ADK image;
the default 300s online wait can read a slow first pull as "failed".
Bump google-adk's wait to 180 iters (900s), matching the rationale for
hermes' extended window. No behavior change for other runtimes.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
google-adk was registered (manifest, provisioner, canvas, CP pin +
allowlist) but had no e2e coverage. Add it everywhere the other
runtimes sit so it is exercised "like other runtimes":
- scripts/test-all-runtimes-a2a-e2e.sh: provision + provider-key +
online + A2A round-trip + session-continuity loops now include
google-adk (5 runtimes). AI-Studio key via GOOGLE_API_KEY → workspace
secret; SKIP_GOOGLE_ADK guard mirrors the other SKIP_* flags.
- e2e-staging-saas.yml + continuous-synth-e2e.yml: add the
`google-adk)` per-runtime LLM-key case (expects
MOLECULE_STAGING_GOOGLE_API_KEY) + E2E_GOOGLE_API_KEY env + the
gemini model slug. Same dispatch-gated shape as codex/hermes/langgraph
(Gitea drops workflow_dispatch.inputs, so E2E_RUNTIME-driven).
Auth note: PROD disallows API keys (Vertex+ADC there); CI uses the
keyed AI-Studio path (config model google_genai:gemini-2.5-pro). Vertex
stays the supported prod path. The MOLECULE_STAGING_GOOGLE_API_KEY
secret must be set for a green google-adk run (documented in-file).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
RUNTIME_OPTIONS gained 'Google ADK' but the test's hardcoded expected array
(separate-selectors test) still listed 4 → Canvas (Next.js) CI red (5 vs 4).
Add it in component order (after OpenAI Codex CLI). Caught by comprehensive
pre-merge review — a real regression from this PR's own diff, not the
staging-E2E infra flake.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Railway pin audit (drift detection) / Audit Railway env vars for drift-prone pins (push) Compensated by status-reaper (workflow has no push: trigger; Gitea 1.22.6 hardcoded-suffix bug — see .gitea/scripts/status-reaper.py)
Extends the single monthly per-workspace budget to four independent ROLLING
windows so a workspace can be capped per hour/day/week/month (#49 — gives the
canvas Budget tab a real lever against runaway LLM spend, e.g. the reno-stars
opus drain). SSOT design:
- budget_periods.go = single source of truth: the period set + rolling windows,
one FILTERed per-period spend query over the ledger, and the PURE
parse/encode/exceededPeriods logic. Add a period = one line here.
- migration: workspaces.budget_limits jsonb (canonical config, backfilled from
the legacy monthly budget_limit) + workspace_spend_events ledger.
- heartbeat (registry.go): derive the spend INCREMENT from the agent's existing
cumulative report (delta vs prev; reset-aware) → ledger row. Server owns
windowing; NO runtime change.
- budget.go GET/PATCH: per-period limit/spend/remaining; accepts the new
{budget_limits:{...}} shape AND the legacy {budget_limit} (→ monthly); legacy
response fields still emitted + budget_limit kept synced (rollout back-compat).
A limit of 0 = block-all (preserved); null/absent = no limit.
- a2a_proxy.go checkWorkspaceBudget: 402 if ANY configured period's rolling
window spend >= its limit; fail-open on DB error.
- canvas BudgetSection: four period rows (USD limit input + spend/limit + bar).
Tests: pure SSOT (parse/encode/exceededPeriods); GET/PATCH + multi-period +
A2A enforcement (sqlmock, migrated to the new two-query flow); shared
expectBudgetCheck helpers updated; canvas behavioral + per-period progress/aria.
go build + vet + full handlers suite + migrations + canvas vitest all green.
NOTE: the duplicate components/__tests__/BudgetSection.test.tsx (old single-limit
UI) was repurposed to a focused per-period progress/aria suite — behavioral
coverage now lives in tabs/__tests__/BudgetSection.test.tsx (one component, no
parallel identical suites).
Refs #49.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
publish-workspace-server-image / Production auto-deploy (push) Compensated by status-reaper (push run was cancelled/superseded; Gitea 1.22.6 reports cancelled runs as failure statuses)
internal#2006 — backstops for the recreate-orphans-schedules class. The
primary fix is migration-on-recreate (separate PR); these are defense-in-depth
so a future regression is detected + recoverable instead of silent.
GET /admin/schedules/health reports only LIVE workspaces' schedules
(JOIN … WHERE status != 'removed'), so a schedule stranded on a
removed/recreated workspace silently stops firing and never shows there —
which is exactly why tonight's orphans went unnoticed.
- GET /admin/schedules/orphans (Orphans): the monitor surface — lists every
schedule bound to a removed OR missing workspace (id, name, source, enabled,
ws_status). A monitor polls this and pages on non-empty.
- POST /admin/schedules/reap-orphans (ReapOrphans): the cleaner — re-points
runtime schedules onto the live successor agent (matched by role+parent),
then disables any remaining dead-bound schedules so the scheduler stops
firing into removed workspaces. Idempotent; returns {repointed, disabled}.
Health() is unchanged (no churn to its tests). +2 tests, +2 routes. Build +
handler tests green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
internal#2006 — recreating an agent orphans its schedules.
Root cause: createWorkspaceTree's INSERT … ON CONFLICT (parent_id,name)
WHERE status != 'removed' only matches NON-removed rows, so when an agent
is recreated after its prior workspace was marked removed, a brand-new
workspace id is minted. Reconcile then re-derives template-sourced state
(MODEL, template schedules via the upsert loop), but schedules a user added
at runtime (source='runtime', via the canvas/API) bind to the ephemeral
workspace_id and are abandoned on the removed row — they silently stop
firing (the 2026-05-29 agents-team incident: all 5 *-autonomous-tick
schedules, source=runtime, orphaned on removed ids; canvas showed
"missing schedulers").
Fix: after a fresh insert, migrate runtime-created schedules from the
most-recent removed predecessor of the same agent onto the new workspace.
The predecessor is matched by the stable `role` (survives the name
auto-suffixing that yields "Agent (2)"), falling back to name+parent.
Template-sourced schedules are NOT migrated (reconcile re-derives those);
runs before the template upsert loop so a same-named template schedule
still wins; skips names already present on the new workspace; best-effort
(logs, never errors the import).
Tests: predecessor-found re-points; no-predecessor (first create) does NOT
run the UPDATE; name-fallback branch.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Both tutorials cited misattributed PRs and claimed shipped runtimes that
didn't exist (RFC internal#730 finding):
- google-adk-runtime.md: cited 'PR #550' (actually a MemoryTab test suite) +
'already first-class'. Rewritten to the REAL implementation — ADK engine-only
(google-adk[mcp]==2.1.0, no [a2a]), Vertex AI via ADC (keyless), a2a-1.x
bridge — with correct PR refs (template PR #1, core #2003, ci #26) + a
landing-status banner.
- gemini-cli-runtime.md: cited 'PR #379' (actually CI cleanup); no gemini-cli
runtime exists in manifest/knownRuntimes. Added a correction banner pointing
to the real google-adk runtime.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
internal#718 retired the org-level LLM billing rung (billing is resolved
per-workspace now). SetGlobal still called the legacy org-env guard
rejectPlatformManagedDirectLLMBypass, which reads MOLECULE_LLM_BILLING_MODE and
400s any vendor/oauth key write when the (legacy) org default is
platform_managed. That blocked setting a tenant's own MINIMAX_API_KEY (or any
custom-provider key) at global scope on a byok tenant — agents-team hit "direct
Hermes custom provider secrets are blocked for platform-managed LLM workspaces".
A global secret is the tenant's OWN shared credential. The provision-time
provider-matched strip (workspace_provision, core#2000) already removes any
global cred a given workspace's resolved provider does not accept, and the
platform-managed path strips bypass keys at provision too — so a platform-managed
workspace can never USE a non-matching global vendor/oauth key. The SetGlobal
org-env gate was redundant belt-and-suspenders keyed off the retired rung.
- SetGlobal: remove the org-level guard call.
- Delete the now-dead legacy helpers platformManagedLLMMode +
rejectPlatformManagedDirectLLMBypass (org-env shims; the per-workspace
successors rejectPlatformManagedDirectLLMBypassForWorkspace /
platformManagedLLMModeForWorkspace remain and still gate per-workspace writes).
- Tests: convert the obsolete platform-managed rejection test into
TestSetGlobal_AllowsTenantOwnedVendorKeyDespiteLegacyOrgEnv (asserts the global
write SUCCEEDS even with the legacy env still set to platform_managed).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
#1995 removed the blanket global-LLM-cred strip on the byok branch (correct for
the platform-key co-mingling it targeted), but left EVERY claude-code workspace
inheriting the tenant-global CLAUDE_CODE_OAUTH_TOKEN. The claude-code runtime
greedily prefers that oauth (llm-auth: detected oauth -> api.anthropic.com), so
a workspace whose RESOLVED provider is NOT anthropic-oauth (minimax, kimi-byok)
routes its non-Anthropic model to Anthropic -> "Claude Code returned an error
result" (agents-team Dev Engineer B, MiniMax-M2.7; live-confirmed 2026-05-28 via
SSM container logs, internal#728 comment 52493).
Fix: provider-AWARE replacement for the over-removed strip. On the byok/disabled
branch, keep ONLY the global-origin LLM bypass creds whose env-var name is in
the RESOLVED provider's auth_env; strip the rest.
- minimax auth_env MINIMAX_API_KEY/ANTHROPIC_AUTH_TOKEN/ANTHROPIC_API_KEY ->
stray global CLAUDE_CODE_OAUTH_TOKEN is non-matching -> stripped (fixes DevB).
- anthropic-oauth auth_env CLAUDE_CODE_OAUTH_TOKEN -> matches -> kept (PM opus +
reno opus-byok NOT regressed; #1994 ByokGlobalScopeOAuthSurvives guard holds).
NOT a return to the blanket strip (which would re-break the byok-anthropic-oauth
case #1994 fixed) — keyed off DeriveProvider's resolved provider.
Provenance-scoped: only operator-store (global_secrets) origin keys are
provider-gated. User-authored workspace_secrets (provenance flag cleared by
loadWorkspaceSecrets) are NEVER stripped — JRS kimi workspace-key, reno's own
oauth are exempt. Fail-OPEN: an underivable provider / unavailable registry
strips nothing (keep-first; worst case is a kept stray, never removing the only
usable cred -> never fail-closes a legitimate byok workspace).
Threads loadWorkspaceSecrets's globalKeys provenance side-channel into
applyPlatformManagedLLMEnv (signature +map[string]struct{}); caller
prepareProvisionContext already has it.
Tests (llm_billing_mode_provision_parity_test.go):
- MinimaxStripsStrayGlobalOAuth — DevB repro: minimax-resolving ws strips the
stray global oauth + keeps MINIMAX_API_KEY routing.
- WorkspaceOriginCredExemptFromStrip — user-authored ws_secrets cred survives
even when non-matching.
- ByokGlobalScopeOAuthSurvives (strengthened) — global-origin oauth on opus
SURVIVES via provider match (PM/reno regression guard).
Mutation-load-bearing (verified RED): (1) remove strip -> blanket-keep regresses
DevB; (2) empty keep set (provider-unaware) -> minimax routing + reno oauth
stripped; (3) iterate all bypass keys (provenance-unaware) -> user-authored cred
stripped.
build ok; build -tags=integration ok; go test ./internal/handlers/ ok;
golangci-lint ./internal/handlers/ -> 0 issues. Refs internal#728.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Continuous synthetic E2E (staging) / Synthetic E2E against staging (push) Compensated by status-reaper (workflow has no push: trigger; Gitea 1.22.6 hardcoded-suffix bug — see .gitea/scripts/status-reaper.py)
The production auto-deploy aggregated per-tenant redeploy-fleet results
but never asserted fleet COVERAGE: a tenant that was enumerated but
silently skipped, or that SSM-succeeded onto the old image, passed as a
clean deploy. That is how agents-team stayed 46h behind the fleet with no
straggler reported.
Pairs with the controlplane fix that adds per-tenant verified_on_target
(docker-inspect proof the container is on the target tag). This change:
- rollout_stragglers(): every enumerated tenant NOT proven on the target
build is a straggler — errored, skipped (no result row, the agents-team
class), or verified_on_target=false. Backward-compatible: a missing key
(pre-fix CP) is treated as verified so the gate degrades to the old
ok-based behavior against an un-upgraded CP rather than failing spuriously.
- assert_full_coverage(): raises RolloutFailed (→ non-zero exit, response
JSON written with ok=false + stragglers) when any straggler remains
after a non-dry-run rollout. A dry run asserts nothing (it proves
nothing landed).
- publish-workspace-server-image.yml: per-tenant summary gains an
"On target" column and a loud ⚠ Stragglers section; the step emits a
::error:: naming the off-target tenants before failing.
Tests: straggler detection (off-target, no-result, dry-run-skip,
backward-compat missing key) + end-to-end execute_scoped_rollout fail/pass
— mutation-verified RED with the coverage gate removed. All existing
prod-auto-deploy tests still pass; ruff + py_compile clean; workflow YAML
validates.
Refs: internal#724
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Continuous synthetic E2E (staging) / Synthetic E2E against staging (push) Compensated by status-reaper (workflow has no push: trigger; Gitea 1.22.6 hardcoded-suffix bug — see .gitea/scripts/status-reaper.py)
Corrected-model credential fix (CTO-confirmed). `global_secrets` is the
TENANT's own secret store (shared across that tenant's workspaces), NOT the
platform's. The platform's own LLM credential is the CP proxy usage token,
injected separately on the platform_managed path; it is never stored in a
tenant's global_secrets.
The internal#711 provider-aware strip rested on the inverted premise that a
global-scope LLM credential was "the platform's own". On the byok/disabled
branch it stripped the tenant's OWN oauth when that oauth lived at global
scope, leaving the workspace credential-less -> MISSING_BYOK_CREDENTIAL ->
dead (Reno Stars Marketing/SEO byok agents, live-confirmed 2026-05-28).
Changes:
- workspace_provision.go: remove the stripGlobalOriginLLMCreds call on the
byok/disabled branch; delete the now-dead function; drop the unused
globalKeys parameter from applyPlatformManagedLLMEnv.
- secrets.go: remove the symmetric byok strip on the remote-pull path
(GET /workspaces/:id/secrets/values) + its now-unused globalKeys tracking;
the bundle is the tenant's merged secrets served verbatim.
- platform_managed path UNCHANGED: still strips direct oauth + forces the CP
proxy usage token (metered). Only byok/disabled stop being stripped.
- Fail-closed UNCHANGED in spirit: a byok workspace with no LLM credential at
ANY scope still aborts MISSING_BYOK_CREDENTIAL; the trigger narrowed from
"no workspace-scoped cred" to "no cred at any scope".
Guard (co-mingling prevention at the write boundary):
- SetGlobal still rejects bypass-list keys for a platform_managed tenant
(keeps a platform-shaped credential out of global_secrets going forward);
added a regression test pinning it.
Tests: inverted the strip-asserting unit + e2e tests to the corrected model
(global-scope oauth survives, byok runs direct, no proxy); added genuinely-
credential-less byok fail-closed coverage; all three behavior changes are
mutation-load-bearing (re-adding either strip / dropping the SetGlobal guard
turns the respective test RED). build + vet + golangci-lint + the full
integration-tagged handlers suite green. The #1994 model-passthrough fix and
the MiniMax A2A e2e on this branch are untouched.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The A2A e2e historically asserted only response SHAPE (test_a2a_e2e.sh
checked '"kind":"text"' only). A broken agent returns its error AS a
text part -- {"kind":"text","text":"Agent error (Exception) ..."} --
which STILL matches the shape check, so it PASSED on a fully broken
agent. That is why the 2026-05-2x drained-key / byok-misroute failures
(agents-team PM + reno marketing erroring on every LLM call) sailed
through CI. "Channel returns text shape" is not "agent completed an LLM
round-trip."
Adds, ADDITIVELY (no existing assertion weakened or removed):
- tests/e2e/lib/completion_assert.sh -- reusable gates:
* a2a_assert_real_completion: deterministic known-answer round-trip;
asserts CONTAINS the expected token AND NOT an error-as-text marker
(Agent error / Exception / error result / MISSING_BYOK_CREDENTIAL).
* provider_liveness_matrix + offered_platform_models_for_runtime:
per-offered-provider cheap (max_tokens:4) probe; the offered set is
read from the providers.yaml SSOT (runtimes.<rt>.providers[platform]
.models) -- not a hardcoded list -- so the matrix tracks the SSOT.
* assert_byok_not_platform_proxy: #1994 regression guard -- a
byok-resolving workspace must NOT resolve platform_managed (reads the
same derived resolver GET /admin/workspaces/:id/llm-billing-mode the
provision strip gate uses).
- tests/e2e/test_staging_full_saas.sh (the live-agent lane, MiniMax
primary): new stanzas 8b (PINEAPPLE known-answer, the core gate),
8c (byok-routing guard), 8d (SSOT-driven per-provider liveness matrix).
- tests/e2e/test_a2a_e2e.sh: added check_no_error_as_text on Echo + SEO
replies so the brief's literal shape-only example now FAILS on an
error-as-text payload.
- tests/e2e/test_completion_assert_unit.sh: offline fail-direction proof
(16 cases) that the negative gates are load-bearing -- error-as-text
MUST fail, platform_managed MUST trip the #1994 guard. Wired into
ci.yml "Run E2E bash unit tests (no live infra)" (required, per-PR +
main). e2e-staging-saas.yml paths filter extended to re-trigger the
live lane on lib changes.
No #1994 fix code touched -- tests/e2e + workflow wiring only.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The provision-time LLM billing resolver diverged from the read endpoint:
a byok workspace (claude-code, opus) was provisioned platform_managed and
routed through the platform LLM proxy, billing the platform Anthropic key
for the customer own usage (Reno Stars Marketing 6b66de8d; live-confirmed
2026-05-28).
Root cause: applyPlatformManagedLLMEnv passed the RAW payload.Model to
ResolveLLMBillingModeDerived. On a re-provision (restart/resume/
auto-restart) the payload is rebuilt from the DB with Name+Tier+Runtime
only (workspace_restart.go:333/844/1017 via withStoredCompute, which
backfills Compute but NOT Model), so payload.Model == "". DeriveProvider
errors on an empty model, the resolver defaults closed to platform_managed
and bakes ANTHROPIC_BASE_URL=<platform proxy>. The read endpoint
(ResolveLLMBillingMode -> readWorkspaceDeriveInputs) reads MODEL from
workspace_secrets, derives opus -> anthropic-oauth -> byok. Divergence,
deterministic on every re-provision.
Fix: extract effectiveModelForBilling (the fallback chain
applyRuntimeModelEnv already used: explicit -> MOLECULE_MODEL -> MODEL)
into a shared helper and have the billing resolver consult it, so the
provision-path derive inputs match the read-path. The stored model already
lives in the merged envVars (loadWorkspaceSecrets) — no new DB query. The
byok branch (no proxy override; strip only global-origin platform creds;
fail-closed on missing own cred, internal#711) is preserved unchanged;
genuinely-platform and no-model workspaces still default platform_managed
(CTO: default stays platform).
Tests (mutation-load-bearing): re-provision-uses-stored-model byok repro,
read/provision parity guard, default-preservation, and the #711 global-
only-oauth fail-closed guard. Reverting the envVars fallback turns the
repro + parity + #711 tests RED; default-preservation stays GREEN.
BEHAVIOR-AFFECTING (provisioning hot path) — needs CTO merge-go.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Railway pin audit (drift detection) / Audit Railway env vars for drift-prone pins (push) Compensated by status-reaper (workflow has no push: trigger; Gitea 1.22.6 hardcoded-suffix bug — see .gitea/scripts/status-reaper.py)
agent-reviewer #7790 (blocking) found that ConfigTab.registryBilling.test.tsx
did not actually pin retire-list #5's core claim — both existing assertions
("platform"→platform_managed, "anthropic-oauth"→byok) return the SAME value
under both the registry-authoritative impl and a regression to the old
hardcoded billingModeForProvider rule, so the test was tautological and a
regression would still pass. The misleading comment on the anthropic-oauth
case claimed it was "a case the hardcoded rule gets WRONG" but the hardcoded
rule actually agrees there too.
This commit adds a genuine disagreement case: a registry provider
"managed-federated" whose registry-served billing_mode is "platform_managed"
even though its name is not "" / "platform" (so the legacy
billingModeForProvider rule would return "byok"). The new test asserts the
two rules disagree on this input (sanity) and then asserts
billingModeForSelectedProvider returns the REGISTRY value
("platform_managed"), which is only reachable by honoring the catalog.
Load-bearing proof: with the registry-first impl, the new test PASSES; when
billingModeForSelectedProvider is temporarily forced to fall through to the
hardcoded rule, the new test (and only the new test) FAILS with
expected 'platform_managed' / received 'byok' — proving it pins the
registry-wins contract.
Also fixes the misleading "hardcoded rule gets WRONG" comment on the
anthropic-oauth case (explicitly annotates it as non-discriminating and
points to the new disagreement case as the registry-WINS proof).
Implementation (billingModeForSelectedProvider) untouched — confirmed
byte-identical to PR #1978 HEAD (f2d7f1da).
Verification:
- targeted: 5 passed (was 4 — adds the discriminating case)
- regressed-impl: only the new test fails, others pass (= they are
non-discriminating as the review found)
- full canvas vitest: 223 files / 3381 passed | 1 skipped (3382) — +1
vs the 3380/1 baseline
- tsc: 0 new errors (touched file clean; pre-existing 223 baseline
unchanged with my diff stashed)
- eslint on touched file: 0
Refs: #1978, review #7790, internal#718 P3 retire-list #5.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
P3 item 2. The canvas Provider/Model selector + Config-tab billing-mode now
consume the registry-served GET /templates fields (registry_backed /
registry_providers / registry_models from PR-A) instead of re-deriving provider
knowledge client-side. Retires the hardcoded vocabularies as the PRIMARY path:
- ProviderModelSelector (#4): new buildProviderCatalogFromRegistry(providers,
models) builds the dropdown catalog from the registry payload — provider
label = registry display_name, bucket = DERIVED provider, billing + auth_env
from the registry — instead of inferVendor / VENDOR_LABELS /
BARE_VENDOR_PATTERNS. The selector takes an optional pre-built `catalog`
prop and uses it verbatim when supplied. inferVendor/buildProviderCatalog
remain ONLY as the fallback for non-registry runtimes / older backends.
- ConfigTab (#5): when the selected runtime is registry-backed, the provider
catalog + selector models come from registry_providers/registry_models, and
billingModeForSelectedProvider(provider, catalog) reads the DERIVED provider's
billing_mode off the registry catalog. The hardcoded billingModeForProvider
('' | 'platform' → platform_managed else byok) stays as the fallback only.
So the billing-mode the UI shows/sends reflects the DERIVED provider
(folds in the closed#1931's canvas intent).
Federation/back-compat preserved: a non-registry runtime (external/mock/kimi/
future third-party) or an older backend that doesn't serve the registry fields
yields registry_backed=false → the canvas keeps the template-served models +
its heuristic, unchanged. NO hard-reject (the canvas just can't render an
option the registry didn't serve for registry-backed runtimes).
Out of scope (per brief): the manifest runtime allowlist
(SUPPORTED_RUNTIME_VALUES / FALLBACK_RUNTIME_OPTIONS) is NOT a provider
vocabulary and is untouched; PUT /workspaces/:id/provider is NOT retired (that
CTO #3 follow-through is a later phase).
Stacked on PR-A (workspace-server registry-served /templates); re-target to
main after PR-A merges.
TDD: ProviderModelSelector.registry.test.tsx (catalog bucketed by derived
provider, labelled from display_name, carries billing_mode + auth_env, no empty
buckets), ConfigTab.registryBilling.test.tsx (billing reads registry catalog;
falls back to the legacy rule with no catalog / unknown provider). Full canvas
suite green (3380 passed / 1 skipped), tsc clean for touched files, eslint 0.
internal#718 P3 — not merged; CTO merge-go after Five-Axis (UI-affecting).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
publish-workspace-server-image / Production auto-deploy (push) Compensated by status-reaper (push run was cancelled/superseded; Gitea 1.22.6 reports cancelled runs as failure statuses)
The provider-SSOT closure: with the registry-derived provider model
(P0-P4) flowing through every decision point — proxy (P1), billing
(P2-B), templates (P3 PR-A/B), provisioner (P3 PR-C) — the
LLM_PROVIDER workspace_secret has no reader left on core. This PR
retires:
- WorkspaceHandler.Create's setProviderSecret writes (the
payload.LLMProvider and deriveProviderFromModelSlug-derived
write paths). payload.LLMProvider is preserved on the request
struct for backwards-compat with older canvases that still send
it; the value is intentionally ignored. Coverage moved to
TestWorkspaceCreate_FirstDeploy_OnlyPersistsMODEL (asserts only
the MODEL secret is written, even on a slug-prefixed model that
pre-P4 would have triggered an LLM_PROVIDER write).
- SecretsHandler.SetProvider / GetProvider gin handlers + the
setProviderSecret helper. Both route registrations now point at
handlers.ProviderEndpointGone, which returns 410 Gone with a
structured PROVIDER_ENDPOINT_RETIRED body so older canvases that
still call PUT /provider on Save fail loud rather than silently
writing into a vanished row. Coverage: TestPutProvider_410Gone +
TestGetProvider_410Gone + TestProviderEndpointGone_BodyShape.
- deriveProviderFromModelSlug (retire-list #3) — the hand-rolled
35-arm slug-prefix→provider switch in workspace_provision.go.
Its only caller was Create's setProviderSecret write; the
derivation now flows through providers.Manifest.DeriveProvider
against the registry SSOT at every decision point. The drift
test (derive_provider_drift_test.go) that pinned parity with the
hermes template's derive-provider.sh is deleted with it. The
shell script remains the in-container fallback; its byte-identity
with the registry view of hermes is a P4 follow-up gated on
registry data growth (see codegen of hermes config.yaml from the
registry).
- loadWorkspaceSecrets LLM_PROVIDER drop (defence-in-depth):
any straggler workspace_secrets or global_secrets row keyed
LLM_PROVIDER is filtered out before envVars is built, so a
rolling deploy (new code, old DB) cannot re-emit the retired key
into the CP-side provisioner env.
- Canvas: ConfigTab.tsx no longer GETs or PUTs
/workspaces/:id/provider, and the provider→billing-mode linkage
(internal#703 Gap 2) is retired together — P2-B moved the
platform-vs-byok decision to ResolveLLMBillingModeDerived, which
derives the provider from (runtime, model). The provider
dropdown still renders for display so users can preview the
derived value locally. The two retired vitest suites
(ConfigTab.provider, ConfigTab.billingMode) are replaced with
documentation files pointing at the new coverage.
- Migration 20260528000000_drop_llm_provider_workspace_secret
removes any straggler rows from workspace_secrets. Idempotent:
a fresh tenant with zero LLM_PROVIDER rows produces a 0-row
delete. The .down.sql is a documented no-op (the rows cannot
be reconstituted from a soft-delete, and the writers are gone).
Behavior delta — explicitly tested:
- Registered (runtime, model) workspace → 201, provider derived,
no LLM_PROVIDER stored. UNCHANGED for the runtime-visible
`provider:` in /configs/config.yaml (CP-side commit derives it
from the same registry).
- PUT /workspaces/:id/provider → 410 Gone {code:
PROVIDER_ENDPOINT_RETIRED, error, issue: internal#718}. Was 200
with a workspace_secrets write.
- GET /workspaces/:id/provider → 410 Gone. Was 200 + {provider,
source}.
- WorkspaceHandler.Create with a slug-prefixed model (e.g.
minimax/MiniMax-M2.7) + an explicit llm_provider in the payload
→ only the MODEL workspace_secret is written. Pre-P4 both rows
were written.
- Existing workspace with an LLM_PROVIDER row → migration drops
it at next deploy; loadWorkspaceSecrets filters it defensively
in the interim.
Five-Axis review notes:
- Correctness: the four readers of stored LLM_PROVIDER (core
GetProvider, core loadWorkspaceSecrets, CP resolveModelAndProvider,
CP ValidateProviderEnv) are all migrated in this PR + the
CP-side commit. Audit query trail in the brief; the empirical
finding is that no fifth reader exists (verified across both
repos via grep of LLM_PROVIDER, setProviderSecret, SetProvider,
GetProvider, llm_provider).
- Tests: TDD red→green for the 410 Gone shape; SQL-mock for the
"no LLM_PROVIDER write on Create" contract; existing P2-B
billing tests confirm the derived-provider billing path is
untouched (the regression risk this PR could have created).
- Backward-compat: payload.LLMProvider preserved on the
CreateWorkspacePayload struct; the canvas still sends it; the
server ignores it. Older canvases that PUT /provider get a loud
410 with a recognizable code so they can stop calling.
- Rollback: revert the migration + revert this commit; the
LLM_PROVIDER workspace_secret writers stay gone (the PUT route
has no handler symbol to wire back without a separate revert).
- Observability: provider derivation is logged in
applyPlatformManagedLLMEnv (existing P2-B emission); no new
structured-event surface added — the retirement is silent at
the request boundary and the 410 Gone surface is the
operator-facing signal.
cp#362 anthropic passthrough untouched. P1 proxy ResolveUpstream
untouched. P2-B billing derives via DeriveProvider — still reads
the same derivation, never the stored LLM_PROVIDER. P3 PR-A
templates-from-registry + P3 PR-C ValidateProviderEnv-from-registry
untouched. P4 PR-2 hard-reject 422 untouched.
NOT MERGED.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
WorkspaceHandler.Create now returns 422 UNREGISTERED_MODEL_FOR_RUNTIME when the provider registry knows the runtime but the (runtime, model) pair is not in its native model set. Was the P2-B WARN-mode signal (X-Molecule-Model-Unregistered header + log; create proceeds); now a hard rejection at the boundary with no DB rows touched.
Behavior delta (under test):
* Workspace with a REGISTERED (runtime, model) → 201, unchanged.
* Workspace with an UNREGISTERED (runtime, model) → 422 with body
{code:UNREGISTERED_MODEL_FOR_RUNTIME, error, runtime, model}, no DB writes (mock ExpectationsWereMet asserts zero unexpected DB calls).
* Workspace with the legacy colon-form anthropic:claude-opus-4-7 for runtime=claude-code → 201 (P4 PR-1 reconciled the colon-vocab into the registry, making this a first-class registered model alongside the slash form).
* Workspace with runtime NOT in the registry (langgraph/external/kimi/mock/federated) → unchanged (fails OPEN — federation-ready, the registry can not speak to non-first-party runtimes).
* External workspaces (external=true or external-like runtime) → unchanged (URL is the contract, not the model).
Why P4 vs P2-B: P2-B kept WARN-mode because the legacy colon-namespaced BYOK vocabulary (anthropic:claude-opus-4-7 etc.) was live across the create/import/template corpus and not yet in the registry. P4 PR-1 reconciled that vocab into the per-runtime native sets (each runtime now lists bare + slash + colon forms for the BYOK ids in the live corpus). With the reconcile landed, an unregistered pair is a real misconfiguration and the gate flips loud — the codex anthropic:claude-opus-4-7 wedge class (the MODEL_REQUIRED gate targets the same failure mode) now fails AT THE BOUNDARY instead of provisioning a workspace that will wedge at adapter init.
Test surface (workspace_test.go):
* TestWorkspaceCreate_718_P4_UnregisteredModelHardReject422 (NEW) — explicit 422 + body code + no DB writes
* TestWorkspaceCreate_718_P4_RegisteredModelProceeds (renamed from _RegisteredModelNoWarnHeader) — 201 + no legacy WARN header
* TestWorkspaceCreate_718_P4_LegacyColonVocabAccepted (NEW) — anthropic:claude-opus-4-7 on claude-code proceeds (the central regression guard for the PR-1 reconcile + PR-2 flip combo)
* TestWorkspaceCreate_718_NonRegistryRuntimeFailsOpen — unchanged (federation path)
Fixture updates for the flip (tests that previously used an unregistered model as a fixture for OTHER gate paths; updated to a valid model so those gates can actually fire):
* TestWorkspaceCreate_WithInvalidCompute_ReturnsBadRequest — gpt-4 (no runtime owns it) → claude-opus-4-7 (so the compute-validation 400 path tests what it should)
* TestWorkspaceCreate_TemplateDefaultsMissingRuntimeAndModel — hermes/nousresearch/hermes-4-70b → hermes/moonshot/kimi-k2.6 (hermes native set per the CTO matrix)
* TestWorkspaceCreate_TemplateDefaultsLegacyTopLevelModel — hermes/anthropic:claude-sonnet-4-5 → hermes/moonshot/kimi-k2.5
* TestWorkspaceCreate_CallerModelOverridesTemplateDefault — hermes override minimax/MiniMax-M2.7 → moonshot/kimi-k2.5 (still tests the caller-overrides-template-default mechanic, just with a hermes-valid pair)
Phase-1 falsification + Phase-2 design were established in PR-1. Phase-3 TDD: each new behavior assertion mapped to a discriminating test (422 vs 201 vs unchanged WARN-header absence). Phase-4 Five-Axis to follow in PR review.
NOT regressed (verified via -short + -tags=integration -short for handlers + providers):
* cp#362 anthropic passthrough (proxy layer; unaffected).
* P1 proxy ResolveUpstream (registry resolution by namespace token; unaffected).
* P2-B billing-derive (DeriveProvider semantics unchanged by the reconcile).
* P3 templates-from-registry (GET /templates still serves ModelsForRuntime; PR-1 enlarges the set, this PR rejects calls outside it).
Stacked on feat/internal-718-p4-pr1-reconcile-colon-vocab-sync (PR-1 must merge first; this PR's tests would 422 the legacy colon vocab otherwise).
Refs internal#718.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the canonical change in molecule-controlplane PR feat/internal-718-p4-pr1-reconcile-colon-vocab:
adds the legacy colon-namespaced BYOK model ids (anthropic:claude-*, moonshot:kimi-k2.*, minimax:MiniMax-M2*) to each runtime native set so DeriveProvider / Manifest.ModelsForRuntime returns true for every legitimate model in the live workspace-create corpus (canvas/ConfigTab default + ~44 test files + openclaw template precedent).
Per the sync_canonical_test.go header procedure:
1. Copied molecule-controlplane/internal/providers/providers.yaml verbatim.
2. Regenerated internal/providers/gen/registry_gen.go via go run ./cmd/gen-providers.
3. Bumped canonicalProvidersYAMLSHA256 to the new canonical sha (73e8003062edaa4ce75bfb324be615b6e2b380f07487e3af4dc16cb644dc12bc).
4. Synced runtimes_test.go to match CP's expanded claude-code expectation set.
ZERO behavior change in core: the WARN-mode validateRegisteredModelForRuntime gate (workspace.go:451-456) just goes silent for the now-registered colon-form models; the X-Molecule-Model-Unregistered response header stops being emitted for legitimate colon-form workspaces. No new rejection path; no proxy/billing-derive change.
Stacked atop molecule-controlplane PR-1 — merge order: CP PR-1 → core PR-1 sync. The cross-repo sync-providers-yaml CI gate stays green once the canonical lands.
Refs internal#718.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
P3 item 1 (retire-list #1 surface). GET /templates (templates.go List) now
ANNOTATES each registry-known runtime's template with an authoritative
registry-served selectable list, sourced from the provider registry
(workspace-server/internal/providers, the P2-A synced SSOT) instead of the
template's hand-authored config.yaml providers:/runtime_config.models block:
- registry_backed: true when the runtime is in the registry runtimes: block.
- registry_providers: the runtime's NATIVE provider set (ProvidersForRuntime),
each with display_name + auth_env + billing_mode (platform_managed if the
registry IsPlatform predicate holds, else byok) — the SSOT the canvas
Provider dropdown consumes instead of its hardcoded VENDOR_LABELS map.
- registry_models: the runtime's NATIVE model ids (ModelsForRuntime), each
annotated with its DERIVED provider (DeriveProvider) + the billing_mode that
provider implies — so the canvas shows the billing source of the DERIVED
provider (folds in #1931 intent) and can render no model the registry did
not list for the runtime ("only registered selectable").
Additive + federation-ready + fail-OPEN: the existing template-served
Models/Providers/ProviderRegistry fields are UNCHANGED, so non-registry
runtimes (external/mock/kimi/future third-party) and older canvases keep
working — a runtime absent from the registry yields registry_backed=false and
no synthesized block. NO hard-reject: templates whose model isn't
registry-derivable are still served (WARN-level only; legacy-vocab reconcile
is P4).
Reuses the package-level providerRegistry() accessor + LLMBillingModePlatformManaged/
LLMBillingModeBYOK constants from llm_billing_mode.go (P2-B / #1972, now on
main) — one accessor + one constant set for the package; both the billing
derivation and this templates projection wrap the same providers.LoadManifest()
SSOT and the same wire strings.
Proxy ResolveUpstream / billing DeriveProvider untouched (P1/P2). Templates'
own config.yaml providers: codegen untouched (P4).
TDD: TestTemplatesList_RegistryServesSelectableModels (a template's bogus model
id never leaks into the registry-served list; native ids present),
TestTemplatesList_RegistryAnnotatesDerivedProviderAndBilling (derived
provider + platform_managed/byok per model; provider display_name/auth_env/
billing from the registry), TestTemplatesList_NonRegistryRuntimeFallsOpenToTemplate
(mock runtime: registry_backed=false, template fields untouched). All existing
TestTemplatesList_* stay green (template-served fields unchanged). Rebased onto
main after P2-B (#1972) landed; full handlers+providers suites green alongside it.
internal#718 P3 — not merged; CTO merge-go after Five-Axis (UI/API-affecting).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cross-repo drift gate fetched controlplane providers.yaml from the
Gitea /contents endpoint with Accept: application/vnd.gitea.raw. On this
Gitea (1.22.6) that header is NOT honored on /contents -- it returns the
JSON+base64 envelope ({"name":"providers.yaml","content":"<base64>"...},
~45.6 KB), not raw bytes. So diff -u compared JSON-vs-YAML and exited 1
(RED) on every run even when byte-identical, making the gate inert
(detected neither sync nor real drift).
Switch the fetch to the /raw endpoint, which returns the file bytes
directly (33319 B, sha256 48a66921...), byte-identical to core's synced
copy. diff now exits 0 on the in-sync state and goes RED on real drift.
Authorization: token header kept; soft-fail backstop and the hermetic
sha-pin in sync_canonical_test.go are untouched.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Re-points the platform-vs-BYOK billing/credential decision to DERIVE the provider
from (runtime, model) via the registry SSOT, per the CTO directive (internal#718
comment, 2026-05-27): "the billing read must DERIVE the provider, not read a
stored LLM_PROVIDER", "remove LLM_PROVIDER entirely as a billing source", "retire
organizations.llm_billing_mode as a billing source".
## BEHAVIOR DELTA (this PR changes behavior — tested explicitly)
- platform-derived (or unset → platform default) → platform_managed → platform
creds. UNCHANGED.
- non-platform-derived → byok → the already-merged #1963 strips platform
scope:global LLM creds + FAIL-CLOSES if the workspace has no own cred. THIS IS
THE INTENDED FIX (the Reno billing-leak class: Reno Stars SEO 352e3c2b /
Marketing 6b66de8d ran on the platform's Anthropic credits because the never-
written org rung always resolved platform_managed).
- unset model → platform default (CTO-confirmed).
## What changed
- `ResolveLLMBillingModeDerived(ctx, ws, runtime, model, authEnv)` — the new SSOT
resolver: explicit `workspaces.llm_billing_mode` override (precedence 1, the
only stored billing signal that survives — operator escape hatch) → else
DeriveProvider + IsPlatform → else default-closed platform_managed.
- `ResolveLLMBillingMode(ctx, ws, orgMode)` legacy signature retained for callers
without (runtime, model) (admin route, secrets remote-pull): reads the stored
runtime + MODEL + auth-env names from DB and delegates to the derived resolver.
`orgMode` is RETIRED/ignored; the org rung is gone.
- `applyPlatformManagedLLMEnv` calls the derived resolver directly (it has
runtime + model + the workspace env) — no stored LLM_PROVIDER read. Feeds
#1963's strip + fail-closed the correct DERIVED signal.
- SUPERSEDES core#1966: that PR made the billing read consult a stored
LLM_PROVIDER first; this reworks the decision onto derive-from-provider. #1966
should be closed in favor of this.
- Removed the now-dead org-default normalization (normalizeOrgDefault).
- ONLY-REGISTERED validation at create (model_registry_validation.go +
WorkspaceHandler.Create): a (runtime, model) not in the registry's
ModelsForRuntime for a REGISTRY-known runtime is flagged
(X-Molecule-Model-Unregistered header + warning log). P2 = WARN mode (NOT hard
422) because the legacy colon-namespaced model vocabulary ("anthropic:claude-
opus-4-7") is still live across the create/import/template corpus and is not
yet reconciled into the registry — hard-reject is a one-line flip gated on
P3/P4 vocabulary convergence. Fails OPEN for non-registry runtimes
(langgraph/external/kimi/mock/federated) so those flows are unchanged.
## Tests (TDD; behavior delta explicit)
- llm_billing_mode_derived_test.go — platform/non-platform/unset/override/
unregistered/auth-env-disambiguation table + DB-error default-closed + empty-id.
- workspace_provision_shared_test.go — DERIVED platform→unchanged,
non-platform→byok+strip+fail-closed (the FIX), unset→platform default, through
the real applyPlatformManagedLLMEnv path. Existing #1963 override-byok strip +
fail-closed tests unchanged (still pass).
- model_registry_validation_test.go + workspace_test.go — only-registered warn +
registered-no-warn + non-registry-fail-open.
- Reworked the legacy resolver/admin/secrets tests off the retired org rung.
## Build/CI
go build ./... (+ -tags=integration) green; full `go test ./...` (43 pkgs) green
incl. -race on handlers; vet clean; changed files gofmt-clean. cp#362 anthropic
passthrough untouched (CP repo); merged #1963 strip+fail-closed reused unchanged.
internal#718 P2-B. BEHAVIOR-AFFECTING. Supersedes #1966. Not merged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Distributes the provider-registry SSOT into molecule-core per the CTO-decided
shape (internal#718 comment, 2026-05-27): "Distribution = SDK via codegen +
verify-CI", multi-repo branch "codegen-checked-into-each-repo + verify-CI".
molecule-core has no Go module dependency on molecule-controlplane, so this
lands a SYNCED COPY of the canonical providers.yaml plus the loader,
DeriveProvider/IsPlatform/ResolveUpstream, the generated Go projection
(cmd/gen-providers), and the drift gates — a byte-faithful mirror of the
controlplane P0/P1 machinery. Canonical SSOT stays in controlplane
internal/providers/providers.yaml.
ZERO behavior change (additive, like P0): NO production code path imports the
new package yet. P2-B wires the billing/credential decision onto the loader.
What lands:
- internal/providers/{providers.go,derive_provider.go,providers.yaml} — mirror
of the controlplane loader + canonical YAML (synced copy).
- internal/providers/gen/registry_gen.go — generated projection; fingerprint
faffcbe59bb9f38c matches controlplane.
- cmd/gen-providers — the generator (go generate + -check drift mode).
- .gitea/workflows/verify-providers-gen.yml — artifact ↔ synced-copy drift gate
(mirror of the controlplane workflow; standalone, not in branch protection
yet — same soak-then-promote posture).
- .gitea/workflows/sync-providers-yaml.yml — NEW cross-repo gate: fetches the
controlplane canonical providers.yaml and byte-compares against core's synced
copy (RED on canonical drift). Read-only AUTO_SYNC_TOKEN; degrades to a
warning if the token is absent.
- internal/providers/sync_canonical_test.go — hermetic sha pin of the synced
copy (the always-on backstop; catches a hand-edit even with no network).
- internal/providers/gen_import_boundary_test.go — arch-lint-equivalent AST gate
(core has no go-arch-lint): no production package may import the raw gen
projection. Proven load-bearing.
Build/test: go build ./... (+ -tags=integration) green; providers/gen/
gen-providers suites pass (incl. -race); gen -check in sync; gofmt + vet clean.
internal#718 P2-A. NO behavior change. Not merged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Suppresses the lint finding while adding enough context that a reviewer
can distinguish "intentional side-effect from the loop" from an
accidental _ prefixed attribute mutation.
Addresses follow-up from #1769 suppression-comment audit.
TestProxyA2A_CrossTenant_RoutingDenied expected the old behavior where
CanCommunicate's root-sibling bypass ALLOWED unrelated org roots and the
org-scope guard denied afterward. Post-#1955 fix (e69d6383), CanCommunicate
correctly denies unrelated org roots at the hierarchy check, so:
- Error message is now hierarchy-level denial, not org-scope denial
- WITH RECURSIVE org_chain AS queries are never reached
Updated test expectations and removed stale sqlmock setups.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The sop-checklist.yml workflow subscribes only to issue_comment:[created]
(consolidated in PR #1345 / issue #1280 to reduce runner-slot occupancy).
The script header still claimed [created, edited, deleted], which could
mislead future maintainers into thinking edited/deleted events are handled.
No behavior change — comment-only.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
When the Gitea token owner is not a member of the qa/security team,
every team-membership probe returns 403. Previously the final error
message said "none are in team", which misled ops into verifying the
team roster when the real issue was token provisioning (Bug C).
Add tracking for all-403 vs mixed-response scenarios. When every
candidate returns 403, emit an explicit error naming the root cause
and the remediation (add token owner to team or switch tokens).
No behavior change — still fail-closed; only the diagnostic message
is improved.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
A workspace whose resolved LLM billing mode is NOT platform_managed
(byok / subscription) was still being injected with the platform's
scope:global CLAUDE_CODE_OAUTH_TOKEN and ran on the platform's Anthropic
credits. Confirmed live 2026-05-27 on the Reno Stars tenant: the SEO
(352e3c2b-...) and Marketing (6b66de8d-...) claude-code agents had no
workspace-scoped LLM credential, yet ran MODEL=opus directly on
api.anthropic.com using the platform's global OAuth token.
Root cause: loadWorkspaceSecrets merges ALL global_secrets into every
workspace's env provenance-blind. applyPlatformManagedLLMEnv's
non-platform (byok/disabled) path then early-returned WITHOUT stripping
those inherited platform globals — so a workspace with no LLM credential
of its own kept the platform's scope:global CLAUDE_CODE_OAUTH_TOKEN.
The same leak existed on the remote-pull path (GET
/workspaces/:id/secrets/values), which also merged globals unconditionally.
Fix (provider-aware, both injection vectors):
- applyPlatformManagedLLMEnv now takes the global-provenance key set and,
on the non-platform path, strips every platform-managed LLM bypass key
(CLAUDE_CODE_OAUTH_TOKEN + the rest) that originated from global_secrets.
A workspace's OWN LLM cred (a workspace_secrets row — provenance flag
dropped by loadWorkspaceSecrets) is NOT in the global set and survives.
- secrets.Values applies the same provenance-aware gate before returning
the merged bundle to a remote agent.
- Fail closed: a byok workspace left with no usable LLM credential aborts
provision with code MISSING_BYOK_CREDENTIAL instead of starting on the
(now-stripped) platform creds. Scoped to byok; disabled mode strips but
still boots (no-LLM workspaces are legitimate).
- platform_managed path is unchanged (it still receives + force-routes the
platform creds via the CP proxy), and the LLM-proxy anthropic path is
untouched.
Tests (all green; go build/test ./... + -tags=integration build pass):
- ByokStripsGlobalOriginOAuthToken — platform global token stripped, no cred.
- ByokKeepsWorkspaceOwnOAuthEvenWithGlobal — workspace's own token survives.
- DisabledStripsGlobalButReportsNoCred — disabled strips but does not abort.
- PlatformManagedStillReceivesGlobalCreds — no regression on platform path.
- PrepareProvisionContext_ByokWithOnlyGlobalOAuthFailsClosed — e2e abort.
- SecretsValues_ByokStripsGlobalLLMCred — remote-pull path gated.
Note: open PR #1930 (refactor/drop-org-tier-llm-billing-mode, internal#691
follow-up) changes ResolveLLMBillingMode's signature in the same files.
This change is built on current main and is orthogonal in intent; whichever
merges second needs a mechanical 1-line resolver-call adjustment (drop the
orgMode arg). #1930 does NOT fix this leak.
Refs internal#711
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fixes 6 failing tests that asserted the old insecure root-sibling
behavior after removing the root-sibling fast path from CanCommunicate:
- delegation_test.go: give testDelivery workspaces a shared parent
- handlers_additional_test.go: TestDiscover_TargetOffline +
TestCheckAccess_SiblingsAllowed → shared parent
- handlers_extended_test.go: TestExtended_DiscoverWithCallerID +
TestExtended_CheckAccess → shared parent
- tests/e2e/test_api.sh: Tests 12 + 14 now expect denial for
unrelated root-level workspaces (peers list unchanged)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The `caller.ParentID == nil && target.ParentID == nil` fast path
treated any two org-root workspaces as siblings, allowing cross-tenant
communication when the workspaces table has no org_id column.
Rules after this change:
- self → self (unchanged)
- siblings with same parent (unchanged)
- ancestor ↔ descendant, any depth (unchanged)
- unrelated org roots → DENIED (fixed)
Updates integration-test fixtures to place source/target under a shared
parent so CanCommunicate still returns true for the test scenario.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The #1953 fixture re-seed made Summarizer a CHILD of Echo (same-org) so
the peer-discovery assertions exercise legit same-org enumeration. But
Test 21 still deleted the PARENT (Echo) first and asserted the other
workspace survives (count=1). CascadeDelete walks the recursive parent_id
CTE, so deleting Echo also removed its child Summarizer -> "List after
delete" saw 0, and Test 22 then hit 410 Gone deleting an already-removed
Summarizer ("got: {error: workspace removed}").
Fix: capture Summarizer's bundle, delete the CHILD (Summarizer) first
(child delete does not cascade upward so Echo survives -> count=1), then
delete the parent Echo in the round-trip block and re-import the captured
bundle. Cross-tenant isolation and the same-org parent/child relationship
are unchanged; only the delete ordering is corrected.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
publish-workspace-server-image / Production auto-deploy (push) Compensated by status-reaper (push run was cancelled/superseded; Gitea 1.22.6 reports cancelled runs as failure statuses)
The #1953 sameOrg() guard over-blocked legitimate SAME-ORG a2a routing:
orgRootSubtreeCTE carried `id AS root_id` from the recursive SEED, so a
non-root workspace resolved to ITSELF instead of its topmost ancestor.
sameOrg(child, root) therefore compared child-id vs root-id, reported the
pair as DIFFERENT orgs, and 403'd a legitimate same-org delegation. The
cross-org case was unaffected (two distinct roots already resolve to
different ids), so isolation stayed closed — but real same-org delegation
broke. Caught only by the real-Postgres integration suite: the sqlmock
unit tests hand-feed sameOrg() a root_id row and so structurally cannot
exercise the CTE.
Fix: select the parentless chain row's own `id` (aliased root_id) instead
of the seed-carried value. A node that already IS an org root has a
one-row chain and still resolves to itself.
Why the two required checks were red:
- handlers-postgres-integration (real CTE): the executeDelegation
success-path fixtures seeded source AND target both parent_id=NULL —
two DISTINCT org roots, i.e. a CROSS-tenant pair that only ever
"communicated" via the OLD leaky root-sibling behavior #1953 closes.
Re-seeded target as a CHILD of source (same org). With the same-org
fixture, the CTE bug surfaced and is now fixed; all 5 ExecuteDelegation
tests pass (success + failure paths). Added
TestIntegration_SameOrg_RealCTE_ResolvesAncestorChain as the real-SQL
regression gate for root→child→grandchild resolution + cross-org denial.
- e2e-api (test_api.sh): created Echo + Summarizer both as org roots and
asserted they appear in each other's /registry/:id/peers — that
enumeration WAS the cross-tenant leak (org root seeing another org
root). Re-created Summarizer as a child of Echo so the peer assertions
exercise legitimate same-org parent/child enumeration.
Cross-tenant isolation remains closed (all cross-org negative tests pass);
same-org peers + a2a now work. go build ./... + go test ./internal/handlers/...
green; integration suite green.
Refs #1953
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The audit-force-merge workflow previously used a single flat list of
required status checks for all branches. This caused false negatives on
staging merges (staging requires only 2 checks, main requires 3) and
false positives if a check existed on one branch but not the other.
Changes:
- audit-force-merge.sh:
- Accept REQUIRED_CHECKS_JSON (branch-keyed dict) as primary input.
- Fall back to REQUIRED_CHECKS (newline list) for backward compat.
- Look up checks by PR base branch; empty set → no-op gracefully.
- audit-force-merge.yml:
- Replace flat REQUIRED_CHECKS with REQUIRED_CHECKS_JSON declaring
main (3 checks) and staging (2 checks) explicitly.
Rework of PR #1946; closes internal#1739.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three workspace-server paths computed an "org-root sibling set" as
`WHERE parent_id IS NULL`, which matches EVERY tenant's org root (the
workspaces table has no org_id column) → cross-tenant data exposure:
1. GET /registry/:id/peers (discovery.Peers) — returned peer
id/name/role/url/agent_card across ALL tenants when the caller
was itself an org root.
2. MCP toolListPeers (mcp_tools.go) — same cross-tenant peer
enumeration via the MCP bridge.
3. a2a routing (a2a_proxy.proxyA2ARequest → resolveAgentURL) —
CanCommunicate's "root-level siblings, both no parent" rule treats
every tenant's org root as a sibling, and resolveAgentURL accepts
ANY workspace id with no org check, so an org root could resolve
and route a2a to another tenant's org root.
Fix — reuse the OFFSEC-015 broadcast scoping (commit 5a05302c,
workspace_broadcast.go): the org is the parent_id-chain subtree from a
single org root. New org_scope.go centralises that recursive CTE
(orgRootID / sameOrg) so all paths derive "the caller's org" the same way:
- discovery.Peers + toolListPeers: drop the `parent_id IS NULL`
sibling branch entirely. An org root has no siblings inside its own
org; its peers are its children (still enumerated). Only the
parent_id-bound sibling branch remains, already scoped to one tenant.
- a2a proxyA2ARequest: after CanCommunicate, add a sameOrg() guard that
rejects (403) before resolveAgentURL when caller and target resolve
to different org roots. Fail-closed: a DB error denies routing.
No org_id column is added — that is a separate architecture decision
pending CTO. This uses the existing parent_id-chain scoping.
Tests (cross_tenant_isolation_test.go): per-path cross-tenant regression
— a DIFFERENT-org workspace must NOT appear in /registry peers, must NOT
appear in toolListPeers, and a2a MUST reject resolving/routing to a
workspace outside the caller's org; plus same-org positive tests. The
three negative tests were verified to FAIL against the pre-fix code.
Existing peer/a2a/delegation tests updated to the org-scoped behavior.
Follow-up for CTO: registry.CanCommunicate still treats any two org
roots as siblings, so discovery.Discover and CheckAccess share the same
root-sibling weakness. Scoping CanCommunicate itself (registry package)
would close that class fully; flagged separately as it is outside the
three #1953 paths.
Refs #1953
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three test fixes after rebasing #1669 onto latest main:
1. TestWorkspaceCreate_ReturnsAuthToken_201:
- Removed extra sqlmock.AnyArg() for status column (now
hardcoded as 'provisioning' in SQL, not a parameter).
- Changed expected runtime from "langgraph" to "claude-code" to
match model resolution for "anthropic:claude-opus-4-7".
2. TestWorkspaceCreate_SaaSHardForcesTier4:
- Removed INSERT INTO workspace_auth_tokens expectation.
- External workspaces return early before the inline auth_token
mint at the bottom of Create.
3. TestWorkspaceCreate_ExternalURL_SSRFSafe:
- Same fix — external workspaces don't reach the non-external
auth_token minting path.
Full handlers package now passes (18.5s).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace remaining user-facing references to the old repo name
molecule-monorepo with molecule-core in clone instructions,
documentation links, path examples, and source links.
Affected files:
- README.md (clone commands in Quick Start)
- docs/quickstart.md (clone commands in one-command and manual paths)
- docs/architecture/molecule-technical-doc.md (repo links)
- docs/development/local-development.md (path example)
- docs/infra/workspace-terminal.md (factually incorrect rename claim)
- docs/integrations/opencode.md (task example)
- docs/internal-content-policy.md (repo name and path references)
- canvas/src/app/pricing/page.tsx (source code link)
- .env.example (repo name in comment)
- tools/check-template-parity.sh (path example in comment)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Covers detection, immediate fix (fresh PAT + secret update), long-term
fix (update provisioning templates), and prevention for the engineer-class
agent read:issue scope gap that blocks swarm-pull issue discovery.
Refs: #1750
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Resolves the merge conflict between main's schedule seeding (#1929) and
PR#1669's inline auth_token minting (#1644) in workspace.go Create handler.
Changes:
- Bring template_schedules.go + template_schedules_test.go from main so
parseTemplateSchedules / seedTemplateSchedules are available (#1929).
- Capture provisionOK return from provisionWorkspaceAuto (main pattern).
- Insert schedule seeding block BEFORE auth_token minting, matching main's
ordering and comment structure.
- Preserve auth_token inline minting with non-fatal fallback (PR#1669).
Both features now coexist: workspaces created from templates get schedules
seeded, AND the 201 response includes the first bearer token.
Refs: #1669, #1920, #1929
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
PR#1669 introduced func TestBuildProvisionerConfig_IncludesAwarenessSettings
without a body or closing brace, causing Go compilation failures in
Platform (Go) and Handlers Postgres Integration CI lanes.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
PR #1669 CI statuses were all showing None / not started. Pushing an
empty commit to wake the Gitea Actions runner and re-evaluate required
status checks.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Six additional tests across handlers_test.go, handlers_additional_test.go,
workspace_compute_test.go, and workspace_budget_test.go also reach the 201
path and need the INSERT INTO workspace_auth_tokens expectation.
Refs PR #1669 / mc#1644
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
PR #1669 adds inline auth_token minting via wsauth.IssueToken in the
Create handler. This inserts into workspace_auth_tokens after the
workspace row commits. Nine existing Create tests reach the 201 path
but don't mock the INSERT, causing sqlmock unmet-expectation failures.
Add the expectation to each affected test. Tests that fail before
the workspace INSERT (400/422/500-rollback) are left unchanged.
Refs PR #1669 / mc#1644
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Empirical trigger (issue #1644): staging peer-visibility E2E cannot mint
an MCP bearer for managed runtimes. The create response shipped only
{id, status, awareness_namespace, workspace_access} — no token. Callers
had two fallbacks, both broken on staging:
- POST /admin/workspaces/:id/tokens (AdminAuth-gated, canonical mint)
— returns HTML 404 on staging because the CP-admin route prefix
differs from local (`/cp/admin/...` per reference_controlplane_admin_api_access).
- GET /admin/workspaces/:id/test-token (dev-only mint) — deliberately
404s when MOLECULE_ENV=production per admin_test_token.go::TestTokensEnabled.
Per feedback_no_dev_only_routes_in_e2e (CTO 2026-05-21), E2E must
use production paths only; this fallback was always wrong.
Fix: mint the workspace's first bearer inline at the end of Create and
return it as `auth_token` in the 201 response. Now every caller (canvas
Save, org_import, E2E, third-party API) gets the bearer they need in
the same round trip — single production path, no separate mint
endpoint, no dev-only fallback, no path-prefix gotcha.
Mirrors the existing pre-register external-workspace mint shape (lines
~605-615), where the create response already includes a
`connection.token` field for the same reason. This commit extends the
pattern to spawned-runtime workspaces.
Failure mode: non-fatal. If wsauth.IssueToken errors (extremely rare —
the workspace row just committed a microsecond ago), the 201 still
ships without auth_token + a log line. Callers that need the bearer
can recover via POST /admin/workspaces/:id/tokens (canonical admin
mint). Returning the 201 without the field is friendlier than 500'ing
a partial-success write.
Tests:
- New TestWorkspaceCreate_ReturnsAuthToken_201: asserts auth_token
is present, non-empty, and >= 40 chars (sanity-bounds the
wsauth.IssueToken base64-RawURL encoding of the 32-byte payload).
Pins the INSERT INTO workspace_auth_tokens expectation so the
inline mint path can't silently drop without surfacing as
unexpected ExecQuery.
- Existing TestWorkspaceCreate (and the broader Create test family)
continue to pass — they don't assert auth_token, and the non-fatal
error branch keeps the 201 shape stable.
Verified: `go test -count=1 -short ./internal/handlers/... → OK`.
Coordinated follow-ups:
- Part A (in molecule-core test E2E scripts): once this lands +
deploys, update `test_peer_visibility_mcp_local.sh` /
`test_peer_visibility_mcp_staging.sh` to consume the inline
auth_token instead of the GET /test-token fallback. Tracked
separately; gated on Engineer-A (Kimi) Gitea persona token
injection per the production-team auth-block surface 2026-05-22.
- Drop the dev-only GET /admin/workspaces/:id/test-token route in
a follow-up once all E2E callers migrate to the inline shape.
Memory refs: feedback_no_dev_only_routes_in_e2e,
reference_controlplane_admin_api_access,
feedback_workspace_model_required_no_platform_default_dynamic_credential_intake
(this PR is the "production credential path" sibling of the model SSOT in PR#1667).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 21:58:15 -07:00
183 changed files with 15529 additions and 3828 deletions
# MOLECULE_IN_DOCKER= # Set when running the platform inside Docker (accepts 1/0, true/false). Triggers A2A proxy to rewrite 127.0.0.1:<port> agent URLs to Docker bridge hostnames. Auto-detected via /.dockerenv; only set if detection fails or to force off.
# GITHUB_TOKEN= # Personal access token / installation token used by agents that clone private repos. Register as a global secret via POST /admin/secrets for propagation to workspace env. Token is used in-URL during clone and then scrubbed from .git/config via `git remote set-url`.
debug "probe ${U} in team ${TEAM} (id=${TEAM_ID}) → HTTP ${CODE}"
@@ -317,14 +325,20 @@ for U in $CANDIDATES; do
continue
;;
404)
_ALL_CANDIDATES_403="no"
debug "${U} not a member of ${TEAM}"
;;
*)
_ALL_CANDIDATES_403="no"
echo"::warning::team-probe for ${U} in ${TEAM} returned unexpected HTTP ${CODE}"
cat "$TEAM_PROBE_TMP" >&2
;;
esac
done
echo"::error::${TEAM}-review awaiting non-author APPROVE from ${TEAM} team (candidates: $(echo"$CANDIDATES"| tr '\n'','| sed 's/,$//') — none are in team)"
echo"::error::${TEAM}-review FAILED — every candidate returned 403 (token owner is not a member of the ${TEAM} team). This is a TOKEN PROVISIONING issue, not a reviewer-eligibility issue. Add the token owner to the '${TEAM}' Gitea team (id=${TEAM_ID}) or use a token whose owner is already in that team."
else
echo"::error::${TEAM}-review awaiting non-author APPROVE from ${TEAM} team (candidates: $(echo"$CANDIDATES"| tr '\n'','| sed 's/,$//') — none are in team)"
Engineer-class agents (e.g. `agent-dev-a`, `agent-dev-b`) fail swarm-pull issue discovery or receive HTTP 403 when calling Gitea issue-list APIs, while PR review and repository API operations continue to work.
Typical failing call:
```bash
GET /api/v1/repos/molecule-ai/molecule-core/issues?state=open&labels=approved&limit=50
# => 403 Forbidden
```
Typical working calls (same token):
```bash
GET /api/v1/repos/molecule-ai/molecule-core/pulls?state=open&limit=50
POST /api/v1/repos/molecule-ai/molecule-core/pulls/1666/comments
# => 200 OK
```
## Root Cause
Gitea v1.22.6 routes issue-list under the `Issue` scope category (`routers/api/v1/api.go:1379-1491`), while PR routes live under repository/pull routing (`api.go:1278-1305`). The scope gate derives required read/write level from HTTP method (`api.go:309-313`), so `GET /issues?...` requires `read:issue`.
Engineer-class agent PATs were provisioned with repository and PR scopes but without `read:issue`, causing the asymmetric 403.
## Detection
1.**Agent-side**: swarm-pull workflow logs show `403 Forbidden` on issue enumeration but not on PR list/review.
2.**Platform-side**: Gitea access logs show `GET /repos/{owner}/{repo}/issues` returning 403 for the affected token.
3.**Reproduction** (from any workspace with a suspected token):
# Trigger the agent's autonomous tick or delegate a task that enumerates open issues.
```
## Long-Term Fix
Update the **workspace secret injection path** that writes `/configs/secrets.d/GITEA_TOKEN` for engineer-class agents. The provisioning template or secret-distribution job should request `read:issue` (and optionally `write:issue`) at token-creation time.
File locations to audit:
- `.gitea/scripts/` — any token-provisioning automation
- `infra/terraform/` or equivalent — IAM/secret-manager templates
- `workspace-configs-templates/` — engineer-class workspace templates that declare required secrets
## Prevention
1. **Token scope checklist**: when provisioning new engineer-class agent tokens, verify the scope set includes `read:issue` before distributing the secret.
2. **Monitoring**: add an agent health-check that probes `GET /repos/molecule-ai/molecule-core/issues?limit=1` and surfaces a non-fatal warning if it returns 403.
3. **Documentation**: update the onboarding runbook for new engineer agents to include the full required scope list.
> **⚠️ Accuracy correction (2026-05-29):** this page is **aspirational, not
> shipped.** There is **no `gemini-cli` runtime** in `manifest.json` or the
> provisioner's `knownRuntimes`, and the "PR #379" cited below is unrelated (a
> CI-workflow-cleanup PR, not a gemini-cli adapter). Do not follow this as-is.
>
> **For Gemini on Molecule, use the real `google-adk` runtime instead** — see
> [`google-adk-runtime.md`](./google-adk-runtime.md) (ADK engine + Gemini on
> Vertex AI/AI Studio), implemented in PR
> [`molecule-ai-workspace-template-google-adk#1`](https://git.moleculesai.app/molecule-ai/molecule-ai-workspace-template-google-adk) per RFC `internal#730`.
> This gemini-cli page is retained only until it's either implemented for real or removed.
Molecule AI now ships a `gemini-cli` runtime adapter alongside the existing `claude-code` adapter. This tutorial walks you from zero to a running Gemini agent workspace in under five minutes.
Google's Agent Development Kit (ADK) is now a first-class runtime on Molecule AI. This tutorial walks you from zero to a running ADK agent workspace — one that persists per-conversation session state and sits alongside your Claude Code and Gemini CLI workers in the same A2A network.
> **Status (2026-05-29):** the `google-adk` runtime is **landing**, not yet on
After step 4, ADK streams the Gemini response through its event bus, filters for `is_final_response()` events, and returns the agent's reply as a standard A2A text part. Step 5 should reference the prior answer — the adapter ties each A2A `context_id` to an `InMemorySessionService` session, so conversation state is isolated per task context and survives across calls within the same session.
## How it works
The `google-adk` adapter wraps Google ADK's runner/session model behind the same `AgentExecutor` interface used by every other Molecule AI runtime. On each turn, `GoogleADKA2AExecutor` calls `runner.run_async()` with the incoming message wrapped in a `google.genai.types.Content` object, then drains the event stream until it collects a final-response event. The `google:` model prefix is stripped before being passed to ADK — so `google:gemini-2.0-flash` in your workspace config becomes `gemini-2.0-flash` in the ADK `LlmAgent`. Error class names are sanitized before leaving the executor; raw Google SDK stack traces never reach the A2A caller.
## Mixed-runtime teams
ADK workspaces participate in the same A2A network as Claude Code, Gemini CLI, Hermes, and LangGraph workers. An orchestrator can delegate long-context summarisation to a `google-adk` worker (Gemini 1.5 Pro's 1M token window) while routing tool-use tasks to a `claude-code` worker — with no provider-specific code in the orchestrator itself. Add an ADK peer with `POST /workspaces`, set `GOOGLE_API_KEY`, and it's available for `delegate_task` immediately.
Send it a task via the A2A proxy (`POST /workspaces/:id/a2a`, JSON-RPC
`message/send`) and it replies through the ADK `Runner`. Verified end-to-end:
a Gemini 2.5 round-trip on Vertex via ADC returns through the built image.
# (a) does NOT contain any error-as-text marker (broken-agent trap), AND
# (b) CONTAINS <expected_token> (case-insensitive) — proving a real LLM
# round-trip produced the deterministic known answer.
# Calls fail() (which exits) on either violation. This MUST fail on an
# error-as-text payload — that is the property test_completion_assert_unit.sh
# pins.
a2a_assert_real_completion(){
localtext="$1"
localexpected="$2"
localctx="${3:-A2A}"
if[ -z "$text"];then
fail "$ctx — real-completion gate: agent returned EMPTY text (no round-trip)."
fi
local hit
ifhit=$(a2a_completion_error_marker "$text");then
fail "$ctx — real-completion gate: agent returned an ERROR-AS-TEXT payload (matched '$hit'). A broken agent that surfaces its error as a text part is NOT a completed round-trip. This is the trap the shape-only check missed (#1994). Raw: ${text:0:200}"
fi
# Known-answer: real LLM round-trip yields the deterministic token. A
fail "$ctx — real-completion gate: reply did NOT contain expected known-answer token '$expected'. The channel returned a text shape but no real completion. Raw: ${text:0:200}"
fi
ok "$ctx — real completion verified (contains '$expected', no error-as-text). Reply: \"${text:0:80}\""
}
# offered_platform_models_for_runtime <runtime>
# Emits, one per line, the platform-servable model ids the providers.yaml
# SSOT (runtimes.<runtime>.providers[name=platform].models) declares for
# <runtime>. This is the SSOT-driven offered/platform-servable matrix — NOT
# a hardcoded provider list — so a provider added/removed in providers.yaml
# automatically changes the matrix this probe exercises.
#
# Reads the embedded copy at workspace-server/internal/providers/providers.yaml
# (the same file go:embed compiles into the binary). Requires python3 +
# PyYAML (already a test-harness dep). On parse failure, emits nothing and
# returns 1 so the caller can fail loud rather than silently skip.
offered_platform_models_for_runtime(){
localruntime="$1"
localyaml_path="${PROVIDERS_YAML_PATH:-}"
if[ -z "$yaml_path"];then
# This lib lives at tests/e2e/lib/ -> repo root is three dirs up
fail "$ctx — byok-routing guard: could not read resolved_mode from billing-mode response. Raw: ${body:0:200}"
fi
if["$mode"="platform_managed"];then
fail "$ctx — byok-routing guard TRIPPED (#1994 regression): a byok-configured workspace resolved to 'platform_managed' (provider_selection=$prov) → it would route through the platform proxy and drain the platform LLM key. Expected resolved_mode=byok. Raw: ${body:0:200}"
log " probe $model_id: error-as-text or empty: ${text:0:120}"
return1
fi
return0
}
if ! provider_liveness_matrix "$RUNTIME" provider_liveness_probe;then
fail "Per-provider liveness matrix: at least one offered provider failed its auth+routing probe (see matrix above). This is the #1994 class — a drained key / wrong base-URL / byok-misroute."
fi
ok "Per-provider liveness matrix passed (all probed offered providers completed without error)"
| `workspaceToken` | Per-workspace bearer, bound to one workspace id (+ routing header) | Read/lifecycle/secrets on a single `/workspaces/:id/*`. **Rejected** on admin list/create/delete when ADMIN_TOKEN is set — use `orgApiKey`. |
| `orgRoutingHeaderId` / `orgRoutingHeaderSlug` | `X-Molecule-Org-Id` / `X-Molecule-Org-Slug` | Required on every tenant-host request so the edge / TenantGuard route + authorize against the correct org. Send one of them alongside the bearer. |
### Guards worth knowing (modelled per-operation)
- **Dry-run:** `POST /api/v1/admin/orgs?dry_run=true` — validate + echo, no org
created. (The only dry-run on the whole management API.)
- **Confirm token:** `DELETE /api/v1/admin/tenants/:slug` and
`…/scrub-artifacts` — body `confirm` MUST equal the URL slug, else `400`
before any teardown.
- **Force flag:** `POST /api/v1/admin/workspaces/:id/env` — keys matching the
t.Errorf("period %s missing from budgetPeriods SSOT list",p)
}
}
}
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.