IsRunning used an unbounded json.NewDecoder(resp.Body).Decode on
CP status responses. Start already caps its body read at 64 KiB
(cp_provisioner.go:137) to defend against a misconfigured or
compromised CP streaming a huge body and exhausting memory.
IsRunning is called reactively per-request from a2a_proxy and
periodically from healthsweep, so it's a hotter path than Start
and arguably deserves the same defense more.
Adds TestIsRunning_BoundedBodyRead that serves a body padded past
the cap and asserts the decode still succeeds on the JSON prefix.
Follow-up to code-review Nit-2 on #1073.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
My #1071 made IsRunning return (false, err) on all error paths, but that
breaks a2a_proxy which depends on Docker provisioner's (true, err) contract.
Without this fix, any brief CP outage causes a2a_proxy to mark workspaces
offline and trigger restart cascades across every tenant.
Contract now matches Docker.IsRunning:
transport error → (true, err) — alive, degraded signal
non-2xx response → (true, err) — alive, degraded signal
JSON decode error → (true, err) — alive, degraded signal
2xx state!=running → (false, nil)
2xx state==running → (true, nil)
healthsweep.go is also happy with this — it skips on err regardless.
Adds TestIsRunning_ContractCompat_A2AProxy as regression guard that
asserts each error path explicitly against the a2a_proxy expectations.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-existing silent-failure path: IsRunning decoded CP responses
regardless of HTTP status, so a CP 500 → empty body → State="" →
returned (false, nil). The sweeper couldn't distinguish "workspace
stopped" from "CP broken" and would leave a dead row in place.
## Fix
- Non-2xx → wrapped error, does NOT echo body (CP 5xx bodies may
contain echoed headers; leaking into logs would expose bearer)
- JSON decode error → wrapped error
- Transport error → now wrapped with "cp provisioner: status:"
prefix for easier log grepping
## Tests
+7 cases (5-status table + malformed JSON + existing transport).
IsRunning coverage 100%; overall cp_provisioner at 98%.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes review gap: pre-PR coverage on CPProvisioner was 37%.
After this commit every exported method is exercised:
- NewCPProvisioner 100%
- authHeaders 100%
- Start 91.7% (remainder: json.Marshal error
path, unreachable with fixed-type
request struct)
- Stop 100% (new — header + path + error)
- IsRunning 100% (new — 4-state matrix + auth)
- Close 100% (new — contract no-op)
New cases assert both auth headers (shared secret + admin_token) land
on every outbound request, transport failures surface clear errors
on Start/Stop, and IsRunning doesn't misreport on transport failure.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #729 tightened AdminAuth to require ADMIN_TOKEN, breaking the
workspace credential helper which called /admin/github-installation-token
with a workspace bearer token. Tokens expired after 60 min with no refresh.
Fix: Add /workspaces/:id/github-installation-token under WorkspaceAuth
so any authenticated workspace can refresh its GitHub token. Keep the
admin path as backward-compatible alias.
Update molecule-git-token-helper.sh to use the workspace-scoped path
when WORKSPACE_ID is set.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
controlplane #118 + #130 made /cp/workspaces/* require a per-tenant
admin_token header in addition to the platform-wide shared secret.
Without it, every workspace provision / deprovision / status call
now 401s.
ADMIN_TOKEN is already injected into the tenant container by the
controlplane's Secrets Manager bootstrap, so this is purely a
header-plumbing change — no new config required on the tenant side.
## Change
- CPProvisioner carries adminToken alongside sharedSecret
- New authHeaders method sets BOTH auth headers on every outbound
request (old authHeader deleted — single call site was misleading
once the semantics changed)
- Empty values on either header are no-ops so self-hosted / dev
deployments without a real CP still work
## Tests
Renamed + expanded cp_provisioner_test cases:
- TestAuthHeaders_NoopWhenBothEmpty — self-hosted path
- TestAuthHeaders_SetsBothWhenBothProvided — prod happy path
- TestAuthHeaders_OnlyAdminTokenWhenSecretEmpty — transition window
Full workspace-server suite green.
## Rollout
Next tenant provision will ship an image with this commit merged.
Existing tenants (none in prod right now — hongming was the only
one and was purged earlier today) will auto-update via the 5-min
image-pull cron.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
These directories are cloned from their standalone repos
(molecule-ai-org-template-*, molecule-ai-plugin-*) and should
never be committed to molecule-core directly.
Removed the !/org-templates/molecule-dev/ exception that allowed
PR #1056 to land template files in the wrong repo.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Mock get_hma_instructions in exact-match tests so they don't break
when HMA content is appended. Add a dedicated test for HMA inclusion.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comprehensive rewrite of the Molecule AI dev team org template:
- Rename agents to {team}-{role} convention (e.g., core-be, cp-lead, app-qa)
- Add 5 new team leads: Core Platform Lead, Controlplane Lead, App & Docs Lead, Infra Lead, SDK Lead
- Add new roles: Release Manager, Integration Tester, Technical Writer, Infra-SRE, Infra-Runtime-BE, SDK-Dev, Plugin-Dev
- Delete triage-operator and triage-operator-2 (leads own triage now)
- Set default model to MiniMax-M2.7, tier 3, idle_interval_seconds 900
- Update org.yaml category_routing to new agent names
- Add orchestrator-pulse schedules for all leads (*/5 cron)
- Add pick-up-work schedules for engineers (*/15 cron)
- Add qa-review schedules for QA agents (*/15 cron)
- Add security-scan schedules for security agents (*/30 cron)
- Add release-cycle and e2e-test schedules for Release Manager and Integration Tester
- Update marketing agents with web search MCP and media generation capabilities
- All schedule prompts reference Molecule-AI/internal for PLAN.md and known-issues.md
- Un-ignore org-templates/molecule-dev/ in .gitignore for version tracking
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add MemorySeed model and initial_memories support at three levels:
- POST /workspaces payload: seed memories on workspace creation
- org.yaml workspace config: per-workspace initial_memories with
defaults fallback
- org.yaml global_memories: org-wide GLOBAL scope memories seeded
on the first root workspace during import
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Every agent now gets hierarchical memory instructions in their system
prompt automatically — no template configuration needed. Instructions
cover commit_memory (LOCAL/TEAM/GLOBAL scopes), recall_memory, and
when to use each proactively.
Follows the same pattern as A2A instructions: defined in
executor_helpers.py, injected by _build_system_prompt() in the
claude_sdk_executor.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The provisioner was unconditionally writing CLAUDE_CODE_OAUTH_TOKEN into
config.yaml's required_env for all claude-code workspaces. When the
baked token expired, preflight rejected every workspace — even those
with a valid token injected via the secrets API at runtime.
Changes:
- workspace_provision.go: remove hardcoded required_env for claude-code
and codex runtimes; tokens are injected at container start via secrets
- workspace_provision_test.go: flip assertion to reject hardcoded token
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When a workspace is deleted (status set to 'removed'), its schedules
remained enabled, causing the scheduler to keep firing cron jobs for
non-existent containers. Add a cascade disable query alongside the
existing token revocation and canvas layout cleanup.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three changes to boost agent throughput:
1. Event-driven cron triggers (webhooks.go): GitHub issues/opened events
fire all "pick-up-work" schedules immediately. PR review/submitted
events fire "PR review" and "security review" schedules. Uses
next_run_at=now() so the scheduler picks them up on next tick.
2. Auto-push hook (executor_helpers.py): After every task completion,
agents automatically push unpushed commits and open a PR targeting
staging. Guards: only on non-protected branches with unpushed work.
Uses /usr/local/bin/git and /usr/local/bin/gh wrappers with baked-in
GH_TOKEN. Never crashes the agent — all errors logged and continued.
3. Integration (claude_sdk_executor.py): auto_push_hook() called in the
_execute_locked finally block after commit_memory.
Closes productivity gap where agents wrote code but never pushed,
and where work crons only fired on timers instead of reacting to events.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Mechanical lint fix. github-code-quality[bot] flagged unused
import on line 18 — fireEvent is imported but never referenced in
the test file. Removing it clears the code quality gate without
changing any test behaviour.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The canary-verify workflow blocked the self-hosted runner for a fixed
6 minutes regardless of whether canaries had already updated. This
wastes the runner slot when canaries update in 2-3 minutes.
Fix: poll each canary's /health endpoint every 30s for up to 7 min.
Exit early when all canaries report the expected SHA. Falls back to
proceeding after timeout — the smoke suite validates regardless.
Typical time saving: ~3-4 minutes per canary verify run.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Silent data loss on mid-cursor DB errors — partial sub-workspace
bundles returned instead of surfacing the iteration error. Adds
rows.Err() check after the SELECT id FROM workspaces query in
Export(), mirroring the pattern already used in scheduler.go
and handlers with similar recursion patterns.
Closes: R1 MISSING-ROWS-ERR findings (bundle/exporter.go)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
GitHub's UI-configured "Code quality" scan only fires on the default
branch (staging), which leaves every staging→main promotion PR
unscanned. The "On push and pull requests to" field in the UI has no
dropdown; multi-branch scanning on private repos without GHAS isn't
available there.
Workflow file gives us the control we can't get in the UI: triggers
on push + pull_request for both branches. Runs on the same
self-hosted mac mini via [self-hosted, macos, arm64].
upload: never — GHAS isn't enabled on this repo so the SARIF upload
API 403s. Keep results locally, filter to error+warning severity,
fail the PR check on findings, publish SARIF as a workflow artifact.
Flipping upload: never → always after GHAS is enabled (if ever) is
a one-line change.
Picks up the review-flagged improvements from the earlier closed PR:
- jq install step (brew, no assumption it's present)
- severity filter (error+warning only, drops noisy note-level)
- set -euo pipefail
- SARIF glob (file name doesn't match matrix language id)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The test docstring promised polling coverage but I'd only wired the
describe-block header, not the actual tests. Closing that gap — vitest
fake timers drive three cases:
- `provisioning` org → 2nd fetch fires after 5.1s advance
- all `running` → no 2nd fetch even after 10s advance
- `awaiting_payment` org, unmount before timer fires → no post-unmount
fetch (cleanup correctly clears the pollTimer)
The unmount case is the meaningful one: without it a fast nav-away
leaves the 5s interval chasing the CP forever. page.tsx L97-99 does
clear the timer; the test pins the contract.
Local baseline on origin/staging tip ede6597 + this branch:
canvas vitest: 50 files / 781 tests, all green (+3 vs prior commit)
canvas build: clean
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two independent test additions that harden the surface freshly landed on
staging via PRs #982 (canvas fetch timeout), #992 (/orgs landing), #994
(post-checkout redirect to /orgs).
canvas/src/lib/__tests__/api.test.ts (+74 lines, 7 new tests)
- GET/POST/PATCH/PUT/DELETE each pass an AbortSignal to fetch
- TimeoutError (DOMException name=TimeoutError) propagates to the caller
- Each request installs its own signal — no shared module-level controller
that would allow one slow request to cancel an unrelated fast one
This is the hardening nit I flagged in my APPROVE-w/-nit review of
fix/canvas-api-fetch-timeout. Landing as a follow-up now that #982 is in
staging.
canvas/src/app/__tests__/orgs-page.test.tsx (+251 lines, new file, 10 tests)
- Auth guard: signed-out → redirectToLogin and no /cp/orgs fetch
- Error state: failed /cp/orgs → Error message + Retry button
- Empty list: CreateOrgForm renders
- CTA by status:
running → "Open" link targets {slug}.moleculesai.app
awaiting_payment → "Complete payment" → /pricing?org=<slug>
failed → "Contact support" mailto
- Post-checkout: ?checkout=success renders CheckoutBanner AND
history.replaceState scrubs the query param
- Fetch contract: /cp/orgs called with credentials:include + AbortSignal
Local baseline on origin/staging tip ede6597:
canvas vitest: 50 files / 778 tests, all green
canvas build: clean, /orgs route present (2.83 kB / 105 kB first-load)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>