GET /admin/memories/export returns all agent memories with workspace
name mapping. POST /admin/memories/import accepts the same format,
resolves workspaces by name, and deduplicates on content+scope.
Both endpoints are AdminAuth-gated.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The provisioner was unconditionally writing CLAUDE_CODE_OAUTH_TOKEN into
config.yaml's required_env for all claude-code workspaces. When the
baked token expired, preflight rejected every workspace — even those
with a valid token injected via the secrets API at runtime.
Changes:
- workspace_provision.go: remove hardcoded required_env for claude-code
and codex runtimes; tokens are injected at container start via secrets
- workspace_provision_test.go: flip assertion to reject hardcoded token
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When a workspace is deleted (status set to 'removed'), its schedules
remained enabled, causing the scheduler to keep firing cron jobs for
non-existent containers. Add a cascade disable query alongside the
existing token revocation and canvas layout cleanup.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three changes to boost agent throughput:
1. Event-driven cron triggers (webhooks.go): GitHub issues/opened events
fire all "pick-up-work" schedules immediately. PR review/submitted
events fire "PR review" and "security review" schedules. Uses
next_run_at=now() so the scheduler picks them up on next tick.
2. Auto-push hook (executor_helpers.py): After every task completion,
agents automatically push unpushed commits and open a PR targeting
staging. Guards: only on non-protected branches with unpushed work.
Uses /usr/local/bin/git and /usr/local/bin/gh wrappers with baked-in
GH_TOKEN. Never crashes the agent — all errors logged and continued.
3. Integration (claude_sdk_executor.py): auto_push_hook() called in the
_execute_locked finally block after commit_memory.
Closes productivity gap where agents wrote code but never pushed,
and where work crons only fired on timers instead of reacting to events.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
GitHub's UI-configured "Code quality" scan only fires on the default
branch (staging), which leaves every staging→main promotion PR
unscanned. The "On push and pull requests to" field in the UI has no
dropdown; multi-branch scanning on private repos without GHAS isn't
available there.
Workflow file gives us the control we can't get in the UI: triggers
on push + pull_request for both branches. Runs on the same
self-hosted mac mini via [self-hosted, macos, arm64].
upload: never — GHAS isn't enabled on this repo so the SARIF upload
API 403s. Keep results locally, filter to error+warning severity,
fail the PR check on findings, publish SARIF as a workflow artifact.
Flipping upload: never → always after GHAS is enabled (if ever) is
a one-line change.
Picks up the review-flagged improvements from the earlier closed PR:
- jq install step (brew, no assumption it's present)
- severity filter (error+warning only, drops noisy note-level)
- set -euo pipefail
- SARIF glob (file name doesn't match matrix language id)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GitHub-hosted ubuntu-latest runs on this repo hit "recent account
payments have failed or your spending limit needs to be increased"
— same root cause as the publish + CodeQL + molecule-app workflow
moves earlier this quarter. canary-verify was the last one still on
ubuntu-latest.
Switches both jobs to [self-hosted, macos, arm64]. crane install
switched from Linux tarball to brew (matches promote-latest.yml's
install pattern + avoids /usr/local/bin write perms on the shared
mac mini).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously, the scheduler skipped cron fires entirely when a workspace
had active_tasks > 0 (#115). This caused permanent cron misses for
workspaces kept perpetually busy by the 5-min Orchestrator pulse — work
crons (pick-up-work, PR review) were skipped every fire because the
agent was always processing a delegation.
Measured impact on Dev Lead: 17 context-deadline-exceeded timeouts in
2 hours, ~30% of inter-agent messages silently dropped.
Fix: when workspace is busy, poll every 10s for up to 2 minutes waiting
for idle. If idle within the window, fire normally. If still busy after
2 min, fall back to the original skip behavior.
This is a minimal, safe change:
- No new goroutines or channels
- Same fire path once idle
- Bounded wait (2 min max, won't block the scheduler pool)
- Falls back to skip if workspace never becomes idle
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wraps /orgs in a TermsGate that polls /cp/auth/terms-status on mount
and overlays a blocking modal when the current terms version hasn't
been accepted yet. "I agree" POSTs /cp/auth/accept-terms and dismisses
the modal; the backend records IP + UA as GDPR Art. 7 proof-of-consent.
Also adds a short data residency notice under the page header:
workspaces run in AWS us-east-2 (Ohio, US). An EU region selector is
a future lift once the infra is provisioned there.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the UI surface for the credit system to /orgs:
- CreditsPill next to each org row. Tone shifts from zinc → amber at
10% of plan to red at zero.
- LowCreditsBanner appears under the pill for running orgs when the
balance crosses thresholds: overage_used > 0 → "overage active",
balance <= 0 → "out of credits, upgrade", trial tail → "trial almost
out".
- Pure helpers extracted to lib/credits.ts so formatCredits, pillTone,
and bannerKind are unit-tested without jsdom.
Backend List query now returns credits_balance / plan_monthly_credits
/ overage_used_credits / overage_cap_credits so no second round-trip
is needed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Escape hatch for the initial rollout window (canary fleet not yet
provisioned, so canary-verify.yml's automatic promotion doesn't fire)
AND for manual rollback scenarios.
Uses the default GITHUB_TOKEN which carries write:packages on repo-
owned GHCR images, so no new secrets are needed. crane handles the
remote retag without pulling or pushing layers.
Validates the src tag exists before retagging + verifies the :latest
digest post-retag so a typo can't silently promote the wrong image.
Trigger from Actions → promote-latest → Run workflow → enter the
short sha (e.g. "4c1d56e").
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Publish has been failing since the 2026-04-18 open-source restructure
(#964's merge) because workspace-server/Dockerfile still COPYs
./molecule-ai-plugin-github-app-auth/ but the restructure moved that
code out to its own repo. Every main merge since has produced a
"failed to compute cache key: /molecule-ai-plugin-github-app-auth:
not found" error — prod images haven't moved.
Fix: add an actions/checkout step that fetches the plugin repo into
the build context before docker build runs.
Private-repo safe: uses PLUGIN_REPO_PAT secret (fine-grained PAT with
Contents:Read on Molecule-AI/molecule-ai-plugin-github-app-auth).
Falls back to the default GITHUB_TOKEN if the plugin repo is public.
Ops: set repo secret PLUGIN_REPO_PAT before the next main merge, or
publish will fail with a 404 on the checkout step.
Also gitignores the cloned dir so local dev builds don't accidentally
commit it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two small polish items that together close the signup-to-running-tenant
flow for real users:
1. Stripe success_url now points at /orgs?checkout=success instead of
the current page (was pricing). The old behavior left people staring
at plan cards with no indication payment went through — the new
behavior drops them right onto their org list where they can watch
the status flip.
2. /orgs shows a green "Payment confirmed, workspace spinning up"
banner when it sees ?checkout=success, then clears the query
param via replaceState so a reload doesn't show it again.
3. /orgs now polls every 5s while any org is awaiting_payment or
provisioning. Users see the Stripe webhook's effect live — no
manual refresh needed — and once every org settles the polling
stops so idle tabs don't hammer /cp/orgs.
Paired with PR #992 (the /orgs page itself) this makes the end-to-end
flow on BILLING_REQUIRED=true deployments feel right:
/pricing → Stripe → /orgs?checkout=success → banner → live poll →
"Open" button when org.status transitions to running.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CP's Callback handler redirects every new WorkOS session to
APP_URL/orgs, but canvas had no such route — new users hit the canvas
Home component, which tries to call /workspaces on a tenant that
doesn't exist yet, and saw a confusing error. This PR plugs that gap
with a dedicated landing page that:
- Bounces anonymous visitors back to /cp/auth/login
- Zero-org users see a slug-picker (POST /cp/orgs, refresh)
- For each existing org, shows status + CTA:
* awaiting_payment → amber "Complete payment" → /pricing?org=…
* running → emerald "Open" → https://<slug>.moleculesai.app
* failed → "Contact support" → mailto
* provisioning → read-only "provisioning…"
- Surfaces errors inline with a Retry button
Deliberately server-light: one GET /cp/orgs, no WebSocket, no canvas
store hydration. Goal is to move the user from signup to either
Stripe Checkout or their tenant URL with one click each.
Closes the last UX gap between the BILLING_REQUIRED gate landing on
the CP and real users being able to complete a signup today.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The phantom-producer detector (#795) was doing UPDATE + SELECT in two
roundtrips — first incrementing consecutive_empty_runs, then re-
reading to check the stale threshold. Switch to UPDATE ... RETURNING
so the post-increment value comes back in one query.
Called once per schedule per cron tick. At 100 tenants × dozens of
schedules per tenant, the halved DB traffic on the empty-response
path is measurable, not just cosmetic.
Also now properly logs if the bump itself fails (previously it silent-
swallowed the ExecContext error and still ran the SELECT, which would
confuse debugging).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Post-merge audit flagged cp_provisioner.go as the only new file from
the canary/C1 work without test coverage. Fills the gap:
- NewCPProvisioner_RequiresOrgID — self-hosted without MOLECULE_ORG_ID
refuses to construct (avoids silent phone-home to prod CP).
- NewCPProvisioner_FallsBackToProvisionSharedSecret — the operator
ergonomics of using one env-var name on both sides of the wire.
- AuthHeader noop + happy path — bearer only set when secret is set.
- Start_HappyPath — end-to-end POST to stubbed CP, bearer forwarded,
instance_id parsed out of response.
- Start_Non201ReturnsStructuredError — when CP returns structured
{"error":"…"}, that message surfaces to the caller.
- Start_NoStructuredErrorFallsBackToSize — regression gate for the
anti-log-leak change from PR #980: raw upstream body must NOT
appear in the error, only the byte count.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the canary loop with the escape hatch and a single place to
read about the whole flow.
scripts/rollback-latest.sh <sha>
uses crane to retag :latest ← :staging-<sha> for BOTH the platform
and tenant images. Pre-checks the target tag exists and verifies
the :latest digest after the move so a bad ops typo doesn't
silently promote the wrong thing. Prod tenants auto-update to the
rolled-back digest within their 5-min cycle. Exit codes: 0 = both
retagged, 1 = registry/tag error, 2 = usage error.
docs/architecture/canary-release.md
The one-page map of the pipeline: how PR → main → staging-<sha> →
canary smoke → :latest promotion works end-to-end, how to add a
canary tenant, how to roll back, and what this gate explicitly does
NOT catch (prod-only data, config drift, cross-tenant bugs).
No code changes in the CP or workspace-server — this PR is shell
+ docs only, so it's safe to land independently of the other Phase
{1,1.5,2,3} PRs still in review.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Completes the canary release train. Before this, publish-workspace-
server-image.yml pushed both :staging-<sha> and :latest on every
main merge — meaning the prod tenant fleet auto-pulled every image
immediately, before any post-deploy smoke test. A broken image
(think: this morning's E2E current_task drift, but shipped at 3am
instead of caught in CI) would have fanned out to every running
tenant within 5 min.
Now:
- publish workflow pushes :staging-<sha> ONLY
- canary tenants are configured to track :staging-<sha>; they pick
up the new image on their next auto-update cycle
- canary-verify.yml runs the smoke suite (Phase 2) after the sleep
- on green: a new promote-to-latest job uses crane to remotely
retag :staging-<sha> → :latest for both platform and tenant images
- prod tenants auto-update to the newly-retagged :latest within
their usual 5-min window
- on red: :latest stays frozen on prior good digest; prod is untouched
crane is pulled onto the runner (~4 MB, GitHub release) rather than
docker-daemon retag so the workflow doesn't need a privileged runner.
Rollback: if canary passed but something surfaces post-promotion,
operator runs "crane tag ghcr.io/molecule-ai/platform:<prior-good-sha>
latest" manually. A follow-up can wrap that in a Phase 4 admin
endpoint / script.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Post-deploy verification for staging tenant images. Runs against the
canary fleet after each publish-workspace-server-image build — catches
auto-update breakage (a la today's E2E current_task drift) before it
propagates to the prod tenant fleet that auto-pulls :latest every 5 min.
scripts/canary-smoke.sh iterates a space-sep list of canary base URLs
(paired with their ADMIN_TOKENs) and checks:
- /admin/liveness reachable with admin bearer (tenant boot OK)
- /workspaces list responds (wsAuth + DB path OK)
- /memories/commit + /memories/search round-trip (encryption + scrubber)
- /events admin read (AdminAuth C4 path)
- /admin/liveness without bearer returns 401 (C4 fail-closed regression)
.github/workflows/canary-verify.yml runs after publish succeeds:
- 6-min sleep (tenant auto-updater pulls every 5 min)
- bash scripts/canary-smoke.sh with secrets pulled from repo settings
- on failure: writes a Step Summary flagging that :latest should be
rolled back to prior known-good digest
Phase 3 follow-up will split the publish workflow so only
:staging-<sha> ships initially, and canary-verify's green gate is
what promotes :staging-<sha> → :latest. This commit lays the test
gate alone so we have something running against tenants immediately.
Secrets to set in GitHub repo settings before this workflow can run:
- CANARY_TENANT_URLS (space-sep list)
- CANARY_ADMIN_TOKENS (same order as URLs)
- CANARY_CP_SHARED_SECRET (matches staging CP PROVISION_SHARED_SECRET)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>