Commit Graph

1046 Commits

Author SHA1 Message Date
Hongming Wang
ccaa0a6b8a Merge pull request #1012 from Molecule-AI/ci/codeql-workflow-covers-main
ci(codeql): scan main + staging via workflow (UI can't multi-branch)
2026-04-19 14:37:41 -07:00
Hongming Wang
07ec90a23c ci(codeql): cover main + staging via workflow
GitHub's UI-configured "Code quality" scan only fires on the default
branch (staging), which leaves every staging→main promotion PR
unscanned. The "On push and pull requests to" field in the UI has no
dropdown; multi-branch scanning on private repos without GHAS isn't
available there.

Workflow file gives us the control we can't get in the UI: triggers
on push + pull_request for both branches. Runs on the same
self-hosted mac mini via [self-hosted, macos, arm64].

upload: never — GHAS isn't enabled on this repo so the SARIF upload
API 403s. Keep results locally, filter to error+warning severity,
fail the PR check on findings, publish SARIF as a workflow artifact.
Flipping upload: never → always after GHAS is enabled (if ever) is
a one-line change.

Picks up the review-flagged improvements from the earlier closed PR:
  - jq install step (brew, no assumption it's present)
  - severity filter (error+warning only, drops noisy note-level)
  - set -euo pipefail
  - SARIF glob (file name doesn't match matrix language id)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 14:34:04 -07:00
Hongming Wang
0bea001ef3 Merge pull request #1008 from Molecule-AI/fix/ci-canary-verify-self-hosted
fix(ci): move canary-verify to self-hosted runner
2026-04-19 11:41:11 -07:00
Hongming Wang
53c55097f8 fix(ci): move canary-verify to self-hosted runner
GitHub-hosted ubuntu-latest runs on this repo hit "recent account
payments have failed or your spending limit needs to be increased"
— same root cause as the publish + CodeQL + molecule-app workflow
moves earlier this quarter. canary-verify was the last one still on
ubuntu-latest.

Switches both jobs to [self-hosted, macos, arm64]. crane install
switched from Linux tarball to brew (matches promote-latest.yml's
install pattern + avoids /usr/local/bin write perms on the shared
mac mini).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:26:41 -07:00
Hongming Wang
ec14cb2975 Merge pull request #1006 from Molecule-AI/feat/tos-gate-eu-notice
feat(canvas): ToS gate modal + us-east-2 data residency notice
2026-04-19 07:54:15 -07:00
Hongming Wang
dd5654b803 feat(canvas): ToS gate modal + us-east-2 data residency notice
Wraps /orgs in a TermsGate that polls /cp/auth/terms-status on mount
and overlays a blocking modal when the current terms version hasn't
been accepted yet. "I agree" POSTs /cp/auth/accept-terms and dismisses
the modal; the backend records IP + UA as GDPR Art. 7 proof-of-consent.

Also adds a short data residency notice under the page header:
workspaces run in AWS us-east-2 (Ohio, US). An EU region selector is
a future lift once the infra is provisioned there.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 07:44:47 -07:00
Hongming Wang
adf43c50d8 Merge pull request #1005 from Molecule-AI/feat/credits-phase-5-ui
feat(canvas): Phase 5 — credit balance pill + low-balance banner
2026-04-19 07:32:44 -07:00
Hongming Wang
18894bebe8 feat(canvas): Phase 5 — credit balance pill + low-balance banner
Adds the UI surface for the credit system to /orgs:
- CreditsPill next to each org row. Tone shifts from zinc → amber at
  10% of plan to red at zero.
- LowCreditsBanner appears under the pill for running orgs when the
  balance crosses thresholds: overage_used > 0 → "overage active",
  balance <= 0 → "out of credits, upgrade", trial tail → "trial almost
  out".
- Pure helpers extracted to lib/credits.ts so formatCredits, pillTone,
  and bannerKind are unit-tested without jsdom.

Backend List query now returns credits_balance / plan_monthly_credits
/ overage_used_credits / overage_cap_credits so no second round-trip
is needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 07:27:29 -07:00
Hongming Wang
24e8c5affd Merge pull request #1004 from Molecule-AI/staging
promote: staging → main — brew cleanup fix
2026-04-19 05:56:18 -07:00
Hongming Wang
eeb19f2584 Merge pull request #1003 from Molecule-AI/ci/promote-latest-self-hosted
ci(promote-latest): suppress brew cleanup perm-denied
2026-04-19 05:56:01 -07:00
Hongming Wang
7619e44802 ci(promote-latest): suppress brew cleanup that hits perm-denied on shared runner 2026-04-19 05:55:45 -07:00
Hongming Wang
57596a5b7a Merge pull request #1002 from Molecule-AI/staging
promote: staging → main — self-hosted promote-latest
2026-04-19 05:54:22 -07:00
Hongming Wang
37cc0a004c Merge pull request #1001 from Molecule-AI/ci/promote-latest-self-hosted
ci(promote-latest): run on self-hosted mac mini
2026-04-19 05:53:54 -07:00
Hongming Wang
fb2c126ed1 ci(promote-latest): run on self-hosted mac mini (GH-hosted quota blocked) 2026-04-19 05:53:39 -07:00
Hongming Wang
5de0110dd1 Merge pull request #1000 from Molecule-AI/staging
promote: staging → main — promote-latest workflow + codeql self-hosted
2026-04-19 05:52:06 -07:00
Hongming Wang
756e57b788 Merge pull request #999 from Molecule-AI/ci/promote-latest-workflow
ci(promote-latest): workflow_dispatch retag :staging-<sha> → :latest
2026-04-19 05:43:45 -07:00
Hongming Wang
5a67c6be4a ci(promote-latest): workflow_dispatch to retag :staging-<sha> → :latest
Escape hatch for the initial rollout window (canary fleet not yet
provisioned, so canary-verify.yml's automatic promotion doesn't fire)
AND for manual rollback scenarios.

Uses the default GITHUB_TOKEN which carries write:packages on repo-
owned GHCR images, so no new secrets are needed. crane handles the
remote retag without pulling or pushing layers.

Validates the src tag exists before retagging + verifies the :latest
digest post-retag so a typo can't silently promote the wrong image.

Trigger from Actions → promote-latest → Run workflow → enter the
short sha (e.g. "4c1d56e").

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 05:42:48 -07:00
Hongming Wang
e9be46ab0b Merge pull request #997 from Molecule-AI/staging
promote: staging → main — unblock publish workflow (private-repo plugin clone)
2026-04-19 05:34:39 -07:00
Hongming Wang
9ecadaf573 Merge pull request #996 from Molecule-AI/fix/publish-clone-plugin-sibling
fix(ci): clone sibling plugin repo so publish-workspace-server-image builds
2026-04-19 05:32:01 -07:00
Hongming Wang
ac85ee2a0d fix(ci): clone sibling plugin repo so publish-workspace-server-image builds
Publish has been failing since the 2026-04-18 open-source restructure
(#964's merge) because workspace-server/Dockerfile still COPYs
./molecule-ai-plugin-github-app-auth/ but the restructure moved that
code out to its own repo. Every main merge since has produced a
"failed to compute cache key: /molecule-ai-plugin-github-app-auth:
not found" error — prod images haven't moved.

Fix: add an actions/checkout step that fetches the plugin repo into
the build context before docker build runs.

Private-repo safe: uses PLUGIN_REPO_PAT secret (fine-grained PAT with
Contents:Read on Molecule-AI/molecule-ai-plugin-github-app-auth).
Falls back to the default GITHUB_TOKEN if the plugin repo is public.

Ops: set repo secret PLUGIN_REPO_PAT before the next main merge, or
publish will fail with a 404 on the checkout step.

Also gitignores the cloned dir so local dev builds don't accidentally
commit it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 05:19:31 -07:00
Hongming Wang
62b409423d Merge pull request #995 from Molecule-AI/staging
promote: staging → main — #994 post-checkout UX
2026-04-19 04:35:34 -07:00
Hongming Wang
ede6597cc0 Merge pull request #994 from Molecule-AI/feat/canvas-post-checkout-redirect
feat(canvas): post-checkout UX — Stripe success lands on /orgs with live banner
2026-04-19 04:32:02 -07:00
Hongming Wang
3021785391 Merge pull request #993 from Molecule-AI/staging
promote: staging → main — canary infra + /orgs + env refresh + perf
2026-04-19 04:26:13 -07:00
Hongming Wang
26b89400ed test(canvas): bump billing test for /orgs success_url 2026-04-19 04:26:01 -07:00
Hongming Wang
d77378294b feat(canvas): post-checkout UX — Stripe success lands on /orgs with banner
Two small polish items that together close the signup-to-running-tenant
flow for real users:

1. Stripe success_url now points at /orgs?checkout=success instead of
   the current page (was pricing). The old behavior left people staring
   at plan cards with no indication payment went through — the new
   behavior drops them right onto their org list where they can watch
   the status flip.

2. /orgs shows a green "Payment confirmed, workspace spinning up"
   banner when it sees ?checkout=success, then clears the query
   param via replaceState so a reload doesn't show it again.

3. /orgs now polls every 5s while any org is awaiting_payment or
   provisioning. Users see the Stripe webhook's effect live — no
   manual refresh needed — and once every org settles the polling
   stops so idle tabs don't hammer /cp/orgs.

Paired with PR #992 (the /orgs page itself) this makes the end-to-end
flow on BILLING_REQUIRED=true deployments feel right:
  /pricing → Stripe → /orgs?checkout=success → banner → live poll →
  "Open" button when org.status transitions to running.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 04:18:32 -07:00
Hongming Wang
08e37f3c87 Merge pull request #992 from Molecule-AI/feat/canvas-orgs-landing
feat(canvas): /orgs landing page for post-signup users
2026-04-19 04:15:50 -07:00
Hongming Wang
b29ffb9546 feat(canvas): /orgs landing page for post-signup users
CP's Callback handler redirects every new WorkOS session to
APP_URL/orgs, but canvas had no such route — new users hit the canvas
Home component, which tries to call /workspaces on a tenant that
doesn't exist yet, and saw a confusing error. This PR plugs that gap
with a dedicated landing page that:

- Bounces anonymous visitors back to /cp/auth/login
- Zero-org users see a slug-picker (POST /cp/orgs, refresh)
- For each existing org, shows status + CTA:
  * awaiting_payment → amber "Complete payment" → /pricing?org=…
  * running          → emerald "Open" → https://<slug>.moleculesai.app
  * failed           → "Contact support" → mailto
  * provisioning     → read-only "provisioning…"
- Surfaces errors inline with a Retry button

Deliberately server-light: one GET /cp/orgs, no WebSocket, no canvas
store hydration. Goal is to move the user from signup to either
Stripe Checkout or their tenant URL with one click each.

Closes the last UX gap between the BILLING_REQUIRED gate landing on
the CP and real users being able to complete a signup today.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 04:13:54 -07:00
Hongming Wang
393ecc74e3 Merge pull request #991 from Molecule-AI/perf/scheduler-returning-clause
perf(scheduler): collapse empty-run bump to single RETURNING query
2026-04-19 03:48:42 -07:00
Hongming Wang
69d5af6636 Merge pull request #990 from Molecule-AI/fix/cp-provisioner-tests
test(ws-server): CPProvisioner coverage — auth, env fallback, error paths
2026-04-19 03:48:40 -07:00
Hongming Wang
b749c558f3 perf(scheduler): collapse empty-run bump to single RETURNING query
The phantom-producer detector (#795) was doing UPDATE + SELECT in two
roundtrips — first incrementing consecutive_empty_runs, then re-
reading to check the stale threshold. Switch to UPDATE ... RETURNING
so the post-increment value comes back in one query.

Called once per schedule per cron tick. At 100 tenants × dozens of
schedules per tenant, the halved DB traffic on the empty-response
path is measurable, not just cosmetic.

Also now properly logs if the bump itself fails (previously it silent-
swallowed the ExecContext error and still ran the SELECT, which would
confuse debugging).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 03:44:48 -07:00
Hongming Wang
5a33784ea1 Merge pull request #989 from Molecule-AI/feat/canary-rollback-script
feat(canary): rollback script + release-pipeline doc (Phase 4)
2026-04-19 03:41:53 -07:00
Hongming Wang
296c52cb25 test(ws-server): cover CPProvisioner — auth, env fallback, error paths
Post-merge audit flagged cp_provisioner.go as the only new file from
the canary/C1 work without test coverage. Fills the gap:

- NewCPProvisioner_RequiresOrgID — self-hosted without MOLECULE_ORG_ID
  refuses to construct (avoids silent phone-home to prod CP).
- NewCPProvisioner_FallsBackToProvisionSharedSecret — the operator
  ergonomics of using one env-var name on both sides of the wire.
- AuthHeader noop + happy path — bearer only set when secret is set.
- Start_HappyPath — end-to-end POST to stubbed CP, bearer forwarded,
  instance_id parsed out of response.
- Start_Non201ReturnsStructuredError — when CP returns structured
  {"error":"…"}, that message surfaces to the caller.
- Start_NoStructuredErrorFallsBackToSize — regression gate for the
  anti-log-leak change from PR #980: raw upstream body must NOT
  appear in the error, only the byte count.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 03:41:16 -07:00
Hongming Wang
ee2cfe1148 Merge pull request #988 from Molecule-AI/feat/canary-gate-latest-tag
feat(canary): gate :latest tag promotion on canary verify green (Phase 3)
2026-04-19 03:38:22 -07:00
Hongming Wang
97dbe1c987 feat(canary): rollback-latest script + release-pipeline doc (Phase 4)
Closes the canary loop with the escape hatch and a single place to
read about the whole flow.

scripts/rollback-latest.sh <sha>
  uses crane to retag :latest ← :staging-<sha> for BOTH the platform
  and tenant images. Pre-checks the target tag exists and verifies
  the :latest digest after the move so a bad ops typo doesn't
  silently promote the wrong thing. Prod tenants auto-update to the
  rolled-back digest within their 5-min cycle. Exit codes: 0 = both
  retagged, 1 = registry/tag error, 2 = usage error.

docs/architecture/canary-release.md
  The one-page map of the pipeline: how PR → main → staging-<sha> →
  canary smoke → :latest promotion works end-to-end, how to add a
  canary tenant, how to roll back, and what this gate explicitly does
  NOT catch (prod-only data, config drift, cross-tenant bugs).

No code changes in the CP or workspace-server — this PR is shell
+ docs only, so it's safe to land independently of the other Phase
{1,1.5,2,3} PRs still in review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 03:37:42 -07:00
Hongming Wang
7e3a6fd756 feat(canary): gate :latest tag promotion on canary verify green (Phase 3)
Completes the canary release train. Before this, publish-workspace-
server-image.yml pushed both :staging-<sha> and :latest on every
main merge — meaning the prod tenant fleet auto-pulled every image
immediately, before any post-deploy smoke test. A broken image
(think: this morning's E2E current_task drift, but shipped at 3am
instead of caught in CI) would have fanned out to every running
tenant within 5 min.

Now:
- publish workflow pushes :staging-<sha> ONLY
- canary tenants are configured to track :staging-<sha>; they pick
  up the new image on their next auto-update cycle
- canary-verify.yml runs the smoke suite (Phase 2) after the sleep
- on green: a new promote-to-latest job uses crane to remotely
  retag :staging-<sha> → :latest for both platform and tenant images
- prod tenants auto-update to the newly-retagged :latest within
  their usual 5-min window
- on red: :latest stays frozen on prior good digest; prod is untouched

crane is pulled onto the runner (~4 MB, GitHub release) rather than
docker-daemon retag so the workflow doesn't need a privileged runner.

Rollback: if canary passed but something surfaces post-promotion,
operator runs "crane tag ghcr.io/molecule-ai/platform:<prior-good-sha>
latest" manually. A follow-up can wrap that in a Phase 4 admin
endpoint / script.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 03:33:04 -07:00
Hongming Wang
fdcc715e97 Merge pull request #987 from Molecule-AI/feat/canary-smoke-harness
feat(canary): smoke harness + GHA verify workflow (Phase 2)
2026-04-19 03:31:22 -07:00
Hongming Wang
c1ac32365d feat(canary): smoke harness + GHA verification workflow (Phase 2)
Post-deploy verification for staging tenant images. Runs against the
canary fleet after each publish-workspace-server-image build — catches
auto-update breakage (a la today's E2E current_task drift) before it
propagates to the prod tenant fleet that auto-pulls :latest every 5 min.

scripts/canary-smoke.sh iterates a space-sep list of canary base URLs
(paired with their ADMIN_TOKENs) and checks:
- /admin/liveness reachable with admin bearer (tenant boot OK)
- /workspaces list responds (wsAuth + DB path OK)
- /memories/commit + /memories/search round-trip (encryption + scrubber)
- /events admin read (AdminAuth C4 path)
- /admin/liveness without bearer returns 401 (C4 fail-closed regression)

.github/workflows/canary-verify.yml runs after publish succeeds:
- 6-min sleep (tenant auto-updater pulls every 5 min)
- bash scripts/canary-smoke.sh with secrets pulled from repo settings
- on failure: writes a Step Summary flagging that :latest should be
  rolled back to prior known-good digest

Phase 3 follow-up will split the publish workflow so only
:staging-<sha> ships initially, and canary-verify's green gate is
what promotes :staging-<sha> → :latest. This commit lays the test
gate alone so we have something running against tenants immediately.

Secrets to set in GitHub repo settings before this workflow can run:
- CANARY_TENANT_URLS (space-sep list)
- CANARY_ADMIN_TOKENS (same order as URLs)
- CANARY_CP_SHARED_SECRET (matches staging CP PROVISION_SHARED_SECRET)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 03:30:19 -07:00
Hongming Wang
c40e7ba68c Merge pull request #986 from Molecule-AI/feat/tenant-cp-env-refresh
feat(ws-server): pull env from CP on startup
2026-04-19 03:27:14 -07:00
Hongming Wang
f09ff8dc4f Merge pull request #985 from Molecule-AI/docs/saas-migration-notes-prod
docs: 2026-04-19 SaaS prod migration notes
2026-04-19 03:27:12 -07:00
Hongming Wang
bc83f0ac23 Merge pull request #982 from Molecule-AI/fix/canvas-api-fetch-timeout
fix(canvas): add 15s fetch timeout on API calls
2026-04-19 03:27:09 -07:00
Hongming Wang
80caba97ee feat(ws-server): pull env from CP on startup
Paired with molecule-controlplane PR #55 (GET /cp/tenants/config). Lets
existing tenants heal themselves when we rotate or add a CP-side env
var (e.g. MOLECULE_CP_SHARED_SECRET landing earlier today) without any
ssh or re-provision.

Flow: main() calls refreshEnvFromCP() before any other os.Getenv read.
The helper reads MOLECULE_ORG_ID + ADMIN_TOKEN from the baked-in
user-data env, GETs {MOLECULE_CP_URL}/cp/tenants/config with those
credentials, and applies the returned string map via os.Setenv so
downstream code (CPProvisioner, etc.) sees the fresh values.

Best-effort semantics:
- self-hosted / no MOLECULE_ORG_ID → no-op (return nil)
- CP unreachable / non-200 → log + return error (main keeps booting)
- oversized values (>4 KiB each) rejected to avoid env pollution
- body read capped at 64 KiB

Once this image hits GHCR, the 5-minute tenant auto-updater picks it
up, the container restarts, refresh runs, and every tenant has
MOLECULE_CP_SHARED_SECRET within ~5 minutes — no operator toil.

Also fixes workspace-server/.gitignore so `server` no longer matches
the cmd/server package dir — it only ignored the compiled binary but
pattern was too broad. Anchored to `/server`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 02:41:15 -07:00
Hongming Wang
ca5a5f1a7f docs: 2026-04-19 SaaS prod migration notes
Captures the 10-PR staging→main cutover: what shipped, the three new
Railway prod env vars (PROVISION_SHARED_SECRET / EC2_VPC_ID /
CP_BASE_URL), and the sharp edge for existing tenants — their
containers pre-date PR #53 so they still need MOLECULE_CP_SHARED_SECRET
added manually (or a re-provision) before the new CPProvisioner's
outbound bearer works.

Also includes a post-deploy verification checklist and rollback plan.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 02:29:31 -07:00
Hongming Wang
5e1032289e Merge pull request #983 from Molecule-AI/staging
promote: staging → main (security hardening + Phase 35.1)
2026-04-19 02:28:05 -07:00
Hongming Wang
99417b7b20 Merge pull request #984 from Molecule-AI/fix/e2e-current-task-public-get
fix(e2e): stop asserting current_task on public workspace GET
2026-04-19 02:21:08 -07:00
Hongming Wang
f32196d351 fix(e2e): stop asserting current_task on public workspace GET (#966)
PR #966 intentionally stripped current_task, last_sample_error, and
workspace_dir from the public GET /workspaces/:id response to avoid
leaking task bodies to anyone with a workspace bearer. The E2E smoke
test hadn't caught up — it was still asserting "current_task":"..."
on the single-workspace GET, which made every post-#966 CI run fail
with '60 passed, 2 failed'.

Swap the per-workspace asserts to check active_tasks (still exposed,
canonical busy signal) and keep the list-endpoint check that proves
admin-auth'd callers still see current_task end-to-end.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 02:19:15 -07:00
Hongming Wang
956c1fb9b9 fix(canvas): add 15s fetch timeout on API calls
Pre-launch audit flagged api.ts as missing a timeout on every fetch.
A slow or hung CP response would leave the UI spinning indefinitely
with no way for the user to abort — effectively a client-side DoS.

15s is long enough for real CP queries (slowest observed is Stripe
portal redirect at ~3s) and short enough that a stalled backend
surfaces as a clear error with a retry affordance.

Uses AbortSignal.timeout (widely supported since 2023) so the
abort propagates through React Query / SWR consumers cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 02:12:47 -07:00
Hongming Wang
896a34429a Merge pull request #981 from Molecule-AI/fix/security-tenant-cpprovisioner-bearer
fix(security): tenant CPProvisioner sends CP bearer on provision / stop / status
2026-04-19 01:55:20 -07:00
Hongming Wang
a79366a04a fix(security): tenant CPProvisioner attaches CP bearer on all calls
Completes the C1 integration (PR #50 on molecule-controlplane). The CP
now requires Authorization: Bearer <PROVISION_SHARED_SECRET> on all
three /cp/workspaces/* endpoints; without this change the tenant-side
Start/Stop/IsRunning calls would all 401 (or 404 when the CP's routes
refused to mount) and every workspace provision from a SaaS tenant
would silently fail.

Reads MOLECULE_CP_SHARED_SECRET, falling back to PROVISION_SHARED_SECRET
so operators can use one env-var name on both sides of the wire. Empty
value is a no-op: self-hosted deployments with no CP or a CP that
doesn't gate /cp/workspaces/* keep working as before.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 01:53:12 -07:00
Hongming Wang
c8c92ffe21 Merge pull request #980 from Molecule-AI/fix/security-log-scrubbing
fix(security): scrub workspace-server token + upstream error logs
2026-04-19 01:39:39 -07:00
Hongming Wang
365f13199e fix(security): scrub workspace-server token + upstream error logs
Two findings from the pre-launch log-scrub audit:

1. handlers/workspace_provision.go:548 logged `token[:8]` — the exact
   H1 pattern that panicked on short keys. Even with a length guard,
   leaking 8 chars of an auth token into centralized logs shortens the
   search space for anyone who gets log-read access. Now logs only
   `len(token)` as a liveness signal.

2. provisioner/cp_provisioner.go:101 fell back to logging the raw
   control-plane response body when the structured {"error":"..."}
   field was absent. If the CP ever echoed request headers (Authorization)
   or a portion of user-data back in an error path, the bearer token
   would end up in our tenant-instance logs. Now logs the byte count
   only; the structured error remains in place for the happy path.
   Also caps the read at 64 KiB via io.LimitReader to prevent
   log-flood DoS from a compromised upstream.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 01:33:47 -07:00