Commit Graph

2958 Commits

Author SHA1 Message Date
rabbitblood
5ce7af2d2c fix(ci): set WORKSPACE_ID for the runtime-pin smoke import
platform_auth.py validates WORKSPACE_ID at module load — EC2 user-data
sets it from cloud-init, but the CI smoke-test was missing it and
failed with 'WORKSPACE_ID is empty'. Set a placeholder UUID so the
import gate exercises only the dep-resolution path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 01:59:56 -07:00
rabbitblood
b817251c85 refactor(ci): apply simplify findings on #2083
Review of the runtime-pin-compat workflow:

- Add merge_group trigger so when this becomes a required check the
  queue green-checks it (mirrors ci.yml convention).
- Cache pip on workspace/requirements.txt — actions/setup-python@v5
  with cache: pip + cache-dependency-path. Saves ~30s per fire.
- Document the load-bearing install order: runtime FIRST so pip
  honors the runtime's declared a2a-sdk constraint (the surface that
  broke 2026-04-24); workspace/requirements.txt SECOND so a2a-sdk
  is upgraded to the runtime image's pinned version. Import smoke
  validates the upgraded combination.

Skipped: branch-protection wiring (separate ops decision, not in
scope here); ci.yml integration (the standalone schedule trigger
is the load-bearing reason to keep this workflow separate).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 01:32:56 -07:00
rabbitblood
9b42a5e311 test(ci): runtime + a2a-sdk pin compatibility gate (controlplane#253)
Closes Molecule-AI/molecule-controlplane#253.

Prevents recurrence of the 5-hour staging outage from 2026-04-24:
molecule-ai-workspace-runtime 0.1.13 declared `a2a-sdk<1.0` in its
metadata but actually imported `a2a.server.routes` (1.0+ only). pip
resolved successfully; every tenant workspace crashed at import. The
canary tenant ultimately caught it but only after 5 hours of degraded
staging. PR #249 fixed the version pin manually; nothing automated
catches the same class of bug for the next release.

This workflow:

- Installs molecule-ai-workspace-runtime fresh from PyPI in a Python
  3.11 venv (mirrors EC2 user-data install pattern)
- Layers in workspace/requirements.txt (the runtime image's actual
  dep set, including the a2a-sdk[http-server]>=1.0,<2.0 pin)
- Runs `from molecule_runtime.main import main_sync` — same import
  the runtime entrypoint does
- Fails CI if pip resolution silently produced a combo that the
  runtime can't actually import

Triggers:
- PR + push to main/staging touching workspace/requirements.txt or
  this workflow (catches local pin changes)
- Daily 13:00 UTC schedule (catches upstream PyPI publishes that
  break the pin combo without any change in our repo)
- workflow_dispatch (manual)

Concurrency cancels in-progress runs on the same ref.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 01:30:36 -07:00
Hongming Wang
cbb8ee0807
Merge pull request #2080 from Molecule-AI/fix/retarget-action-handle-duplicate-pr-1884
ci(retarget): handle 422 'duplicate PR' by closing redundant main-PR (closes #1884)
2026-04-26 07:56:13 +00:00
Hongming Wang
b5f9cbbc55 ci(retarget): handle 422 'duplicate PR' by closing redundant main-PR (closes #1884)
When a bot opens a PR against main and there's already another PR on
the same head branch targeting staging, GitHub's PATCH /pulls returns
422 with:

  "A pull request already exists for base branch 'staging' and
   head branch '<branch>'"

Pre-fix: the retarget Action exited 1 with no further action. The
target-main PR sat there as a duplicate, the workflow run showed
red, and someone had to manually close the duplicate. Today's case
(#1881 duplicate of #1820) had to be closed manually.

Fix: catch that specific 422 message and close the main-PR as
redundant instead of failing. Any OTHER 422 (or other error) still
fails loud — the grep matches the specific duplicate-base text, not
a blanket "any 422 means duplicate".

Behaviour matrix:

  PATCH succeeds                           → retargeted, explainer
                                              comment posted
  PATCH 422 "already exists for staging"   → close main-PR with
                                              explainer (NEW)
  PATCH any other failure                  → workflow fails (preserves
                                              loud-fail for real bugs)

Tests: GitHub Actions don't have an inline unit-test framework here.
The workflow YAML parses (validated locally) and the bash logic is
straightforward. Real verification will be the next duplicate-PR
scenario in production.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 00:53:55 -07:00
Hongming Wang
194121c674
Merge pull request #2063 from Molecule-AI/feat/redeploy-tenants-on-main-merge
ci(redeploy): auto-redeploy tenant EC2s after every main merge
2026-04-26 07:00:59 +00:00
Hongming Wang
944ddcb4e5
Merge pull request #2062 from Molecule-AI/fix/sweep-script-env-override
fix(scripts): make sweep-cf-orphans MAX_DELETE_PCT env override actually work
2026-04-26 06:55:14 +00:00
Hongming Wang
20cce3c27c
Merge pull request #2078 from Molecule-AI/fix/api-401-probe-before-redirect
fix(api): probe /cp/auth/me before redirecting on 401
2026-04-26 06:51:38 +00:00
Hongming Wang
5a3dbb95e1 fix(api): probe /cp/auth/me before redirecting on 401
The actual cause-fix for the staging-tabs E2E saga (#2073/#2074/#2075).

Old behaviour: ANY 401 from any fetch on a SaaS tenant subdomain
called redirectToLogin → window.location.href = AuthKit. This is
wrong. Plenty of 401s don't mean "session is dead":

  - workspace-scoped endpoints (/workspaces/:id/peers, /plugins)
    require a workspace-scoped token, not the tenant admin bearer
  - resource-permission mismatches (user has tenant access but not
    this specific workspace)
  - misconfigured proxies returning 401 spuriously

A single transient one of those yanked authenticated users back to
AuthKit. Same bug yanked the staging-tabs E2E off the tenant origin
mid-test for 6+ hours tonight, leading to the cascade of test-side
mocks (#2073/#2074/#2075) that worked around the symptom without
fixing the cause.

This PR fixes it at the source. The new logic:

  - 401 on /cp/auth/* path → that IS the canonical session-dead
    signal → redirect (unchanged)
  - 401 on any other path with slug present → probe /cp/auth/me:
      probe 401 → session genuinely dead → redirect
      probe 200 → session fine, endpoint refused this token →
                  throw a real Error, caller renders error state
      probe network err → assume session-fine (conservative) →
                  throw real Error
  - slug empty (localhost / LAN / reserved subdomain) → throw
    without redirect (unchanged)

The probe adds one extra fetch on a 401, only when slug is set
and the path isn't already auth-scoped. That's rare and
worthwhile — a transient probe round-trip is cheap; an unwanted
auth redirect is a UX disaster.

Tests:
  - api-401.test.ts rewritten with the full matrix:
      * /cp/auth/me 401 → redirect (no probe, that IS the signal)
      * non-auth 401 + probe 401 → redirect
      * non-auth 401 + probe 200 → throw, no redirect  ← the fix
      * non-auth 401 + probe network err → throw, no redirect
      * empty slug paths (localhost/LAN/reserved) → throw, no probe
  - 43 tests in canvas/src/lib/__tests__/api*.test.ts all pass
  - tsc clean

The staging-tabs E2E spec's universal-401 route handler stays as
defense-in-depth (silences resource-load console noise + guards
against panels without try/catch), but the comment now describes
its role honestly: api.ts is the primary fix, the route is the
safety net.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 23:49:28 -07:00
Hongming Wang
23bea6e793
Merge pull request #2075 from Molecule-AI/fix/canvas-e2e-filter-resource-404
fix(canvas/e2e): filter generic 'Failed to load resource' + add URL diagnostics
2026-04-25 19:09:19 +00:00
Hongming Wang
bef6fca395 fix(canvas/e2e): filter generic "Failed to load resource" + add URL diagnostics
After #2074, the staging-tabs spec stopped failing on the auth-redirect
locator timeout (good — the broadened 401-mock works) but started
failing on a different aggregate check:

  Error: unexpected console errors:
  Failed to load resource: the server responded with a status of 404
  Failed to load resource: the server responded with a status of 404
  Failed to load resource: the server responded with a status of 404

Browser console messages for resource-load failures omit the URL,
so the message is uninformative on its own — we can't filter
selectively (e.g. "is this a missing-CSS noise or a real broken
endpoint?"). The previous filter list (sentry/vercel/WebSocket/
favicon/molecule-icon) catches specific known-noisy strings but
this generic "Failed to load resource" doesn't contain any of them.

Two changes:

1. Add page.on('requestfailed') + page.on('response>=400') logging
   to capture the URL of any failed request. Logs to test stdout
   (visible in the workflow log) — leaves a breadcrumb so a real
   bug isn't completely hidden when we filter the generic message.

2. Add "Failed to load resource" to the filter list. With (1) in
   place we still see the URLs for diagnosis; the generic console
   message is just noise.

Real JS exceptions (panel crash, undefined access, etc.) come with
a file path and stack trace and aren't matched by either filter,
so the gate still catches actual bugs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 12:07:07 -07:00
Hongming Wang
cdfe4e7b85
Merge pull request #2074 from Molecule-AI/fix/canvas-e2e-broaden-401-mock
fix(canvas/e2e): broaden 401-mock to all fetches
2026-04-25 18:43:07 +00:00
Hongming Wang
a84b167d4d fix(canvas/e2e): broaden 401-mock to all fetches, not just /workspaces/*
#2073 caught workspace-scoped 401s but missed non-workspace paths.
SkillsTab.tsx alone fetches /plugins and /plugins/sources, both
outside the /workspaces/<id>/* tree. Either of those 401s with the
tenant admin bearer in SaaS mode → canvas/src/lib/api.ts:62-74
redirects to AuthKit → page navigates away mid-test → next locator
times out.

Same failure signature observed at 16:03Z post-#2073 merge:

  e2e/staging-tabs.spec.ts:45:7 › tab: skills
  TimeoutError: locator.scrollIntoViewIfNeeded: Timeout 5000ms
  - navigated to "https://scenic-pumpkin-83.authkit.app/?..."

Broaden the route to "**" with `request.resourceType() !== "fetch"`
short-circuit (preserves HTML/JS/CSS pass-through) and a
/cp/auth/me skip (the dedicated mock above wins). Same 401 →
empty-body conversion logic; just a wider net.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 11:40:48 -07:00
Hongming Wang
14fab6e544
Merge pull request #2073 from Molecule-AI/fix/canvas-e2e-mock-workspace-apis
fix(canvas/e2e): swap workspace-scoped 401s for empty 200s in staging-tabs spec
2026-04-25 15:23:07 +00:00
Hongming Wang
979d4a0b7a fix(canvas/e2e): swap workspace-scoped 401s for empty 200s
The staging-tabs E2E has been failing for 6+ hours on the same
locator timeout — diagnosed earlier today as the canvas's
lib/api.ts:62-74 redirect-on-401 path firing mid-test:

  e2e/staging-tabs.spec.ts:45:7 › tab: skills
  TimeoutError: locator.scrollIntoViewIfNeeded: Timeout 5000ms
  - navigated to "https://scenic-pumpkin-83.authkit.app/?..."

Several side-panel tabs (Peers, Skills, Channels, Memory, Audit,
and anything workspace-scoped) hit endpoints under
`/workspaces/<id>/*` that require a workspace-scoped token, NOT
the tenant admin bearer the test uses. The endpoints respond 401
in SaaS mode. canvas/src/lib/api.ts:62-74 reacts to ANY 401 by
setting `window.location.href` to AuthKit — yanking the page off
the tenant origin mid-test.

The test comment at line 18 already acknowledged the 401 class
("Peers tab: 401 without workspace-scoped token") but assumed
those would surface as "errored content" rather than a hard
navigation. The redirect logic in api.ts was added later and
breaks the assumption.

Fix: add a Playwright route handler that catches any 401 from
`/workspaces/<id>/*` paths and replaces with `200 + empty body`.
Body shape is best-effort by URL — list endpoints (paths not
ending in a UUID-shaped segment) get `[]`, single-resource
endpoints get `{}`. Both are valid JSON and well-written panels
render an empty state for either rather than crashing.

The two route patterns (`/workspaces/...` and `/cp/auth/me`)
don't overlap — the existing `/cp/auth/me` mock continues to
gate AuthGate's session check independently.

Verification:
- Type-check passes (tsc clean for the spec; pre-existing errors
  in unrelated test files unchanged)
- Can't run staging E2E locally without CP admin token; CI will
  exercise the real path against the freshly-provisioned tenant
- E2E Staging SaaS (full lifecycle) is currently green at 08:07Z,
  confirming the underlying staging infra works — the failures
  have been narrowly in this Playwright-tabs spec

Targets staging per molecule-core convention.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 08:08:05 -07:00
Hongming Wang
fc54601999
Merge pull request #2067 from Molecule-AI/fix/canary-openai-key-staging
ci(canary): inject E2E_OPENAI_API_KEY so A2A turn doesn't 500
2026-04-25 06:12:30 +00:00
Hongming Wang
52d203a098
Merge pull request #2068 from Molecule-AI/ci/sweep-stale-e2e-orgs
ci: hourly sweep of stale e2e-* orgs on staging
2026-04-25 06:12:29 +00:00
Hongming Wang
fe075ee1ba ci: hourly sweep of stale e2e-* orgs on staging
Adds a janitor workflow that runs every hour and deletes any
e2e-prefixed staging org older than MAX_AGE_MINUTES (default 120).
Catches orgs left behind when per-test-run teardown didn't fire:
CI cancellation, runner crash, transient AWS error mid-cascade,
bash trap missed (signal 9), etc.

Why it exists despite per-run teardown:
- Per-run teardown is best-effort by definition. Any process death
  after the test starts but before the trap fires leaves debris.
- GH Actions cancellation kills the runner with no grace period —
  the workflow's `if: always()` step usually catches this but can
  still fail on transient CP 5xx at the wrong moment.
- The CP cascade itself has best-effort branches today
  (cascadeTerminateWorkspaces logs+continues on individual EC2
  termination failures; DNS deletion same shape). Those need
  cleanup-correctness work in the CP, but a safety net belongs in
  CI either way — defense in depth.

Behaviour:
- Cron every hour. Manual workflow_dispatch with overrideable
  max_age_minutes + dry_run inputs for one-off cleanups.
- Concurrency group prevents two sweeps fighting.
- SAFETY_CAP=50 — refuses to delete more than 50 orgs in a single
  tick. If the CP admin endpoint goes weird and returns no
  created_at (or returns no orgs at all), every e2e-* would look
  stale; the cap catches the runaway-nuke case.
- DELETE is idempotent CP-side via org_purges.last_step, so a
  half-deleted org from a prior sweep gets picked up cleanly on the
  next tick.
- Per-org delete failures don't fail the workflow. Next hourly tick
  retries. The workflow only fails loud at the safety-cap gate.

Tonight's specific motivation: ~10 canvas-tabs E2E retries in 2 hours
with various failure modes; each provisioned a fresh tenant + EC2 +
DNS + DB row. Some fraction leaked. Without this loop, ops has to
periodically run the manual sweep-cf-orphans.sh script. With it,
staging self-heals.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 23:07:57 -07:00
Hongming Wang
43c28710ac
Merge pull request #2066 from Molecule-AI/fix/e2e-staging-status-field
fix(e2e): poll instance_status not status — staging E2E never matched the field, masked all real bugs
2026-04-25 05:58:36 +00:00
Hongming Wang
06c85bd185
Merge pull request #2045 from Molecule-AI/feat/flat-rate-pricing-1833
feat(canvas): flat-rate pricing — rename Starter→Team, Pro→Growth (Issue #1833)
2026-04-25 05:54:06 +00:00
Hongming Wang
9a785e9c32 ci(canary): inject E2E_OPENAI_API_KEY so A2A turn doesn't 500
The canary workflow has been failing for ~30 consecutive runs (issue
#1500, opened 2026-04-21) on the same line:

  [hermes-agent error 500] No LLM provider configured. Run `hermes
  model` to select a provider, or run `hermes setup` for first-time
  configuration.

Root cause: the canary's env block was missing E2E_OPENAI_API_KEY.
Without it, tests/e2e/test_staging_full_saas.sh provisions the workspace
with empty secrets; template-hermes start.sh seeds ~/.hermes/.env with
no provider keys; derive-provider.sh resolves the model slug
`openai/gpt-4o` to PROVIDER=openrouter (hermes has no native openai
provider in its registry); A2A request at step 8/11 fails with the
"No LLM provider configured" error from hermes-agent.

The full-lifecycle workflow (e2e-staging-saas.yml line 84) carries the
same secret correctly. Mirror its pattern + add a fail-fast preflight
so future regressions surface in <5s instead of after 8 min of
provision-then-die.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 22:37:13 -07:00
Hongming Wang
e58ecf2974 fix(e2e): scrollIntoView before toBeVisible — clipped tabs were "missing"
Seventh E2E bug, surfaced after the AuthGate mock from the previous
commit finally let the harness reach the tab-iteration loop:

  Error: tab-skills button missing — TABS list may have drifted
  Locator: locator('#tab-skills')

The TABS bar in SidePanel is `overflow-x-auto` (intentional — there
are 13 tabs and they don't all fit on smaller viewports; the
right-edge fade gradient signals the overflow). Tabs after position
~3 are clipped, and Playwright's `toBeVisible()` returns false for
clipped elements (it checks getBoundingClientRect against viewport).

Fix: `scrollIntoViewIfNeeded()` before the visibility assertion,
mirroring what SidePanel's own keyboard handler does on arrow-key
navigation. The tab is then in view and `toBeVisible()` passes.

This was the test's 7th and (probably) final harness bug. The
chain mapping all the way from "staging E2E timed out at 1200s"
this morning:

  1. instance_status field name (#2066)
  2. staging.moleculesai.app DNS zone (#2066)
  3. X-Molecule-Org-Id TenantGuard header (#2066)
  4. Hydration selector waited pre-click (#2066)
  5. networkidle never settles (this PR's parent commits)
  6. AuthGate /cp/auth/me redirect
  7. Tab buttons clipped by overflow-x-auto

If THIS run still fails, the failure surfaces in actual product
behavior (a tab's panel content), not test mechanics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 20:37:36 -07:00
Hongming Wang
6c70b413e0 fix(e2e): mock /cp/auth/me — AuthGate redirect was preventing canvas render
Sixth E2E bug, surfaced after the page.goto-domcontentloaded fix
finally let the navigation complete. The harness now reaches the
canvas-root selector wait but still times out because the canvas
never renders:

  TimeoutError: page.waitForSelector: Timeout 45000ms exceeded.
  waiting for [aria-label="Molecule AI workspace canvas"]

Root cause: canvas/src/components/AuthGate.tsx wraps the page,
fetches /cp/auth/me on mount, and redirects to the login page when
the response is 401. The bearer header we set via
context.setExtraHTTPHeaders works for platform API calls but does
NOT satisfy /cp/auth/me — that endpoint is cookie-based (WorkOS
session). So:

  1. AuthGate mounts
  2. Calls fetchSession() → /cp/auth/me → 401 (no session cookie)
  3. AuthGate transitions to anonymous → redirectToLogin()
  4. Browser navigates away from tenant URL
  5. The React Flow canvas root with the aria-label never mounts
  6. waitForSelector times out at 45s

Fix: context.route() intercepts /cp/auth/me and returns a fake
Session JSON so AuthGate resolves to "authenticated" and renders
its children. The session contents are cosmetic — Session.org_id
and Session.user_id appear in a few canvas surfaces but never fail
on dummy values.

This is the cleanest fix path. Alternatives considered + rejected:
  - Add a ?e2e=1 backdoor to AuthGate: production code shouldn't
    have a "skip auth" flag, even gated.
  - Real WorkOS login flow in Playwright: too much overhead per run.
  - Skip the canvas UI test, test only API: defeats the point of
    the staging E2E (which is to catch UI regressions before
    promotion).

After this lands the harness should reach the workspace-node click
step and exercise tabs — only then can a real product bug (rather
than a test-harness bug) surface. The 6-bug chain mapped to:
  1. instance_status field name (#2066)
  2. staging.moleculesai.app DNS zone (#2066)
  3. X-Molecule-Org-Id TenantGuard header (#2066)
  4. Hydration selector waited pre-click (#2066)
  5. networkidle never settles (this commit's parent)
  6. AuthGate /cp/auth/me redirect (this commit)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 19:59:04 -07:00
Hongming Wang
c2504d9361 fix(e2e): page.goto waitUntil networkidle never settles — switch to domcontentloaded
Fifth E2E bug surfaced by the previous run. After the four setup-
phase fixes (instance_status, DNS zone, X-Molecule-Org-Id, hydration
selector) plus CP#259 ending the pq cache class, the harness finally
reached the actual page navigation step — and timed out there:

  TimeoutError: page.goto: Timeout 45000ms exceeded.
    navigating to "https://...staging.moleculesai.app/", waiting until "networkidle"

`waitUntil: "networkidle"` waits for 500ms of network silence. The
canvas keeps a WebSocket connection open + polls /events and
/workspaces every few seconds for status updates, so the network
is never idle — page.goto sits on it until the default 45s timeout
and throws.

Fix: switch to `waitUntil: "domcontentloaded"`. Returns as soon as
the HTML is parsed. React hydration plus the existing
`waitForSelector` line below is what actually gates ready-for-
interaction; the goto's job is just to land on the page.

This is a generally-applicable lesson — networkidle is broken for
any SPA with a heartbeat. Notably, our existing canvas unit tests
that mock @xyflow/react and don't open WebSockets DON'T hit this,
which is why this only surfaces against staging.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 19:43:46 -07:00
Hongming Wang
59b5449a4e chore: re-trigger CI — staging CP now has CP#259 SetMaxIdleConns(0) fix 2026-04-24 19:07:32 -07:00
Hongming Wang
4e3bb3795a fix(e2e): canvas-hydration wait used a selector that never appears pre-click
Fourth E2E bug in the staging→main chain. The previous three (#2066
setup-phase fixes) let the harness reach the actual Playwright spec.
This one is in staging-tabs.spec.ts itself.

The spec at L78 waits 45s for one of:

  [role="tablist"], [data-testid="hydration-error"]

Both targets are wrong:

  1. [role="tablist"] only appears AFTER the workspace node is
     clicked (which happens 25 lines later at L100). Waiting for
     it BEFORE the click can never resolve, so the wait always
     times out at 45s regardless of whether the canvas actually
     loaded.

  2. [data-testid="hydration-error"] doesn't exist anywhere in
     the canvas. The error banner at app/page.tsx:62 only had
     role="alert" — which collides with toast notifications and
     other alert-type elements, so a more-specific selector was
     never wired.

Two-part fix:

  - Test waits on `[aria-label="Molecule AI workspace canvas"]`
    instead — that's the React Flow wrapper (Canvas.tsx:150),
    always present once hydrated regardless of workspace count
    or selection state. Hydration-error banner remains the
    secondary OR target for the failure path.

  - app/page.tsx hydration-error banner gets the missing
    `data-testid="hydration-error"` attribute. role="alert"
    stays for accessibility; the testid is for programmatic
    detection without conflict.

After this lands, the staging-tabs spec should advance past the
initial wait, click the workspace node, and exercise each tab.
If a tab fails, we get a proper test failure rather than a 45s
timeout that obscures everything.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 18:38:28 -07:00
Hongming Wang
4fdeabdbe0 fix(e2e): send X-Molecule-Org-Id header — TenantGuard 404s without it
Third E2E bug in the staging→main chain, found while debugging the
\`Workspace create 404\` failure that surfaced after the previous two
E2E fixes (instance_status, staging.moleculesai.app DNS).

Root cause: workspace-server's \`middleware/TenantGuard\` middleware
returns 404 (not 401/403, intentionally — see comment in
\`tenant_guard.go\`: "must not be inferable by probing other orgs'
machines") when a request to the tenant origin lacks one of:
  - X-Molecule-Org-Id header matching MOLECULE_ORG_ID env on the tenant
  - Fly-Replay-Src state from the CP router (production browser path)
  - Same-origin Canvas (Referer == Host)

The E2E was a direct GitHub-Actions curl with neither — every non-
allowlisted route 404'd with the platform's ratelimit headers but
none of the security headers, which made it look like a missing
route in the platform.

The org UUID is already on the admin-orgs row alongside instance_status,
so capture it during the readiness poll and add it to the tenantAuth
header bag. Both /workspaces (POST) and /workspaces/:id (GET) now
carry it.

Allowlist still contains /health, /metrics, /registry/register,
/registry/heartbeat — so the TLS readiness step (which hits /health)
keeps working without the header.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 18:13:13 -07:00
Hongming Wang
edcac16b81 fix(e2e): use staging.moleculesai.app for tenant DNS — wrong zone hung TLS poll
Second related E2E bug, surfaced after #2066's instance_status fix
let the harness reach the TLS readiness step:

  Error: tenant TLS: timed out after 180s

The CP provisioner writes staging tenant DNS as
<slug>.staging.moleculesai.app (with the staging. subdomain
prefix — visible in the EC2 provisioner DNS log line). The harness
was building https://<slug>.moleculesai.app (prod-zone shape),
so DNS literally didn't resolve, fetch threw NXDOMAIN inside the
silent catch, and waitFor saw null on every 5s poll until 180s
elapsed.

Fix: parameterize as STAGING_TENANT_DOMAIN env var, default
staging.moleculesai.app. Doc-comment example updated to match.
Override hatch is there only for ops running this harness against
a non-default zone.

Verified manually: a freshly-provisioned tenant
(e2e-canvas-20260425-sav9fe) was unreachable at the prod-shaped
URL (NXDOMAIN) but reached CF at the staging-shaped URL.

teardown.ts only hits CP, not the tenant URL — no fix needed there.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 17:45:48 -07:00
Hongming Wang
754f361c03 fix(e2e): poll instance_status not status — waitFor never matched, masked real bugs
Staging Canvas Playwright E2E has been timing out at 1200s on every
recent run. Found via /code-review-and-quality on the staging→main
promotion chain.

The CP /cp/admin/orgs response shape is (handlers/admin.go:118):

  type adminOrgSummary struct {
    ...
    InstanceStatus string `json:"instance_status,omitempty"`
    ...
  }

There is NO top-level `status` field. The waitFor predicate compared
`row.status === "running"` against undefined on every poll — the
predicate could never resolve truthy. The harness invariably wedged
on the 20-min timeout regardless of whether the tenant was actually
provisioned.

This bug has been double-edged:
  - It MASKED the #242 pq-cache-collision class for hours: the
    tenants WERE provisioning fine, but the test couldn't tell.
  - It survived #255, #257 (real CP fixes) — the test still timed
    out, making us suspect more CP bugs that didn't exist.

Fix: poll `row.instance_status` instead. One-line change. Identical
fix for the failed-state branch one line below.

No new tests for the harness itself; the fix's correctness is
verified by the next E2E run on the affected branch passing
end-to-end. If it doesn't pass after this, there's a separate
bug we can hunt cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 17:32:12 -07:00
Hongming Wang
184f8256cd ci(redeploy): fire post-main tenant fleet redeploy via CP admin endpoint
Closes the "main merged but prod tenants still on old image" gap.

## Trigger chain

  main merge
   └─> publish-workspace-server-image (builds + pushes :latest + :<sha>)
        └─> redeploy-tenants-on-main (this workflow)
             └─> POST https://api.moleculesai.app/cp/admin/tenants/redeploy-fleet
                  └─> Canary hongmingwang + 60s soak, then batches of 3
                       with SSM Run Command redeploying each tenant EC2

## Features

- Auto-fires on every successful publish-workspace-server-image run.
- Manual dispatch with optional target_tag (for rollback to an older
  SHA), canary_slug override, batch_size, dry_run.
- 30s delay before calling CP so GHCR edge cache serves the new
  :latest consistently to every tenant's docker pull.
- Skips when publish job failed (workflow_run fires on any completion).
- Job summary renders per-tenant results as a markdown table so ops
  can see which tenant, if any, broke the chain.
- Exits non-zero on HTTP != 200 or ok=false so a broken rollout marks
  the commit status red.

## Secrets + vars required

- secret CP_ADMIN_API_TOKEN  — Railway prod molecule-platform / CP_ADMIN_API_TOKEN
                               Mirrored into this repo's secrets.
- var    CP_URL (optional)   — defaults to https://api.moleculesai.app

## Paired with

- Molecule-AI/molecule-controlplane branch feat/tenant-auto-redeploy
  which adds the /cp/admin/tenants/redeploy-fleet endpoint + the SSM
  orchestration. This workflow is a no-op until that lands on prod CP.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 14:34:28 -07:00
Hongming Wang
817b8b0307 fix(scripts): make MAX_DELETE_PCT actually honor env override
The script's own help text documents \`MAX_DELETE_PCT=62 ./sweep-cf-orphans.sh\`
as the way to relax the safety gate, but the in-script assignment on line 35
was unconditional and overwrote any env value — so the override never worked.

During today's staging tenant-provision recovery (CP #255 context), hit the
57%-delete threshold and needed the documented override to clear 64 orphan
records. The one-char change to \`\${MAX_DELETE_PCT:-50}\` honors the env
while keeping the 50% default when no caller overrides.

Ran with MAX_DELETE_PCT=62 after the fix — deleted 64 records, CF zone 111→47.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 14:14:55 -07:00
Hongming Wang
62217250ed test(pricing): finish Starter→Team, Pro→Growth rename in 6 stale assertions
Marketing-lead agent's rename pass updated the "renders all three plans"
test (lines 56-57) but missed lines 77, 94, 114, 132, 143, 158 which still
referenced the pre-rename "Upgrade to Starter" / "Upgrade to Pro" button
names. Canvas (Next.js) build failed with getByRole timeout because the
component now says "Upgrade to Team" / "Upgrade to Growth".

Internal PlanId tuple ("free" | "starter" | "pro") and startCheckout(planId)
call are unchanged — only the user-facing button labels shifted, so
assertions like startCheckout("pro", "acme") still match the server-side API.

Verified locally: 9/9 PricingTable tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 13:01:40 -07:00
Hongming Wang
2dbd06d52e
Merge pull request #2055 from Molecule-AI/feat/lark-channel-first-class-v2
feat(channels): first-class Lark/Feishu support via schema-driven config
2026-04-24 19:57:57 +00:00
rabbitblood
998cd03265 fix(tabs-a11y): mock config_schema on adapter response
Schema-driven ChannelsTab renders no inputs when config_schema is
absent — the test's bare {type, display_name} mock mismatched the
real API shape and every getByLabelText("Bot Token") failed.

Mock now mirrors GET /channels/adapters with the Telegram schema
(bot_token password + chat_id text) so the a11y assertions run
against the actual rendered form.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 12:04:51 -07:00
molecule-ai[bot]
92a0c0073d
Merge pull request #2058 from Molecule-AI/chore/canvas-node22-upgrade
chore(canvas): upgrade node:20-alpine → node:22-alpine
2026-04-24 19:04:25 +00:00
molecule-ai[bot]
17f29e874a
Merge pull request #2029 from Molecule-AI/fix/canvas-a11y-tabs-v2
fix(canvas/a11y): add type=button to tab toolbar and settings buttons
2026-04-24 19:01:24 +00:00
molecule-ai[bot]
02406ea823
Merge pull request #2024 from Molecule-AI/fix/gh-identity-plugin-role-env-v2
feat(#1957): wire gh-identity plugin into workspace-server
2026-04-24 19:01:22 +00:00
Hongming Wang
fc2e6150d3
Merge pull request #2056 from Molecule-AI/fix/compliance-default-owasp-agentic
fix(compliance): flip default mode to owasp_agentic (detect-only)
2026-04-24 18:56:00 +00:00
molecule-ai[bot]
58745145cb
Merge pull request #2038 from Molecule-AI/hotfix/audit34-to-main
hotfix: Audit #34 fixes to main
2026-04-24 18:55:39 +00:00
1e5fc48acb chore(canvas): upgrade node:20-alpine → node:22-alpine
Node.js 20 reaches EOL 2026-09 and actions/checkout@v4 emits
Node.js 20 deprecation warnings on GitHub Actions (Node 24 forced
2026-06-02). Next.js 15.1 is fully compatible with Node 22.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 18:54:30 +00:00
Hongming Wang
9af058b82d fix(compliance): flip default mode to owasp_agentic (detect-only)
Prior state: compliance.mode default was "" (fully off) and no template
in the repo set it explicitly — so prompt-injection detection, PII
redaction, and agency-limit checks were silently disabled on every
live workspace, despite the machinery being present in
workspace/builtin_tools/compliance.py.

This was surfaced during a 2026-04-24 review of the A2A inbound path:
a2a_executor.py gates three security checks on
  _compliance_cfg.mode == "owasp_agentic"
and default config never matches, so every A2A message skipped all three.

Fix: default is now owasp_agentic + prompt_injection=detect. Detect mode
logs injection attempts as audit events without blocking — no UX cost,
just visibility. Operators who want stricter enforcement set
`prompt_injection: block` per workspace. Operators who genuinely want
compliance fully off can set `mode: ""` (not recommended; documented).

Changes:
- ComplianceConfig.mode default: "" → "owasp_agentic"
- Yaml parser fallback default: "" → "owasp_agentic" (must match dataclass)
- Docstring updated with rationale + opt-out snippet

Tests: 66/66 test_compliance.py + test_a2a_executor.py pass. 19/19
test_config.py pass. The one test asserting compliance_mode == "" is
for the "config load failed" fallback path (different from the default
config path) — correctly unchanged.

Security posture improvement: prompt-injection detection is now always
on for every workspace created after this ships, with zero behavior
change for legitimate inputs. Block mode remains an opt-in when an
operator wants to actively reject injection attempts rather than just
log them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 11:52:09 -07:00
Hongming Wang
04e60e7303
Merge pull request #2052 from Molecule-AI/fix/canvas-provisioning-timeout-runtime-aware
fix(canvas): runtime-aware provisioning-timeout threshold (hermes 12min vs default 2min)
2026-04-24 18:51:46 +00:00
rabbitblood
00265d7028 feat(channels): first-class Lark/Feishu support via schema-driven config
Lark adapter was already implemented in Go (lark.go — outbound Custom Bot
webhook + inbound Event Subscriptions with constant-time token verify),
but the Canvas connect-form hardcoded a Telegram-shaped pair of inputs
(bot_token + chat_id). Selecting "Lark / Feishu" from the dropdown
silently sent the wrong field names — there was no way to enter a
webhook URL.

Fix: move form shape to the server.

- Add `ConfigField` struct + `ConfigSchema()` method to the
  `ChannelAdapter` interface. Each adapter declares its own fields with
  label/type/required/sensitive/placeholder/help.
- Implement per-adapter schemas:
  - Lark: webhook_url (required+sensitive) + verify_token (optional+sensitive)
  - Slack: bot_token/channel_id/webhook_url/username/icon_emoji
  - Discord: webhook_url + optional public_key
  - Telegram: bot_token + chat_id (unchanged UX, keeps Detect Chats)
- Change `ListAdapters()` to return `[]AdapterInfo` with config_schema
  inline. Sorted deterministically by display name so UI ordering is
  stable across Go's random map iteration.
- Update the 3 existing `ListAdapters` test sites to struct access.

Canvas (`ChannelsTab.tsx`):
- Replace the two hardcoded bot_token/chat_id inputs with a single
  schema-driven `SchemaField` component. Renders one input per field in
  the order the adapter returns them.
- Form state becomes `formValues: Record<string,string>` keyed by
  `ConfigField.key`. Values reset on platform-switch so stale
  Telegram credentials can't leak into a new Lark channel.
- "Detect Chats" stays but only renders for platforms in
  `SUPPORTS_DETECT_CHATS` (Telegram only — the only provider with
  getUpdates).
- Only schema-known keys are posted in `config`, scrubbing any stale
  values from previous platform selections.

Regression tests:
- `TestLark_ConfigSchema` locks in the 2-field Lark contract with the
  required/sensitive flags correctly set.
- `TestListAdapters_IncludesLark` confirms registry wiring + schema
  survives round-trip through ListAdapters.

Known pre-existing `TestStripPluginMarkers_AwkScript` failure in
internal/handlers is unrelated to this change (verified via stash+test
on clean staging).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 11:51:15 -07:00
Hongming Wang
0b237ed9dd refactor(canvas): extract runtime profiles to @/lib/runtimeProfiles
Preparation for a "hundreds of runtimes" plugin ecosystem. Keeping the
runtime-specific UX knobs in-line inside ProvisioningTimeout scales badly
— every new runtime would require editing a component, not just adding a
table entry. Other components (create-workspace dialog, workspace card
tooltips, etc.) will want the same runtime metadata.

Changes:

- New file `canvas/src/lib/runtimeProfiles.ts` owns:
  * `RuntimeProfile` type — structural shape, every field optional so
    new runtimes can partially-fill without breaking consumers.
  * `DEFAULT_RUNTIME_PROFILE` — 2-min default floor (docker-fast).
  * `RUNTIME_PROFILES` — named overrides (currently: hermes 12 min).
  * `WorkspaceRuntimeOverrides` — interface for server-provided
    per-workspace overrides, so operators can tune via template
    manifest / workspace metadata without a canvas release.
  * `getRuntimeProfile()` — resolver with
    overrides → profile → default priority.
  * `provisionTimeoutForRuntime()` — convenience wrapper.

- `ProvisioningTimeout.tsx` now delegates to the profile module.
  `DEFAULT_PROVISION_TIMEOUT_MS` re-exported for legacy test importers.

- Tests: 16/16 (up from 9 before the first fix). Adds pinning for:
  * overrides > profile > default priority chain
  * "every entry in RUNTIME_PROFILES resolves to a number" contract
  * backward-compat export

Adding a new slow runtime is now one table entry in
`canvas/src/lib/runtimeProfiles.ts` with a mandatory `WHY` comment.
Moving to server-driven profiles later is a ~10-line change (the
resolver already threads WorkspaceRuntimeOverrides through).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 11:48:39 -07:00
molecule-ai[bot]
1a27370e7b
Merge pull request #2051 from Molecule-AI/fix/canvas-embeddedteam-removal-and-canvasorbearer-return
refactor(canvas): remove unused EmbeddedTeam component from WorkspaceNode
2026-04-24 18:47:16 +00:00
Hongming Wang
9597d262ca fix(canvas): runtime-aware provisioning-timeout threshold
Hermes workspaces cold-boot in 8-13 min (ripgrep + ffmpeg + node22 +
hermes-agent source build + Playwright + Chromium ~300MB). The canvas's
2-min hardcoded "Provisioning Timeout" warning fired at ~2min and told
users their workspace was "stuck" while it was still mid-install. Users
hit Retry, triggering fresh cold boots and cancelling healthy workspaces.

User-facing symptom (reported 2026-04-24 18:35Z): hermes workspace showed
"has been provisioning for 3m 15s — it may have encountered an issue"
with Retry + Cancel buttons, while the EC2 was installing node_modules.

Fix:
- Keep DEFAULT_PROVISION_TIMEOUT_MS = 120_000 (2min) — correct for fast
  docker runtimes (claude-code, langgraph, crewai) where cold boot is
  30-90s.
- Add RUNTIME_TIMEOUT_OVERRIDES_MS = { hermes: 720_000 } (12min).
  Aligns with tests/e2e/test_staging_full_saas.sh's
  PROVISION_TIMEOUT_SECS=900 (15min) so UI warns shortly before the
  backend itself gives up.
- New timeoutForRuntime() resolves the base; per-node lookup in the
  check-timeouts interval so a mixed batch (1 hermes + 2 langgraph) uses
  the right threshold for each.
- timeoutMs prop is now optional. Undefined → per-runtime lookup; a
  number → forces a single threshold for every workspace (tests use this
  for deterministic behavior).

Tests: 4 new cases pinning the runtime-aware resolution, including a
guard that catches future regressions that would weaken hermes's budget.
Existing tests unchanged (they import DEFAULT_PROVISION_TIMEOUT_MS which
still exports 120_000).

13/13 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 11:46:09 -07:00
molecule-ai[bot]
345dc9c2b4
Merge pull request #2033 from Molecule-AI/fix/validateagenturl-testnet-blocklist
fix(registry): block RFC 5737 TEST-NET and RFC 3849 documentation IPs
2026-04-24 18:42:18 +00:00
molecule-ai[bot]
312af5a94a
Merge pull request #2020 from Molecule-AI/fix/gh-identity-plugin-role-env
feat(#1957): wire gh-identity plugin into workspace-server
2026-04-24 18:42:14 +00:00
Molecule AI Core Platform Lead
49fc97e6e4 refactor(canvas): remove unused EmbeddedTeam component from WorkspaceNode
EmbeddedTeam was defined in WorkspaceNode.tsx but had no call site —
TeamMemberChip (which is called directly) covers the same rendering
responsibility. The function was stranded after a prior refactor and
was flagged by github-code-quality on PR #1989 (merged 2026-04-24T14:09Z
without this cleanup because the token died before push).

Removes 25 lines of dead code. MAX_NESTING_DEPTH is kept — it is used
by TeamMemberChip at line 498.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 18:30:36 +00:00
Hongming Wang
40cfc55784 feat(#1957): wire gh-identity plugin into workspace-server
Ships the monorepo side of molecule-core#1957 (agent identity collapse).
Companion to molecule-ai-plugin-gh-identity (new repo, merged-and-tagged
separately).

Changes:
- manifest.json: add gh-identity plugin to Tier 1 registry
- workspace-server/go.mod: require github.com/Molecule-AI/molecule-ai-plugin-gh-identity
- cmd/server/main.go: build a shared provisionhook.Registry, register
  gh-identity first (always), then github-app-auth (gated on GITHUB_APP_ID)
- workspace_provision.go: propagate workspace.Role into
  env["MOLECULE_AGENT_ROLE"] before calling the mutator chain, so the
  gh-identity plugin can see which agent is booting
- provisionhook/mutator.go: add Registry.Mutators() accessor so
  individual-plugin registries can be merged onto a shared one at boot

Boot log gains a line like:
  env-mutator chain: [gh-identity github-app-auth]

Effect per workspace:
- env contains MOLECULE_AGENT_ROLE, MOLECULE_OWNER, MOLECULE_ATTRIBUTION_BADGE,
  MOLECULE_GH_WRAPPER_B64, MOLECULE_GH_WRAPPER_SHA
- Each workspace template's install.sh can decode + install the wrapper at
  /usr/local/bin/gh, intercepting @me assignment and prepending agent
  attribution on PR/issue creates

Does not break existing workspaces — absent workspace.role, the plugin is
a no-op. Absent install.sh updates in each template, the env vars are
simply unused.

Follow-up template PRs (hermes, claude-code, langgraph, etc.) each add
~15 lines to install.sh to decode + install the wrapper.

Ref: #1957

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 18:28:18 +00:00