Commit Graph

2937 Commits

Author SHA1 Message Date
Hongming Wang
43c28710ac
Merge pull request #2066 from Molecule-AI/fix/e2e-staging-status-field
fix(e2e): poll instance_status not status — staging E2E never matched the field, masked all real bugs
2026-04-25 05:58:36 +00:00
Hongming Wang
06c85bd185
Merge pull request #2045 from Molecule-AI/feat/flat-rate-pricing-1833
feat(canvas): flat-rate pricing — rename Starter→Team, Pro→Growth (Issue #1833)
2026-04-25 05:54:06 +00:00
Hongming Wang
e58ecf2974 fix(e2e): scrollIntoView before toBeVisible — clipped tabs were "missing"
Seventh E2E bug, surfaced after the AuthGate mock from the previous
commit finally let the harness reach the tab-iteration loop:

  Error: tab-skills button missing — TABS list may have drifted
  Locator: locator('#tab-skills')

The TABS bar in SidePanel is `overflow-x-auto` (intentional — there
are 13 tabs and they don't all fit on smaller viewports; the
right-edge fade gradient signals the overflow). Tabs after position
~3 are clipped, and Playwright's `toBeVisible()` returns false for
clipped elements (it checks getBoundingClientRect against viewport).

Fix: `scrollIntoViewIfNeeded()` before the visibility assertion,
mirroring what SidePanel's own keyboard handler does on arrow-key
navigation. The tab is then in view and `toBeVisible()` passes.

This was the test's 7th and (probably) final harness bug. The
chain mapping all the way from "staging E2E timed out at 1200s"
this morning:

  1. instance_status field name (#2066)
  2. staging.moleculesai.app DNS zone (#2066)
  3. X-Molecule-Org-Id TenantGuard header (#2066)
  4. Hydration selector waited pre-click (#2066)
  5. networkidle never settles (this PR's parent commits)
  6. AuthGate /cp/auth/me redirect
  7. Tab buttons clipped by overflow-x-auto

If THIS run still fails, the failure surfaces in actual product
behavior (a tab's panel content), not test mechanics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 20:37:36 -07:00
Hongming Wang
6c70b413e0 fix(e2e): mock /cp/auth/me — AuthGate redirect was preventing canvas render
Sixth E2E bug, surfaced after the page.goto-domcontentloaded fix
finally let the navigation complete. The harness now reaches the
canvas-root selector wait but still times out because the canvas
never renders:

  TimeoutError: page.waitForSelector: Timeout 45000ms exceeded.
  waiting for [aria-label="Molecule AI workspace canvas"]

Root cause: canvas/src/components/AuthGate.tsx wraps the page,
fetches /cp/auth/me on mount, and redirects to the login page when
the response is 401. The bearer header we set via
context.setExtraHTTPHeaders works for platform API calls but does
NOT satisfy /cp/auth/me — that endpoint is cookie-based (WorkOS
session). So:

  1. AuthGate mounts
  2. Calls fetchSession() → /cp/auth/me → 401 (no session cookie)
  3. AuthGate transitions to anonymous → redirectToLogin()
  4. Browser navigates away from tenant URL
  5. The React Flow canvas root with the aria-label never mounts
  6. waitForSelector times out at 45s

Fix: context.route() intercepts /cp/auth/me and returns a fake
Session JSON so AuthGate resolves to "authenticated" and renders
its children. The session contents are cosmetic — Session.org_id
and Session.user_id appear in a few canvas surfaces but never fail
on dummy values.

This is the cleanest fix path. Alternatives considered + rejected:
  - Add a ?e2e=1 backdoor to AuthGate: production code shouldn't
    have a "skip auth" flag, even gated.
  - Real WorkOS login flow in Playwright: too much overhead per run.
  - Skip the canvas UI test, test only API: defeats the point of
    the staging E2E (which is to catch UI regressions before
    promotion).

After this lands the harness should reach the workspace-node click
step and exercise tabs — only then can a real product bug (rather
than a test-harness bug) surface. The 6-bug chain mapped to:
  1. instance_status field name (#2066)
  2. staging.moleculesai.app DNS zone (#2066)
  3. X-Molecule-Org-Id TenantGuard header (#2066)
  4. Hydration selector waited pre-click (#2066)
  5. networkidle never settles (this commit's parent)
  6. AuthGate /cp/auth/me redirect (this commit)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 19:59:04 -07:00
Hongming Wang
c2504d9361 fix(e2e): page.goto waitUntil networkidle never settles — switch to domcontentloaded
Fifth E2E bug surfaced by the previous run. After the four setup-
phase fixes (instance_status, DNS zone, X-Molecule-Org-Id, hydration
selector) plus CP#259 ending the pq cache class, the harness finally
reached the actual page navigation step — and timed out there:

  TimeoutError: page.goto: Timeout 45000ms exceeded.
    navigating to "https://...staging.moleculesai.app/", waiting until "networkidle"

`waitUntil: "networkidle"` waits for 500ms of network silence. The
canvas keeps a WebSocket connection open + polls /events and
/workspaces every few seconds for status updates, so the network
is never idle — page.goto sits on it until the default 45s timeout
and throws.

Fix: switch to `waitUntil: "domcontentloaded"`. Returns as soon as
the HTML is parsed. React hydration plus the existing
`waitForSelector` line below is what actually gates ready-for-
interaction; the goto's job is just to land on the page.

This is a generally-applicable lesson — networkidle is broken for
any SPA with a heartbeat. Notably, our existing canvas unit tests
that mock @xyflow/react and don't open WebSockets DON'T hit this,
which is why this only surfaces against staging.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 19:43:46 -07:00
Hongming Wang
59b5449a4e chore: re-trigger CI — staging CP now has CP#259 SetMaxIdleConns(0) fix 2026-04-24 19:07:32 -07:00
Hongming Wang
4e3bb3795a fix(e2e): canvas-hydration wait used a selector that never appears pre-click
Fourth E2E bug in the staging→main chain. The previous three (#2066
setup-phase fixes) let the harness reach the actual Playwright spec.
This one is in staging-tabs.spec.ts itself.

The spec at L78 waits 45s for one of:

  [role="tablist"], [data-testid="hydration-error"]

Both targets are wrong:

  1. [role="tablist"] only appears AFTER the workspace node is
     clicked (which happens 25 lines later at L100). Waiting for
     it BEFORE the click can never resolve, so the wait always
     times out at 45s regardless of whether the canvas actually
     loaded.

  2. [data-testid="hydration-error"] doesn't exist anywhere in
     the canvas. The error banner at app/page.tsx:62 only had
     role="alert" — which collides with toast notifications and
     other alert-type elements, so a more-specific selector was
     never wired.

Two-part fix:

  - Test waits on `[aria-label="Molecule AI workspace canvas"]`
    instead — that's the React Flow wrapper (Canvas.tsx:150),
    always present once hydrated regardless of workspace count
    or selection state. Hydration-error banner remains the
    secondary OR target for the failure path.

  - app/page.tsx hydration-error banner gets the missing
    `data-testid="hydration-error"` attribute. role="alert"
    stays for accessibility; the testid is for programmatic
    detection without conflict.

After this lands, the staging-tabs spec should advance past the
initial wait, click the workspace node, and exercise each tab.
If a tab fails, we get a proper test failure rather than a 45s
timeout that obscures everything.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 18:38:28 -07:00
Hongming Wang
4fdeabdbe0 fix(e2e): send X-Molecule-Org-Id header — TenantGuard 404s without it
Third E2E bug in the staging→main chain, found while debugging the
\`Workspace create 404\` failure that surfaced after the previous two
E2E fixes (instance_status, staging.moleculesai.app DNS).

Root cause: workspace-server's \`middleware/TenantGuard\` middleware
returns 404 (not 401/403, intentionally — see comment in
\`tenant_guard.go\`: "must not be inferable by probing other orgs'
machines") when a request to the tenant origin lacks one of:
  - X-Molecule-Org-Id header matching MOLECULE_ORG_ID env on the tenant
  - Fly-Replay-Src state from the CP router (production browser path)
  - Same-origin Canvas (Referer == Host)

The E2E was a direct GitHub-Actions curl with neither — every non-
allowlisted route 404'd with the platform's ratelimit headers but
none of the security headers, which made it look like a missing
route in the platform.

The org UUID is already on the admin-orgs row alongside instance_status,
so capture it during the readiness poll and add it to the tenantAuth
header bag. Both /workspaces (POST) and /workspaces/:id (GET) now
carry it.

Allowlist still contains /health, /metrics, /registry/register,
/registry/heartbeat — so the TLS readiness step (which hits /health)
keeps working without the header.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 18:13:13 -07:00
Hongming Wang
edcac16b81 fix(e2e): use staging.moleculesai.app for tenant DNS — wrong zone hung TLS poll
Second related E2E bug, surfaced after #2066's instance_status fix
let the harness reach the TLS readiness step:

  Error: tenant TLS: timed out after 180s

The CP provisioner writes staging tenant DNS as
<slug>.staging.moleculesai.app (with the staging. subdomain
prefix — visible in the EC2 provisioner DNS log line). The harness
was building https://<slug>.moleculesai.app (prod-zone shape),
so DNS literally didn't resolve, fetch threw NXDOMAIN inside the
silent catch, and waitFor saw null on every 5s poll until 180s
elapsed.

Fix: parameterize as STAGING_TENANT_DOMAIN env var, default
staging.moleculesai.app. Doc-comment example updated to match.
Override hatch is there only for ops running this harness against
a non-default zone.

Verified manually: a freshly-provisioned tenant
(e2e-canvas-20260425-sav9fe) was unreachable at the prod-shaped
URL (NXDOMAIN) but reached CF at the staging-shaped URL.

teardown.ts only hits CP, not the tenant URL — no fix needed there.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 17:45:48 -07:00
Hongming Wang
754f361c03 fix(e2e): poll instance_status not status — waitFor never matched, masked real bugs
Staging Canvas Playwright E2E has been timing out at 1200s on every
recent run. Found via /code-review-and-quality on the staging→main
promotion chain.

The CP /cp/admin/orgs response shape is (handlers/admin.go:118):

  type adminOrgSummary struct {
    ...
    InstanceStatus string `json:"instance_status,omitempty"`
    ...
  }

There is NO top-level `status` field. The waitFor predicate compared
`row.status === "running"` against undefined on every poll — the
predicate could never resolve truthy. The harness invariably wedged
on the 20-min timeout regardless of whether the tenant was actually
provisioned.

This bug has been double-edged:
  - It MASKED the #242 pq-cache-collision class for hours: the
    tenants WERE provisioning fine, but the test couldn't tell.
  - It survived #255, #257 (real CP fixes) — the test still timed
    out, making us suspect more CP bugs that didn't exist.

Fix: poll `row.instance_status` instead. One-line change. Identical
fix for the failed-state branch one line below.

No new tests for the harness itself; the fix's correctness is
verified by the next E2E run on the affected branch passing
end-to-end. If it doesn't pass after this, there's a separate
bug we can hunt cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 17:32:12 -07:00
Hongming Wang
62217250ed test(pricing): finish Starter→Team, Pro→Growth rename in 6 stale assertions
Marketing-lead agent's rename pass updated the "renders all three plans"
test (lines 56-57) but missed lines 77, 94, 114, 132, 143, 158 which still
referenced the pre-rename "Upgrade to Starter" / "Upgrade to Pro" button
names. Canvas (Next.js) build failed with getByRole timeout because the
component now says "Upgrade to Team" / "Upgrade to Growth".

Internal PlanId tuple ("free" | "starter" | "pro") and startCheckout(planId)
call are unchanged — only the user-facing button labels shifted, so
assertions like startCheckout("pro", "acme") still match the server-side API.

Verified locally: 9/9 PricingTable tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 13:01:40 -07:00
Hongming Wang
2dbd06d52e
Merge pull request #2055 from Molecule-AI/feat/lark-channel-first-class-v2
feat(channels): first-class Lark/Feishu support via schema-driven config
2026-04-24 19:57:57 +00:00
rabbitblood
998cd03265 fix(tabs-a11y): mock config_schema on adapter response
Schema-driven ChannelsTab renders no inputs when config_schema is
absent — the test's bare {type, display_name} mock mismatched the
real API shape and every getByLabelText("Bot Token") failed.

Mock now mirrors GET /channels/adapters with the Telegram schema
(bot_token password + chat_id text) so the a11y assertions run
against the actual rendered form.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 12:04:51 -07:00
molecule-ai[bot]
92a0c0073d
Merge pull request #2058 from Molecule-AI/chore/canvas-node22-upgrade
chore(canvas): upgrade node:20-alpine → node:22-alpine
2026-04-24 19:04:25 +00:00
molecule-ai[bot]
17f29e874a
Merge pull request #2029 from Molecule-AI/fix/canvas-a11y-tabs-v2
fix(canvas/a11y): add type=button to tab toolbar and settings buttons
2026-04-24 19:01:24 +00:00
molecule-ai[bot]
02406ea823
Merge pull request #2024 from Molecule-AI/fix/gh-identity-plugin-role-env-v2
feat(#1957): wire gh-identity plugin into workspace-server
2026-04-24 19:01:22 +00:00
Hongming Wang
fc2e6150d3
Merge pull request #2056 from Molecule-AI/fix/compliance-default-owasp-agentic
fix(compliance): flip default mode to owasp_agentic (detect-only)
2026-04-24 18:56:00 +00:00
molecule-ai[bot]
58745145cb
Merge pull request #2038 from Molecule-AI/hotfix/audit34-to-main
hotfix: Audit #34 fixes to main
2026-04-24 18:55:39 +00:00
1e5fc48acb chore(canvas): upgrade node:20-alpine → node:22-alpine
Node.js 20 reaches EOL 2026-09 and actions/checkout@v4 emits
Node.js 20 deprecation warnings on GitHub Actions (Node 24 forced
2026-06-02). Next.js 15.1 is fully compatible with Node 22.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 18:54:30 +00:00
Hongming Wang
9af058b82d fix(compliance): flip default mode to owasp_agentic (detect-only)
Prior state: compliance.mode default was "" (fully off) and no template
in the repo set it explicitly — so prompt-injection detection, PII
redaction, and agency-limit checks were silently disabled on every
live workspace, despite the machinery being present in
workspace/builtin_tools/compliance.py.

This was surfaced during a 2026-04-24 review of the A2A inbound path:
a2a_executor.py gates three security checks on
  _compliance_cfg.mode == "owasp_agentic"
and default config never matches, so every A2A message skipped all three.

Fix: default is now owasp_agentic + prompt_injection=detect. Detect mode
logs injection attempts as audit events without blocking — no UX cost,
just visibility. Operators who want stricter enforcement set
`prompt_injection: block` per workspace. Operators who genuinely want
compliance fully off can set `mode: ""` (not recommended; documented).

Changes:
- ComplianceConfig.mode default: "" → "owasp_agentic"
- Yaml parser fallback default: "" → "owasp_agentic" (must match dataclass)
- Docstring updated with rationale + opt-out snippet

Tests: 66/66 test_compliance.py + test_a2a_executor.py pass. 19/19
test_config.py pass. The one test asserting compliance_mode == "" is
for the "config load failed" fallback path (different from the default
config path) — correctly unchanged.

Security posture improvement: prompt-injection detection is now always
on for every workspace created after this ships, with zero behavior
change for legitimate inputs. Block mode remains an opt-in when an
operator wants to actively reject injection attempts rather than just
log them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 11:52:09 -07:00
Hongming Wang
04e60e7303
Merge pull request #2052 from Molecule-AI/fix/canvas-provisioning-timeout-runtime-aware
fix(canvas): runtime-aware provisioning-timeout threshold (hermes 12min vs default 2min)
2026-04-24 18:51:46 +00:00
rabbitblood
00265d7028 feat(channels): first-class Lark/Feishu support via schema-driven config
Lark adapter was already implemented in Go (lark.go — outbound Custom Bot
webhook + inbound Event Subscriptions with constant-time token verify),
but the Canvas connect-form hardcoded a Telegram-shaped pair of inputs
(bot_token + chat_id). Selecting "Lark / Feishu" from the dropdown
silently sent the wrong field names — there was no way to enter a
webhook URL.

Fix: move form shape to the server.

- Add `ConfigField` struct + `ConfigSchema()` method to the
  `ChannelAdapter` interface. Each adapter declares its own fields with
  label/type/required/sensitive/placeholder/help.
- Implement per-adapter schemas:
  - Lark: webhook_url (required+sensitive) + verify_token (optional+sensitive)
  - Slack: bot_token/channel_id/webhook_url/username/icon_emoji
  - Discord: webhook_url + optional public_key
  - Telegram: bot_token + chat_id (unchanged UX, keeps Detect Chats)
- Change `ListAdapters()` to return `[]AdapterInfo` with config_schema
  inline. Sorted deterministically by display name so UI ordering is
  stable across Go's random map iteration.
- Update the 3 existing `ListAdapters` test sites to struct access.

Canvas (`ChannelsTab.tsx`):
- Replace the two hardcoded bot_token/chat_id inputs with a single
  schema-driven `SchemaField` component. Renders one input per field in
  the order the adapter returns them.
- Form state becomes `formValues: Record<string,string>` keyed by
  `ConfigField.key`. Values reset on platform-switch so stale
  Telegram credentials can't leak into a new Lark channel.
- "Detect Chats" stays but only renders for platforms in
  `SUPPORTS_DETECT_CHATS` (Telegram only — the only provider with
  getUpdates).
- Only schema-known keys are posted in `config`, scrubbing any stale
  values from previous platform selections.

Regression tests:
- `TestLark_ConfigSchema` locks in the 2-field Lark contract with the
  required/sensitive flags correctly set.
- `TestListAdapters_IncludesLark` confirms registry wiring + schema
  survives round-trip through ListAdapters.

Known pre-existing `TestStripPluginMarkers_AwkScript` failure in
internal/handlers is unrelated to this change (verified via stash+test
on clean staging).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 11:51:15 -07:00
Hongming Wang
0b237ed9dd refactor(canvas): extract runtime profiles to @/lib/runtimeProfiles
Preparation for a "hundreds of runtimes" plugin ecosystem. Keeping the
runtime-specific UX knobs in-line inside ProvisioningTimeout scales badly
— every new runtime would require editing a component, not just adding a
table entry. Other components (create-workspace dialog, workspace card
tooltips, etc.) will want the same runtime metadata.

Changes:

- New file `canvas/src/lib/runtimeProfiles.ts` owns:
  * `RuntimeProfile` type — structural shape, every field optional so
    new runtimes can partially-fill without breaking consumers.
  * `DEFAULT_RUNTIME_PROFILE` — 2-min default floor (docker-fast).
  * `RUNTIME_PROFILES` — named overrides (currently: hermes 12 min).
  * `WorkspaceRuntimeOverrides` — interface for server-provided
    per-workspace overrides, so operators can tune via template
    manifest / workspace metadata without a canvas release.
  * `getRuntimeProfile()` — resolver with
    overrides → profile → default priority.
  * `provisionTimeoutForRuntime()` — convenience wrapper.

- `ProvisioningTimeout.tsx` now delegates to the profile module.
  `DEFAULT_PROVISION_TIMEOUT_MS` re-exported for legacy test importers.

- Tests: 16/16 (up from 9 before the first fix). Adds pinning for:
  * overrides > profile > default priority chain
  * "every entry in RUNTIME_PROFILES resolves to a number" contract
  * backward-compat export

Adding a new slow runtime is now one table entry in
`canvas/src/lib/runtimeProfiles.ts` with a mandatory `WHY` comment.
Moving to server-driven profiles later is a ~10-line change (the
resolver already threads WorkspaceRuntimeOverrides through).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 11:48:39 -07:00
molecule-ai[bot]
1a27370e7b
Merge pull request #2051 from Molecule-AI/fix/canvas-embeddedteam-removal-and-canvasorbearer-return
refactor(canvas): remove unused EmbeddedTeam component from WorkspaceNode
2026-04-24 18:47:16 +00:00
Hongming Wang
9597d262ca fix(canvas): runtime-aware provisioning-timeout threshold
Hermes workspaces cold-boot in 8-13 min (ripgrep + ffmpeg + node22 +
hermes-agent source build + Playwright + Chromium ~300MB). The canvas's
2-min hardcoded "Provisioning Timeout" warning fired at ~2min and told
users their workspace was "stuck" while it was still mid-install. Users
hit Retry, triggering fresh cold boots and cancelling healthy workspaces.

User-facing symptom (reported 2026-04-24 18:35Z): hermes workspace showed
"has been provisioning for 3m 15s — it may have encountered an issue"
with Retry + Cancel buttons, while the EC2 was installing node_modules.

Fix:
- Keep DEFAULT_PROVISION_TIMEOUT_MS = 120_000 (2min) — correct for fast
  docker runtimes (claude-code, langgraph, crewai) where cold boot is
  30-90s.
- Add RUNTIME_TIMEOUT_OVERRIDES_MS = { hermes: 720_000 } (12min).
  Aligns with tests/e2e/test_staging_full_saas.sh's
  PROVISION_TIMEOUT_SECS=900 (15min) so UI warns shortly before the
  backend itself gives up.
- New timeoutForRuntime() resolves the base; per-node lookup in the
  check-timeouts interval so a mixed batch (1 hermes + 2 langgraph) uses
  the right threshold for each.
- timeoutMs prop is now optional. Undefined → per-runtime lookup; a
  number → forces a single threshold for every workspace (tests use this
  for deterministic behavior).

Tests: 4 new cases pinning the runtime-aware resolution, including a
guard that catches future regressions that would weaken hermes's budget.
Existing tests unchanged (they import DEFAULT_PROVISION_TIMEOUT_MS which
still exports 120_000).

13/13 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 11:46:09 -07:00
molecule-ai[bot]
345dc9c2b4
Merge pull request #2033 from Molecule-AI/fix/validateagenturl-testnet-blocklist
fix(registry): block RFC 5737 TEST-NET and RFC 3849 documentation IPs
2026-04-24 18:42:18 +00:00
molecule-ai[bot]
312af5a94a
Merge pull request #2020 from Molecule-AI/fix/gh-identity-plugin-role-env
feat(#1957): wire gh-identity plugin into workspace-server
2026-04-24 18:42:14 +00:00
Molecule AI Core Platform Lead
49fc97e6e4 refactor(canvas): remove unused EmbeddedTeam component from WorkspaceNode
EmbeddedTeam was defined in WorkspaceNode.tsx but had no call site —
TeamMemberChip (which is called directly) covers the same rendering
responsibility. The function was stranded after a prior refactor and
was flagged by github-code-quality on PR #1989 (merged 2026-04-24T14:09Z
without this cleanup because the token died before push).

Removes 25 lines of dead code. MAX_NESTING_DEPTH is kept — it is used
by TeamMemberChip at line 498.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 18:30:36 +00:00
Hongming Wang
40cfc55784 feat(#1957): wire gh-identity plugin into workspace-server
Ships the monorepo side of molecule-core#1957 (agent identity collapse).
Companion to molecule-ai-plugin-gh-identity (new repo, merged-and-tagged
separately).

Changes:
- manifest.json: add gh-identity plugin to Tier 1 registry
- workspace-server/go.mod: require github.com/Molecule-AI/molecule-ai-plugin-gh-identity
- cmd/server/main.go: build a shared provisionhook.Registry, register
  gh-identity first (always), then github-app-auth (gated on GITHUB_APP_ID)
- workspace_provision.go: propagate workspace.Role into
  env["MOLECULE_AGENT_ROLE"] before calling the mutator chain, so the
  gh-identity plugin can see which agent is booting
- provisionhook/mutator.go: add Registry.Mutators() accessor so
  individual-plugin registries can be merged onto a shared one at boot

Boot log gains a line like:
  env-mutator chain: [gh-identity github-app-auth]

Effect per workspace:
- env contains MOLECULE_AGENT_ROLE, MOLECULE_OWNER, MOLECULE_ATTRIBUTION_BADGE,
  MOLECULE_GH_WRAPPER_B64, MOLECULE_GH_WRAPPER_SHA
- Each workspace template's install.sh can decode + install the wrapper at
  /usr/local/bin/gh, intercepting @me assignment and prepending agent
  attribution on PR/issue creates

Does not break existing workspaces — absent workspace.role, the plugin is
a no-op. Absent install.sh updates in each template, the env vars are
simply unused.

Follow-up template PRs (hermes, claude-code, langgraph, etc.) each add
~15 lines to install.sh to decode + install the wrapper.

Ref: #1957

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 18:28:18 +00:00
a2a6121a3f fix(registry): block RFC 5737 TEST-NET and RFC 3849 documentation IPs
PR #2021 follow-up: add TEST-NET reserved ranges and IPv6 documentation
prefix to validateAgentURL blocklist in all SaaS/self-hosted modes.

RFC 5737 reserves 192.0.2.0/24, 198.51.100.0/24, and 203.0.113.0/24 for
documentation and example code — no production agent has a legitimate
reason to use them. RFC 3849 designates 2001:db8::/32 as the IPv6
documentation prefix. All are blocked unconditionally.

Also adds 8 regression test cases covering each blocked range.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 18:27:07 +00:00
molecule-ai[bot]
f5d44eba8c
Merge pull request #2048 from Molecule-AI/fix/active-tasks-cancellation-stuck-2026
fix(executors): active_tasks stuck at 1 under CancelledError — queue drain blocked (#2026)
2026-04-24 18:17:03 +00:00
molecule-ai[bot]
90def3f3b9
Merge pull request #2040 from Molecule-AI/hotfix/canvasorbearer-return-main
hotfix(middleware): P0 — add missing return after AbortWithStatusJSON in CanvasOrBearer
2026-04-24 18:16:05 +00:00
f11b1703f0 hotfix(wsauth+restart_template): CanvasOrBearer return + CWE-22 path traversal guard
- wsauth_middleware: add missing return after AbortWithStatusJSON in
  CanvasOrBearer final else branch (CRITICAL auth bypass)
- restart_template: apply sanitizeRuntime before filepath.Join to
  prevent CWE-22 path traversal via dbRuntime field
2026-04-24 18:12:07 +00:00
molecule-ai[bot]
6b557082d5
Merge branch 'staging' into hotfix/canvasorbearer-return-main 2026-04-24 18:10:35 +00:00
Hongming Wang
4b0c85b2a4
Merge pull request #2046 from Molecule-AI/fix/scheduler-wedge-2026
fix(scheduler): prevent wedge on invalid UTF-8 + unbounded DB ops (#2026)
2026-04-24 18:05:33 +00:00
molecule-ai[bot]
f71557482f fix(test): rename duplicate TestCanvasOrBearer_WrongOrigin test at line 946 — resolves Platform(Go) CI compile error on PR #2040 2026-04-24 18:04:13 +00:00
4034f0dc55 fix(middleware): add missing return after AbortWithStatusJSON in CanvasOrBearer
P0 security: CanvasOrBearer final else branch aborts with 401 but
continues execution to c.Next() — allowing the downstream handler to
overwrite the 401 response. Regression tests added to verify the handler
is not called after AbortWithStatusJSON in both no-cred and wrong-origin
paths.

Confirmed on origin/main @ 69408ab6 and origin/staging @ 6b62391e.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 18:04:13 +00:00
Molecule AI Core Platform Lead
6f24cc0961 fix(executors): move set_current_task inside try so active_tasks always decrements (#2026)
If asyncio.CancelledError arrived during the heartbeat HTTP push inside
set_current_task() (the increment call), the code raised before entering
the try/finally block in _execute_locked. The finally block never ran,
so active_tasks stayed at 1 forever. Every subsequent heartbeat reported
active_tasks=1, the server saw active_tasks < max_concurrent_tasks as
false (1 < 1), and DrainQueueForWorkspace never fired. Queued A2A
requests were permanently stuck.

Fix: move set_current_task(increment) to be the FIRST statement inside
the try block, not before it. set_current_task's synchronous portion
(heartbeat.active_tasks mutation) still runs unconditionally; only the
optional HTTP push can be cancelled. The finally block now always runs
and always decrements active_tasks back to 0.

Affected executors: claude_sdk_executor, cli_executor, a2a_executor.
hermes_executor is not affected (does not call set_current_task).

Root cause of today's "active_tasks: 1 + queue drain never triggers"
P1 pattern across three workspaces.

All 167 executor tests pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 18:03:12 +00:00
rabbitblood
fa56cc964b fix(scheduler): prevent wedge on invalid UTF-8 + unbounded DB ops (#2026)
Two stalls in cycle 132 traced to the same root cause: activity_logs
INSERTs were wedging on invalid UTF-8 bytes (observed: 0xe2 0x80 0x2e)
and the surrounding DB operations had no deadlines, so a single stuck
transaction blocked wg.Wait() in tick() and stalled the whole scheduler
until a container restart.

Root cause: truncate() did byte-slicing without UTF-8 boundary checks.
A prompt containing U+2026 (`…` = 0xe2 0x80 0xa6) at byte ~197 was
sliced at maxLen-3, producing the trailing fragment 0xe2 0x80 followed
by '.' (0x2e) from the "..." suffix — Postgres rejects this as invalid
UTF-8 for jsonb, holds the transaction open, and the INSERT never
returns.

Fix:
- truncate(): UTF-8 safe — backs up to a rune boundary via utf8.RuneStart
- sanitizeUTF8(): new helper applied to every agent-produced string
  before it crosses the DB boundary (prompt, error detail, schedule name)
- dbQueryTimeout = 10s on every scheduler DB call:
  - tick() due-schedules query
  - capacity-check queries in fireSchedule
  - empty-run counter UPDATE / reset
  - activity_logs INSERTs (fireSchedule + recordSkipped)
  - recordSkipped bookkeeping UPDATE
- Bookkeeping writes use context.Background() parent (F1089 pattern)
  so fireTimeout / shutdown cancellation can't silently skip the UPDATE.

Regression tests lock in the 0xe2 0x80 0x2e wedge: truncate() is
verified UTF-8-valid and never produces that byte sequence even when
input contains a multi-byte rune at the cut position.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 11:00:47 -07:00
Hongming Wang
a59f1a6ce4
Merge pull request #2036 from Molecule-AI/sync/staging-to-main-2026-04-24-final
chore: promote sync-to-main-final → main (finish #1981)
2026-04-24 11:00:41 -07:00
Molecule AI Marketing Lead
de19cf9bae fix(canvas): apply flat-rate pricing copy for Phase 34 launch (Issue #1833)
Rename "Starter" → "Team", update tagline + pricing page hero copy to
lead with flat-rate per-org positioning — deliberate wedge against
Cursor/Windsurf per-seat pricing ($40/seat vs $29/org).

PMM decision: Issue #1833. Approved by Marketing Lead 2026-04-24.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 17:54:23 +00:00
molecule-ai[bot]
ad89049c66
Merge pull request #2034 from Molecule-AI/hotfix/canvasorbearer-return-staging
hotfix(wsauth_middleware): add missing return after AbortWithStatusJSON — CRITICAL auth bypass
2026-04-24 17:23:53 +00:00
95f0f3c9e9 fix(wsauth_middleware): add missing return after AbortWithStatusJSON in CanvasOrBearer (CRITICAL auth bypass) 2026-04-24 17:14:26 +00:00
molecule-ai[bot]
fa1536e2f8
chore: sync staging to main — 2026-04-24 04h (71 commits)
chore: sync staging to main — 2026-04-24 04h (71 commits)
2026-04-24 17:13:22 +00:00
ca7fa3b65e fix(e2e): increase hermes workspace wait from 20 to 30 min
Root cause of PR #1981 E2E failures (step 7 timeout):
- hermes-agent install from NousResearch (Node 22 tarball + Python
  deps from source) + gateway health wait takes 15-25 min on staging
2026-04-24 17:11:37 +00:00
molecule-ai[bot]
3dda26766f
Merge pull request #2025 from Molecule-AI/fix/ki005-orgtoken-terminal-routing
fix(terminal): org-token A2A routing regression — skip ValidateToken when org_token_id already set
2026-04-24 17:02:02 +00:00
molecule-ai[bot]
a157ae2188
Merge pull request #2023 from Molecule-AI/fix/ssrf-wrapper-tests
test(handlers): add SaaS-mode wrapper tests for isSafeURL and validateAgentURL
2026-04-24 17:02:01 +00:00
molecule-ai[bot]
60b85dc553
Merge pull request #1977 from Molecule-AI/feat/1957-gh-identity-plugin-wireup
feat(#1957): wire gh-identity plugin — per-agent attribution via env injection
2026-04-24 16:54:57 +00:00
Molecule AI Core Platform Lead
4ff45f8955 fix(registry): add always-blocked ranges to validateAgentURL (TEST-NET, CGNAT, multicast, fc00)
The validateAgentURL function was missing several ranges from the always-
blocked list. In SaaS mode only link-local, loopback, and IPv6 metadata
were blocked — TEST-NET (192.0.2/24, 198.51.100/24, 203.0.113/24),
CGNAT (100.64.0.0/10), IPv4 multicast (224.0.0.0/4), and fc00::/8 (IPv6
ULA non-routable prefix) were allowed through.

These ranges are never valid agent URLs in any deployment:
- TEST-NET (RFC-5737): documentation-only, no real hosts
- CGNAT (RFC-6598): never used as VPC subnets on AWS/GCP/Azure
- IPv4 multicast: never a unicast agent endpoint
- fc00::/8: non-routable prefix (fd00::/8 stays allowed in SaaS mode)

Also tighten the non-SaaS ULA block: instead of blocking fc00::/7 (the
supernet covering both fc00 and fd00), split it into always-blocked
fc00::/8 (above) + non-SaaS-only fd00::/8. This makes the SaaS relaxation
explicit and auditable.

Fixes TestValidateAgentURL_SaaSMode_StillBlocksMetadataEtAl failure.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 16:54:23 +00:00
Molecule AI Core Platform Lead
78f8391f02 fix(terminal): check org_token_id context to allow org-token A2A routing (KI-005 followup)
PR #1885 introduced a regression: HandleConnect called wsauth.ValidateToken
for any bearer token when X-Workspace-ID ≠ workspaceID. Org-scoped tokens
(org_api_tokens table) are not in workspace_auth_tokens, so ValidateToken
always returned ErrInvalidToken for them → hard 401 for all A2A routing
that uses org tokens.

Fix: if WorkspaceAuth already validated an org token (org_token_id set in
gin context by orgtoken.Validate), skip the workspace_auth_tokens lookup and
trust the X-Workspace-ID claim. Hierarchy enforcement via canCommunicateCheck
is unchanged — org token holders are still subject to the workspace hierarchy.

Workspace-scoped tokens continue to require ValidateToken binding. Invalid
tokens (neither workspace-bound nor org-level) still return 401. This closes
the regression while preserving the KI-005 security property.

Add TestKI005_OrgToken_SkipsValidateToken to terminal_test.go as a regression
guard for this exact path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 16:17:50 +00:00