From fef6a6210e7693d95605ae83c742b60471854036 Mon Sep 17 00:00:00 2001
From: Hongming Wang <hongmingwangrabbit@gmail.com>
Date: Thu, 30 Apr 2026 00:05:47 -0700
Subject: [PATCH] fix: probe ordering + PID file cleanup on exit(2) (v0.2.2)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Independent five-axis review of v0.2.1 surfaced three issues. All
three only manifest after the upgrade-or-die path actually fires,
which is rare in normal operation but exactly the failure mode the
probe was added for — so quiet correctness matters.

1. **CRITICAL: process.exit(2) orphans PID file → cross-process-kill
   hazard.** v0.2.1 added a fatal exit on too_old probe response, but
   left the PID file pointing at this (about-to-die) process. Next
   launch reads PID_FILE, calls `process.kill(stale, 'SIGTERM')` —
   the dead PID is now likely owned by an unrelated process, so the
   plugin SIGTERMs whatever happens to be there. This is exactly the
   bug the singleton lock dance was designed to prevent. Fix: install
   a `process.on('exit')` listener that unlinks PID_FILE iff this
   process still owns it (handles all exit paths: clean shutdown,
   probe failure, unhandledException, etc.). Listener uses sync I/O
   (only path 'exit' permits) and is no-op-safe if another process
   has already taken ownership of the file.

2. **Probe ran AFTER mcp.connect(transport).** Claude Code had
   already finished the MCP initialize handshake when the probe
   detected too_old and exited; from Claude's perspective this looked
   like "MCP server crashed mid-session" — the carefully-written
   stderr explanation isn't surfaced. Probe must run BEFORE
   mcp.connect so the failure mode is "server failed to start" with
   stderr visible to the user.

3. **Probe ran AFTER registerAsPoll().** If the platform predates
   delivery_mode=poll AND since_id (consistent — both shipped under
   #2339), registerAsPoll() may have already mutated the workspace's
   delivery_mode on the platform. Then the probe fails and we exit,
   leaving the workspace in a broken half-configured state. Probe
   must run BEFORE registerAsPoll for the same reason it must run
   before mcp.connect — fail without side effects.

Reordered boot sequence:
  loadCursors → probeCursorSupport (parallel via allSettled) →
  mcp.connect → registerAsPoll → poll loop.

Also parallelized the probe with Promise.allSettled — sequentially
it was up to N × 15s on a hung platform, which adds up for
multi-workspace channels.

README updated to document that 401/403/404/5xx from the probe are
treated as inconclusive (orthogonal to cursor support — usually
auth or transient) and the plugin continues; this avoids users
debugging a phantom probe failure when the real issue is their
token.

Bumps to 0.2.2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 README.md    |  2 +-
 package.json |  2 +-
 server.ts    | 78 ++++++++++++++++++++++++++++++++++++----------------
 3 files changed, 57 insertions(+), 25 deletions(-)

diff --git a/README.md b/README.md
index e100a92..287da87 100644
--- a/README.md
+++ b/README.md
@@ -136,7 +136,7 @@ A2A messages can carry `Part` entries with `url` and `media_type`. The MVP deliv
 
 ## Compatibility
 
-- **molecule-runtime/workspace-server**: requires `delivery_mode=poll` support (`/registry/register` + a2a_proxy short-circuit, molecule-core PRs #2348 + #2353) and the `since_id` cursor on `GET /activity` (PR #2354). All three shipped under issue #2339, available staging-onward. The plugin probes for cursor support on startup (sends a known-invalid UUID, expects `410 Gone`) and exits with code 2 if the platform predates PR #2354 — silent re-delivery is a worse failure mode than failing to start.
+- **molecule-runtime/workspace-server**: requires `delivery_mode=poll` support (`/registry/register` + a2a_proxy short-circuit, molecule-core PRs #2348 + #2353) and the `since_id` cursor on `GET /activity` (PR #2354). All three shipped under issue #2339, available staging-onward. The plugin probes for cursor support on startup (sends a known-invalid UUID, expects `410 Gone`) and exits with code 2 if the platform predates PR #2354 — silent re-delivery is a worse failure mode than failing to start. `401`/`403`/`404`/`5xx` from the probe are treated as inconclusive (orthogonal to cursor support — usually a token, workspace_id, or transient-network issue) and the plugin continues to the poll loop where the real failure surfaces with workspace-level context.
 - **Claude Code**: tested against the channel-plugin contract that expects `notifications/claude/channel` with `{content, meta}` (matches `@claude-plugins-official/telegram` v0.0.6).
 - **bun**: the MCP server runs under bun for fast startup; `package.json` `start` does `bun install --no-summary && bun server.ts` so no global install needed.
 
diff --git a/package.json b/package.json
index 57b4c4d..c516c60 100644
--- a/package.json
+++ b/package.json
@@ -1,6 +1,6 @@
 {
   "name": "molecule-mcp-claude-channel",
-  "version": "0.2.1",
+  "version": "0.2.2",
   "description": "Molecule AI channel for Claude Code — bridges A2A traffic into a Claude Code session via MCP",
   "license": "Apache-2.0",
   "type": "module",
diff --git a/server.ts b/server.ts
index 673d43e..7de05b7 100644
--- a/server.ts
+++ b/server.ts
@@ -37,7 +37,7 @@ import {
   CallToolRequestSchema,
 } from '@modelcontextprotocol/sdk/types.js'
 import { z } from 'zod'
-import { readFileSync, writeFileSync, mkdirSync, chmodSync, existsSync, renameSync } from 'fs'
+import { readFileSync, writeFileSync, mkdirSync, chmodSync, existsSync, renameSync, unlinkSync } from 'fs'
 import { homedir } from 'os'
 import { join } from 'path'
 
@@ -131,6 +131,23 @@ try {
 } catch {}
 writeFileSync(PID_FILE, String(process.pid))
 
+// Unlink the PID file on every exit path — including process.exit(N)
+// from the cursor-support probe (v0.2.1) which doesn't go through the
+// SIGINT/SIGTERM handlers. Without this, a non-clean exit leaves a
+// stale pid in PID_FILE pointing at a dead pid; the next launch's
+// `process.kill(stale, 'SIGTERM')` (above) would deliver the signal to
+// whatever unrelated process now owns that PID — exactly the cross-
+// process-kill hazard the singleton lock exists to prevent. exit
+// listeners only run synchronous code; unlinkSync is the right tool.
+process.on('exit', () => {
+  try {
+    const owned = parseInt(readFileSync(PID_FILE, 'utf8'), 10)
+    if (owned === process.pid) unlinkSync(PID_FILE)
+  } catch {
+    // Already gone, or another process took ownership — leave it alone.
+  }
+})
+
 // Last-resort safety net — without these the process dies silently on any
 // unhandled promise rejection. With them it logs and keeps serving tools.
 process.on('unhandledRejection', err => {
@@ -495,7 +512,7 @@ function emitNotification(mcp: Server, workspaceId: string, act: ActivityEntry):
 // ─── MCP server ─────────────────────────────────────────────────────────
 
 const mcp = new Server(
-  { name: 'molecule', version: '0.2.1' },
+  { name: 'molecule', version: '0.2.2' },
   { capabilities: { tools: {} } },
 )
 
@@ -621,29 +638,31 @@ mcp.setRequestHandler(CallToolRequestSchema, async req => {
 
 loadCursors()
 
-const transport = new StdioServerTransport()
-await mcp.connect(transport)
-
-// Self-register each workspace as poll-mode BEFORE the first poll fires.
-// Sequenced (not Promise.all) so failures are surfaced one at a time and
-// the operator can spot which workspace's token is bad.
-if (AUTO_REGISTER_POLL) {
-  for (const id of WORKSPACE_IDS) {
-    await registerAsPoll(id)
-  }
-}
-
-// Compat probe: confirm the platform supports the since_id cursor before
-// we start the poll loop. We probe each workspace independently because
-// a multi-tenant deployment could theoretically have tenants on different
-// workspace-server image SHAs (rolling redeploy in progress, blue/green,
-// etc.). Any 'too_old' answer kills the channel — silent re-delivery is
-// the worst failure mode.
+// Compat probe FIRST — before we open the MCP transport or self-register
+// any workspaces. v0.2.1 had this probe AFTER mcp.connect+registerAsPoll,
+// which had two bugs:
+//   1. mcp.connect already finished the initialize handshake, so a
+//      probe-failure exit looked like "MCP server crashed mid-session"
+//      to Claude Code (which swallows the stderr explanation) instead of
+//      the cleaner "server failed to start" with the upgrade message.
+//   2. registerAsPoll() may have already mutated the platform's
+//      delivery_mode for a workspace whose workspace-server can't honor
+//      poll, leaving the workspace in a broken state if we then exit.
+// Probing first is purely a startup-ordering fix; the probe semantics
+// (410 → ok, 200 → too_old, anything else → inconclusive) are unchanged.
+//
+// Probes run in parallel (allSettled) — sequentially they were N × 15s
+// at worst, which adds up for multi-workspace channels. Order doesn't
+// matter for the verdict; we only care if any one came back too_old.
 {
+  const results = await Promise.allSettled(
+    WORKSPACE_IDS.map(id => probeCursorSupport(id).then(r => ({ id, r }))),
+  )
   let anyTooOld = false
-  for (const id of WORKSPACE_IDS) {
-    const result = await probeCursorSupport(id)
-    if (result === 'too_old') {
+  for (const settled of results) {
+    if (settled.status !== 'fulfilled') continue
+    const { id, r } = settled.value
+    if (r === 'too_old') {
       anyTooOld = true
       process.stderr.write(
         `molecule channel: workspace ${id} on a platform that predates ` +
@@ -658,10 +677,23 @@ if (AUTO_REGISTER_POLL) {
       `molecule channel: refusing to start in poll mode against an older platform. ` +
       `Pin MOLECULE_PLATFORM_URL to an upgraded tenant or downgrade to plugin v0.1.\n`
     )
+    // exit triggers the 'exit' listener, which unlinks the PID file.
     process.exit(2)
   }
 }
 
+const transport = new StdioServerTransport()
+await mcp.connect(transport)
+
+// Self-register each workspace as poll-mode BEFORE the first poll fires.
+// Sequenced (not Promise.all) so failures are surfaced one at a time and
+// the operator can spot which workspace's token is bad.
+if (AUTO_REGISTER_POLL) {
+  for (const id of WORKSPACE_IDS) {
+    await registerAsPoll(id)
+  }
+}
+
 process.stderr.write(
   `molecule channel: connected — watching ${WORKSPACE_IDS.length} workspace(s) at ${PLATFORM_URL}\n` +
   `  workspaces: ${WORKSPACE_IDS.join(', ')}\n` +