PR #32 (workspace template) merged 2026-05-02; image rebuild
succeeded. Plugin baked in. Local full-chain E2E green; caught + fixed
a real KeyError in upstream hermes_cli/tools_config.py. Upstream PR
#18775 still OPEN/CONFLICTING — not on critical path.
Also rewrites hermes-platform-plugins-upstream-pr.md to reflect the
final landing shape (existing hermes_cli/plugins.py, not a new
plugins/platforms/ system).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The harness needs Authorization + X-Molecule-Org-Id (per-tenant, NOT
CP_ADMIN_API_TOKEN) when targeting *.moleculesai.app subdomains.
Existing single-Origin-header form silent-failed with 404 against
staging tenants since the SaaS edge WAF rewrites unauthenticated
/workspaces calls to Next.js (per
reference_saas_waf_origin_header.md).
Switch to a headers array so multiple -H flags compose cleanly with
curl arg-quoting, and document the env var contract at the top of
the script.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds two scripts:
scripts/test-all-runtimes-a2a-e2e.sh
Provisions one workspace per runtime (claude-code, hermes, codex,
openclaw), sets provider keys, waits online, sends two A2A messages
per workspace. First message validates round-trip; second message
validates session continuity. Cleans up via trap on EXIT.
scripts/test-hermes-plugin-e2e.sh
Hermes-only variant focused on the plugin /a2a/inbound path.
Proof-point: session continuity between turns (the plugin path's
deliverable; old chat-completions path lost context per turn).
Both honor SKIP_<runtime> env vars for incremental testing and tolerate
the SaaS edge WAF Origin header requirement (per
reference_saas_waf_origin_header.md).
Run:
PLATFORM=https://demo-tenant.staging.moleculesai.app \\
./scripts/test-all-runtimes-a2a-e2e.sh
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The hourly Sweep stale Cloudflare Tunnels job got cancelled mid-cleanup
on 2026-05-02 (run 25248788312, killed at 5min after deleting 424/672
stale tunnels). A second manual dispatch finished the remaining 254
fine, so the immediate backlog cleared, but two underlying bugs would
re-trip on the next big cleanup.
Bug 1: serial delete loop. The execute branch was a `while read; do
curl -X DELETE; done` pipeline at ~0.7s/tunnel — fine for the
steady-state cleanup of a handful, but a 600+ backlog needs ~7-8min.
This commit fans out to $SWEEP_CONCURRENCY (default 8) workers via
`xargs -P 8 -L 1 -I {} bash -c '...' _ {} < "$DELETE_PLAN"`. With 8x
parallelism the same 600+ list drains in ~60s. Notes:
- We use stdin (`<`) not GNU's `xargs -a FILE` so the script stays
portable to BSD xargs (matters for local-runner testing on macOS).
- We pass ONLY the tunnel id on argv. xargs tokenizes on whitespace
by default; tab-separating id+name on argv risks mangling. The
name is kept in a side-channel id->name map ($NAME_MAP) and looked
up by the worker only on failure, for FAIL_LOG readability.
- Workers print exactly `OK` or `FAIL` on stdout; tally with
`grep -c '^OK$' / '^FAIL$'`.
- On non-zero FAILED, log the first 20 lines of $FAIL_LOG as
"Failure detail (first 20):" — same diagnostic surface as before
but consolidated so we don't spam logs on a flaky CF API.
Bug 2: workflow's 5-min cap was set as a hangs-detector but turned out
to be a real-job-too-slow detector. Raised to 30 min — generous
headroom for the ~60s steady-state run while still surfacing genuine
hangs (and in line with the sweep-cf-orphans companion job).
Bug 3 (drive-by): the existing trap was `trap 'rm -rf "$PAGES_DIR"'
EXIT`, which would have been silently overwritten by any later trap
registration. Replaced with a single `cleanup()` function that wipes
PAGES_DIR + all four new tempfiles (DELETE_PLAN, NAME_MAP, FAIL_LOG,
RESULT_LOG), called once via `trap cleanup EXIT`.
Verification:
- bash -n scripts/ops/sweep-cf-tunnels.sh: clean
- shellcheck -S warning scripts/ops/sweep-cf-tunnels.sh: clean
- python3 yaml.safe_load on the workflow: clean
- Synthetic 30-line delete plan with every 7th id sentinel'd to
return {"success":false}: TEST PASS, DELETED=26 FAILED=4, FAIL_LOG
side-channel name lookup verified.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-up to PR #2509/#2510. The defensive v1-detection branches in
extract_attached_files (Python) and extractFilesFromTask (TypeScript)
were merged with comments claiming they fix a "v0→v1 silent-drop"
bug that surfaced as the 2026-05-01 hongming "no text content"
incident. Live test disproved that hypothesis: a2a-sdk's JSON-RPC
layer validates inbound requests against the v0 Pydantic union, so
v1 shapes are rejected at the request boundary — the v1 detection
branch is unreachable on the JSON-RPC ingress path. The actual root
cause of the hongming incident was the missing /workspace chown
fixed by CP PR #381 + test #382.
Update the comments to honestly describe these branches as
defensive future-proofing (kept against an eventual SDK schema
migration or in-process callers that construct Parts directly from
protobuf), not as fixes for an observed bug. Also trims
ChatTab.tsx's outbound-shape comment block from ~21 lines to a
3-line pointer to the SDK union.
Comment-only change. No behavior change. 86 workspace tests + 91
canvas tests still pass.
Adds the OpenAI Codex CLI as a Molecule workspace runtime and lands
the design docs that drove the runtime native-MCP push parity work
across claude-code, hermes, openclaw, and codex.
manifest.json:
- Adds `codex` workspace_template entry pointing at the new
Molecule-AI/molecule-ai-workspace-template-codex repo (initial
commit landed there in parallel; 14 files / 1411 LOC). The
workspace-server runtime registry already had `codex` in its
fallback set — this entry makes it manifest-reachable in prod.
docs/integrations/:
- runtime-native-mcp-status.md — index across all four runtime streams
- codex-app-server-adapter-design.md — full design including v2 RPC
sequence, executor skeleton, schema-vs-runtime drift findings
(real codex 0.72 returns thread.id, schema says thread.threadId)
- hermes-platform-plugins-upstream-pr.md — pre-submission draft of
the hermes-agent upstream PR
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Recurring failure pattern in redeploy-tenants-on-staging:
##[error]redeploy-fleet returned HTTP 500
##[error]Process completed with exit code 1.
with the per-tenant breakdown in the response body showing the failures
were on ephemeral e2e-* tenants (saas/canvas/ext) whose parent E2E run
torn them down mid-redeploy — SSM exit=2 because the EC2 was already
terminating, or healthz timeout because the CF tunnel was already gone.
The actual operator-facing tenants (dryrun-98407, demo-prep, etc) all
rolled fine in the same call.
This shape repeats every staging push that overlaps an active E2E run.
The downstream `Verify each staging tenant /buildinfo matches published
SHA` step ALREADY distinguishes STALE vs UNREACHABLE for exactly this
reason (per #2402); only the top-level `if HTTP_CODE != 200; exit 1`
gate misclassifies the race.
Filter: HTTP 500 + every failed slug matches `^e2e-` → soft-warn and
fall through to verify. Any non-e2e-* failure or non-500 HTTP remains
a hard fail, with the failed non-e2e slugs surfaced in the error so
the operator doesn't have to dig the response body out of CI.
Verified the gate logic with 6 synthetic CP responses (happy / e2e-only
race / mixed real+e2e fail / non-200 / 200+ok=false / all-real-fail) —
all behave correctly.
prod's redeploy-tenants-on-main is intentionally NOT touched: prod CP
serves no e2e-* tenants, so the race can't occur there and the strict
gate is the right behavior.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous PR (#2509) flipped canvas outbound file parts to the v1
flat shape `{url, filename, mediaType}` based on a hypothesis that
a2a-sdk's JSON-RPC parser silently dropped v0 `{kind:"file", file:{...}}`
shapes. Live test shows the opposite: a2a-sdk's JSON-RPC layer
validates against the v0 Pydantic discriminated union (TextPart |
FilePart | DataPart), so v1 flat shape is rejected with:
Invalid Request:
params.message.parts.0.TextPart.text — Field required
params.message.parts.0.FilePart.file — Field required
params.message.parts.0.DataPart.data — Field required
The actual root cause of the user-visible "Error: message contained
no text content" was the missing `/workspace` chown (CP PR #381 +
test pin #382), not a wire-shape mismatch. Verified end-to-end by
sending a v0 image-only message after PR #381 + workspace re-provision
— agent receives the file, reads its bytes, and replies normally.
Reverting only the canvas outbound shape. Defensive v1-tolerance
stays in:
- workspace/executor_helpers.py — extract_attached_files still
accepts v1 protobuf parts in case a future client emits them or
a future SDK release flips internal representation. Harmless on
the v0 hot path.
- canvas/message-parser.ts — extractFilesFromTask still tolerates
v1 shape on incoming agent responses. Some agents may emit v1
when their internal serializer round-trips through protobuf.
Tests stay green (91 canvas, 86 workspace).
Image-only chats surface "Error: message contained no text content"
because canvas posts v0 `{kind:"file", file:{uri,name,mimeType}}` shapes
that the workspace runtime's a2a-sdk v1 protobuf parser silently drops:
v1 `Part` has fields `[text, raw, url, data, metadata, filename,
media_type]` and `ignore_unknown_fields=True` discards `kind`+`file`,
producing a fully-empty Part. With no text and no extracted file
attachments, the executor's "no text content" guard fires.
Three coordinated changes close the gap:
1. canvas/ChatTab.tsx — outbound file parts now carry the v1 flat
shape `{url, filename, mediaType}` so the v1 protobuf parser
populates Part fields instead of dropping them.
2. workspace/executor_helpers.py — extract_attached_files learns the
v1 detection branch (non-empty `part.url` + `filename` +
`media_type`) alongside the existing v0 RootModel and flat-file
shapes. Defends every runtime that mounts the OSS wheel against
the same drop, including any pre-fix client still on the wire.
3. canvas/message-parser.ts — extractFilesFromTask tolerates the v1
shape on incoming agent responses too, so file chips render in
chat history regardless of which Part shape the runtime emits.
Test pins:
- workspace/tests/test_executor_helpers.py:
+ v1 protobuf shape extraction
+ empty-Part defense (v0→v1 silent-drop fall-through returns [])
- canvas message-parser test:
+ v1 protobuf flat parts
+ filename fallback to URL basename for v1
The page-merge loop passed the entire accumulating tunnel JSON to
python3 -c via argv on every iteration. On a busy account (verified
2026-05-02: 672 tunnels, 14 pages on Hongmingwangrabbit account) this
exceeds the GH Ubuntu runner's combined argv+envp limit (~128 KB) and
dies with `python3: Argument list too long` at exit 126 — the workflow
has been silently failing this way since the very first run that hit a
real account, masked earlier by a missing-CF_ACCOUNT_ID secret check.
Buffer each page response to a file under a temp dir, merge from disk
at the end. Also bumps the page cap from 20 to 40 (1000 → 2000 tunnel
ceiling) so the existing soft-cap warning has headroom; the disk-merge
shape is O(n) in tunnel count rather than the previous O(n^2) so the
larger ceiling is cheap.
Verified locally against the live account (672 tunnels): script now
runs cleanly to the existing MAX_DELETE_PCT safety gate, which trips
at 99% > 90% as designed and surfaces the actual orphan backlog for
operator-driven cleanup.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Canary started flaking 2026-05-01 22:11 with model-refusal replies:
- "I'm unable to do that."
- "I'm unable to fulfill that request. Can I assist you with anything else?"
- "I'm unable to reply with responses that don't allow me to fulfill tasks…"
3 fails / 10 recent runs ≈ 30% flake.
Trigger: 2026-04-30's Platform Capabilities preamble (#2332) added the
directive "Use them proactively" to the top of every system prompt.
Combined with the heavy A2A + HMA tool docs further down, the model
reads the contrived bare-echo prompt ("Reply with exactly: PONG") as
out-of-role and intermittently refuses.
Real user prompts don't hit this — only the synthetic smoke prompt does,
so the right fix is in the canary's prompt phrasing, not the platform's
system prompt (which is correctly priming agents toward tool use). New
phrasing explicitly tells the model "this is a smoke test" and "no
tools or memory are needed" so it has permission to comply.
Also updates the child workspace's CHILD_PONG prompt with the same
framing — same failure mode would have hit it once full-mode runs again.
No code change to system prompt, no test infra change. Just two prompt
strings + a load-bearing comment so future readers don't trim back to
the brittle phrasing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#1569 Phase 1 discovery (2026-05-02) found six historical credential
exposures in molecule-core git history. All confirmed dead — but the
reason they got committed in the first place was that the local
pre-commit hook had two gaps that the canonical CI gate (and the
runtime's hook) didn't:
1. **Pattern set was incomplete.** Local hook checked
`sk-ant-|sk-proj-|ghp_|gho_|AKIA|mol_pk_|cfut_` — missing
`ghs_*`, `ghu_*`, `ghr_*`, `github_pat_*`, `sk-svcacct-`,
`sk-cp-`, `xox[baprs]-`, `ASIA*`. The historical leaks were 5×
`ghs_*` (App installation tokens) + 1× `github_pat_*` — none of
which the local hook would have caught even if it ran.
2. **`*.md` and `docs/` were skip-listed.** The leaked tokens lived
in `tick-reflections-temp.md`, `qa-audit-2026-04-21.md`, and
`docs/incidents/INCIDENT_LOG.md` — exactly the file types the
skip-list excluded. The hook ran and silently passed.
This commit:
- Replaces the local hook's hard-coded inline regex with the canonical
13-pattern array (byte-aligned with `.github/workflows/secret-scan.yml`
and the workspace runtime's `pre-commit-checks.sh`).
- Removes the `\.md$|docs/` skip — keeps only binary, lockfile, and
hook-self exclusions.
- Adds the local hook to `lint_secret_pattern_drift.py` as an in-repo
consumer (read-from-disk, no network — the hook lives in the same
checkout the lint runs against). Drift now fails the lint when
canonical changes without the local hook updating in lockstep.
- Adds `.githooks/pre-commit` to the drift-lint workflow's path
filter so consumer-side edits also trigger the lint.
- Adopts the canonical's "don't echo the matched value" defense (the
prior version would have round-tripped a leaked credential into
scrollback / CI logs).
Verified: `python3 .github/scripts/lint_secret_pattern_drift.py`
reports both consumers aligned at 13 patterns. The hook's existing
six other gates (canvas 'use client', dark theme, SQL injection,
go-build, etc.) are untouched.
Companion change (already applied via API, no diff here):
`Scan diff for credential-shaped strings` is now in the required-checks
list on both `staging` and `main` branch protection — was previously a
soft gate (workflow ran, exited 1, but didn't block merge).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both restart paths (interactive Restart handler + auto-restart's
stopForRestart) used to log-and-continue on cpProv.Stop failure. After
PR #2500 made CPProvisioner.Stop surface CP non-2xx as an error, those
paths became the actual leak generator: every transient CP/AWS hiccup =
one orphan EC2 alongside the freshly provisioned one. The 13 zombie
workspace EC2s on demo-prep staging traced to this exact path.
Adds cpStopWithRetry helper with bounded exponential backoff (3 attempts,
1s/2s/4s). Different policy from workspace_crud.go's Delete handler:
Delete returns 500 to the client on Stop failure (loud-fail-and-block —
user asked to destroy, silent leak unacceptable), whereas Restart's
contract is "make the workspace alive again" — refusing to reprovision
strands the user with a dead workspace. So this helper retries to absorb
transient failures, then on exhaustion emits a structured `LEAK-SUSPECT`
log line for the (forthcoming) CP-side workspace orphan reconciler to
correlate. Caller proceeds to reprovision regardless.
ctx-cancel exits the retry early without sleeping the backoff (matters
during shutdown drain); the cancel path emits a distinct log line and
deliberately does NOT emit LEAK-SUSPECT — operator-cancel and
retry-exhaustion are different signals and conflating them would noise
up the orphan-reconciler queue with workspaces we never had a chance to
retry.
Tests: 5 behavior tests covering every branch (no-op, first-try success,
eventual success, exhaustion, ctx-cancel) + 1 AST gate that pins the
helper-only invariant (any future inline `h.cpProv.Stop(...)` in
workspace_restart.go fires the gate, mutation-tested).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Auto-trigger from publish-workspace-server-image now resolves
target_tag to the just-published `staging-<short_head_sha>` digest
instead of `:latest`. Bypasses the dead retag path that was leaving
prod tenants on a 4-day-old image.
The chain pre-fix:
publish-image → pushes :staging-<sha> + :staging-latest (NOT :latest)
canary-verify → soft-skips (CANARY_TENANT_URLS unset, fleet not stood up)
promote-latest → manual workflow_dispatch only, last run 2026-04-28
redeploy-main → pulls :latest → 2026-04-28 digest → all 3 tenants STALE
Today's incident:
e7375348 (main) → publish-image green → redeploy fired → tenants
pulled :latest (76c604fb digest from prior canary-verified state) →
hongming /buildinfo returned 76c604fb instead of e7375348 → verify
step correctly flagged 3/3 STALE → workflow failed.
Today's PRs (#2473 smoke wedge, #2487 panic recovery, #2496 sweeper
followups) shipped to GHCR as :staging-<sha> but never reached prod.
Fix:
- workflow_dispatch input default '' (was 'latest'); empty input
triggers auto-compute path
- new "Compute target tag" step resolves:
1. operator-supplied input → verbatim (rollback / pin)
2. else → staging-<short_head_sha> (auto)
- verify step's operator-pin detection now allows
staging-<short_head_sha> as a non-pin (verification still runs)
When canary fleet is real, this workflow should chain on
canary-verify completion (workflow_run from canary-verify, gated on
promote-to-latest success) instead of publish-image — separate,
smaller PR. Today's fix unblocks prod deploys without that
prerequisite.
Companion: promote-latest.yml dispatched 2026-05-02 against
e7375348 to unstick existing prod tenants. This PR prevents
recurrence.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>