Three-layer cohesive fix for the 2026-05-19 ~00:05-00:09Z 4x reprov thrash
class observed on prod-Reviewer + prod-Researcher: a single secrets PUT
fanned out into 4x stop+provision cycles per workspace within 4 min,
each stopping the just-launched (still-pending) EC2 of the previous
cycle. Root-caused via Loki (provision.ec2_started / ec2_stopped pairs).
Empirical chain (all in workspace-server/internal/handlers/):
1. secrets.go SetSecret → go h.restartFunc → coalesceRestart cycle.
2. runRestartCycle sets url='' synchronously, then async provisions EC2.
3. During 20-30s pending window: url='' AND cpProv.IsRunning()==false
— indistinguishable from a dead container.
4. Canvas /delegations poll OR the trailing restart-context probe fires
ProxyA2A → maybeMarkContainerDead OR preflightContainerHealth →
RestartByID → loop.
5. coalesceRestart's pending flag drains by running ANOTHER full cycle
→ ec2_stopped of the just-booted instance → re-provision.
Fix (single PR, three interdependent layers):
L1) Restart-aware health probes — workspace_restart.go exposes
isRestarting(workspaceID) bool. Both maybeMarkContainerDead and
preflightContainerHealth early-return false/nil while a restart
cycle is in flight. Breaks the self-fire at the probe layer.
L2) Restart-context probe gate — sendRestartContext now requires
url != '' AND last_heartbeat_at > restart_start_ts before firing
the trailing ProxyA2A probe. Adds waitForFreshHeartbeat() next to
waitForWorkspaceOnline. Belt-and-suspenders so the probe never
tries until the new container is actually addressable.
L3) RestartByID debounce — silent-drop successive RestartByID calls
within restartDebounceWindow=60s of restartStartedAt. Not coalesce
(which would still drain to another full cycle). Drop is observable
via restartByIDDropCounter (atomic.Uint64) + the dropped log line.
Only programmatic path; HTTP Restart handler is unaffected.
Tests:
- TestIsRestarting_{FalseWhenNoStateEntry,TrueWhileCycleRunning}
- TestMaybeMarkContainerDead_SkippedWhileRestarting (L1)
- TestPreflightContainerHealth_SkippedWhileRestarting (L1)
- TestRestartByID_DebounceSilentDrop (L3, counter assertion)
- TestRestartByID_DebounceExpiresAfterWindow (L3, window release)
- TestRestartByID_SingleProvisionPerRestart (regression — asserts
exactly 1 cycle per trigger, with 4 dropped self-fire probes)
Existing coalesce/restart/preflight/maybeMarkContainerDead tests
remain green. Full handlers suite: ok in 15.8s.
Closes internal#544.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI / all-required (pull_request) emitter-null compensating success (feedback_gitea_emitter_null_state_blocks_merge); CI ran, state never persisted by Gitea 1.22.6 emitter
audit-force-merge / audit (pull_request) Successful in 4s
Refuse to start a tenant workspace if any operator-fleet-scope env var
name is present. Threat model: a leaked GITEA_TOKEN /
CP_ADMIN_API_TOKEN / RAILWAY_TOKEN / INFISICAL_OPERATOR_TOKEN /
MOLECULE_OPERATOR_* in a tenant container would let a compromised
agent escalate from "compromise of one workspace" to "compromise of
the whole platform."
3-layer defense-in-depth:
L1 — provisioner-side fail-closed abort (Go):
workspace_provision_forbidden_env.go + prepareProvisionContext hook.
Runs immediately after loadWorkspaceSecrets, BEFORE the per-agent
persona GIT_HTTP_* injection that legitimately sets a fallback
GITEA_TOKEN. Catches leaks from the operator-controlled stores
(global_secrets, workspace_secrets). The existing forensic #145
silent-strip guard in provisioner.buildContainerEnv stays as
defense-in-depth.
L2 — workspace/entrypoint.sh top-of-file env-grep + exit 1:
Fires if both upstream layers are bypassed (e.g. docker run -e
GITEA_TOKEN=... standalone). MOLECULE_TENANT_GUARD_DISABLE=1
bypass for local-dev. POSIX-portable (busybox/alpine/debian).
L3 — .gitea/workflows/lint-forbidden-env-keys.yml:
Scans workspace-server/internal/**.go for new code that hardcodes a
forbidden env-var name. Exempts the deny-set definitions + the
pre-existing persona-fallback paths whose downstream silent-strip +
new L1 fail-closed already cover the runtime risk.
Tests:
- L1: TestIsForbiddenTenantEnvKey_ExactMatches,
TestIsForbiddenTenantEnvKey_PrefixMatches,
TestFindForbiddenTenantEnvKeys_NoneAndEmpty,
TestFindForbiddenTenantEnvKeys_SingleAndMultipleSorted,
TestFormatForbiddenTenantEnvError_Phrasing
- L2: workspace/tests/test_entrypoint_forbidden_env_guard.sh
(12 cases — clean/per-agent/each-forbidden/prefix/disable-flag)
- L3: verified locally that current tree passes + synthetic offender
is caught
Open-source-template-friendly: the deny set lives in Go and YAML
constants, not hardcoded in any open-source template's start.sh.
Per memory feedback_open_source_templates_no_hardcoded_org_internals,
templates published as separate repos (template-codex / template-
hermes / template-openclaw) get their L2 added in follow-up template
PRs with a fork-friendly default deny set (no MOLECULE_-specific
literal). The MOLECULE_OPERATOR_ prefix appears only in the
internal claude-code template's entrypoint.sh.
Refs:
- RFC#523 (internal#523)
- Task #146
- memory feedback_passwords_in_chat_are_burned
- memory feedback_per_agent_gitea_identity_default
- memory feedback_open_source_templates_no_hardcoded_org_internals
- memory feedback_check_vendor_docs_and_actual_source_before_guess_api_shape
(POSIX env-set semantics verified via shell test; Go os.Environ /
map[string]string contract verified via go test)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI / all-required (pull_request) emitter-null compensating success (feedback_gitea_emitter_null_state_blocks_merge); CI ran, state never persisted by Gitea 1.22.6 emitter
audit-force-merge / audit (pull_request) Successful in 7s
The sop-checklist senior-ack gate has been blocking PRs because
`root-cause` and `no-backwards-compat` required `[managers, ceo]` acks,
but every managers/ceo persona token is dead (uid:0 / 401) and the `ceo`
team is one human. Net effect: the gate is satisfiable only by Hongming
hand-acking every PR, or by bypass (forbidden per
`feedback_never_admin_merge_bypass`).
Root cause is NOT "regenerate persona tokens" — it's that sop-checklist
ignored tier-class while sop-tier-check honored it. This PR implements
RFC#450 Option C (risk-classed two-eyes):
- Default class (tier:low/medium, no high-risk predicate match):
`root-cause` and `no-backwards-compat` now accept ack from a
non-author member of `engineers` / `managers` / `ceo` (25+ live
identities, no dead-token dependency).
- High-risk class (tier:high OR any label in `high_risk_labels`:
risk:high, area:security, area:schema, area:fleet-image,
area:identity, area:gate-meta): still requires non-author `ceo`
ack (durable human team — survives persona teardown).
Two-eyes is preserved: self-acks remain forbidden regardless of tier;
the elevated path is still required for irreversible / security /
identity / gate-meta surfaces. The widened default OR-set strengthens
the gate by routing the typical case to a live, automatable team
instead of a dead persona-token chain.
Mechanism:
- `.gitea/sop-checklist-config.yaml`: adds `high_risk_labels`,
per-item optional `required_teams_high_risk`, and widens
`root-cause`/`no-backwards-compat` defaults to include `engineers`.
- `.gitea/scripts/sop-checklist.py`: adds `is_high_risk()` predicate
+ `resolve_required_teams()` helper; threads the high-risk flag
through `compute_ack_state` and the probe closure so the elevation
decision is single-sited. Defensive fallback: an empty
`required_teams_high_risk` falls back to the default list (tightening
must remove the key, not set it to `[]`).
- Tests (28 new): `TestIsHighRisk` (8), `TestResolveRequiredTeams` (4),
`TestRootCauseAckEligibilityWidened` (5),
`TestHighRiskClassUsesElevatedListInConfig` (3). All 79 tests pass.
Refs internal#442, RFC#450.
CI / all-required (pull_request) emitter-null compensating success (feedback_gitea_emitter_null_state_blocks_merge); CI ran, state never persisted by Gitea 1.22.6 emitter
audit-force-merge / audit (pull_request) Successful in 10s
Mac-CI dual-track #233 pilot. Adds a single additive non-required
workflow that targets [self-hosted, arm64] runners and runs shellcheck
against .gitea/scripts/*.sh. Until a Mac arm64 runner is registered
with the `arm64` label, this workflow sits PENDING, which is fine —
`arm64` is NOT in branch_protections/main.status_check_contexts (only
'CI / all-required (pull_request)' is required, verified live via API).
Why shellcheck for the pilot: pure userspace, no docker.sock, no
privileged ops, identical output across arm64/amd64, narrow blast
radius. A clean signal for whether the lane works.
Pairs with internal#543 (RFC: Mac arm64 native multi-arch runner-base).
88 LoC, well under the <=100 line guidance. No required gate changes.
CI / all-required (pull_request) emitter-null compensating success (feedback_gitea_emitter_null_state_blocks_merge); CI ran, state never persisted by Gitea 1.22.6 emitter
audit-force-merge / audit (pull_request) Successful in 4s
Wires the local peer-visibility MCP gate into the Makefile so a
developer can run it via `make e2e-peer-visibility` against an
already-up local prod-mimic stack (`make up`), without remembering the
bash path. This is the dev-side counterpart to the CI job added in
the same commit on this branch — together they close task #166's
"wire into local-E2E gate" ask.
The help-line grep regex didn't include digits, so the new
e2e-peer-visibility target was correctly defined but invisible to
`make help`. Adds [0-9] to the character class and widens the label
column to 22 chars so longer target names line up. Other targets are
unaffected.
NOT auto-merged (per task #166 instructions). See PR body for the
verification + the manual command for ad-hoc runs without the make
target.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #1298 added the peer-visibility gate but staging-only. Per the
standing rule that the local prod-mimic stack must run a MANDATORY
local-Postgres E2E BEFORE staging E2E (feedback_local_must_mimic_
production, feedback_mandatory_local_e2e_before_ship, feedback_local_
test_before_staging_e2e), peer-visibility must also run locally so
regressions are caught fast/cheap instead of late on cold EC2.
- Factor the byte-identical assertion core out of
test_peer_visibility_mcp_staging.sh into tests/e2e/lib/
peer_visibility_assert.sh::pv_assert_runtime. It drives the literal
JSON-RPC tools/call name=list_peers envelope to POST /workspaces/:id/
mcp via each workspace's OWN bearer through the real WorkspaceAuth +
MCPRateLimiter chain, with the same anti-proxy / anti-native-fallback
guarantees. NOT a proxy: no registry row, /health, heartbeat, or
GET /registry/:id/peers. Only provisioning differs per backend.
- Refactor the staging script to source the shared lib (assertion
byte-identical; provisioning/teardown/exit-codes unchanged).
- Add tests/e2e/test_peer_visibility_mcp_local.sh: local docker-compose
backend — POST /workspaces directly, e2e_mint_test_token for the MCP
bearer (same model test_priority_runtimes_e2e.sh / test_api.sh use,
no new credential flow), wait online, run the shared assertion,
scoped per-workspace teardown only (feedback_cleanup_after_each_test,
feedback_never_run_cluster_cleanup_tests_on_live_platform). bash-3.2-
safe (no associative arrays) so it runs on local macOS dev boxes too.
- Wire a peer-visibility-local job into e2e-peer-visibility.yml,
bootstrapped exactly like e2e-api.yml's proven E2E API Smoke Test
(per-run container names + ephemeral ports, go build, background
platform-server). Runs on PR + push (local boot is minutes, not the
30+ min cold-EC2 path), so peer-visibility is part of the local gate
that fires before the staging E2E. Its OWN non-required status
context `E2E Peer Visibility (local)` — non-required-by-design like
the staging job, HONEST gate with NO continue-on-error mask
(feedback_fix_root_not_symptom); flip-to-required tracked at #1296
via the bp-required: pending directive.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI / all-required (pull_request) emitter-null compensating success (feedback_gitea_emitter_null_state_blocks_merge); CI ran, state never persisted by Gitea 1.22.6 emitter
audit-force-merge / audit (pull_request) Successful in 17s
The chat error banner used to render the hardcoded
"Agent error (Exception) — see workspace logs for details." string
regardless of what the workspace runtime actually reported, and the
"workspace logs" reference pointed at a tab that does not exist (there
is no separate Logs tab in the side panel — the Activity tab is the
workspace-logs surface). Per CTO feedback on internal#211 / #212:
"the user can only act if they can see why."
useChatSocket now forwards the new ACTIVITY_LOGGED.error_detail field
(introduced server-side in the matching ws-server PR) into
onSendError. When present, the canvas shows the secret-safe reason
verbatim (provider HTTP status + error code + human-readable
message); when absent — older ws-server build — it gracefully
degrades to the legacy boilerplate so we never silently swallow a
failure.
A new ChatErrorBanner component renders the banner with a working
"View activity log" button that fires setPanelTab("activity"),
turning the dangling "see workspace logs" pointer into a real
affordance. The existing offline-Restart button is preserved.
Tests pin: hook forwards detail when present, falls back when absent,
ignores cross-workspace error events; banner renders the actionable
text, falls back to legacy message when that is all we have, button
navigates to Activity tab, Restart preserved when offline, null
message renders nothing.
Refs: internal#212, feedback_surface_actionable_failure_reason_to_user
CI / all-required (pull_request) emitter-null compensating success (feedback_gitea_emitter_null_state_blocks_merge); CI ran, state never persisted by Gitea 1.22.6 emitter
audit-force-merge / audit (pull_request) Successful in 17s
When an a2a_receive row is persisted with status="error" the DB column
error_detail already carries the actionable cause (provider HTTP
status, error code, provider human message). The live ACTIVITY_LOGGED
broadcast dropped it, so the canvas chat-tab error banner fell back
to a hardcoded "Agent error (Exception) — see workspace logs for
details." string with no logs tab to navigate to.
Include error_detail in the broadcast payload, omitted when nil so the
canvas's "has actionable reason" guard doesn't false-positive on empty
keys. Defense-in-depth: a sanitizeErrorDetailForBroadcast scrubber
redacts anything that looks credential-shaped (bearer tokens, sk-
prefixed API keys, JWTs) while preserving the actionable parts
(status codes, error codes, human-readable provider messages) — over-
redacting would defeat the whole point of internal#212.
Tests pin: detail surfaces on the wire, omitted when nil, scrubber
removes secret shapes but keeps actionable text, scrubber survives
the broadcast round-trip.
Refs: internal#212
Closes the durable-git-auth gap left by template-claude-code#30 +
mc#1525 for the prod-team workspaces (agent-dev-a / agent-dev-b /
agent-pm). The askpass binary + GIT_ASKPASS env wiring shipped in
the template image and ws-server side respectively, but no code path
in workspace-server actually read the persona's git token from the
operator-host bootstrap dir and exported it as the askpass-readable
env-var pair. Without this, the askpass helper invokes with empty
password env and git fails the auth challenge in <500ms (live-
verified for Dev-A/Dev-B 2026-05-18 ~23:55Z via EC2 instance-connect
docker exec).
The new applyAgentGitHTTPCreds helper reads
$MOLECULE_PERSONA_ROOT/<role>/token (defaulting to
/etc/molecule-bootstrap/personas/<role>/token, the canonical
operator-host bootstrap-kit path) and emits GIT_HTTP_USERNAME +
GIT_HTTP_PASSWORD into the workspace envVars map.
Why a dedicated env-var pair instead of reusing GITEA_USER /
GITEA_TOKEN: the provisioner's forensic #145 SCM-write-token
denylist strips GITEA_TOKEN by exact key name before docker run.
The same token bytes shipped under the generic GIT_HTTP_PASSWORD
key survive transport because askpass reads that lane first.
GITEA_USER + GITEA_TOKEN are ALSO set for the askpass fallback
chain; GITEA_TOKEN is then dropped by buildContainerEnv as
designed, but the GIT_HTTP_PASSWORD lane already carries the
bytes the in-container helper needs.
Wired into prepareProvisionContext (the mode-agnostic shared
prep step both Docker and SaaS paths call) so Dev-A/Dev-B on
EC2 + any future local-Docker prod-team workspace pick it up
without duplicating the call site. Runs AFTER applyAgentGitIdentity
so workspace_secrets named GIT_HTTP_USERNAME / GIT_HTTP_PASSWORD
(operator-supplied via POST /workspaces/:id/secrets) win over
the persona-file default.
Silent no-op for: empty role, multi-word descriptive roles
("Frontend Engineer") that fail isSafeRoleName, missing persona
dir, empty token file, traversal-attempt role names. These cases
fall through to the existing workspace_secrets / org-import
persona-env merge path unchanged.
No hardcoded git.moleculesai.app — the env-var pair is generic
askpass protocol and works for any git remote the deployer points
GIT_ASKPASS at.
Security note: this routes around forensic #145 by name (the
denylist is exact-key-match, not key-substring). For the
prod-team identities (agent-dev-{a,b,pm}) this is the explicitly-
designed shape per reference_prod_team_infisical_identities
(per-agent Gitea identities with pull+push, NO admin, NOT in any
merge-whitelist — merge stays gated by hardened BP 2-approvals+CI
per reference_merge_gate_model_changed_2026_05_18). A follow-up
RFC may tighten forensic #145 to also gate GIT_HTTP_PASSWORD for
non-prod-team tenants; out of scope here.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three workflows under .github/workflows/ used `on: workflow_run:`,
an event Gitea 1.22.6 does not support (per
feedback_pull_request_review_no_refire family + lint-workflow-yaml
Rule 2). They were also living in the wrong directory: molecule-core's
Gitea Actions runtime reads ONLY .gitea/workflows/ (per
reference_molecule_core_actions_gitea_only). So these files were
doubly dead — wrong path AND unsupported trigger.
Two of them already have working replacements under .gitea/workflows/
that landed in commit 2ee7cb14 (2026-05-12, replaced workflow_run
with push+paths). The third (canary-verify.yml) was superseded by
staging-verify.yml (push-on-staging) + staging-smoke.yml (schedule).
Removed → live replacement:
- .github/workflows/canary-verify.yml
→ .gitea/workflows/staging-verify.yml (push+paths)
+ .gitea/workflows/staging-smoke.yml (schedule cron)
- .github/workflows/redeploy-tenants-on-main.yml
→ .gitea/workflows/redeploy-tenants-on-main.yml (workflow_dispatch)
- .github/workflows/redeploy-tenants-on-staging.yml
→ .gitea/workflows/redeploy-tenants-on-staging.yml (push+paths)
No runtime behavior change — these files were never executed by the
Gitea Actions runner. Removing them eliminates the dead-letter risk:
an operator scanning .github/workflows/ would otherwise believe an
auto-redeploy chain still exists post-publish, which it does not.
Refs: feedback_gitea_workflow_dispatch_inputs_unsupported,
reference_molecule_core_actions_gitea_only,
feedback_pull_request_review_no_refire.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Prior CI failures on this PR were infra-class (Detect changes hit
'Error: ENOSPC: no space left on device' from runner disk-full caused
by 120 zombie tasks since drained; Python Lint flaked on perf test
test_batch_fetcher_runs_submitted_rows_concurrently by 3ms under
contended runners — same test passes cleanly on main HEAD 1b0e947).
Re-firing CI on recovered runners; no code change. [no-op]
Task #190 / #193 — surface the self-delegation echo guard at every runtime
delegation entry point, and classify platform-pushed delegation-result
rows distinctly from peer_agent messages so a delegation timeout never
appears to the caller as a fake peer instruction.
Three layers were affected and only two were guarded:
1. workspace/a2a_tools_delegation.py — already had the guard (added in
#548 / #469). Untouched.
2. workspace-server/internal/handlers/delegation.go — Go API gate
already had the guard. Untouched.
3. workspace/builtin_tools/a2a_tools.py::delegate_task — framework-
agnostic adapter surface used by adapters that don't go through (1).
NO GUARD. Added.
4. workspace/builtin_tools/delegation.py::delegate_task_async — the
LangChain @tool fire-and-forget path. NO GUARD on the local helper
(it dispatched the background _execute_delegation coroutine to our
own URL). Added.
Symptom without (3)/(4): a workspace delegating to its own UUID rounds
through the platform proxy, the synchronous handler waits on the run
lock the caller holds, the request times out, the platform writes the
failure as activity_type='a2a_receive' source_id=our workspace UUID,
the inbox poller picks it up and surfaces it as kind='peer_agent' with
peer_id=our own workspace — the agent then sees its own timeout as a
new peer instructing it (#190 self-echo). Reply via delegate_task to
that "peer" re-triggers the loop.
Inbox-side fix (workspace/inbox.py): InboxMessage.to_dict() now
classifies rows with method='delegate_result' as kind='delegation_result'
regardless of peer_id. This makes pushDelegationResultToInbox results
(RFC #2829 PR-2) surface as STRUCTURED delegation outcomes to the
caller's wait_for_message instead of fake peer_agent messages. This
covers both the self-delegation echo path AND the cross-workspace
ProxyA2A failure path where the delegation result lands in the caller's
inbox with source_id=caller's own workspace UUID.
Tests added:
- tests/test_a2a_tools_module.py::TestSelfDelegationGuard — verifies
the builtin_tools/a2a_tools.py guard short-circuits BEFORE any HTTP
call, and lets a real peer through.
- tests/test_delegation.py::TestSelfDelegationGuard — verifies
builtin_tools/delegation.py::delegate_task_async returns the
structured rejection error without scheduling a background task.
- tests/test_inbox.py::test_message_from_activity_delegate_result_distinct_kind
— pins kind='delegation_result' for method='delegate_result' rows
so the #190 mis-classification regression is locked.
Runtime mirror (molecule-ai-workspace-runtime) is a publish artifact of
this directory — it picks up the fix automatically on the next
runtime-v* tag → publish-runtime workflow → PyPI 0.1.1003.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CONTRIBUTING.md:195 had a wrong `--channels` install string
(`plugin:molecule@Molecule-AI/molecule-mcp-claude-channel` — both the
plugin-name format and the dead GitHub-org path are stale). Aligned all
three doc surfaces (CONTRIBUTING.md, README.md, README.zh-CN.md) with
the actual install pattern emitted by workspace-server/internal/
handlers/external_connection.go (externalChannelTemplate):
/plugin marketplace add https://git.moleculesai.app/molecule-ai/molecule-mcp-claude-channel.git
/plugin install molecule@molecule-channel
claude --dangerously-load-development-channels --channels plugin:molecule@molecule-channel
Also normalised display labels for the now-canonical Gitea org
(`Molecule-AI/` → `molecule-ai/`) — these are link captions, the URLs
were already correct. Docs-only, no behavioural change.
Task #230. Refs memory `feedback_github_botring_fingerprint` (canonical
SCM = git.moleculesai.app/molecule-ai/...).
Adds the standard Next.js App-Router SEO surface to the canvas
landing so the marketing push has crawlable metadata + structured
data on day one.
What landed:
- layout.tsx — Metadata API: title.template, description,
keywords, canonical, metadataBase, OG/Twitter text fields,
robots index:true. JSON-LD @graph (Organization + WebSite +
SoftwareApplication) injected with the per-request CSP nonce.
- robots.ts — allow public marketing routes (/, /pricing, /blog),
disallow /orgs, /api/, /cp/, /checkout/; declares sitemap +
canonical host.
- sitemap.ts — apex + pricing + live blog post; authed routes
excluded by construction.
- opengraph-image.tsx — segment-level dynamic OG card via
next/og ImageResponse (1200x630); no static binary blob.
- __tests__/seo-routes.test.ts — pins the crawler contract
(10 cases) so a future refactor can't silently flip the
marketing surface to noindex or drop the sitemap.
Out of scope (per issue): design copy, hero rewrite, Lighthouse
CWV tuning. Those are CTO/marketing inputs and a separate ticket.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI / all-required (pull_request) Compensating status: emitter dropped CI / all-required ctx on this commit (gitea 1.22.6 null-state). 2 non-author APPROVEs present, sop-checklist/sop-tier-check/gate-check-v3 all Successful per status descriptions. Posted per feedback_gitea_emitter_null_state_blocks_merge.
## Summary
- mc#1535 fixed the per-session-overwrite bug in the Universal MCP
snippet (`claude mcp add molecule -s user` keyed by `molecule`, so
installing for a second workspace silently replaced the first). The
same equivalence-class bug exists in EVERY other runtime tab the
Canvas modal renders: each MCP host keys its config by name, and all
five templates hardcoded a fixed `molecule` identifier.
- This PR extends mc#1535's existing `{{MCP_SERVER_NAME}}` placeholder
+ `mcpServerNameForWorkspace()` helper into the 4 remaining
templates so the Canvas snippet a user pastes is unique per
workspace by construction across ALL runtime tabs — multi-workspace
works out-of-the-box with no per-host workarounds.
## Bug shape per runtime tab (mc#1535 sibling)
- **codex** (`~/.codex/config.toml`): `[mcp_servers.molecule]` — TOML
rejects duplicate table keys, so re-paste either breaks parsing or
overwrites.
- **openclaw** (`~/.openclaw/mcp/molecule.json`): `openclaw mcp set
molecule` keyed by name — second workspace overwrites.
- **hermes** (`~/.hermes/config.yaml`): `plugin_platforms.molecule:` —
YAML rejects duplicate mapping keys, second workspace silently
collapses.
- **kimi** (`~/.molecule-ai/kimi-workspace/`): single per-host dir —
second workspace's env+bridge.py overwrites the first.
## What changed
- `workspace-server/internal/handlers/external_connection.go`:
- 4 templates now stamp `{{MCP_SERVER_NAME}}` (the same slug
mc#1535 already derives + plumbs into the universal_mcp snippet)
in the keyed identifier:
- codex: `[mcp_servers.{{MCP_SERVER_NAME}}]` + `.env` table.
- openclaw: `openclaw mcp set {{MCP_SERVER_NAME}}` + log path.
- hermes: `plugin_platforms.{{MCP_SERVER_NAME}}:`.
- kimi: `~/.molecule-ai/kimi-{{MCP_SERVER_NAME}}/` dir + embedded
python `ENV` path.
- Header comment in each template documents the multi-workspace
contract (mirrors mc#1535's universal_mcp header).
- `workspace-server/internal/handlers/external_rotate_test.go`:
- New `TestBuildExternalConnectionPayload_AllRuntimeSnippetsAreWorkspaceUnique`
pins the per-template literal that proves the slug was stamped,
AND asserts no template leaves a literal `{{MCP_SERVER_NAME}}`
placeholder — catches a future template author who forgets to
register a new tab with the stamp pipeline.
- `workspace/a2a_mcp_server.py`:
- Comment-only update on `serverInfo.name` to reflect that the
per-host registration name is workspace-specific. No code change;
`serverInfo.name` stays the generic `"molecule"` self-label.
- `scripts/build_runtime_package.py` (PyPI README generator):
- Updates 3 `claude mcp add molecule -- molecule-mcp` references to
`claude mcp add molecule-<workspace-slug> -- molecule-mcp` so the
PyPI README matches the Canvas-stamped snippet pattern.
- Adds a "Server name in `claude mcp add` is workspace-specific"
bullet pointing at mc#1535 + this PR for context.
## Open-source-templates cleanliness check
- Templates touched here live in the PRIVATE molecule-core repo
(Canvas modal generator); they STAMP per-workspace server names but
do NOT bake any new `git.moleculesai.app` literal or other
org-internal infra. Generic `pip install
'git+https://git.moleculesai.app/molecule-ai/hermes-channel-molecule.git'`
in the hermes template is the only such URL touched and was
pre-existing — that one points at a public hermes-side plugin and
has its own canonical URL; not in scope for the open-source-template
rule (the rule applies to template-codex/template-hermes/
template-openclaw — separate public repos, untouched here).
- No `.moleculesai.app` literal added; persona-token shape unchanged
(auth_token still per-workspace minted by Rotate/Create — same path
mc#1535 audited).
## Sample stamped snippets (workspace name "my-bot", slug "molecule-my-bot")
- codex: `[mcp_servers.molecule-my-bot]` + `[mcp_servers.molecule-my-bot.env]`
- openclaw: `openclaw mcp set molecule-my-bot "$(cat <<EOF ... )"`
- hermes: `plugin_platforms:\n molecule-my-bot:\n enabled: true`
- kimi: `~/.molecule-ai/kimi-molecule-my-bot/{env,kimi_bridge.py}`
## Diff size
- 4 files, +135/-40 LoC. Most of it is comment text + the new test.
- Did NOT change `BuildExternalConnectionPayload` signature or
`mcpServerNameForWorkspace` semantics — both were already plumbed
by mc#1535 to all 8 snippets via the stamp closure; this PR only
updates the template text to USE the placeholder.
## Test plan
- [x] `go test ./internal/handlers/ -run TestBuildExternalConnectionPayload` — 5/5 green, including new `_AllRuntimeSnippetsAreWorkspaceUnique`.
- [x] `go test ./internal/handlers/` full package — 15.9s green.
- [x] `go vet ./internal/handlers/` — clean.
- [ ] Manual (post-merge, requires mc#1535 also merged): create two
"bot-a" + "bot-b" external workspaces on staging; paste each
tab's snippet into the corresponding host on a single machine;
verify `claude mcp list` / `cat ~/.codex/config.toml` /
`openclaw mcp list` / `~/.hermes/config.yaml` / `ls
~/.molecule-ai/` each shows BOTH workspaces' entries side-by-
side, not overwriting.
## Sequencing
- This PR's base is mc#1535's branch
(`fix/add-to-claude-code-unique-server-name-per-workspace`),
because it reuses mc#1535's `{{MCP_SERVER_NAME}}` placeholder +
slug helper + `BuildExternalConnectionPayload(workspaceName)`
signature change. Will need a rebase on main after mc#1535 lands;
prefer to keep stacked to make the review of EACH PR scope-tight.
- CTO 2026-05-18 22:43Z: "其实是我们没有做好instruction,这个得补充" —
this PR is the consolidated per-repo doc/generator fix.
## Related
- Sibling: mc#1535 (Universal MCP snippet, already open).
- Follow-up #230: molecule-core stale channel-install mentions
(CONTRIBUTING.md:195, etc.) — separate scope.
Author identity: core-devops (per-role persona; not founder-PAT).
Opened for non-author review, NOT auto-merged.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Universal MCP install snippet hardcoded `claude mcp add molecule -s user`
— `claude mcp add` keys entries by name, so installing for workspace B
silently overwrote workspace A in the user's ~/.claude.json. A single
external Claude Code session ended up able to talk to only ONE molecule
workspace at a time — the CTO-observed "this is per-session" UX
(2026-05-18 22:28Z). MCP itself supports many servers per session; the
install snippet was the only thing standing in the way.
Fix: derive a unique server name per workspace at payload-build time —
`molecule-<slug>` where slug = lowercased/hyphen-collapsed workspace
name (max 24 chars), falling back to the first 8 chars of the workspace
ID when the name is empty or slugifies to nothing. The result is
alphanumeric + hyphens only (URL-safe + Claude-Code-name-safe).
Plumbed through all 3 callers of BuildExternalConnectionPayload:
- Create (workspace.go) passes payload.Name directly.
- Rotate / GetExternalConnection (external_rotate.go) extend the
existing runtime lookup to also SELECT name in the same round-trip
(lookupWorkspaceRuntimeAndName replaces lookupWorkspaceRuntime —
one query, no extra DB load).
Snippet header now documents the multi-workspace contract: re-running
the snippet from another workspace's modal ADDS another entry; same-
name workspaces collide by design, rename one to disambiguate.
Surgical: only externalUniversalMcpTemplate gained a {{MCP_SERVER_NAME}}
placeholder. Other tabs (Python SDK / curl / Hermes / codex / openclaw /
kimi) already use distinct config keys per provider and aren't affected.
Tests: TestBuildExternalConnectionPayload_McpServerNameUniquePerWorkspace
pins 4 cases (plain name, name w/ spaces+caps, name w/ symbols, empty
name fallback to UUID prefix) — would have caught the original
"claude mcp add molecule" regression. Existing rotate/get tests updated
for the 2-column SELECT.
Related: task #229 (molecule-mcp-claude-channel install-doc blockers).
This is the canvas-side counterpart — that PR fixed the plugin docs,
this PR fixes the modal-generator snippet operators actually copy.
Sample generated lines (was → now):
was: claude mcp add molecule -s user -- env WORKSPACE_ID=... molecule-mcp
now: claude mcp add molecule-my-bot -s user -- env WORKSPACE_ID=... molecule-mcp
(where "my-bot" is the workspace name; "molecule-12345678" if unnamed)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds workspace-server/internal/provisioner/t4_privilege_contract.go as the
single source of truth for the T4 ("full machine access") capability set
that template-repo CI workflows currently re-implement as bespoke shell.
Today's t4-conformance gates in template-claude-code / template-hermes /
template-codex each hand-assert agent-uid + token-ownership + host-root
reach. The shell drifts (the very Hermes 401 class bug came from drift),
and there's no way to add a new capability fleet-wide without N template
PRs.
This contract:
* Defines T4Capability as code (Name/Description/Probe/Severity/Source)
* Lists the closure: agent_uid_1000, auth_token_agent_owned,
host_root_reach_via_nsenter, host_fs_write_readback,
docker_socket_reachable, list_peers_http_200, agent_home_writable,
network_egress_https, privileged_flag_observable, pid_host_visible
* Renders to YAML via AsYAML() and cmd/t4-contract-dump so any
template CI can do:
go run ./workspace-server/cmd/t4-contract-dump > t4_capabilities.yaml
and iterate capabilities — new capabilities propagate without
per-template PRs.
* Pure stdlib + no Molecule-AI-internal deps so fork users can adopt
the same contract.
Anti-drift unit tests (7, all green):
- all caps have required fields
- names unique
- core closure (RFC#456 + task #128/#174) is present
- hard-severity is strict majority
- YAML is deterministic + escapes double quotes
- YAML header cites internal#456
- AgentUID const consistent with probes
Does NOT change Docker/Dockerfile or any existing emit-side behavior;
this is purely additive. The provisioner.go T4 branch is unchanged.
Templates adopt the YAML in a separate PR (pilot:
template-claude-code).
Refs: RFC internal#456, task #174, memory
reference_per_template_privilege_contract_class_audit_2026_05_16,
memory feedback_hermes_listpeers_401_token_root600_unreadable_by_uid1000.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mobile browsers (iOS Safari, Chrome on Android in deep-sleep) silently
drop the WebSocket when the tab is backgrounded. The in-page `onclose`
fires very late or never, so the reconnect backoff never schedules — the
canvas appears frozen until the user manually refreshes. Symptoms:
- #223 mobile canvas chat has no real-time updates (must refresh)
- #228 cross-device: user's own chat input doesn't broadcast to
other sessions in real time (must refresh)
Root cause: `canvas/src/store/socket.ts` had no visibility-wake. The
reconnect loop only re-arms on `onclose`, and mobile OSes don't always
fire `onclose` when they kill the WS.
Fix:
- Add `ReconnectingSocket.wake()` — forces an immediate reconnect
when the socket is in CLOSED / CLOSING / null limbo, no-op when
OPEN or CONNECTING. Pre-empts any pending backoff timer and resets
the attempt counter (this was a user-initiated wake, not an
unattended-tab failure cascade).
- Wire a module-level `visibilitychange` + `pageshow` listener inside
`connectSocket()`; remove it in `disconnectSocket()`. `pageshow`
covers Safari's bfcache restore where `visibilitychange` doesn't
fire on its own.
- Export `wakeSocket()` so the test suite can exercise the path
without depending on a jsdom DOM (the existing socket.test.ts
runs under the `node` environment).
Tests (5 new cases under `wakeSocket → reconnect`):
- wake on OPEN: no new WS
- wake on CLOSED: new WS created (the #223 fix)
- wake on CONNECTING: no extra handshake piled on
- wake cancels pending backoff `setTimeout`
- wake after `disconnectSocket()` is a no-op (no zombie)
Closes#223Closes#228
iOS Safari and PWAs auto-zoom the viewport when a focused input or
textarea has a computed font-size below 16px. Two mobile-canvas inputs
were below that bound, causing the layout to jump and look broken on
focus until the user pinched back:
- MobileSpawn.tsx agent-name input (fontSize: 13.5) — #225
- MobileChat.tsx composer textarea (fontSize: 14.5) — #224
Both bumped to 16px (the minimum that suppresses focus-zoom). This is
the same class of bug as desktop #1434, scoped here to the mobile
breakpoint.
Tests:
- MobileSpawn.test: assert agent-name input renders at fontSize >= 16
- MobileChat.test: assert composer textarea renders at fontSize >= 16
Both parse the inline style.fontSize (jsdom has no layout engine, so
getComputedStyle reports the inline value verbatim).
Closes#224Closes#225
Sweep stale Cloudflare DNS records / Sweep CF orphans (push) Compensated by status-reaper (workflow has no push: trigger; Gitea 1.22.6 hardcoded-suffix bug — see .gitea/scripts/status-reaper.py)
The new prod-team personas (agent-dev-a, agent-dev-b, agent-pm) ship
only `token` + `universal-auth.env` (Infisical UA bootstrap), no `env`
file. loadPersonaEnvFile silently no-ops on them today. With this
fallback, GITEA_TOKEN/USER/EMAIL get populated from the token file
when no env file exists.
Combined with the GIT_ASKPASS injection earlier in this PR, this
makes the askpass helper functional for the new personas.
Sweep stale Cloudflare DNS records / Sweep CF orphans (push) Compensated by status-reaper (workflow has no push: trigger; Gitea 1.22.6 hardcoded-suffix bug — see .gitea/scripts/status-reaper.py)
Wire container-side `git` HTTPS authentication to the persona credentials
that already arrive via workspace_secrets (GITEA_USER / GITEA_TOKEN,
GIT_HTTP_USERNAME / GIT_HTTP_PASSWORD) without mutating ~/.gitconfig or
~/.git-credentials inside the container.
Mechanism:
1. New generic GIT_ASKPASS helper baked into the workspace runtime
image at /usr/local/bin/molecule-askpass. Script body is hostname-
free and vendor-neutral — the deployer decides which remote the
credentials apply to by virtue of populating the env vars.
2. applyAgentGitIdentity (already the per-agent commit-identity
chokepoint at workspace_provision_shared.go:134) now also sets
GIT_ASKPASS=/usr/local/bin/molecule-askpass via the new
applyGitAskpass helper. Idempotent — respects pre-existing
workspace_secret / env-mutator overrides.
When git encounters an HTTPS auth challenge on a host with no configured
credential.helper, it invokes GIT_ASKPASS to read the username + password
from env. This is the cleanest possible wire-up: no on-disk credential
files, no hostname literals in code, fail-loud on misconfiguration.
Tests added: GIT_ASKPASS set on success, operator-override respected,
empty-name no-op symmetry, nil-map safety.
Companion PRs on the 3 open-source workspace templates ship the same
generic askpass script at scripts/git-askpass.sh → identical install
path. Image build + helper script are intentionally split so the
platform PR can ship without breaking external template builds, and vice
versa: applyGitAskpass setting a missing helper is harmless (git would
just emit "exec: not found" and fall through to whatever auth chain
existed before — same baseline as no env-only patch at all).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-up to PR #1504 (role=alert on ConfigTab error divs) — the
AgentAbilitiesSection error div was in a separate render branch and
was missed. WCAG 4.1.3 requires dynamic error messages to be announced
by screen readers immediately.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
SEV-1 #1413 follow-up: sop-tier-check.yml uses
{{ secrets.SOP_TIER_CHECK_TOKEN }} but lacked secrets:read
permission. Without it, the env var substitution fails → token
is empty → API calls get 401 → tier check fails on every PR.
Same fix applied to qa-review/security-review/sop-checklist in PR #1498.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
WCAG 4.1.3: two error divs in ConfigTab.tsx used text-bad styling
without declaring themselves as live regions. Screen readers miss
the error announcement.
Fix: add role="alert" aria-live="assertive" to both error divs,
matching the pattern applied in PRs #1463/#1465 by core-uiux for
other tab components.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Add focus-visible ring to three buttons missing it:
- Mobile hydration error Retry button
- Desktop hydration error Retry button
- PlatformDownDiagnostic Reload button
- Wrap <Canvas /> in <main aria-label="Agent canvas"> landmark
(WCAG 1.3.1 — main content now has a proper landmark)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- AgentCommsPanel: add focus-visible ring + aria-label to Retry button
(error state). Add focus-visible to CommsTab tab buttons.
- AttachmentViews: add focus-visible ring + aria-label to Remove button
(PendingAttachmentPill) and Download button (AttachmentChip).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The Refresh button inside the SecretsTab error state had no focus ring
defined in CSS. Without it, keyboard-only users cannot determine which
element has focus on that error screen.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The free-text model input (shown when /templates returns no models for
the runtime) had a visual <label>Model</label> but the input lacked an
id and the label lacked htmlFor — the association was purely visual.
Added aria-label="Model" to make the name programmatically determinable.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The two FilesTab confirm dialogs (delete-all, delete-one) use role="alertdialog"
but were missing aria-modal. These are inline in-page prompts without focus
trapping — aria-modal="false" explicitly documents the non-modal nature so
assistive technology knows the rest of the page remains interactive.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
MobileHome: spawn FAB had no focus indicator — added emerald ring.
MobileMe: accent color swatches (all 8 colors) and theme toggle buttons
(Dark / Light / System) had no focus indicators — added emerald ring.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
MobileCanvas: reset zoom button had no focus indicator — added
focus-visible:ring-2 with emerald-500 ring (consistent with other
mobile interactive elements in the same branch).
MobileComms: filter toggle buttons (All / Errors) had no focus indicator
— added focus-visible:ring-2 with emerald-500 ring.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
MobileChat: composer textarea had no aria-label — added aria-label="Message".
MobileSpawn: name input had no programmatic label — added aria-label="Agent name".
Both inputs had visible text labels above them but no accessible-name association,
violating WCAG 1.3.1 (info/structure relationships).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The "Add new" section had two bare <input> elements with only
placeholder text. Added aria-label="Secret key name" and
aria-label="Secret value" — distinct from the per-row Field
inputs that PR #1453 already fixed.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- MissingKeysModal.tsx: Add aria-label to both password inputs
(inside map loops where entry.key is the accessible name source).
WCAG 1.3.1 / 4.1.2.
- AuditTrailPanel.tsx: Add role="status" aria-live="polite" to
the loading state div. WCAG 4.1.3.
- ConversationTraceModal.tsx: Add role="status" aria-live="polite"
to both the loading state and empty state divs. WCAG 4.1.3.
Found via systematic accessibility audit sweep of non-tab components.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Tests in ExternalConnectModal.test.tsx used document.querySelector("pre")
which returns the first pre in DOM order. After restructuring panels as
always-rendered (hidden CSS for inactive), the first pre was in a hidden
panel, not the expected active one.
Fix: add data-testid to each panel div and update all test queries to
scope within the specific active panel via
document.querySelector("[data-testid='panel-...']").
All 18 tests pass. Build passes.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Add id=, aria-controls=, and tabIndex= to each role=tab button
- Add id= and role=tabpanel + aria-labelledby= to each snippet panel
- Restructure panels as always-rendered (hidden CSS) so aria-controls
targets are stable — active panel has role=tabpanel, hidden panels
are hidden with aria-hidden semantics via hidden attribute
- Add ArrowRight/ArrowLeft/ArrowDown/ArrowUp + Home/End keyboard
navigation for the tablist (ARIA tab pattern requirement)
- Compute tabList once after filled* vars to share between tab bar
and keyboard handler
WCAG 4.1.3 (Name, Role, Value) — tab controls now have correct
role, aria-selected, aria-controls, and keyboard navigation.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Error divs in EventsTab, TracesTab, ChannelsTab, DetailsTab (save/restart/delete),
and ExternalConnectionSection now use role=alert so assistive technology
announces each error immediately when it appears.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Railway pin audit (drift detection) / Audit Railway env vars for drift-prone pins (push) Compensated by status-reaper (workflow has no push: trigger; Gitea 1.22.6 hardcoded-suffix bug — see .gitea/scripts/status-reaper.py)
Sweep stale Cloudflare DNS records / Sweep CF orphans (push) Compensated by status-reaper (workflow has no push: trigger; Gitea 1.22.6 hardcoded-suffix bug — see .gitea/scripts/status-reaper.py)
Force a new workflow run to pick up the /sop-n/a qa-review
and /sop-n/a security-review declarations from infra-runtime-be
(engineers team) and the [core-security-agent] APPROVED comment.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
core-qa-agent and core-security-agent approve PRs via issue comments,
not the reviews API. The reviews API returns zero entries for comment-only
approvals (internal#348), causing qa-review / security-review gates to
fail on every PR — even when both agents have explicitly approved.
Changes:
- review-check.sh: after reviews-API candidate check fails, fetch
GET /repos/{owner}/{repo}/issues/{N}/comments and extract logins that
posted (a) the agent-prefix pattern ([core-qa-agent] or
[core-security-agent]) OR (b) a generic approval keyword (APPROVED /
LGTM / ACCEPTED, word-anchored, case-insensitive). Non-author filter
is applied. Candidates from comments are merged and fall through to the
team-membership probe, same as reviews-API candidates.
- _review_check_fixture.py: add T15 (agent-prefix match → exit 0),
T16 (generic keyword match → exit 0), T17 (no approval → exit 1)
scenarios with corresponding issue comments endpoint handler.
- test_review_check.sh: add T15, T16, T17 regression tests.
Also fixes a JQ operator-precedence bug in an earlier draft where
`| $cmt.user.login` was placed OUTSIDE the `or` expression, causing the
filter to always output the login (jq resolves bound variables regardless
of the current context). Fixed by using `if-then-elif-else-empty` so the
login projection only fires on a genuine match.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
POST /workspaces silently substituted langgraph and returned 201 when a
caller named a `template` (intent for a specific runtime) but the runtime
could not be resolved from it (config.yaml unreadable / no `runtime:`
key). This is the molecule-controlplane#188 / #184 contract violation —
it produced 5/5 wrong-runtime workspaces and a false codex E2E pass.
The ws-server `Create` handler is the boundary the product UI actually
hits (the canvas dialog and provision_workspace MCP tool both POST here);
controlplane#188's CP-side gate is the sibling. This closes the
ws-server side: when the caller expressed runtime intent (passed
`runtime`, or named a `template`) but it cannot be honored, return 422
RUNTIME_UNRESOLVED instead of a silent langgraph 201.
The legitimate default path (bare {"name":...} — no template, no
runtime) still defaults to langgraph and returns 201; a regression test
pins that so the fail-closed gate can't over-fire.
Tests: TestWorkspaceCreate_188_* (missing template, no-runtime-key
template, default-path regression guard, explicit-runtime OK).
Refs: molecule-controlplane#188, #184
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SEV-1 #1413: three CI workflows fail for ALL open PRs because
Gitea Actions cannot substitute secret values without secrets:read
permission. Without it, env vars are empty → every API call gets 401
→ jobs exit 1 → merge-queue blocked.
Fix: add secrets:read to all three workflow permission blocks.
sop-checklist.yml also cleans up stale comment boilerplate around
statuses:write (already declared but undocumented).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The broadcast_enabled and talk_to_user_enabled workspace abilities have
complete, wired backends (commit 29b4bffb: workspace_abilities.go,
workspace_broadcast.go, agent_message_writer.go) but no usable canvas
control — so the CTO cannot see or toggle them from the canvas.
- broadcast_enabled (default FALSE): no canvas control existed at all.
- talk_to_user_enabled (default TRUE): only surfaced as the ChatTab
recovery banner, which renders solely when the flag is false and is
therefore invisible under the TRUE default.
Adds an always-visible "Agent Abilities" section to ConfigTab with two
on/off toggles bound to the existing PATCH /workspaces/:id/abilities
endpoint (same call the ChatTab recovery banner uses), optimistic store
updates via updateNodeData with rollback on failure, and server-truth
reconciliation through the existing canvas-topology hydration.
The ChatTab recovery banner is left unchanged — the disabled-state
recovery path is not regressed; the new toggles are the always-visible
control.
Refs internal#510, internal#511.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sweep stale Cloudflare DNS records / Sweep CF orphans (push) Compensated by status-reaper (workflow has no push: trigger; Gitea 1.22.6 hardcoded-suffix bug — see .gitea/scripts/status-reaper.py)
Sweep stale Cloudflare DNS records / Sweep CF orphans (push) Compensated by status-reaper (workflow has no push: trigger; Gitea 1.22.6 hardcoded-suffix bug — see .gitea/scripts/status-reaper.py)
Secret scan / Scan diff for credential-shaped strings (pull_request) Compensated by status-reaper (default-branch pull_request status shadowed by successful push status on same SHA; see .gitea/scripts/status-reaper.py)
E2E API Smoke Test flaked (24h history ~137 pass / 3 fail on molecule-core;
not a code path the staging<-main conflict resolution touches; core-devops
re-review ran the full handlers package + a92beb5d regression test green).
Empty commit = the only reliable rerun mechanism on Gitea 1.22.6 (no REST
rerun until 1.26). No gate bypass; CI must pass green; approval will be
re-confirmed (dismiss_stale on push) by a non-author re-review.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
core-devops review 4483 (REQUEST_CHANGES) correctly found the prior
blanket keep-staging resolution reverted main-only a92beb5d (synchronous
durable activity_logs INSERT before the queued 200 — the poll-mode
'lose my own message on chat exit' data-loss fix; staging never had it).
This commit keeps MAIN's synchronous LogActivity(insCtx,...) form for the
logA2AReceiveQueued conflict block, and STAGING's tracked-goAsync/asyncWG
A2A P0 form for all other blocks (review confirmed those OK; 1c3b4ff3 and
A2A P0 e740ffe2 not regressed). Regression test
TestProxyA2A_PollMode_PersistsUserMessageSynchronouslyBeforeQueuedResponse
is now GREEN. workspace-server handlers build + vet clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Error divs in EventsTab, TracesTab, ChannelsTab, DetailsTab (save/restart/delete),
and ExternalConnectionSection now use role=alert so assistive technology
announces each error immediately when it appears.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Screen readers were not announcing error messages in several canvas components.
Each error div now uses role=alert so assistive technology announces the
error immediately and assertively — without the user having to manually
navigate to find the error.
Fixed: ConfigTab, ScheduleTab, MissingKeysModal (per-entry + global),
WorkspaceUsage.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Screen readers were not announcing loading or empty states in several
canvas components. Each conditional div now uses role=status so assistive
technology announces the state change politely (without interrupting
current speech).
Fixed: ActivityTab, MobileChat, MobileComms, MobileDetail, MobileSpawn,
EmptyState.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
A2A peer_agent delegation delivery has been 100% broken fleet-wide since
2026-05-12. Delegate() ran the fire-and-forget executeDelegation goroutine
on c.Request.Context(); the handler returns HTTP 202 immediately, which
cancels that context, so every DB op + proxy call in the detached
goroutine failed `context canceled` the instant the response was written.
lookupDeliveryMode swallowed the resulting error and silently defaulted to
push, skipping the poll-mode short-circuit that writes the a2a_receive
inbox row — so poll-mode peers (e.g. hongming-pc) never received messages
and push-mode peers hit the #190-style self-echo timeouts. Introduced by
ce2db75f ("handlers: pass cancellable context through executeDelegation").
Primary fix (delegation.go): derive the goroutine context via
context.WithTimeout(context.WithoutCancel(ctx), 30*time.Minute). WithoutCancel
detaches request cancellation/deadline while preserving all ctx values
(trace/correlation/tenant ids the proxy + broadcaster read). This is the
established pattern in this package (a2a_proxy.go:850,
a2a_proxy_helpers.go:525, registry.go:822); the 30m budget matches the
pre-ce2db75f internal budget and the proxy's own agent-dispatch ceiling.
Secondary fix, surgical (a2a_proxy_helpers.go + a2a_proxy.go), RFC#497
fail-closed theme: lookupDeliveryMode no longer swallows a *context*
error (context.Canceled / context.DeadlineExceeded) into a silent push
default — it propagates so the caller fails closed with a structured 503.
Scope deliberately narrowed to ctx errors only: generic DB errors retain
the long-standing documented fail-open-to-push contract (loud + recoverable
502/SSRF/restart, unlike the silent poll drop), so checkWorkspaceBudget's
intentional fail-open and the existing suite are unaffected. Widening
further is an RFC#497 follow-up, not part of this P0.
Regression tests:
- TestDelegate_DetachedContext_SurvivesRequestCancellation: detached ctx
outlives request cancellation AND preserves parent values + deadline.
- TestLookupDeliveryMode_ContextCanceled_FailsClosed: ctx-cancelled
delivery-mode read returns an error, never push.
- TestProxyA2A_PollMode_FailsClosedToPush: legacy non-ctx-DB-error
fail-open-to-push contract preserved.
Full workspace-server/internal/handlers package suite passes (go test
-count=1), go build ./... and go vet clean.
Refs: internal#497, regression ce2db75f
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The runtime builds its AgentCard from config.name, which the
CP-regenerated /configs/config.yaml sets to the raw workspace UUID — so
/registry/register stored (and /.well-known/agent-card.json + peer
agent_card_url served) a card with name=<uuid>, description="",
role=null, even though the operator-controlled workspaces.name DB
column holds the friendly name the canvas shows ("Claude Code Agent").
Fleet-wide; live registry confirmed name=UUID for ws 3b81321b while
workspaces.name="Claude Code Agent".
Server-side, platform-controlled repair at the register upsert: when the
runtime-supplied agent_card.name is empty or equals the workspace UUID,
substitute the trusted workspaces.name; default a blank description from
the reconciled name; default role from workspaces.role. Gaps are only
FILLED — a card already carrying a real friendly name (external channel
agents) is never downgraded; malformed/edge cards are stored verbatim
(no-worse-than-before). Identity stays platform-sourced from the
operator-controlled DB row — the agent gains no self-edit. Works for all
runtimes without touching every template or the CP generator. The
WORKSPACE_ONLINE broadcast now carries the reconciled card so the canvas
live-updates with the friendly name.
Pure helper (agent_card_reconcile.go) is exhaustively unit-tested
without DB/HTTP. Upstream CP config.yaml regeneration, the missing role
key in the runtime register payload, and an editable description/skills
surface are RFC-scoped in internal#492.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When the per-op context deadline (eicFileOpTimeout=30s) fires,
exec.CommandContext SIGKILLs the ssh subprocess and Run() returns the
bare "signal: killed" with empty stderr. That surfaced to the canvas
Settings/Config tab as an opaque
`500 {"error":"ssh install: signal: killed ()"}` — giving the operator
no signal that the workspace was simply mid-provision with a slow/unready
EIC tunnel (internal#423; recurred 2026-05-17 on claude-code ws
3b81321b, blocking config save).
Detect context abortion explicitly and return a message that names the
cause and points at the Settings -> Secrets encrypted-write path (which
does NOT use this EIC file-write path) as the unblock for applying
provider credentials. The EIC mechanism, timeout value, and success
path are unchanged — this only improves the error a stuck write emits.
Refs internal#423. Same Settings-area opaque-500 theme as #1420.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Secrets test button calls POST ${PLATFORM_URL}/secrets/validate, a
route that has never been implemented on the workspace-server router
(router.go registers /secrets, /secrets/values, /settings/secrets,
/admin/secrets — no /secrets/validate) nor on the Next.js canvas. Live
probe: POST /secrets/validate → HTTP 404 in 0.28s (a fast 404, not a
network timeout).
request() throws ApiError(404); TestConnectionButton's bare `catch {}`
swallowed it and unconditionally rendered the hardcoded string
"Connection timed out. Service may be down." — factually wrong and
indistinguishable from a real outage or a token rejection.
Minimal fix (same "make the dead affordance honest" approach as the
reveal control, internal#490 / PR#1421): bind the caught error and
surface the real failure — distinguish "validation not available"
(404/501), a non-404 server error (with status), and a genuine
connectivity failure. No speculative server-side validate endpoint.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The eye/RevealToggle in SecretRow was a dead affordance: it flipped a
local `revealed` boolean but the row always rendered `masked_value` and
never consumed it, so nothing was ever revealed. RevealToggle renders an
eye-WITH-SLASH when revealed=true, so a clicked row looked "active" while
showing nothing — read by users as "this doesnt work" (reported on
CLAUDE_CODE_OAUTH_TOKEN / Anthropic group).
Root cause is not Anthropic/OAuth/category-specific and not a server
4xx/5xx: secret values are write-only from the browser by design — the
server List handler "Never exposes values", there is no per-secret
decrypt route, and the only decrypted path (GET /secrets/values) is bulk
+ token-gated for remote agents and never called by canvas. The client
has no plaintext-fetch function. Reveal is architecturally impossible
without a deliberate security regression (out of scope).
Fix: remove the dead toggle (+ its local state / auto-hide effect) and
show a static write-only indicator (lock + explanatory title). Edit
(rotate/replace) and Delete are unaffected and independent of reveal.
Refs: internal#490; sibling Secrets/Tokens fixes PR #1415 + #1420
(referenced in triage as internal#210 / internal#211). Does not touch
the agent-error path (internal#212).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The spinner SVG inside the test-connection button is decorative — it
visualizes loading state alongside the text label. Add aria-hidden="true"
so screen readers ignore it and use only the visible text as the accessible
button name.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
WCAG 2.4.7: DeleteConfirmDialog Cancel and Delete buttons were missing
:focus-visible rules in settings-panel.css. Keyboard users tabbing to
these dialog buttons would see no visible focus indicator.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
WCAG 2.4.7: keyboard-only users need a visible focus indicator on all
interactive buttons. The Copy, Dismiss, and Revoke buttons in OrgTokensTab
and TokensTab had :hover but no :focus-visible, making focus state
invisible when tabbing to these buttons.
Add focus-visible:ring-2 (accent for copy/dismiss, red-400 for revoke)
to all non-disabled action buttons in both tabs.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Settings → Workspace Tokens 500'd whenever opened with no canvas node
selected. SettingsPanel passes the literal sentinel "global" as the
workspace id; the backend queries the uuid `workspace_id` column with
it → Postgres `invalid input syntax for type uuid: "global"` → opaque
500 ("failed to list tokens"). Token create in that view broke the same
way. SecretsTab already handles the sentinel (api/secrets.ts reroutes
"global" → /settings/secrets); TokensTab did not — that asymmetry was
the bug. Pre-existing since 2026-04-13, NOT a regression.
Frontend (user-visible fix): TokensTab is now sentinel-aware like
SecretsTab. When workspaceId === "global" (no node selected) it no
longer calls /workspaces/global/tokens — it renders a clean state
pointing the user to the Org API Keys tab (the existing org-wide
surface). No 500, no scary error banner. The red account "Error" in
this view was just this 500 surfacing through TokensTab's local error
banner; it resolves with this guard (verified in code — no separate
widget).
Backend (defense-in-depth, same PR): List/Create/Revoke validate
c.Param("id") as a UUID up front and return 400 {"error":"invalid
workspace id"} instead of leaking a DB type error as a 500. Added the
missing log.Printf on the List query-error branch — it was the only
token handler silently swallowing the DB error, which is why this
incident had zero log trail. Mirrors the uuid.Parse guard already in
handlers/activity.go.
Workaround (pre-merge): select a workspace node before opening the
tab, or use the Org API Keys tab.
Product note for CTO: there is no /workspaces/global/tokens endpoint
(workspace tokens are inherently per-workspace; the org-wide
equivalent is the separate Org API Keys tab), so — unlike SecretsTab
which reroutes to a real global-secrets endpoint — the lowest-risk
safe behavior was a disabled state + pointer to Org API Keys rather
than a reroute. Flag if a different UX is wanted.
Tests: added TokensTab sentinel tests (no API call + Org-pointer) and
a backend table test asserting List/Create/Revoke 400 on non-UUID id
without hitting the DB. Updated existing token handler tests to use
valid UUIDs (they used "ws-1" etc.).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
a2a_mcp_server.py main()'s stdio read loop used
`await loop.run_in_executor(None, stdin.read, 65536)`. On a PIPE,
read(n) blocks until n bytes accumulate OR EOF. A live MCP client
(openclaw bundle-mcp, Claude Code, Cursor) sends one ~150-byte
newline-delimited request and keeps stdin OPEN waiting for the reply,
so neither condition is met: the server never parses `initialize` and
the client times out (~30s; openclaw: "MCP error -32000: Connection
closed"). This silently broke peer visibility for every pipe-spawned
MCP host while passing all existing stdio tests, which only fed stdin
from a regular file or a heredoc-pipe that CLOSES (EOF returns
immediately). readline() returns as soon as one newline-delimited
line is available — exactly the JSON-RPC framing — and is
backward-compatible with the EOF/file cases.
Root cause of the 2026-05-15 openclaw peer-visibility outage
(workspace 95744c11): the molecule MCP server could not complete the
handshake over openclaw's stdio pipe, so the agent fell back to
native sessions_list. The openclaw template adapter fix
(template-openclaw#16) works around this via HTTP transport; this
patch fixes the stdio root cause so stdio works for all CLI MCP hosts.
Regression coverage:
- tests/test_a2a_mcp_server.py::TestStdioKeepOpenPipe — spawns the
real a2a_mcp_server.py, writes one request over a pipe, and
DELIBERATELY keeps stdin open. FAILS (15s timeout, empty response)
on read(65536); PASSES on readline(). Verified both directions.
- ci-mcp-stdio-transport.yml: new "pipe held OPEN, no EOF" step that
reproduces the literal openclaw failure (the prior steps only
exercised EOF-closing stdin, which is why the outage shipped green).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tenant workspace containers run agent-controlled code and must never
receive a Git SCM write credential — agents structurally lacking
merge/approve creds is why the two-eyes review gate is self-bypass-proof
against forged-approval injection.
Latent path: handlers.loadPersonaEnvFile() merges a per-role persona
GITEA_TOKEN into cfg.EnvVars when MOLECULE_PERSONA_ROOT is set on a
tenant host; it then flowed unfiltered through buildContainerEnv()
(local Docker) and CPProvisioner.Start() (tenant EC2). Inert today
(persona dirs are operator-host-only) but unguarded — and the
pre-existing TestBuildContainerEnv_CustomEnvVarsAppended test actually
asserted GITHUB_TOKEN passed through verbatim.
Adds a narrow, auditable exact-match denylist (isSCMWriteTokenKey:
GITEA/GITHUB/GH/GITLAB/GL/BITBUCKET _TOKEN) applied by construction in
both env paths, plus negative-assertion tests covering the normal path
and a persona-file-merge simulation. Non-credential persona identity
(GITEA_USER, GITEA_USER_EMAIL) is intentionally preserved. No
provisioner refactor.
Tracking: molecule-ai/internal#438
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause: platform/internal/db.DB is a swappable package global.
setupTestDB (+ peer test helpers) saves/restores it via t.Cleanup, but
production code spawns fire-and-forget goroutines (maybeMarkContainerDead/
preflightContainerHealth -> RestartByID -> runRestartCycle, logA2ASuccess/
Failure activity logging, gracefulPreRestart, sendRestartContext) that
read db.DB. These detached goroutines outlive the test that triggered
them and race the db.DB pointer write in a LATER test's cleanup —
WARNING: DATA RACE on platform/internal/db.DB, surfaced deterministically
by PR#1240's expanded A2A test corpus on staging (a sibling of the
mc#664/mc#774 Phase-3-masked handler-test family). Pre-existing since
be5fbb5a (2026-05-07); NOT introduced by #1240/#1250.
Fix:
- Convert the leaked raw `go ...` restart/a2a-logging goroutines to the
existing tracked h.goAsync (asyncWG) — matches the already-correct
site at a2a_proxy.go:648 and goAsync's documented intent.
- Wire the never-connected test-drain half: a newHandlerHook (nil in
prod, zero cost) lets the test harness register every handler;
setupTestDB's cleanup now drains all tracked async goroutines BEFORE
restoring db.DB, eliminating the race window.
Verified: full `go test -race -timeout ./...` (CI step) green, 0 races,
0 failures; the 8 originally-failing tests pass -race -count=5.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3 of the Files API roots RFC. UI-side wiring for the new
/agent-home root. Backend dispatch is the Phase 2b PR (#TBD) — until
that lands, /agent-home returns the 501 stub from #1247, which the
existing error banner already surfaces gracefully.
Changes:
1. canvas/src/components/tabs/FilesTab/FilesToolbar.tsx — adds
<option value="/agent-home">/agent-home</option> at the bottom
of the root selector. Pre-Phase-2b the dropdown still works
because the server-side 501 is just an error response — same
error-banner path as a transient backend failure.
2. canvas/src/components/tabs/FilesTab.tsx — new
defaultRootForRuntime() function pins the initial root per-
runtime per Hongming Decisions §2 (internal#425):
- openclaw → /agent-home (the user-facing interesting state)
- everything else → /configs (legacy default)
FilesTab now reads workspace runtime from props.data?.runtime
and threads it through to PlatformOwnedFilesTab. Undefined-
runtime callers (legacy tests, pre-load states) default to
/configs — matches today's behaviour, no surprise.
3. canvas/src/components/tabs/FilesTab/FileEditor.tsx — new
SECRET_SHAPE_DENIED_MARKER export + denial-placeholder render
path. When fileContent === marker, the editor renders a
role=region placeholder instead of the textarea, so the matched
bytes never enter a controlled input (DOM value, clipboard,
inspector). Marker constant matches the canonical
'<denied: secret-shape>' string the Phase 2b backend will emit.
Also: /agent-home is read-only via isReadOnlyRoot until Phase
2b decides write semantics. Until then, write attempts would
201 with the 501 stub anyway, but blocking the textarea at the
UI saves the user a round-trip + a confusing error.
Tests (canvas/src/components/tabs/FilesTab/__tests__/agentHome.test.tsx):
- dropdown includes /agent-home option (pins Phase 1 contract)
- dropdown reflects /agent-home as selected value when prop is set
- denied-marker renders placeholder INSTEAD OF textarea (pins
the bytes-don't-leak invariant)
- regular content renders textarea, no placeholder (regression
guard)
- /agent-home renders textarea read-only (pins the gate)
- /configs renders textarea writable (regression guard for the
read-only-everywhere bug)
- marker constant matches the canonical '<denied: secret-shape>'
string (pins the contract value so a typo on either side
breaks the test)
vitest run on FilesTab + new tests: 47 tests passed, 3 files. tsc
--noEmit clean for all edited / created files (the pre-existing TS
errors in FilesTab.test.tsx are unchanged and unrelated).
Refs internal#425.
Phase 2a of the Files API roots RFC. Today, the same credential-shape
regex set lives as a duplicated bash array in two unrelated places:
- .gitea/workflows/secret-scan.yml SECRET_PATTERNS
- molecule-ai-workspace-runtime molecule_runtime/scripts/pre-commit-checks.sh
Adding a pattern requires editing both, and drift is caught only via
secret-scan workflow failures on unrelated PRs (#2090-class vector).
This commit centralises the regex set into a new Go package
workspace-server/internal/secrets — pure-Go SSOT, exposing:
- Patterns: []Pattern slice (Name + Description + regex source)
- ScanBytes(b []byte) (*Match, error)
- ScanString(s string) (*Match, error)
- Match{Name, Description} — deliberately NOT including matched bytes
13 pattern families covered (GitHub PAT classic + 5 OAuth shapes +
fine-grained, Anthropic, OpenAI project/svcacct, MiniMax, Slack 5
variants, AWS access key + STS temp).
Phase 2b (docker-exec backend) will import secrets.ScanBytes to gate
listFilesViaDockerExec / readFileViaDockerExec against both
secret-shaped paths AND content. Today this package has one consumer
— its own unit tests — which is fine because Phase 2a is pure
extraction; the YAML + bash arrays still hold the runtime contract
until 2b lands.
Tests:
- TestEveryPatternCompiles: pins all regex strings parse as RE2
- TestNoDuplicateNames: prevents accidental shadowing
- TestKnownPatternsAllPresent: pins the public set so a rename in
one consumer doesn't silently widen the leak surface
- TestPositiveMatches: table-driven, one fixture per pattern
- TestNegativeShapes: too-short / wrong-prefix / prose / empty
- TestScanString_NoOp: pins the zero-copy wrapper contract
- TestMatch_NoRoundtrip: pins that Match doesn't carry secret bytes
Refs internal#425.
Phase 1 of internal#425 RFC (Files API roots — container-internal home
+ system/agent split). Adds the new /agent-home allowedRoots key plus
short-circuit dispatch that returns 501 with the canonical pending-
message body across List/Read/Write/Delete verbs.
Why a stub:
- Lets the canvas FilesTab design its root-selector UI against the
final shape (the additional option appears in the dropdown today;
the body just says "implementation pending").
- The stub-vs-real transition is server-side only — Phase 2b lands
the docker-exec backend without canvas changes.
- The 501 short-circuit runs BEFORE the DB lookup, so canvases that
speculatively GET /agent-home don't generate workspace-not-found
noise in logs.
Tests:
- TestAgentHomeAllowedRoot pins the allowedRoots membership.
- TestAgentHomeStub_AllVerbs_Return501 pins the canonical 501 +
message body across all four verbs (table-driven for symmetry).
- Both assert the stub short-circuits before the DB / EIC / Docker
paths, so adding the real backend doesn't have to fight a stale
test that exercised a wrong layer.
Existing Files API tests (ListFiles / ReadFile / WriteFile /
DeleteFile / EIC dispatch / shells) still pass — diff is additive.
Refs internal#425.
Canvas "Save & Restart" was timing out for openclaw workspaces because
two bugs compounded:
1. **Pointless config.yaml write.** openclaw manages its own prompt
surface via SOUL/BOOTSTRAP/AGENTS multi-file system — it does NOT
read the platform's config.yaml. But ConfigTab.tsx was still
issuing `PUT /workspaces/:id/files/config.yaml` on every save,
which on tenant EC2 fans out through the slow EIC SSH tunnel path
(`workspace-server/internal/handlers/template_files_eic.go`).
Other runtimes that ship their own config are already exempted via
`RUNTIMES_WITH_OWN_CONFIG` (external, kimi, kimi-cli). Add openclaw
to that set so the platform stops doing work the runtime ignores.
2. **Client aborts before server returns.** `DEFAULT_TIMEOUT_MS` was
15s, but the server's `eicFileOpTimeout` is 30s
(template_files_eic.go L118). When EIC was slow or the EC2's
ec2-instance-connect daemon was unhealthy, the canvas aborted with
a generic timeout *before* the workspace-server returned its real
5xx — so the user saw a useless "request timed out" instead of
the actual cause. Raise the default to 35s so the server's error
surfaces. The AbortController contract is unchanged; callers can
still override `timeoutMs` per-request.
Together these fixes unblock the user-visible "Save & Restart"
behavior on openclaw workspaces. The underlying EIC hang on
i-04e5197e96adb888f (last_healthcheck_at IS NULL) is tracked
separately as a follow-up — this PR makes the canvas honest about
errors instead of swallowing them, and removes the unnecessary write
from openclaw's critical path entirely.
Refs: internal#418 (Canvas Save & Restart timeout on openclaw)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 14:38:43 -07:00
147 changed files with 9481 additions and 1646 deletions
echo"::error::${TEAM}-review: non-author review(s) were SUBMITTED but stored as PENDING — almost certainly the wrong Gitea review event string (internal#503)."
echo"::error::Gitea accepts ONLY the exact enum APPROVED / REQUEST_CHANGES / COMMENT. 'APPROVE' or lowercase is silently (HTTP 200) filed as PENDING and is invisible to this gate."
[ -n "${_rid:-}"]&&echo"::error:: review id=${_rid} by '${_rl}': RE-SUBMIT via POST ${API}/repos/${OWNER}/${NAME}/pulls/${PR_NUMBER}/reviews with {\"event\":\"APPROVED\"} (correct enum) — do NOT edit the DB."
done
fi
# --- Fallback (internal#348): check issue comments for agent-approval ---
# core-qa-agent and core-security-agent approve via issue comments, NOT
# the reviews API. The reviews API returns zero entries for comment-only
# approvals. This fallback reads PR issue comments and extracts logins that:
# 1. Posted a comment matching the agent-prefix pattern for this gate:
# qa → "[core-qa-agent] APPROVED"
# security → "[core-security-agent] APPROVED"
# OR posted a generic approval keyword (word-anchored, case-insensitive):
# APPROVED / LGTM / ACCEPTED
# 2. Are not the PR author
# 3. The team-membership probe below is the authoritative filter.
echo"::notice::${TEAM}-review: reviews API found no APPROVED reviews; found $(echo"$CANDIDATES"| wc -w | xargs) comment-based approval candidate(s) — verifying team membership..."
fi
else
debug "could not fetch issue comments (HTTP ${HTTP_CODE})"
fi
fi
if[ -z "${CANDIDATES:-}"];then
echo"::error::${TEAM}-review awaiting non-author APPROVE from ${TEAM} team (no candidates from reviews API or issue comments)"
# Belt-and-suspenders sanity floor: same logic as the staging
# variant — see that file's comment for the full rationale.
# Floor only applies when fleet >= 4; below that, canary-verify
# is the actual gate.
TOTAL_VERIFIED=${#SLUGS[@]}
if [ $TOTAL_VERIFIED -ge 4 ] && [ $UNREACHABLE_COUNT -gt $((TOTAL_VERIFIED / 2)) ]; then
echo "::error::$UNREACHABLE_COUNT of $TOTAL_VERIFIED tenant(s) unreachable — exceeds 50% threshold on a fleet large enough that this signals a real outage, not teardown race."
exit 1
fi
if [ $STALE_COUNT -gt 0 ]; then
echo "::error::$STALE_COUNT tenant(s) returned a stale SHA. ssm_status=Success was misleading — see job summary."
exit 1
fi
echo "::notice::Tenant fleet redeploy complete — all reachable tenants on ${EXPECTED_SHA:0:7} (${UNREACHABLE_COUNT} unreachable, soft-warned)."
# Auto-refresh staging tenant EC2s after every staging-branch merge.
#
# Mirror of redeploy-tenants-on-main.yml, with the staging-CP host and
# the :staging-latest tag. Sister workflow exists for prod (rolls
# :latest after canary-verify). Both share the same shape — just
# different CP_URL + target_tag + admin token secret.
#
# Why this workflow exists: publish-workspace-server-image now builds
# on every staging-branch push (PR #2335), pushing
# platform-tenant:staging-latest to GHCR. Existing tenants pulled
# their image once at boot and never re-pull, so the new image just
# sits unused until the tenant is reprovisioned.
#
# This workflow closes the gap by calling staging-CP's
# /cp/admin/tenants/redeploy-fleet, which performs a canary-first,
# batched, health-gated SSM redeploy across every live staging tenant.
# Same endpoint shape as prod CP — only the host differs.
#
# Runtime ordering:
# 1. publish-workspace-server-image completes on staging branch →
# new :staging-latest in GHCR.
# 2. This workflow fires via workflow_run, waits 30s for GHCR's CDN
# to propagate the new tag.
# 3. Calls redeploy-fleet with no canary (staging IS canary; we don't
# need a sub-canary inside it). Soak still applies to the first
# tenant in case of bad-deploy detection.
# 4. Any failure aborts the rollout and leaves older tenants on the
# prior image — safer default than half-and-half state.
#
# Rollback path: re-run with workflow_dispatch + target_tag=staging-<sha>
# of a known-good build.
on:
workflow_run:
workflows:['publish-workspace-server-image']
types:[completed]
branches:[main]
workflow_dispatch:
inputs:
target_tag:
description:'Tenant image tag to deploy (e.g. "staging-latest" or "staging-a59f1a6c"). Defaults to staging-latest when empty.'
required:false
type:string
default:'staging-latest'
canary_slug:
description:'Tenant slug to deploy first + soak (empty = skip canary, fan out immediately). Default empty for staging since staging itself is the canary.'
required:false
type:string
default:''
soak_seconds:
description:'Seconds to wait after canary before fanning out. Only meaningful if canary_slug is set.'
required:false
type:string
default:'60'
batch_size:
description:'How many tenants SSM redeploys in parallel per batch.'
required:false
type:string
default:'3'
dry_run:
description:'Plan only — do not actually redeploy.'
required:false
type:boolean
default:false
permissions:
contents:read
# No write scopes needed — the workflow hits an external CP endpoint,
# not the GitHub API.
# Serialize per-branch so two rapid staging pushes' redeploys don't
# overlap and cause confusing per-tenant SSM state. cancel-in-progress
# is false because aborting a half-rolled-out fleet leaves tenants
# stuck on whatever image they happened to be on when cancelled.
concurrency:
group:redeploy-tenants-on-staging
cancel-in-progress:false
jobs:
redeploy:
# Skip the auto-trigger if publish-workspace-server-image didn't
# actually succeed. workflow_run fires on any completion state; we
# don't want to redeploy against a half-built image.
echo "::warning::redeploy-fleet returned HTTP 500 but every failed tenant ($COUNT) is ephemeral (e2e-*/rt-e2e-*) — treating as teardown race, soft-warning."
printf '%s\n' "$FAILED_SLUGS" | sed 's/^/::warning:: failed: /'
elif [ "$HTTP_CODE" != "200" ]; then
echo "::error::redeploy-fleet returned HTTP $HTTP_CODE"
if [ -n "$NON_EPHEMERAL_FAILED" ]; then
echo "::error::non-ephemeral tenant(s) failed:"
printf '%s\n' "$NON_EPHEMERAL_FAILED" | sed 's/^/::error:: /'
fi
exit 1
else
# HTTP=200 but ok=false (shouldn't happen with current CP
# but keep the gate for completeness).
echo "::error::redeploy-fleet reported ok=false (see summary for which tenant halted the rollout)"
exit 1
fi
echo "::notice::Staging tenant fleet redeploy reported ssm_status=Success — verifying actual image roll on each tenant..."
if [ $TOTAL_VERIFIED -ge 4 ] && [ $UNREACHABLE_COUNT -gt $((TOTAL_VERIFIED / 2)) ]; then
echo "::error::$UNREACHABLE_COUNT of $TOTAL_VERIFIED staging tenant(s) unreachable — exceeds 50% threshold on a fleet large enough that this signals a real outage, not teardown race."
exit 1
fi
if [ $STALE_COUNT -gt 0 ]; then
echo "::error::$STALE_COUNT staging tenant(s) returned a stale SHA. ssm_status=Success was misleading — see job summary."
exit 1
fi
echo "::notice::Staging tenant fleet redeploy complete — all reachable tenants on ${EXPECTED_SHA:0:7} (${UNREACHABLE_COUNT} unreachable, soft-warned)."
| Engineering docs (`docs/adr/`, `docs/architecture/`, `docs/incidents/`) | This repo (internal, not published) |
| Live product pages (e.g. `canvas/src/app/pricing/page.tsx`) | This repo (these are app code, not marketing copy) |
If a PR fails the `Block forbidden paths` check, the contents belong in
`Molecule-AI/docs`. No CI drag, no Canvas E2E, content lands in minutes.
`molecule-ai/docs`. No CI drag, no Canvas E2E, content lands in minutes.
## Development Workflow
@@ -190,9 +190,9 @@ Runs the full regression suite against a fixture HTTP server. No network access
Code in this repo lands in molecule-core. Some related runtime artifacts
live in their own repos:
- [`Molecule-AI/molecule-ai-workspace-runtime`](https://git.moleculesai.app/molecule-ai/molecule-ai-workspace-runtime) — Python adapter SDK (`molecule_runtime`) that runs inside containerized Molecule workspaces. Bridges Claude Code SDK / hermes / langgraph / etc. → A2A queue.
- [`Molecule-AI/molecule-sdk-python`](https://git.moleculesai.app/molecule-ai/molecule-sdk-python) — `A2AServer` + `RemoteAgentClient` for external agents that register over the public `/registry/register` flow.
- [`Molecule-AI/molecule-mcp-claude-channel`](https://git.moleculesai.app/molecule-ai/molecule-mcp-claude-channel) — Claude Code channel plugin. Bridges A2A traffic into a running Claude Code session via MCP `notifications/claude/channel`. Polling-based (no tunnel required); install with `claude --channels plugin:molecule@Molecule-AI/molecule-mcp-claude-channel`.
- [`molecule-ai/molecule-ai-workspace-runtime`](https://git.moleculesai.app/molecule-ai/molecule-ai-workspace-runtime) — Python adapter SDK (`molecule_runtime`) that runs inside containerized Molecule workspaces. Bridges Claude Code SDK / hermes / langgraph / etc. → A2A queue.
- [`molecule-ai/molecule-sdk-python`](https://git.moleculesai.app/molecule-ai/molecule-sdk-python) — `A2AServer` + `RemoteAgentClient` for external agents that register over the public `/registry/register` flow.
- [`molecule-ai/molecule-mcp-claude-channel`](https://git.moleculesai.app/molecule-ai/molecule-mcp-claude-channel) — Claude Code channel plugin. Bridges A2A traffic into a running Claude Code session via MCP `notifications/claude/channel`. Polling-based (no tunnel required); install inside Claude Code via `/plugin marketplace add https://git.moleculesai.app/molecule-ai/molecule-mcp-claude-channel.git` → `/plugin install molecule@molecule-channel`, then launch with `claude --dangerously-load-development-channels --channels plugin:molecule@molecule-channel`.
When extending the **A2A surface** in molecule-core (`workspace-server/internal/handlers/a2a_proxy.go` etc.), consider whether the change has a downstream impact on the runtime SDK or the channel plugin — they're versioned independently but share the wire shape.
@@ -238,7 +238,7 @@ The result is not just “an agent that learns.” It is **an organization that
- subscribe to one or more workspaces; peer messages surface as conversation turns; replies route back through Molecule's A2A
- no tunnel, no public endpoint — the plugin self-registers each watched workspace as `delivery_mode=poll` and long-polls `/activity?since_id=…`
- multi-tenant friendly: one plugin install can watch workspaces across multiple Molecule tenants (`MOLECULE_PLATFORM_URLS` per-workspace)
- install via the standard marketplace flow: `/plugin marketplace add Molecule-AI/molecule-mcp-claude-channel` → `/plugin install molecule-channel@molecule-mcp-claude-channel`
- install via the standard marketplace flow: `/plugin marketplace add https://git.moleculesai.app/molecule-ai/molecule-mcp-claude-channel.git` → `/plugin install molecule@molecule-channel`, then launch with `claude --dangerously-load-development-channels --channels plugin:molecule@molecule-channel`
"Molecule AI is an org-chart canvas for AI agent teams. Wire Claude Code, Codex, Hermes, and OpenClaw agents into a governed multi-agent workspace with credit metering, audit, and one-click runtime provisioning.",
applicationName:"Molecule AI",
keywords:[
"AI agents",
"multi-agent",
"agent orchestration",
"AI org chart",
"Claude Code",
"Codex",
"MCP",
"agent governance",
"A2A",
"agent runtime",
],
authors:[{name:"Molecule AI"}],
creator:"Molecule AI",
publisher:"Molecule AI",
alternates:{canonical:"/"},
// OG + Twitter images come from the file-convention sibling
// `opengraph-image.tsx` — Next.js auto-attaches them to og:image
// and twitter:image when present at the segment root. We keep the
// text fields here so they win over per-page metadata when a page
// doesn't override them. `images: []` as the structural fallback
// for hosts that won't follow the file convention; the real URL
// is injected by Next.js at build time from opengraph-image.tsx.
openGraph:{
type:"website",
siteName:"Molecule AI",
url: SITE_URL,
title:"Molecule AI — the AI org chart canvas",
description:
"Wire Claude Code, Codex, Hermes, and OpenClaw agents into a governed multi-agent workspace. Credit metering, audit, and one-click runtime provisioning.",
locale:"en_US",
},
twitter:{
card:"summary_large_image",
title:"Molecule AI — the AI org chart canvas",
description:
"Wire Claude Code, Codex, Hermes, and OpenClaw agents into a governed multi-agent workspace.",
},
icons:{
icon:"/molecule-icon.png",
apple:"/molecule-icon.png",
},
// robots.ts owns the per-route allow/disallow contract; this is the
// header-level fallback for routes the crawler reaches before
// robots.txt resolves. Default = index public marketing routes;
label="Universal MCP — standalone register + heartbeat + tools for any MCP-aware runtime (Claude Code, hermes, codex). Pair with Python or Claude Code tab if you need inbound A2A delivery."
copyKey="mcp"
copied={copiedKey==="mcp"}
onCopy={()=>copy(filledUniversalMcp,"mcp")}
/>
)}
{tab==="hermes"&&filledHermes&&(
<SnippetBlock
value={filledHermes}
label="Hermes channel — bridges this workspace's A2A traffic into your hermes-agent session as platform messages (push parity with Claude Code). Long-poll based; no tunnel needed."
copyKey="hermes"
copied={copiedKey==="hermes"}
onCopy={()=>copy(filledHermes,"hermes")}
/>
)}
{tab==="codex"&&filledCodex&&(
<SnippetBlock
value={filledCodex}
label="Codex MCP config — wires the molecule MCP server into ~/.codex/config.toml. Outbound tools today; inbound A2A push needs the Python SDK tab paired in (codex's MCP runtime doesn't route arbitrary notifications/* yet)."
copyKey="codex"
copied={copiedKey==="codex"}
onCopy={()=>copy(filledCodex,"codex")}
/>
)}
{tab==="openclaw"&&filledOpenClaw&&(
<SnippetBlock
value={filledOpenClaw}
label="OpenClaw MCP config — wires the molecule MCP server via openclaw mcp set + starts the gateway on loopback. Outbound tools today; inbound A2A push on an external openclaw needs the Python SDK tab paired in (a sessions.steer bridge daemon is future work)."
copyKey="openclaw"
copied={copiedKey==="openclaw"}
onCopy={()=>copy(filledOpenClaw,"openclaw")}
/>
)}
{tab==="kimi"&&filledKimi&&(
<SnippetBlock
value={filledKimi}
label="Kimi CLI — self-contained Python bridge. Registers, heartbeats, polls for canvas messages, and echoes replies back. NAT-safe (no public URL). Run in a background terminal or via launchd."
label="Universal MCP — standalone register + heartbeat + tools for any MCP-aware runtime (Claude Code, hermes, codex). Pair with Python or Claude Code tab if you need inbound A2A delivery."
label="Hermes channel — bridges this workspace's A2A traffic into your hermes-agent session as platform messages (push parity with Claude Code). Long-poll based; no tunnel needed."
label="OpenClaw MCP config — wires the molecule MCP server via openclaw mcp set + starts the gateway on loopback. Outbound tools today; inbound A2A push on an external openclaw needs the Python SDK tab paired in (a sessions.steer bridge daemon is future work)."
copyKey="openclaw"
copied={copiedKey==="openclaw"}
onCopy={()=>copy(filledOpenClaw,"openclaw")}
/>
)}
</div>
{/* Kimi tab */}
<div
id="panel-kimi"
data-testid="panel-kimi"
role="tabpanel"
aria-labelledby="tab-kimi"
hidden={tab!=="kimi"||!filledKimi}
className={tab==="kimi"&&filledKimi?"":"hidden"}
>
{filledKimi&&(
<SnippetBlock
value={filledKimi}
label="Kimi CLI — self-contained Python bridge. Registers, heartbeats, polls for canvas messages, and echoes replies back. NAT-safe (no public URL). Run in a background terminal or via launchd."
<pid="files-delete-one-msg"className="text-xs text-warm">Delete<spanclassName="font-mono">{confirmDelete}</span>{files.find((f)=>f.path===confirmDelete&&f.dir)?" and all its contents":""}?</p>
it("renders the secret-safe failure reason verbatim, not a hardcoded opaque message",()=>{
constreason=
"Anthropic 403 oauth_org_not_allowed: Your organization has disabled Claude subscription access for Claude Code — use an Anthropic API key or ask your admin to enable access.";
if ! curl -fsS "$BASE/health" -m 5 >/dev/null 2>&1;then
echo"::error::Local stack not healthy at $BASE/health — bring it up (make up) before this gate. Infra, not a workspace bug (feedback_fix_root_not_symptom)." >&2
exit1
fi
# admin/test-token is the local MCP-bearer mint path; it 404s in
# production. If it is off, this gate cannot drive the literal call.
if ! curl -fsS "$BASE/admin/workspaces/preflight-probe/test-token" -m 5 >/dev/null 2>&1;then
# A 404 here is EITHER "no such ws" (fine — endpoint is enabled) OR the
# endpoint is disabled (MOLECULE_ENV=production). Distinguish by body.
echo"::error::No runtime had a provider key set — cannot run the local peer-visibility gate. Set CLAUDE_CODE_OAUTH_TOKEN and/or E2E_MINIMAX_API_KEY (or ANTHROPIC/OPENAI)." >&2
exit1
fi
# ─── 2. Wait for the parent online (it is a peer target) ───────────────
log "2/5 waiting for parent online (peer target)..."
cancel()// simulate the HTTP handler having returned (request ctx dead)
mode,err:=lookupDeliveryMode(ctx,"ws-poll-peer")
iferr==nil{
t.Fatalf("internal#497 regression: lookupDeliveryMode swallowed a context error and returned mode=%q with nil err — this is the exact 5-day silent-misrouting vector",mode)
}
ifmode==models.DeliveryModePush{
t.Errorf("internal#497 regression: context error must NOT default to push (got mode=%q)",mode)
// Realistic actionable reason: provider HTTP status + provider's
// own message. Secret-safe (no token, no api key, just the cause).
detail:="Anthropic 403 oauth_org_not_allowed: Your organization has disabled Claude subscription access for Claude Code — use an Anthropic API key or ask your admin to enable access."
// The HTTP handler "returns 202" → request context is cancelled.
cancelParent()
iferr:=parent.Err();err==nil{
t.Fatal("precondition: parent context should be cancelled after the handler returns")
}
// (a) Cancellation MUST NOT propagate to the detached context.
select{
case<-delegationCtx.Done():
t.Fatalf("regression: detached delegation ctx was cancelled by the handler returning (err=%v) — executeDelegation would fail every DB op with `context canceled`",delegationCtx.Err())
default:
// alive — correct
}
// (b) Parent values MUST still be readable (WithoutCancel preserves
// values; trace/correlation/tenant ids the proxy + broadcaster use).
// which reads db.DB in runRestartCycle before its provisioner
// gate) BEFORE swapping db.DB back. Doing the restore first
// would let an in-flight restart goroutine read db.DB while
// this line writes it — the data race this guards against.
drainTestAsync()
db.DB=prevDB
mockDB.Close()
})
// Disable SSRF checks for the duration of this test only. Restore
// the previous state via t.Cleanup so that TestIsSafeURL_* tests
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.