Bug: a2a_response.py:197 returned Queued(method=method) without passing
delivery_mode, silently defaulting to "poll" for push-mode busy-queue
responses. Callers branching on v.delivery_mode would mis-identify push-mode
responses as poll-mode, causing wrong dispatch logic.
Fix: pass delivery_mode="push" explicitly in the push-mode branch.
Tests: add push_queued_full/notify/no_method fixtures and 4 test cases
asserting delivery_mode="push" for all three envelope shapes. Also add
adversarial {"queued": "yes"} and {"queued": False} → Malformed guards.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Run 5160 publish-runtime build step failed:
error: TOP_LEVEL_MODULES drifted from workspace/*.py contents:
in workspace/ but NOT in TOP_LEVEL_MODULES (will ship un-rewritten): ['_sanitize_a2a']
Edit scripts/build_runtime_package.py:TOP_LEVEL_MODULES to match.
workspace/_sanitize_a2a.py was added recently but the allowlist in
scripts/build_runtime_package.py was not updated. The build script
intentionally aborts (exit 3) when it detects the drift, because
shipping a module un-rewritten breaks the package's flat-layout import
contract.
Fix: add '_sanitize_a2a' to the set. Alphabetical order preserved
(it sorts before 'a2a_*').
Third workflow defect after #353 (workflow_dispatch.inputs parser) and
#355 (Publish step working-directory). After this lands, attempt #4 of
runtime-v0.1.130 should finally succeed.
Refs: #351, #353, #355, #348 Q3
First-ever publish-runtime.yml dispatch (run 5097 post-#353, 2026-05-11
02:06Z) failed at the twine upload step:
ERROR InvalidDistribution: Cannot find file (or expand pattern): 'dist/*'
Cause: the Publish step was missing 'working-directory: ${{ runner.temp
}}/runtime-build' while the preceding Build/Verify steps all had it.
Result: twine ran from the workspace checkout dir where dist/ doesn't
exist.
Fix: add working-directory to match the rest of the publish job.
This is the second of three workflow defects exposed by #353 finally
making the workflow run at all:
1. workflow_dispatch.inputs rejection → fixed in #353
2. Publish step missing working-directory → THIS PR
3. (anything else surfaced by 0.1.130 attempt #2)
After merge: push runtime-v0.1.130 again (tag was already pushed once
post-#353 but the run failed at publish; need a fresh trigger). Should
finally land 0.1.130 on PyPI.
Refs: #351, #348 Q3, #353
test_audit_ledger.py imports sqlalchemy directly (line 42).
Without an explicit sqlalchemy install, pip dependency resolution can
omit it when pytest/pytest-asyncio/pytest-cov are installed as a
separate step after requirements.txt.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Gitea Actions reads .gitea/workflows/, not .github/workflows/. The
.github/ copy of this workflow has been kept in lockstep with .gitea/
since the post-suspension migration (e.g. 6d94fd30, 5216e781, 67b2e488
all touch both files). The functional code is identical between the
two; the only differences are comment verbosity and the path-filter
self-reference (each version watches its own location).
Removing the .github/ copy:
- eliminates the dual-edit maintenance tax (two files touched per fix)
- prevents accidental drift where one is updated and the other isn't
- leaves a single source-of-truth at .gitea/workflows/
Cross-references confirmed safe:
- canary-verify.yml + redeploy-tenants-on-{staging,main}.yml all use
`workflows: ['publish-workspace-server-image']` (workflow name,
not file path) — they trigger off the workflow_run event keyed on
`name:`, which is identical in both files.
- No other workflow path-watches .github/workflows/publish-workspace-
server-image.yml.
Other two triplicates from task #287 (publish-runtime.yml and
secret-scan.yml) are NOT addressed in this PR — see PR description for
the ambiguity report flagging them for human review.
Refs: task #287
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Trivial empty commit to force a fresh workflow run now that the
PR has tier:low label and approvals on the rebased branch.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Root cause (from infra-lead PR#7 review id=724):
Sanitization in PR#7 wrapped peer text in [A2A_RESULT_FROM_PEER]
markers, but the markers themselves were not escaped — a malicious
peer could inject "[/A2A_RESULT_FROM_PEER]" to close the trust
boundary early, making subsequent text appear inside the trusted zone.
Fix:
- Create workspace/_sanitize_a2a.py (leaf module, no circular import
risk) with shared sanitize_a2a_result() + _escape_boundary_markers()
- _escape_boundary_markers() escapes boundary open/close markers in the
raw peer text before wrapping (primary security control)
- Defense-in-depth: also escapes SYSTEM/OVERRIDE/INSTRUCTIONS/IGNORE
ALL/YOU ARE NOW patterns (secondary, per PR#7 design intent)
- Update a2a_tools_delegation.py: import from _sanitize_a2a; wrap
tool_delegate_task return and tool_check_task_status response_preview
- Add 15 tests covering boundary escape, injection patterns, integration
shapes (workspace/tests/test_a2a_sanitization.py)
Follow-up (non-blocking, noted in PR#7 infra-lead review):
- Deduplicate if a2a_tools.py also wraps (currently handled in
delegation module only — callers get sanitized output regardless)
- tool_check_task_status: consider sanitizing 'summary' field too
Closes: molecule-ai/molecule-ai-workspace-runtime#7 (wrong-repo PR
that this supersedes)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Implements the Claude Design handoff (Molecules AI Mobile.html) as a
viewport-gated React tree under canvas/src/components/mobile/. < 640px
renders the new shell instead of the desktop ReactFlow canvas.
Six screens, all bound to live store data:
- Home (agent list + filter chips + spawn FAB)
- Canvas (mini-graph with pinch-to-zoom + pan + reset)
- Detail (status pills, tabs: Overview / Activity / Config / Memory;
Activity hits /workspaces/:id/activity)
- Chat (textarea composer, IME-safe Enter, sendInFlightRef guard;
bootstraps from agentMessages so the prior thread shows on entry)
- Comms (live A2A feed via /workspaces/:id/activity + ACTIVITY_LOGGED)
- Spawn (bottom sheet; fetches /templates so users pick what's actually
installed on their platform)
Plus a Me tab for mobile theme/accent/density.
Design system (palette.ts + primitives.tsx) ports tokens 1:1 from the
handoff: cream + dark palettes, T1-T4 tier chips, status dots with
halo, JetBrains Mono for IDs/timestamps. Inter + JetBrains Mono are
self-hosted via next/font/google so CSP `font-src 'self'` is honoured.
URL routing: routes sync to ?m=<route>&a=<id>; popstate restores route;
deep links seed initial state. /?m=detail without ?a collapses to home.
Accent override flows through React context (MobileAccentProvider) —
not by mutating the static MOL_LIGHT/MOL_DARK singletons.
SSR flash: isMobile is tri-state; loading spinner stays up until
matchMedia resolves so mobile devices never paint the desktop tree.
Desktop responsiveness fixes (separate but ride along):
- Toolbar: full-width with overflow-x-auto on mobile, logo text + count
hidden < sm, divider/border collapse to sm: only.
- SidePanel: full-screen on mobile via matchMedia, resize handle hidden.
- Canvas: MiniMap hidden < sm (was overlapping the New Workspace FAB).
Tests (51 total, 33 new):
- palette.test.ts (12) - normalizeStatus, tierCode, light/dark parity
- components.test.ts (10) - toMobileAgent field mapping + classifyForFilter
- MobileApp.test.tsx (12) - route stack, deep links, popstate, tab bar
hidden on chat, spawn overlay
- SidePanel.tabs.test.tsx (18) - regression-clean
Verified: tsc --noEmit clean across mobile/, page.tsx, layout.tsx.
Not yet verified: live phone browser (needs CP backend hydrated).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ROOT CAUSE found in Gitea server logs:
actions/workflows.go:DetectWorkflows() [W] ignore invalid workflow
"publish-runtime.yml": unknown on type:
map["version":{"description":...,"required":true,"type":"string"}]
Gitea 1.22.6's workflow parser flattens workflow_dispatch.inputs.* into
top-level 'on:' event-keys and rejects the workflow when it doesn't
recognize them. Once rejected, the workflow never registers — so NO
event triggers it. publish-runtime.yml has 0 runs in action_run since
the .gitea port for exactly this reason; the runtime-v1.0.0 tag from
yesterday and hongming-pc's runtime-v0.1.130 from tonight both pushed
successfully but went nowhere.
This supersedes the paths-vs-tags hypothesis from #351 (PR #352).
The split is still useful for clarity but was NOT the cause — even
the original tags-only port had this same parse failure.
Fix: drop the inputs block. workflow_dispatch in Gitea 1.22.6 supports
no-input dispatch only. The bash logic for version derivation now uses
just two cases: tag-push (strip prefix) or anything-else (PyPI auto-bump).
Post-merge verification:
- watch for first-ever publish-runtime.yml run in action_run
- check Gitea log no longer emits 'ignore invalid workflow' for this file
- push a runtime-v0.1.130 tag → workflow fires → PyPI 0.1.130
Refs: #351 (root cause), #348 Q3 (the blocker)
publish-runtime.yml has never fired since the .gitea port (0 rows in
action_run.workflow_id='publish-runtime.yml' ever), which is why PyPI
is still at 0.1.129 despite Gitea having a runtime-v1.0.0 tag.
Root cause hypothesis: Gitea Actions evaluates the on.push.paths filter
against tag-push events too (no path diff → workflow skipped). PR #349
made this visible by adding the paths trigger, but the same defect
existed for the originally-ported tags-only trigger on this Gitea version
— hence the runtime-v1.0.0 tag also never published.
Fix: split into two files, each with a single unambiguous trigger shape.
- publish-runtime.yml : on.push.tags only (the publisher)
- publish-runtime-autobump.yml : on.push.branches+paths (NEW; the bumper)
The autobump file computes next version from PyPI latest, pushes
'runtime-v$VERSION' tag via DISPATCH_TOKEN (not GITHUB_TOKEN — needed
to trigger downstream workflows on Gitea), and exits. The tag push
then triggers publish-runtime.yml.
Test plan after merge:
1. Push no-op commit to workspace/. Observe autobump fire, push tag.
2. Observe publish-runtime.yml fire on the tag, publish 0.1.130 to
PyPI, cascade to template repos.
3. Verify 'action_run' shows >0 rows for both workflow_ids.
Adds back the original GitHub workflow's auto-publish trigger that was
dropped during the 2026-05-10 .gitea port (#206). Push to main or
staging filtered by workspace/** falls into the existing PyPI-latest
auto-bump path — no logic changes, just the missing trigger and a
comment correction.
Caveat: the workflow still requires PYPI_TOKEN as a repository secret
(or org-level). Without it the publish step will fail loudly with a
descriptive error. Q2 follow-up tracks setting the secret.
Refs: molecule-core#348
Issue: _delegate_sync_via_polling (RFC #2829 PR-5 sync path) returned
unsanitized response_preview and error_detail fields to the agent context.
A malicious peer could inject trust-boundary markers to break the boundary
established by the main sanitization layer.
Changes:
- a2a_tools_delegation.py: sanitize response_preview before returning on
completed; sanitize error_detail/summary before wrapping in _A2A_ERROR_PREFIX
- test_a2a_tools_delegation.py: TestPollingPathSanitization covers both paths
Companion to PR #382 (runtime/offsec-003-executor-sanitize) which covers
the async heartbeat path in executor_helpers.read_delegation_results.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds _sanitize_a2a.py (from PR #346) and integrates sanitize_a2a_result()
into read_delegation_results() so peer-supplied summary and response_preview
fields are escaped before being injected into the agent prompt.
Output is wrapped in [A2A_RESULT_FROM_PEER]...[/A2A_RESULT_FROM_PEER]
boundary markers so content after the block is clearly not from a peer.
Fixes:
- test_a2a_executor.py: correct mock patch path to executor_helpers
- test_executor_helpers.py: fix boundary-injection test assertion to match
_strip_closed_blocks behaviour (closes marker, removes following text)
Follow-up to PR #346 (OFFSEC-003 boundary escape) which noted
"read_delegation_results() path still needs sanitization" as a gap.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Root cause: the sop-tier-check.sh script uses jq extensively for all
JSON API parsing (whoami, labels, team IDs, reviews). Gitea Actions
runners (ubuntu-latest label) do not bundle jq — script exits at
line 67 with "jq: command not found", producing "Failing after 1-3s"
status on every staging PR.
Fix: add apt-get install -y jq step before the script run.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The Canvas template-deploy path returned HTTP 500 with raw pq error
when a user clicked a template card twice in quick succession. Root
cause: migration 20260506000000 added the partial-unique index
`workspaces_parent_name_uniq` on (COALESCE(parent_id, sentinel), name)
WHERE status != 'removed' to close TOCTOU on /org/import (#2872). The
org-import handler resolves the constraint via ON CONFLICT DO NOTHING
+ idempotent re-select. The Canvas Create handler did not — it
bubbled the pq violation as a generic 500.
Fix: auto-suffix the user-typed name on collision via a small retry
helper that pins on SQLSTATE 23505 + constraint name (so unrelated
unique indexes still fail loud), retries with " (2)", " (3)" up to
N=20, and threads the actually-persisted name back into the response
+ broadcast payload (so the canvas displays what the DB actually
holds). Exhaustion maps to a clean 409 Conflict instead of a 500.
#2872 protection is preserved unchanged — the index stays in place,
and /org/import's ON CONFLICT path is unaffected. The bundle-import
INSERT (handlers/bundle.go) is a separate code path and is not
touched here; if it surfaces the same UX issue a follow-up can adopt
the same helper.
Verification (against running localhost:8080 platform):
Three back-to-back POSTs with name="ManualVerify-1778459812":
POST #1 -> 201, id=db2dacf7-…, persisted name="ManualVerify-1778459812"
POST #2 -> 201, id=f468083d-…, persisted name="ManualVerify-1778459812 (2)"
POST #3 -> 201, id=5f5ae905-…, persisted name="ManualVerify-1778459812 (3)"
Log lines: "name collision auto-suffix \"…\" -> \"… (N)\""
Tests:
- workspace_create_name_test.go — 4 unit tests via sqlmock pin the
retry contract (happy path no-suffix, single-collision -> " (2)",
non-retryable error pass-through, exhaustion -> errWorkspaceNameExhausted).
- workspace_create_name_integration_test.go — 2 real-Postgres tests
(build tag `integration`) confirm the partial-unique index
behaviour AND the WHERE status != 'removed' tombstone exemption.
- Watch-it-fail confirmed: temporarily removing the
`fmt.Sprintf("%s (%d)", baseName, attempt+1)` candidate-naming
line makes TestInsertWorkspaceWithNameRetry_SecondAttemptSuffixed
fail with the expected argument-mismatch from sqlmock.
Pre-existing test failures in handlers/ (TestExecuteDelegation_…,
TestMCPHandler_CommitMemory_GlobalScope_Blocked) reproduce on
unmodified staging and are NOT caused by this change.
Cherry-pick of d79a4bd2 from PR #318 onto fresh main base (PR #318 closed).
Issue #310: platform a2a-proxy logs ~300/hr
`timeout awaiting response headers` because ResponseHeaderTimeout was hardcoded
to 60s. Opus agent turns (big context + internal delegate_task round-trips)
routinely exceed 60s, so the proxy gave up before headers arrived even when
the workspace agent was healthy.
Changes:
- a2a_proxy.go: ResponseHeaderTimeout: 60s hardcoded →
envx.Duration("A2A_PROXY_RESPONSE_HEADER_TIMEOUT", 180s).
180s gives Opus turns comfortable headroom. The X-Timeout caller header
still bounds the absolute request ceiling independently.
- a2a_proxy_test.go: TestA2AClientResponseHeaderTimeout verifies the 180s
default and env-override parsing logic.
Env var: A2A_PROXY_RESPONSE_HEADER_TIMEOUT (e.g. 5m, 300s).
Closes#310.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Plugin adapters in molecule-skill-* repos do:
from plugins_registry.builtins import AgentskillsAdaptor as Adaptor
But _load_module_from_path() used exec_module() with a fresh module
namespace that did NOT have plugins_registry or its submodules in sys.modules,
causing:
ModuleNotFoundError: No module named 'plugins_registry'
Fix: before exec_module(), import and register plugins_registry + all three
submodules (builtins, protocol, raw_drop) in sys.modules so adapter imports
resolve correctly. Follows the Option 1 recommendation from issue #296.
Also adds test_resolve_plugin.py verifying the fix for both the
AgentskillsAdaptor import and the full InstallContext/resolve/protocol import.
Closes#296.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Ports the bounded retry+backoff around each `git clone` in
scripts/clone-manifest.sh onto main, mirroring PR #298 which landed the
same change on staging. CI-infra carve-out: publish-workspace-server-image.yml
fires on `push: branches:[main]`, so the retry mitigation must be on main for
the workflow to be resilient to the OOM-killed-git-mid-clone flake
(`error: git-remote-https died of signal 9`, run 4622) when triggered by a
main push. Same one-file change as #298 (+45/-5), POSIX-sh, sh -n clean.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The Docker daemon health-check fix should not change which branches trigger
the build. Revert accidental addition of 'staging' to branch filters.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Cover the canvas image publish workflow with the same `docker info`
guard added to publish-workspace-server-image.yml (commit 5216e781).
publish-canvas-image.yml was the only docker-build workflow still
missing the step.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The publish-workspace-server-image / build-and-push job clones the full
manifest (~36 repos) serially in the "Pre-clone manifest deps" step on a
memory-constrained Gitea Actions runner. Under host memory pressure the
OOM killer SIGKILLs git-remote-https mid-clone:
cloning .../molecule-ai-plugin-molecule-skill-code-review.git ...
error: git-remote-https died of signal 9
fatal: the remote end hung up unexpectedly
❌ Failure - Main Pre-clone manifest deps
exitcode '128': failure
Observed in run 4622 (2026-05-10, staging HEAD b5d2ab88) — died on the
14th of 36 clones, which red-lights CI and wedges staging→main.
Wrap each `git clone` in clone-manifest.sh with bounded retry + backoff
(3 attempts, 3s/6s), wiping any partial checkout between tries. A single
transient SIGKILL / network blip no longer fails the whole tenant image
rebuild. Benefits every caller of the script (publish-workspace-server-image,
harness-replays, Dockerfile builds, local quickstart).
This is a mitigation; the durable fix is more runner RAM/swap on the
operator host — tracked separately with Infra-SRE.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The A2A proxy can return three error shapes:
{"error": "plain string"}
{"error": {"message": "...", "code": ...}}
{"error": {"message": {"nested": "object"}}} ← value at .message is a string
builtin_tools/a2a_tools.py:72 called data["error"].get("message")
without guarding against error being a string, which raised:
AttributeError: 'str' object has no attribute 'get'
This broke every delegation attempt through the legacy a2a_tools path
(the LangChain-wrapped version used by adapter templates). The
SSOT parser a2a_response.py already handled string errors; the
legacy inline sniffer in a2a_tools.py did not.
Fix: branch on isinstance(err, dict/str/other) before calling .get().
Also update both publish-workflow files to remove the dead
`staging` branch trigger — trunk-based migration (PR #109,
2026-05-08) removed the staging branch.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>