Closes#9.
Three pieces, all small:
1. **docs/e2e-coverage.md** — source of truth for which E2E suites
guard which surfaces. Today three were running but informational
only on staging; that's how the org-import silent-drop bug shipped
without a test catching it pre-merge. Now the matrix shows what's
required where + a follow-up note for the two suites that need an
always-emit refactor before they can be required.
2. **tools/branch-protection/apply.sh** — branch protection as code.
Lets `staging` and `main` required-checks live in a reviewable
shell script instead of UI clicks that get lost between admins.
This PR's net change: add `E2E API Smoke Test` and `Canvas tabs E2E`
as required on staging. Both already use the always-emit path-filter
pattern (no-op step emits SUCCESS when the workflow's paths weren't
touched), so making them required can't deadlock unrelated PRs.
3. **branch-protection-drift.yml** — daily cron + drift_check.sh
that compares live protection against apply.sh's desired state.
Catches out-of-band UI edits before they drift further. Fails the
workflow on mismatch; ops re-runs apply.sh or updates the script.
Out of scope (filed as follow-ups):
- e2e-staging-saas + e2e-staging-external use plain `paths:` filters
and never trigger when paths are unchanged. They need refactoring
to the always-emit shape (same as e2e-api / e2e-staging-canvas)
before they can be required.
- main branch protection mirrors staging here; if main wants the
E2E SaaS / External added later, do it in apply.sh and rerun.
Operator must apply once after merge:
bash tools/branch-protection/apply.sh
The drift check picks it up from there.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User feedback (2026-05-04 conversation):
> "Skills and Tools are having their own tab as plugin, and Prompt
> Files are in the file system which can be directly edited. Am I
> missing something?"
> "Tools should be merged into plugin then, and for prompt files... it
> should be in another section than in skill& tools"
The "Skills & Tools" section in ConfigTab had three TagList inputs:
- Skills: managed via the dedicated SkillsTab (per-workspace
skill folders) — duplicate UI affordance
- Tools: managed via the Plugins tab (install a plugin → its
tools become available) — duplicate UI affordance
- Prompt Files: load order for system-prompt files — semantically
unrelated to skills/tools
Drop the Skills + Tools inputs. Move Prompt Files into its own
section with explanatory copy that names the auto-loaded files
(system-prompt.md, CLAUDE.md, AGENTS.md) and points users at the
Files tab for actual editing.
Schema fields `config.skills` and `config.tools` are KEPT (load-bearing
for runtime skill loading + tool registry); only the inline editor goes
away. Operators who need to edit them can still use the Raw YAML toggle.
Tests:
- New ConfigTab.sections.test.tsx with 4 cases:
1. "Skills & Tools" section title is gone
2. Skills tag input is absent
3. Tools tag input is absent
4. Prompt Files section exists with explanatory copy
Sibling ConfigTab tests (hermes, provider) all still pass (20/20).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes#2813 (team-collapse) and #2814 (workspace delete).
Two leaks, one class. Both call sites had the same shape pre-fix:
if h.provisioner != nil {
h.provisioner.Stop(ctx, wsID)
}
On SaaS where h.provisioner (Docker) is nil and h.cpProv is set, that
gate evaluates false and the EC2 keeps running. Workspace gets marked
removed in DB; EC2 lives on until the orphan sweeper catches it.
Same drift class as PR #2811's org-import provision bug — a Docker-
only check on what should be a both-backend operation. Confirmed in
production: PR #2811's verification step deleted a test workspace and
the EC2 stayed running until I terminated it manually.
Fix: WorkspaceHandler.StopWorkspaceAuto(ctx, wsID) — symmetric mirror
of provisionWorkspaceAuto. CP first, Docker second, no-op when neither
is wired (a workspace nobody is running can't be stopped — that's a
no-op, not a failure, distinct from provision's mark-failed contract).
Three call-site changes:
- team.go:208 (Collapse) → h.wh.StopWorkspaceAuto(ctx, childID)
- workspace_crud.go:432 (stopAndRemove) → h.StopWorkspaceAuto(...);
RemoveVolume stays Docker-only behind an explicit gate since
CP-managed workspaces have no host-bind volumes
- TeamHandler.provisioner field + NewTeamHandler's *Provisioner param
removed as dead code (Stop was the only call site)
Volume cleanup separation is intentional: the abstraction is "stop
the running workload," not "tear down all state." Callers that need
volume cleanup keep their `if h.provisioner != nil { RemoveVolume }`
gate AFTER the Stop call.
Tests:
- TestStopWorkspaceAuto_RoutesToCPWhenSet — SaaS path
- TestStopWorkspaceAuto_RoutesToDockerWhenOnlyDocker — self-hosted
- TestStopWorkspaceAuto_NoBackendIsNoOp — pins the contract distinction
from provisionWorkspaceAuto's mark-failed
- TestNoCallSiteCallsBareStop — source-level pin against
`.provisioner.Stop(` / `.cpProv.Stop(` outside the dispatcher,
per-backend bodies, restart helper, and the Docker-daemon-direct
short-lived-container path. Strips Go comments before substring
match so archaeology in code comments doesn't trip the gate.
- Verified: pin FAILS against the buggy shape (workspace_crud.go
reversion); team.go reversion compile-fails because the field is
gone — even stronger than the test.
Out of scope (tracked under #2799):
- workspace_restart.go's manual if-cpProv-else dispatch with retry
semantics tuned for the restart hot path. Functionally equivalent
+ wraps cpStopWithRetry, so it's not the bug class this PR closes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`codex-channel-molecule` 0.1.0 is now on PyPI, so operators no longer need
the `git+https://...` URL workaround.
Verified: `pip install codex-channel-molecule` from a clean venv installs
the wheel and the `codex-channel-molecule --help` console script runs.
PyPI: https://pypi.org/project/codex-channel-molecule/0.1.0/
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GET /workspaces/:id/files/config.yaml on hongming.moleculesai.app's
Hermes workspace returned 500 with body:
ssh cat: exit status 1 (Warning: Permanently added '[127.0.0.1]:37951'
(ED25519) to the list of known hosts.)
Root cause: ssh emits the "Permanently added" notice on every fresh
tunnel connection, even with UserKnownHostsFile=/dev/null (that
prevents persistence, not the warning). It lands on stderr, fooling
readFileViaEIC's classifier:
if len(out) == 0 && stderr.Len() == 0 {
return nil, os.ErrNotExist
}
return nil, fmt.Errorf("ssh cat: %w (%s)", runErr, ...)
stderr was non-empty (the warning), so we returned the wrapped error
→ 500 from the HTTP layer instead of 404.
Fix: add `-o LogLevel=ERROR` to BOTH writeFileViaEIC and readFileViaEIC
ssh invocations. Silences info+warning while keeping real auth/tunnel
errors visible (those emit at ERROR level).
Test: TestSSHArgs_LogLevelErrorBothSites pins the flag in both blocks.
Mutation-tested: stripping the flag from one site fails the gate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bug: the case statement at line 189 grouped completed/failure |
completed/cancelled | completed/timed_out into the same "abort
+ exit 1" branch. cancelled ≠ failure — when per-SHA concurrency
(memory: feedback_concurrency_group_per_sha) cancels an older E2E
run because a newer push landed, the workflow blocked the whole
auto-promote chain on a non-failure.
Caught 2026-05-05 02:03 on sha 31f9a5e: E2E got cancelled by
concurrency, auto-promote :latest aborted with exit 1, the next
auto-promote-staging cycle had to manually clean up.
Split: failure/timed_out keep the abort path. cancelled gets its
own clean-defer branch (same shape as in_progress) — proceed=false
without exit 1, with a step-summary explaining likely concurrency
supersession and pointing operators at manual dispatch if they
need that specific SHA promoted.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Self-review of PR #2810 caught a regression: my mass-fix added
`2>/dev/null` to every curl invocation, suppressing stderr. The
original `|| echo "000"` shape only swallowed exit codes — stderr
(curl's `-sS`-shown dial errors, timeouts, DNS failures) still went
to the runner log so operators could see WHY a connection failed.
After PR #2810 the next deploy failure would log only the bare
HTTP code with no context. That's exactly the kind of diagnostic
loss that makes outages take longer to triage.
Drop `2>/dev/null` from each curl line — keep it on the `cat`
fallback (which legitimately suppresses "no such file" when curl
crashed before -w ran). The `>tempfile` redirect alone captures
curl's stdout (where -w writes) without touching stderr.
Same 8 files as #2810: redeploy-tenants-on-{main,staging},
sweep-stale-e2e-orgs, e2e-staging-{sanity,saas,external,canvas},
canary-staging.
Tests:
- All 8 files pass the lint
- YAML valid
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SSOT pass — replace 4 bare `h.provisioner == nil && h.cpProv == nil`
checks with `!h.HasProvisioner()`. When a third backend lands (k8s,
containerd, whatever), HasProvisioner gets one new field; bare both-nil
checks would each need to be hunted and updated.
Sites:
- a2a_proxy_helpers.go:166 — maybeMarkContainerDead skip-no-backend
- workspace_restart.go:118 — Restart endpoint guard
- workspace_restart.go:363 — RestartByID coalescer guard
- workspace_restart.go:660 — Resume endpoint guard
Adds TestNoBareBothNilCheck (source-level) so the antipattern can't
slip back in.
Out of scope but discovered during the audit (filed separately):
- team.go:207 — team-collapse Stop is Docker-only, leaks EC2 on SaaS
- workspace_crud.go:423 — workspace delete cleanup is Docker-only,
leaks EC2 on SaaS
Both need a StopWorkspaceAuto mirror of provisionWorkspaceAuto. Same
class of bug as today's org-import incident, different verb (stop vs
provision).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two changes that close the silent-drop bug class:
1. Add WorkspaceHandler.HasProvisioner() and use it as the org-import
gate. Pre-fix, org_import.go:178 read `h.provisioner != nil` (Docker-
only) — on SaaS tenants where cpProv is wired but Docker is nil, the
entire 220-line provisioning prep block was skipped. The Auto call
PR #2798 added at line 395 was unreachable on SaaS.
Repro: 2026-05-05 01:14 — hongming prod tenant, 7-workspace org
import, every workspace sat in 'provisioning' for 10 min until the
sweeper marked it failed with the misleading "container started but
never called /registry/register".
2. provisionWorkspaceAuto self-marks-failed on the no-backend path.
Defense in depth: even if a future caller bypasses HasProvisioner
gating or ignores the bool return (TeamHandler pre-#2367 did exactly
this), the workspace ends in a clean failed state with an actionable
error message instead of lingering until the 10-min sweep.
Auto becomes the single source of truth for "start a workspace" —
routing AND the no-backend failure path. Create's redundant
if-not-Auto-then-mark-failed block collapses (kept only the
workspace_config UPSERT, which is a Create-specific UI concern for
rendering runtime/model on the Config tab).
Tests:
- TestProvisionWorkspaceAuto_NoBackendMarksFailed pins the new contract
- TestHasProvisioner_TrueOnCPOnly catches the SaaS-only blind spot
- TestHasProvisioner_TrueOnDockerOnly preserves self-hosted shape
- TestHasProvisioner_FalseWhenNeitherWired pins the gate-out path
- TestOrgImportGate_UsesHasProvisionerNotBareField source-pins the gate
(verified: FAILS against the buggy `h.provisioner != nil` shape, PASSES
with `h.workspace.HasProvisioner()`)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 2026-05-04 redeploy-tenants-on-main run for sha 2b862f6 emitted
"HTTP 000000" and failed the deploy. Root cause: when curl exits non-
zero (connection reset → 56, --fail-with-body 4xx/5xx → 22), the
`-w '%{http_code}'` already wrote a status to stdout; the inline
`|| echo "000"` then fires AND appends another "000" to the captured
substitution stdout. Result: HTTP_CODE="<actual><000>" — fails string
comparisons against "200" while looking superficially right.
Same class of bug the synth-E2E §7c gate hit twice (PRs #2779/#2783
+ #2797). Memory feedback_curl_status_capture_pollution.md.
Mass fix in 8 workflows: route -w into a tempfile so curl's exit
code can't pollute stdout. Wrap with set +e/-e so the non-zero
curl exit doesn't trip the outer pipeline.
redeploy-tenants-on-main.yml (production-critical, caught the bug)
redeploy-tenants-on-staging.yml (sibling)
sweep-stale-e2e-orgs.yml (cleanup loop)
e2e-staging-sanity.yml (E2E safety-net teardown)
e2e-staging-saas.yml
e2e-staging-external.yml
e2e-staging-canvas.yml
canary-staging.yml
Plus a new lint workflow `lint-curl-status-capture.yml` that runs on
every PR/push touching `.github/workflows/**`. Multi-line aware:
collapses bash `\` continuations, then matches the buggy
$(curl ... -w '%{http_code}' ... || echo "000") subshell shape.
Distinguishes from the SAFE $(cat tempfile || echo "000") shape
(cat with missing file emits empty stdout, no pollution).
Verified:
- All 8 workflows pass the lint locally
- A known-bad injection is caught
- A known-safe cat-fallback passes through
- yaml.safe_load clean on all changed files
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the pattern hermes-channel-molecule uses (line 256). Drops
the broken `pip install codex-channel-molecule` which would 404.
PyPI publish workflow is a separate piece of work — until then,
git+https install is the path operators get.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The codex tab in the External Connect modal had a "outbound-tools-only
first cut" caveat — operators got the MCP wiring for codex calling
platform tools, but there was no documented inbound path. Canvas
messages couldn't wake an idle codex session.
That gap is now filled by codex-channel-molecule
(github.com/Molecule-AI/codex-channel-molecule), shipped today as the
codex counterpart to hermes-channel-molecule. The daemon long-polls
the platform inbox, runs `codex exec --resume <session>` per inbound
message, captures the assistant reply, routes it back via
send_message_to_user / delegate_task, and acks the inbox row.
Per-thread session continuity persisted to disk so daemon restarts
don't lose conversation context.
This commit:
- Updates externalCodexTemplate to include `pip install
codex-channel-molecule` (step 1) and a foreground `nohup
codex-channel-molecule` invocation (step 3) using the same env-var
contract as the MCP server (WORKSPACE_ID + PLATFORM_URL +
MOLECULE_WORKSPACE_TOKEN).
- Adds a "Canvas messages don't wake codex" common-issues entry to the
TAB_HELP codex section pointing at the bridge daemon log.
- Updates the doc comment to record the upstream deprecation path:
when openai/codex#17543 lands, the bridge becomes redundant and the
wired MCP server delivers push natively.
Verified: TestExternalTemplates_NoMoleculeOrgIDPlaceholder still
passes (no MOLECULE_ORG_ID re-introduction); full handlers suite
green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex / openclaw / hermes-channel snippets each instructed operators
to set `MOLECULE_ORG_ID = "<your org id>"`. The molecule_runtime MCP
subprocess these snippets spawn never reads MOLECULE_ORG_ID — that
env var is consumed only by workspace-server's TenantGuard
middleware, server-side, on the tenant box itself (set by the control
plane via user-data on provision).
External operator → tenant calls pass TenantGuard via the
isSameOriginCanvas path (Origin matches Host), with auth via Bearer
token + X-Workspace-ID. The universal_mcp snippet — which calls into
the same molecule_runtime — has always (correctly) omitted
MOLECULE_ORG_ID; this brings codex / openclaw / hermes-channel into
line.
Symptom that caught it: an external codex CLI session, after pasting
the codex-tab snippet, surfaced "MOLECULE_ORG_ID is still set to
'<your org id>'" as an unresolved blocker — agent reasonably treated
the placeholder as required setup. Operator has no value to fill.
Pinned with a structural test
(TestExternalTemplates_NoMoleculeOrgIDPlaceholder) so the placeholder
can't drift back across all six external-tab templates.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MEMORY_V2_CUTOVER=true gates the admin export/import path on the v2
plugin, but the cutoverActive() check in admin_memories.go silently
returns false when the plugin isn't wired:
func (h *AdminMemoriesHandler) cutoverActive() bool {
if os.Getenv(envMemoryV2Cutover) != "true" {
return false
}
return h.plugin != nil && h.resolver != nil
}
Two operator misconfigs hit the silent-fallback path:
1. MEMORY_V2_CUTOVER=true set, MEMORY_PLUGIN_URL unset
→ wiring.Build returns nil → handler stays on legacy SQL path
→ operator sees no error, assumes cutover is live, but every
request still writes the legacy table.
2. MEMORY_V2_CUTOVER=true set, MEMORY_PLUGIN_URL set, but plugin
unreachable at boot
→ wiring.Build still returns the bundle (intentional — circuit
breaker handles ongoing unavailability), but every cutover
write quietly falls back via the breaker.
→ only signal: legacy table keeps growing.
Both are exactly the "structurally invisible until prod" failure
mode; the only real-world detection today is "notice the legacy
table is still being written to," which no operator will check.
Add loud, distinctive WARN log lines at Build() time for both
shapes. Boot logs are operator-visible, so a half-config is
immediately obvious without needing dashboards.
Tests:
* 4 new (cutover+no-URL → warn, neither set → silent, cutover+probe-
fail → loud warn, probe-fail-without-cutover → quiet generic)
* 6 existing (still pass; pin no-warning-on-happy-path)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Org-import called h.workspace.provisionWorkspace directly — same silent-
drop bug that bit TeamHandler.Expand on 2026-05-04 (see workspace.go
:121-125 comment + #2486). Symptom on SaaS: every claude-code workspace
sat in "provisioning" until the 600s sweeper marked it failed with
"container started but never called /registry/register" — because no
container ever existed; the goroutine returned silently when the Docker
provisioner field was nil.
User reproduced 2026-05-04 ~22:30Z importing a 7-workspace template on
the hongming prod tenant. Tenant CP logs (queried live via SSM) showed
ZERO "Provisioner: goroutine entered" or "CPProvisioner: goroutine
entered" lines for any of the 7 failed workspace UUIDs in the 60min
window — confirming the goroutine never ran past line 384 of
org_import.go because provisionWorkspace returned early in SaaS mode.
The fix is one line: replace h.workspace.provisionWorkspace with
h.workspace.provisionWorkspaceAuto. Auto is the single source of
truth for backend selection (workspace.go:130) — picks CP-mode when
h.cpProv is wired, Docker-mode when h.provisioner is wired, returns
false when neither.
ALSO adds a generic source-level gate
(TestNoCallSiteCallsDirectProvisionerExceptAuto) so the next future
caller can't repeat the pattern. Walks every non-test .go file in
handlers/ and fails if any direct call to provisionWorkspace( or
provisionWorkspaceCP( appears outside the dispatcher's own definition
file.
The gate currently allows workspace_restart.go which has its own
manual if-h.cpProv-else dispatch (functionally equivalent to Auto,
not the bug class — but is architectural duplication; follow-up
filed for proper de-dup).
Test plan:
- TestOrgImport_UsesAutoNotDirectDockerPath: pin the org_import.go
call site
- TestNoCallSiteCallsDirectProvisionerExceptAuto: generic gate against
future drift
- TestTeamExpand_UsesAutoNotDirectDockerPath (existing): symmetric for
team.go
All 3 + the rest of the handler suite pass.
Closes#2486
Pairs with: PR #2794 (configurable provision concurrency) which made
it possible to bisect concurrency-vs-routing as the cause
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The §9c "Memory KV Edit round-trip" gate (added in #2787) captured the
expected-409 status code via:
$(tenant_call ... -w "%{http_code}" || echo "000")
tenant_call uses CURL_COMMON which carries --fail-with-body. On the
expected 409, curl exits 22; the `|| echo "000"` then fires and
appends "000" to the captured stdout — yielding "409000" instead of
"409", failing the gate even though the contract was satisfied.
Caught on PR #2792's first E2E run (status got "409000"). Has been
silently failing the staging-SaaS E2E since #2787 merged earlier
today; nothing else surfaced it because the workflow is informational,
not required.
Fix: route -w into its own tempfile so curl's exit code can't pollute
the captured stdout. Wrap with set +e/-e so the 22 doesn't trip the
outer pipeline. Same shape as the §7c gate fix that PR #2779/#2783
landed for the same class of bug.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes part of #2790 (Phase A). The Python total floor at 86% (set in
workspace/pytest.ini, issue #1817) averages over ~6000 lines, so a
single MCP-critical file could regress to ~50% with no CI complaint as
long as other modules compensate. This is the same distribution gap
that #1823 closed Go-side: total floor passes while a critical handler
sits at 0%.
Added gates for these five files (per-file floor 75%):
- workspace/a2a_mcp_server.py — MCP dispatcher (PR #2766 / #2771)
- workspace/mcp_cli.py — molecule-mcp standalone CLI entry
- workspace/a2a_tools.py — workspace-scoped tool implementations
- workspace/inbox.py — multi-workspace inbox + per-workspace cursors
- workspace/platform_auth.py — per-workspace token resolver
These handle multi-tenant routing, auth tokens, and inbox dispatch.
Risk shape mirrors Go-side tokens*/secrets* — a 0%/50% file here is
exactly where the PR #2766 dispatcher bug class slips through without
a structural test.
Floor 75% is strictly additive — current actuals 80-96% (measured
2026-05-04). No existing PR fails. Ratchet plan in COVERAGE_FLOOR.md
target 90% by 2026-08-04.
Implementation: pytest already writes .coverage; new step emits a JSON
view scoped to the critical files via `coverage json --include="*name"`,
then jq extracts each file's percent_covered. Exact key match by
basename so workspace/builtin_tools/a2a_tools.py (a different 100%
file) doesn't shadow workspace/a2a_tools.py.
Verified locally with the actual coverage data:
- floor=75 → 0 failures (matches current state)
- floor=81 → 1 failure (a2a_tools.py at 80%) — proves the gate trips
Pairs with PR #2791 (Phase B — schema↔dispatcher AST drift gate). Phase
C (molecule-mcp e2e harness) remains the largest piece in #2790.
YAML validated locally before commit per
feedback_validate_yaml_before_commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Org-import was hard-capped at 3 concurrent workspace provisions (#1084),
calibrated for Docker-mode workspaces where each provision was a
docker-run. Now that workspaces are EC2 instances, AWS RunInstances
parallelises happily and the artificial cap of 3 makes a 7-workspace
org-import take 3-4× longer than necessary (3 batches × ~70s/provision
≈ 4 min wall time when AWS could absorb all 7 in parallel for ~70s).
This PR makes the cap configurable via MOLECULE_PROVISION_CONCURRENCY:
unset → 3 (Docker-mode default, unchanged)
"0" → effectively unlimited (SaaS / EC2 backend; AWS rate-limit
+ vCPU quota are the real backpressure)
N>0 → exactly N
N<0 → fall back to default 3 + warning log
garbage → fall back to default 3 + warning log
The "0 = unlimited" mapping is the user-facing convention requested for
SaaS deployments — operators don't have to pick an arbitrary large
number. Implementation hands off 1<<20 internally so the channel-based
semaphore stays a no-op without infinite-buffer risk.
Test coverage (org_provision_concurrency_test.go, 6 cases / 15 subtests):
- unset → default
- "0" → large unlimited cap
- positive integer exact (1, 5, 10, 50)
- negative → default + warning
- non-numeric → default + warning
- whitespace-trimmed (" 7 " → 7)
Boot-time log line confirms the resolved cap so an operator can verify
their env is being honored without re-deploying.
Does NOT address the separate 600s "never registered" timeout the user
also reported during org-import — that's filed as molecule-core#2793
for proper investigation (parallel-provision contention, network
routing, register-retry budget, or container-start failure are all
candidates and need live SSM capture to bisect).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Org-import was hard-capped at 3 concurrent workspace provisions (#1084),
calibrated for Docker-mode workspaces where each provision was a
docker-run. Now that workspaces are EC2 instances, AWS RunInstances
parallelises happily and the artificial cap of 3 makes a 7-workspace
org-import take 3-4× longer than necessary (3 batches × ~70s/provision
≈ 4 min wall time when AWS could absorb all 7 in parallel for ~70s).
This PR makes the cap configurable via MOLECULE_PROVISION_CONCURRENCY:
unset → 3 (Docker-mode default, unchanged)
"0" → effectively unlimited (SaaS / EC2 backend; AWS rate-limit
+ vCPU quota are the real backpressure)
N>0 → exactly N
N<0 → fall back to default 3 + warning log
garbage → fall back to default 3 + warning log
The "0 = unlimited" mapping is the user-facing convention requested for
SaaS deployments — operators don't have to pick an arbitrary large
number. Implementation hands off 1<<20 internally so the channel-based
semaphore stays a no-op without infinite-buffer risk.
Test coverage (org_provision_concurrency_test.go, 6 cases / 15 subtests):
- unset → default
- "0" → large unlimited cap
- positive integer exact (1, 5, 10, 50)
- negative → default + warning
- non-numeric → default + warning
- whitespace-trimmed (" 7 " → 7)
Boot-time log line confirms the resolved cap so an operator can verify
their env is being honored without re-deploying.
Does NOT address the separate 600s "never registered" timeout the user
also reported during org-import — that's filed as molecule-core#2793
for proper investigation (parallel-provision contention, network
routing, register-retry budget, or container-start failure are all
candidates and need live SSM capture to bisect).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Parent → child knowledge sharing previously lived behind a `shared_context`
list in config.yaml: at boot, every child workspace HTTP-fetched its parent's
listed files via GET /workspaces/:id/shared-context and prepended them as
a "## Parent Context" block. That paid the full transfer cost on every
boot regardless of whether the agent needed it, single-parent SPOF, no team
or org scope, and broken if the parent was unreachable.
Replace with memory v2's team:<id> namespace: agents call recall_memory
on demand. For large blob-shaped artefacts see RFC #2789 (platform-owned
shared file storage).
Removed:
- workspace/coordinator.py: get_parent_context()
- workspace/prompt.py: parent_context arg + injection block
- workspace/adapter_base.py: import + call + arg pass
- workspace/config.py: shared_context field + parser entry
- workspace-server/internal/handlers/templates.go: SharedContext handler
- workspace-server/internal/router/router.go: GET /shared-context route
- canvas/src/components/tabs/ConfigTab.tsx: Shared Context tag input
- canvas/src/components/tabs/config/form-inputs.tsx: schema field + default
- canvas/src/components/tabs/config/yaml-utils.ts: serializer entry
- 6 tests pinning the removed behavior; 5 doc references
Added regression gates so any reintroduction is loud:
- workspace/tests/test_prompt.py: build_system_prompt must NOT emit
"## Parent Context"
- workspace/tests/test_config.py: legacy YAML key loads cleanly but
shared_context attr must NOT exist on WorkspaceConfig
- tests/e2e/test_staging_full_saas.sh §9d: GET /shared-context must NOT
return 200 against a live tenant
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes part of #2790 (Phase B). Prevents a recurrence of the PR #2766 →
PR #2771 cycle: PR #2766 added ``source_workspace_id`` to four tools'
``input_schema`` and tool implementations, but the dispatcher in
``a2a_mcp_server.handle_tool_call`` silently dropped the kwarg for
``commit_memory`` / ``recall_memory`` / ``chat_history`` /
``get_workspace_info``. Schema lied; LLMs populated the param; every
call fell back to ``WORKSPACE_ID``, defeating multi-tenant isolation.
Existing dispatcher tests asserted return-value substrings (``"working"
in result``) instead of kwarg flow, so the bug shipped to main and was
only caught by re-reviewing post-merge.
This change adds an AST-driven gate. For every ToolSpec in
platform_tools.registry.TOOLS, the gate finds the matching
``elif name == "<tool>"`` arm in a2a_mcp_server.py and asserts that
every property declared in input_schema.properties is read by an
``arguments.get("<property>", ...)`` call inside that arm. A new schema
field the dispatcher forgets to forward fails CI loudly.
Three tests:
- test_every_dispatch_arm_reads_every_schema_property: main drift gate.
Walks registry, matches dispatch arms by name, diffs declared vs
read keys.
- test_dispatch_arms_reach_every_registered_tool: inverse direction.
A registered tool with no dispatch arm is "Unknown tool" at runtime,
even though docs/wrappers/schema all advertise it. Catches PRs that
add a ToolSpec but forget the dispatcher.
- test_drift_gate_self_check_finds_known_arms: pin the AST parser. If
handle_tool_call is refactored into a different shape (dict dispatch,
registry-driven, etc.) and _load_dispatch_arms returns {}, the main
gate vacuously passes — this self-check makes that failure mode
explicit by requiring 12 known arms to be discovered.
Verified the gate catches the PR #2766 bug: stripping
``source_workspace_id=arguments.get(...)`` from the commit_memory arm
fails the gate with a descriptive error pointing at the missing kwarg
and referencing the prior incident. Restored → 3 tests pass.
Suite: 1733 passed (was 1730 + 3 new), 3 skipped, 2 xfailed.
Why AST, not runtime invocation: the runtime mock-based tests in
test_a2a_mcp_server.py already assert kwargs flow correctly for four
explicitly-tested tools. This gate is cheaper (~1ms), catches new
properties before someone has to remember the runtime test, and runs
as a structural invariant.
Phase A (Python coverage floor) and Phase C (molecule-mcp e2e harness)
remain in #2790 as separate follow-ups.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>