Simplify pass on top of the canary fix:
- Drop the three CP commit SHAs from comments — issue #2090 covers
the audit trail, SHAs would rot.
- Pull the inline `900` into TLS_TIMEOUT_SEC=$((15 * 60)) so the
bash mirrors the TS side (15 min) at a glance.
- TENANT_HOST extraction now strips http(s) AND any port suffix, so
getent doesn't silently fail on a ws://host:443 style URL.
- sed-redact Authorization/Cookie out of the curl -v dump, defensive
against future callers adding an auth header to this probe.
Pure cleanup; no behaviour change to the happy path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Canary #2090 has been red for 6 consecutive runs over 4+ hours, all
timing out at the TLS-readiness step exactly at the 10-min cap. Time
window correlates with three CP commits that landed today/yesterday
and changed EC2 boot behaviour:
- molecule-controlplane@a3eb8be — fix(ec2): force fresh clone of /opt/adapter
- molecule-controlplane@ed70405 — feat(sweep): wire up healthcheck loop
- molecule-controlplane@4ab339e — fix(provisioner): aggregate cleanup errors
Two changes here, both surgical:
1. Bump the bash-side TLS deadline from 600s to 900s, and the canvas TS
mirror from 10m to 15m. Stays below the 20-min provision envelope
(so a genuinely-stuck tenant still fails loud at the earlier
provision step instead of masquerading as TLS).
2. On TLS-timeout, dump a diagnostic burst before exiting:
- getent hosts $TENANT_HOST (DNS resolution state)
- curl -kv $TENANT_URL/health (TLS handshake + HTTP layer)
The previous failure log was just "no 2xx in N min" with no signal
for which layer was actually broken. After this, the next timeout
tells us whether DNS, TLS handshake, or HTTP layer is the culprit
so the CP root cause can be isolated without speculation.
This is the unblock; a separate molecule-controlplane issue tracks the
underlying regression suspicion.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI on PR #2105 caught 7 Create-handler tests still mocking the
pre-#1408 10-arg INSERT signature. With the column now wired
unconditionally into the INSERT, every WithArgs that pinned
budget_limit as the 10th arg needed a 11th slot for the resolved
max_concurrent_tasks value.
Files:
- workspace_test.go: 6 tests (DBInsertError, DefaultsApplied,
WithSecrets_Persists, TemplateDefaultsMissingRuntimeAndModel,
TemplateDefaultsLegacyTopLevelModel, CallerModelOverridesTemplateDefault)
- workspace_budget_test.go: 1 test (Budget_Create_WithLimit)
All resolved values are the schema-default mirror, so the test
expectation reads as the same models.DefaultMaxConcurrentTasks
const that the handler writes. New imports added to both files.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Simplify pass on top of the wire-up commit:
- New const models.DefaultMaxConcurrentTasks = 1; handlers and tests
reference the symbol so the schema-default mirror lives in one place.
- Strip 5 multi-line comments that narrated what the code does.
- Drop the duplicate field-rationale on OrgWorkspace; the one on
CreateWorkspacePayload is canonical.
- Drop test-side positional comments that would silently lie if columns
get reordered.
Pure cleanup; no behaviour change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 4 of #1408 (active_tasks counter). Runtime increment/decrement,
schema column (037), and scheduler enforcement (scheduler.go:312)
already shipped — but the write path from template config.yaml +
direct API was missing, so every workspace silently fell through to
the schema default of 1. Leaders that set max_concurrent_tasks: 3 in
their org template were getting 1 anyway, defeating the entire
feature for the use case it was built for (cron-vs-A2A contention on
PM/lead workspaces).
- OrgWorkspace gains MaxConcurrentTasks (yaml + json tags)
- CreateWorkspacePayload gains MaxConcurrentTasks (json tag)
- Both INSERTs now write the column unconditionally; 0/omitted
payload value falls back to 1 (schema default mirror) so the wire
stays single-shape — no forked column list / goto.
- Existing Create-handler test mocks updated to expect the 11th arg.
- New TestWorkspaceCreate_MaxConcurrentTasksOverride locks the
payload→DB propagation for the leader case (value=3).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Copilot Auto-fix in 5a8f42b4 addressed the duplicate-import lint by
removing 'import claude_sdk_executor as _executor_mod' entirely, but the
async wedge tests (test_execute_marks_wedge_*, test_execute_clears_wedge_*)
still call _executor_mod._reset_sdk_wedge_for_test() etc. — so they failed
with NameError once that line was removed.
Restore the alias, but at the top of the file (alongside the other module-
level imports) rather than at line 1248. The late-file binding was the
proximate cause of the original CI failure: with --cov enabled (#1817),
sys.settrace + the @pytest.mark.asyncio wrapper combination caused the
late module-level binding to not be visible from inside the async test
bodies, even though the binding existed at module-load time. Hoisting
fixes that scope-resolution issue.
Verified locally with the exact CI config (--cov-fail-under=86):
1280 passed, 2 xfailed — total coverage 90.25%
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Three files conflicted with staging changes that landed while this PR
sat open. Resolved each by combining both intents (not picking one side):
- a2a_proxy.go: keep the branch's idle-timeout signature
(workspaceID parameter + comment) AND apply staging's #1483 SSRF
defense-in-depth check at the top of dispatchA2A. Type-assert
h.broadcaster (now an EventEmitter interface per staging) back to
*Broadcaster for applyIdleTimeout's SubscribeSSE call; falls through
to no-op when the assertion fails (test-mock case).
- a2a_proxy_test.go: keep both new test suites — branch's
TestApplyIdleTimeout_* (3 cases for the idle-timeout helper) AND
staging's TestDispatchA2A_RejectsUnsafeURL (#1483 regression). Updated
the staging test's dispatchA2A call to pass the workspaceID arg
introduced by the branch's signature change.
- workspace_crud.go: combine both Delete-cleanup intents:
* Branch's cleanupCtx detachment (WithoutCancel + 30s) so canvas
hang-up doesn't cancel mid-Docker-call (the container-leak fix)
* Branch's stopAndRemove helper that skips RemoveVolume when Stop
fails (orphan sweeper handles)
* Staging's #1843 stopErrs aggregation so Stop failures bubble up
as 500 to the client (the EC2 orphan-instance prevention)
Both concerns satisfied: cleanup runs to completion past canvas
hangup AND failed Stop calls surface to caller.
Build clean, all platform tests pass.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
PR #2103 widened the SSRF saasMode branch to also relax RFC-1918 + ULA
under MOLECULE_ENV=development (so the docker-compose dev pattern stops
rejecting workspace registrations on 172.18.x.x bridge IPs). The
existing TestIsSafeURL_DevMode_StillBlocksOtherRanges covered the
security floor (metadata / TEST-NET / CGNAT stay blocked), but no
test asserted the positive side — that 10.x / 172.x / 192.168.x / fd00::
ARE now allowed under dev mode.
Without this test, a future refactor that quietly drops the
`|| devModeAllowsLoopback()` from isPrivateOrMetadataIP wouldn't trip
any assertion, and the docker-compose dev loop would silently re-break.
Adds TestIsSafeURL_DevMode_AllowsRFC1918 — table of 4 URLs covering
the three RFC-1918 IPv4 ranges + IPv6 ULA fd00::/8. Sets
MOLECULE_DEPLOY_MODE=self-hosted explicitly so the test exercises the
devMode branch, not a SaaS-mode pass.
Closes the Optional finding I left on PR #2103.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The production-side end of the runtime CD chain. Operators (or the post-
publish CI workflow) hit this after a runtime release to pull the latest
workspace-template-* images from GHCR and recreate any running ws-* containers
so they adopt the new image. Without this, freshly-published runtime sat in
the registry but containers kept the old image until naturally cycled.
Implementation notes:
- Uses Docker SDK ImagePull rather than shelling out to docker CLI — the
alpine platform container has no docker CLI installed.
- ghcrAuthHeader() reads GHCR_USER + GHCR_TOKEN env, builds the base64-
encoded JSON payload Docker engine expects in PullOptions.RegistryAuth.
Both empty → public/cached images only; both set → private GHCR pulls.
- Container matching uses ContainerInspect (NOT ContainerList) because
ContainerList returns the resolved digest in .Image, not the human tag.
Inspect surfaces .Config.Image which is what we need.
- Provisioner.DefaultImagePlatform() exported so admin handler picks the
same Apple-Silicon-needs-amd64 platform as the provisioner — single
source of truth for the multi-arch override.
Local-dev companion: scripts/refresh-workspace-images.sh runs on the
host and inherits the host's docker keychain auth — alternate path for
when GHCR_USER/TOKEN aren't set in the platform env.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Three env vars the platform now reads:
- MOLECULE_ENV=development (default) — activates the WorkspaceAuth /
AdminAuth dev fail-open path so the canvas's bearer-less requests pass
through. Also unlocks RFC-1918 relaxation in the SSRF guard so docker-
bridge IPs work. Override to 'production' for staged deploys.
- GHCR_USER + GHCR_TOKEN — feed POST /admin/workspace-images/refresh's
ImagePull auth payload. Both empty → endpoint can pull cached/public
images only. Set with a fine-grained PAT (read:packages on Molecule-AI
org) to pull private GHCR images.
- MOLECULE_IMAGE_PLATFORM=linux/amd64 (default) — workspace-template-*
images ship single-arch amd64. On Apple Silicon hosts, the daemon's
native linux/arm64/v8 request misses the manifest and pulls fail.
Forcing amd64 makes Docker Desktop run them under Rosetta — slower
(~2-3×) but functional.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Pre-fix, POST /workspaces/:id/notify (the side-channel agents use to push
interim updates and follow-up results) only broadcast via WebSocket — no
DB write. When the user refreshed the page, the chat-history loader
(which queries activity_logs) couldn't restore those messages and they
vanished from the chat.
Hits the most common path: when the platform's POST /a2a times out (idle),
the runtime keeps working and eventually pushes its reply via
send_message_to_user. The reply rendered live but disappeared on reload.
Fix: also INSERT an activity_logs row with shape the existing loader
already understands (type=a2a_receive, source_id=NULL, response_body=
{result: text}). Persistence is best-effort — a DB hiccup doesn't block
the WebSocket push (which the user is already seeing).
🤖 Generated with [Claude Code](https://claude.com/claude-code)
The docker-compose dev pattern puts platform and workspace containers on
the same docker bridge network (172.18.0.0/16, RFC-1918). The runtime
registers via its docker-internal hostname which DNS-resolves to a
172.18.x.x IP. The SSRF defence's isPrivateOrMetadataIP rejected those,
so every workspace POST through the platform proxy returned
'workspace URL is not publicly routable' — breaking the entire docker-
compose dev loop.
Fix: in isPrivateOrMetadataIP, treat MOLECULE_ENV=development the same
as SaaS mode for RFC-1918 relaxation. Both share the 'trusted intra-
network routing' property — SaaS is sibling EC2s in the same VPC, dev
is sibling containers on the same docker bridge. Always-blocked
categories (metadata link-local, TEST-NET, CGNAT) stay blocked.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
When proxyA2A returns 202+{queued:true} (target busy → enqueued for drain
on next heartbeat), executeDelegation previously treated it as a successful
completion and ran extractResponseText on the queued JSON. The result was
'Delegation completed (workspace agent busy — request queued, will dispatch...)'
landing in activity_logs.summary, which the LLM then echoed to the user
chat as garbage.
Two fixes:
1. delegation.go: detect queued shape via new isQueuedProxyResponse helper,
write status='queued' with clean summary 'Delegation queued — target at
capacity', store delegation_id in response_body so the drain can stitch
back later. Also embed delegation_id in params.message.metadata + use it
as messageId so the proxy's idempotency-key path keys off the same id.
2. a2a_queue.go: when DrainQueueForWorkspace successfully drains a queued
item, extract delegation_id from the body's metadata and UPDATE the
originating delegate_result row (queued → completed with real
response_body). Broadcast DELEGATION_COMPLETE so the canvas chat feed
flips the queued line to completed in real time.
Closes the loop so check_task_status reflects ground truth instead of
perpetual 'queued' even after the queued request eventually drained.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
The staging E2E suite already grep's for 5 known regression patterns
in the A2A response (hermes-agent 401, model_not_found, Encrypted
content, Unknown provider, hermes-agent unreachable). The comment
block at lines 386-395 lists "Invalid API key" as the signal for the
CP #238 boot-event 401 race + stale OPENAI_API_KEY paths, but the
explicit grep was never added — meaning a regression in that class
would slip through the generic `error|exception` catch-all.
Closes the gap with one specific-pattern check that fails loud with
the relevant bug references in the message.
Verified `bash -n` clean; pre-existing shellcheck SC2015 at line 88
is unrelated.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 3 skipped tests in workspace_provision_test.go (#1206 regression
tests) were blocked because captureBroadcaster's struct-embed wouldn't
type-check against WorkspaceHandler.broadcaster's concrete
*events.Broadcaster field. This PR fixes the interface blocker for
the 2 broadcaster-related tests; the 3rd (plugins.Registry resolver)
is a separate blocker tracked elsewhere.
Changes:
- internal/events/broadcaster.go: define `EventEmitter` interface with
RecordAndBroadcast + BroadcastOnly. *Broadcaster satisfies it via
its existing methods (compile-time assertion guards future drift).
SubscribeSSE / Subscribe stay off the interface because only sse.go
+ cmd/server/main.go call them, and both still hold the concrete
*Broadcaster.
- internal/handlers/workspace.go: WorkspaceHandler.broadcaster type
changes from *events.Broadcaster to events.EventEmitter.
NewWorkspaceHandler signature updated to match. Production callers
unchanged — they pass *events.Broadcaster, which the interface
accepts.
- internal/handlers/activity.go: LogActivity takes events.EventEmitter
for the same reason — tests passing a stub no longer need to
construct the full broadcaster.
- internal/handlers/workspace_provision_test.go: captureBroadcaster
drops the struct embed (no more zero-value Broadcaster underlying
the SSE+hub fields), implements RecordAndBroadcast directly, and
adds a no-op BroadcastOnly to satisfy the interface. Skip messages
on the 2 empty broadcaster-blocked tests updated to reflect the
new "interface unblocked, test body still needed" state.
Verified `go build ./...`, `go test ./internal/handlers/`, and
`go vet ./...` all clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The token-cache helper had three #1552 findings, all in the
mode-600-after-the-fact pattern:
1. _write_cache writes .tmp with default umask (typically 022 → 644
on disk) and then chmod 600's after the mv. A concurrent reader
in that microsecond-wide window sees the token at mode 644.
2. Each chmod was swallowed via `|| true` — if it ever fails, the
tokens stay world-readable with no operator signal.
3. _refresh_gh's gh_token_file write has the same shape and same
two issues.
Hardening:
- Wrap the .tmp creates in a `umask 077` block so the files are 600
from creation. Restore the previous umask before return so callers
aren't perturbed.
- Replace `chmod ... 2>/dev/null || true` with `if ! chmod ...; then
echo WARN ...; fi`. A chmod failure is a real signal worth grep'ing.
- Apply the same pattern to the _refresh_gh gh_token_file path.
`local` is illegal in a top-level case branch, so use a uniquely-
named global (_gh_prev_umask) and unset it after.
Verified `bash -n` clean and `shellcheck` clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes a 4+ cycle Canvas tabs E2E flake pattern that's been blocking
staging→main PRs since 2026-04-24+ (#2096, #2094, #2055, #2079, ...).
Root cause: TLS_TIMEOUT_MS=180s (3 min) is too tight for the layered
realities of staging tenant TLS readiness:
1. Cloudflare DNS propagation through the edge (1-2 min typical)
2. Tenant CF Tunnel registering the new hostname (1-2 min)
3. CF edge ACME cert provisioning + cache (1-3 min)
Each layer can add 1-3 min on its own under heavy staging load — the
realistic worst case is well past the 3-min cap.
Provision and workspace-online timeouts were already raised to 20 min
(staging-setup.ts:42-46 history). The TLS gate was the remaining
under-budgeted step. Bumping to 10 min keeps it inside the 20-min
PROVISION envelope so a genuinely-stuck tenant still fails loud at
the earlier provision step rather than masquerading as a TLS issue.
Both call sites raised together:
- canvas/e2e/staging-setup.ts: TLS_TIMEOUT_MS = 10 * 60 * 1000
- tests/e2e/test_staging_full_saas.sh: TLS_DEADLINE += 600
Each carries an inline rationale comment so the next reviewer sees
the layer-by-layer decomposition without re-reading the issue thread.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The sweep-cf-orphans workflow shipped in #2088 was noisier than
intended in two ways. This PR fixes both — was filed under the
Optional finding I left on the original review and now matters because
the noise is observably hitting the merge queue.
1) `merge_group: types: [checks_requested]` was firing the entire
sweep job on every PR through the merge queue. The original intent
("future required-check support without a workflow edit") never
materialized, and meanwhile every recent merge-queue eval (#2091,
#2092, #2093, #2094, #2095, #2097) generated a red `Sweep CF
orphans (merge_group)` run.
Drop the trigger. Comment in the workflow explains the re-add path
if/when the workflow IS wired as a required check (re-add the
trigger AND gate the actual sweep step with
`if: github.event_name != 'merge_group'` so merge-queue evals are
no-op success).
2) The `Verify required secrets present` step exits 2 when the 6
secrets aren't configured yet (the PR body's post-merge step,
still pending). That turns the hourly schedule into an hourly red
CI run for as long as the secrets stay unset.
Convert to a soft skip: emit a `:⚠️:` listing the missing
secrets and set a `skip=true` step output, then gate the sweep
step with `if: steps.verify.outputs.skip != 'true'`. Workflow
reports green and ops still sees the warning when they review
recent runs.
Net effect:
- merge-queue evals stop generating spurious red runs
- the schedule reports green-with-warning until secrets land
- once secrets land, behavior is identical to today's (real sweep
runs, hard-fails if a secret is later removed)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#1483 flagged that dispatchA2A() doesn't call isSafeURL internally —
the guard exists only at the caller level (resolveAgentURL at
a2a_proxy.go:424). The primary call path through proxyA2ARequest
is safe today, but if any future code path ever calls dispatchA2A
directly without going through resolveAgentURL, the SSRF check
would be silently bypassed.
This adds the one-line defense-in-depth guard the issue prescribed:
if err := isSafeURL(agentURL); err != nil {
return nil, nil, &proxyDispatchBuildError{err: err}
}
Wrapping as *proxyDispatchBuildError preserves the existing caller
error-classification path — the same shape that maps to 500 elsewhere.
Adds TestDispatchA2A_RejectsUnsafeURL pinning the contract:
re-enables SSRF for the test (setupTestDB disables it for normal
unit tests), passes a metadata IP, asserts the build error returns
and cancel is nil so no resource is leaked.
The 4 existing dispatchA2A unit tests use setupTestDB → SSRF
disabled, so they continue passing unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#1484 flagged that discoverHostPeer() and writeExternalWorkspaceURL()
return URLs sourced from the workspaces table without an isSafeURL
check. Workspace runtimes register their own URLs via /registry/register
— a misbehaving / compromised runtime could register a metadata-IP URL.
Today both functions are gated by Phase 30.6 bearer-required Discover,
so exposure is theoretical. The fix makes them safe regardless of
upstream auth shape.
Changes:
- discoverHostPeer: isSafeURL on resolved URL before responding;
503 + log on rejection.
- writeExternalWorkspaceURL: same guard applied to the post-rewrite
outURL (so a host.docker.internal rewrite is checked AND a
metadata-IP that survived the rewrite untouched is rejected).
- 3 new regression tests:
* RejectsMetadataIPURL on host-peer path (169.254.169.254 → 503)
* AcceptsPublicURL on host-peer path (8.8.8.8 → 200; positive
counterpart so the rejection test can't pass via universal-fail)
* RejectsMetadataIPURL on external-workspace path
setupTestDB already disables SSRF checks via setSSRFCheckForTest,
so the 16+ existing discovery tests remain untouched. Only the new
tests opt in to enabled SSRF.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Extract walkTemplateConfigs(configsDir, fn) shared helper. Both
templates.List and loadRuntimeProvisionTimeouts walked configsDir
+ parsed config.yaml — same boilerplate twice. Now centralised so
a future template-discovery rule (subdir naming, README sentinel,
etc.) lands in one place.
- templates.List uses the walker — net -10 lines.
- loadRuntimeProvisionTimeouts uses the walker — net -10 lines.
- Document runtimeProvisionTimeoutsCache as 'NOT SAFE for
package-level reuse' so a future change doesn't accidentally
promote it to a singleton (sync.Once can't be reset → tests
would lock out other fixtures).
Skipped (review finding): atomic.Pointer[map[string]int] for
future hot-reload. The doc comment already documents the
limitation; YAGNI-promoting the primitive now would buy a
not-yet-built feature at the cost of more code today.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 2 of #2054 — workspace-server reads runtime-level
provision_timeout_seconds from template config.yaml manifests and
includes provision_timeout_ms in the workspace List/Get response.
Phase 1 (canvas, #2092) already plumbs the field through socket →
node-data → ProvisioningTimeout's resolver, so the moment a
template declares the field the per-runtime banner threshold
adjusts without a canvas release.
Implementation:
- templates.go: parse runtime_config.provision_timeout_seconds in
the templateSummary marshaller. The /templates API now surfaces
the field too — useful for ops dashboards and future tooling.
- runtime_provision_timeouts.go (new): loadRuntimeProvisionTimeouts
scans configsDir, parses every immediate subdir's config.yaml,
returns runtime → seconds. Multiple templates with the same
runtime: max wins (so a slow template's threshold doesn't get
cut by a fast template's). Bad/empty inputs are silently
skipped — workspace-server starts cleanly with no templates.
- runtimeProvisionTimeoutsCache: sync.Once-backed lazy cache.
First workspace API request after process start pays the read
cost (~few KB across ~50 templates); every subsequent request is
a map lookup. Cache lifetime = process lifetime; invalidates on
workspace-server restart, which is the normal template-change
cadence.
- WorkspaceHandler gets a provisionTimeouts field (zero-value struct
is valid — the cache lazy-inits on first get()).
- addProvisionTimeoutMs decorates the response map with
provision_timeout_ms (seconds × 1000) when the runtime has a
declared timeout. Absent = no key in the response, canvas falls
through to its runtime-profile default. Wired into both List
(per-row decoration in the loop) and Get.
Tests (5 new in runtime_provision_timeouts_test.go):
- happy path: hermes declares 720, claude-code doesn't, only
hermes appears in the map
- max-on-duplicate: same runtime in two templates → max wins
- skip-bad-inputs: missing runtime, zero timeout, malformed yaml,
loose top-level files all silently ignored
- missing-dir: returns empty map, no crash
- cache: lazy-init on first get; subsequent gets hit cache even
after underlying file changes (sync.Once contract); unknown
runtime returns zero
Phase 3 (separate template-repo PR): template-hermes config.yaml
declares provision_timeout_seconds: 720 under runtime_config.
canvas RUNTIME_PROFILES.hermes becomes redundant + removable.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 97% number from CI run 24956647701 was measured WITHOUT a
.coveragerc omit list. Once this PR's prescribed omit set is in
effect (`*/__init__.py`, `*/tests/*`, `plugins_registry/*` — files
that don't carry behavior), the actual measurement of behavior-bearing
code on the same staging snapshot is 91.11% (run 24957664272).
86% sits at the issue's prescribed `current − 5pp` margin and
unblocks CI without lowering the bar in real terms.
The Python workspace already runs pytest-cov in CI but with no
threshold and inline-flagged config. CI run 24956647701 (2026-04-26
staging) reports 97% coverage on the package — well above the issue's
75% target. The actionable gap is locking in a floor so a regression
can't sneak past, and centralizing config so local `pytest` matches CI.
Changes:
- workspace/pytest.ini — coverage flags moved into addopts (-q,
--cov=., --cov-report=term-missing, --cov-fail-under=92).
92% = current 97% measurement minus the 5pp safety margin
the issue's Step 3 prescribes.
- workspace/.coveragerc (new) — [run] omit list and [report]
skip_covered. coverage.py doesn't read pytest.ini sections, so
the omit config has to live here.
- .github/workflows/ci.yml — removed the inline --cov flags from the
Python Lint & Test step; now reads from pytest.ini. Workflow stays
the same single-command shape, just simpler.
Result: any PR that drops coverage below 92% fails CI loudly. Floor
ratchets up by replacing 92 with current measurement on a future
test-writing pass — same shape as Go coverage gates landed elsewhere.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
simplify-review note: the |/,-delimited node string is brittle if a
future string-typed field is added without sanitization. Document
which fields are user-typed (name — already sanitized) vs primitive
(id is UUID, runtime is a slug, provisionTimeoutMs is numeric) so
the next field-add doesn't accidentally introduce an injection
vector for the splitter.
Skipped (false-positive review finding): the agent flagged the
prop > runtime-profile order as inconsistent with the docstring,
but the docstring explicitly lists the prop at #2 (between node and
runtime-profile) — matches both the implementation AND the original
behavior pre-#2054 (the prop was 'timeoutMs ?? runtime-profile').
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 1 of moving runtime UX knobs server-side. Builds the canvas
foundation: a workspace can carry its own provision_timeout_ms
(sourced server-side from a template manifest in a follow-up PR),
and ProvisioningTimeout's resolver respects it per-node.
Today the resolver had Props-level timeoutMs that applied to ALL
nodes — fine for tests but wrong for production where one batch
could mix runtimes (hermes 12-min cold boot alongside docker 2-min).
The runtime profile fallback already handles per-runtime defaults;
this PR adds the per-WORKSPACE override layer above that.
Resolution priority (most specific wins):
1. node.provisionTimeoutMs — server-declared per-workspace
override (this PR's new field)
2. timeoutMs prop — single-threshold test override
3. runtime profile in @/lib/runtimeProfiles
4. DEFAULT_RUNTIME_PROFILE
Changes:
- WorkspaceData (socket): add optional provision_timeout_ms
- WorkspaceNodeData: add optional provisionTimeoutMs
- canvas-topology hydrate: thread the field through to node.data
- ProvisioningTimeout: extend the serialized-string node iteration
to carry provisionTimeoutMs (4-field positional split); pass as
the second arg to provisionTimeoutForRuntime
- 3 new tests in ProvisioningTimeout.test.tsx covering hydrate
threading, null fall-through, and resolver priority
Phase 2 (separate PR, blocked on workspace-server template-config
loader): workspace-server reads provision_timeout_seconds from
template config.yaml at provision time, includes
provision_timeout_ms in the workspace API/socket response. Phase 3
(template-repo PR): template-hermes config.yaml declares
provision_timeout_seconds: 720; canvas RUNTIME_PROFILES.hermes
becomes redundant and can be removed.
19/19 tests pass (3 new + 16 existing).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GitHub Code Quality bot flagged the empty `except (AttributeError,
TypeError): pass` at workspace/a2a_executor.py:424 as a nit on PR #1783.
The suppression IS intentional — `new_agent_text_message()` returns
a plain string in MagicMock paths in tests where assignment to
`.metadata` raises despite hasattr being true.
This:
- Adds a why-comment citing the test-mock motivation, commit
dcbcf19 (the original guard), and issue #1787 so the next
code-quality pass doesn't re-flag it.
- Adds `logger.debug("metadata attach skipped (non-Message ...")`
for observability — debug-level so production logs stay quiet
but ops can flip the level if metadata loss is ever suspected.
Behavior unchanged. 43 existing a2a_executor tests still pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Drop redundant 'aws --version' step. Script's own 'aws ec2
describe-instances' fails just as loud with a more actionable
error; the pre-check added ~1s with no signal value.
- timeout-minutes 10 → 3. Realistic worst case is ~2min (4 curls +
1 aws + N×CF-DELETE each individually capped at 10s by the
script's curl -m flag). 3 surfaces hangs within one cron tick
instead of burning the full interval.
- Document the schedule-vs-dispatch dry-run asymmetry inline so
the next reader doesn't need to trace input defaults.
- Add merge_group: types: [checks_requested] for queue parity with
runtime-pin-compat.yml — cheap insurance if this ever becomes a
required check.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes Molecule-AI/molecule-controlplane#239.
CF zone hit the 200-record quota 2026-04-23+ — every E2E and canary
left a record on moleculesai.app, and no scheduled job pruned them.
Provisions started failing with code 81045 ('Record quota exceeded').
The sweep-cf-orphans.sh script (PR #1978, with decision-function
unit tests added in #2079) already exists but no workflow fires it.
Adding it here as a parallel janitor to sweep-stale-e2e-orgs.yml:
- hourly schedule at :15 (offset from the e2e-orgs sweep at :00 so
the two converge cleanly without racing the same CP admin endpoint)
- workflow_dispatch with dry_run input default true (ad-hoc verify
without committing to deletes)
- workflow_dispatch with max_delete_pct input for major cleanups
(the script's own MAX_DELETE_PCT defaults to 50% as a safety gate)
- concurrency group prevents schedule + manual-dispatch from racing
the same zone
Why a separate workflow vs sweep-stale-e2e-orgs.yml:
- That workflow drives DELETE /cp/admin/tenants/:slug, assumes CP
has the org row. Doesn't catch records left when CP itself never
knew about the tenant (canary scratch, manual ops experiments)
or when the CP-side cascade's CF-delete branch failed.
- sweep-cf-orphans.sh enumerates the CF zone directly + matches
against live CP slugs + AWS EC2 names. Catches what the CP-driven
sweep can't.
Required secrets (will need to be set on the repo): CF_API_TOKEN,
CF_ZONE_ID, CP_PROD_ADMIN_TOKEN, CP_STAGING_ADMIN_TOKEN,
AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY. Pre-flight verify-secrets
step fails loud if any are missing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>