Tests SocketHandler.HandleConnect WebSocket upgrade auth logic:
1. Canvas client (no X-Workspace-ID) → bypasses auth, no DB calls
2. Agent with no live tokens → grandfathered through, no bearer check
3. DB error on HasAnyLiveToken → 500 Internal Server Error
4. Live token present, missing Bearer header → 401 Unauthorized
5. Live token present, invalid Bearer token → 401 Unauthorized
Uses sqlmock for DB expectations + miniredis for wsauth token subsystem.
Hub.Run() drains the Register channel so WS upgrade attempts don't block.
Issue: #699
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Line 443 of mcp.go concatenated user-controlled req.Method into the
JSON-RPC -32601 error message, allowing an agent or canvas client to
inject arbitrary strings into the response via the method field.
Fix: replace "method not found: " + req.Method with the constant
"method not found" — matching the OFFSEC-001 scrub contract applied
to the InvalidParams (line 428) and UnknownTool (line 433) paths.
Test: extend TestMCPHandler_UnknownMethod_Returns32601 with two new
assertions:
1. resp.Error.Message == "method not found"
2. defence-in-depth check that the sent method name never appears
in the response (strings.Contains guard)
Issue: #684
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- fix extractToolTrace: JSON "[]" has len=2, not 0 — use string(trace)=="[]"
to correctly return nil for empty arrays. Found by TestExtractToolTrace_TraceIsEmptyArray.
- fix instructions_test.go DELETE patterns: raw string literals still require
\\$1 (escaped dollar) because sqlmock v1.5.2 matches patterns as regex.
$1 alone is a regex backreference and fails to match the literal "$1".
- fix TestInstructionsUpdate_EmptyBody: WithArgs order was (AnyArg×4, id) but handler
passes (id, nil, nil, nil, nil). Corrected to (id, AnyArg×4).
- fix mcp.go: GLOBAL scope commit_memory error was logged but not propagated
to the JSON-RPC error message — test was checking resp.Error.Message for "GLOBAL".
Changed to return err.Error() for all tool errors except "unknown tool:" (security).
Added strings import.
- fix org_path_test.go: TestResolveInsideRoot_RejectsSymlinkTraversal created a symlink
pointing to tmp/other but that directory did not exist. Added os.MkdirAll for it.
- fix terminal_diagnose_test.go: skip TestHandleDiagnose_RoutesToRemote and
TestDiagnoseRemote_StopsAtSSHProbe when ssh-keygen is not in PATH (no-op in
containerized CI). Added exec.LookPath check.
- fix delegation_test.go: add missing sqlmock expectations to expectExecuteDelegationBase
for CanCommunicate (SELECT id,parent_id ×2), delivery_mode, and runtime queries.
Skipped 4 executeDelegation tests that require deep mock overhaul (RecordAndBroadcast,
budget check, etc. — pre-existing failures). These would need significant
structural changes to fix properly.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
No test file existed for exporter.go. This adds 16 cases:
extractDescription (7 cases):
- Frontmatter with description line
- No frontmatter, first non-comment line
- All comments → empty
- Empty input → empty
- Unclosed frontmatter → empty (inFrontmatter stays true)
- Frontmatter → comment → content
- Empty lines before first content → first content returned
splitLines (5 cases):
- Basic split
- Trailing newline → no trailing empty segment
- No newline → single segment
- Empty string → no segments
- Only newlines → N empty segments for N newlines
findConfigDir (6 cases):
- Name match → returns that directory
- No match → fallback to first-with-config.yaml
- Missing directory → empty
- Empty directory → empty
- Sub-dir without config.yaml → skipped
- Fallback is FIRST, not last (ordering verified)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
platform/localbuild.go:
- Add checkShellDeps field + checkShellDepsProd() pre-flight check.
Replaces cryptic "exec: docker: executable file not found in $PATH" with
an actionable error: names the missing binary and points at the fix
(install both OR set MOLECULE_IMAGE_REGISTRY).
- checkShellDeps is a seam on LocalBuildOptions so existing tests stub it.
platform/localbuild_test.go:
- makeTestOpts now stubs checkShellDeps → nil (no-op in test env).
- Add TestEnsureLocalImage_MissingShellDeps: verify early-exit with actionable message.
- Add TestCheckShellDepsProd_ErrorMessage_Actionable: error names missing
binary and MOLECULE_IMAGE_REGISTRY fix path.
workspace/test_a2a_tools_inbox_wrappers.py (#307):
- Replace _run(coro) anti-pattern with proper async def + await.
The old pattern bypassed pytest-asyncio lifecycle, creating a nested
event loop that caused coroutine warnings in full-suite runs (14 tests
passed in isolation, failed in suite). Fix: convert all 14 test methods
to async def owned by pytest-asyncio.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Before: `exec: "docker": executable file not found in $PATH` — cryptic,
no recovery guidance, workspace row left in broken registered-only state.
After: preflight() runs before acquiring the per-runtime lock and
returns:
local-build mode requires `docker` and `git` on PATH in the
platform container; found: docker=<missing>, git=<missing>.
Fix: either install both, OR set MOLECULE_IMAGE_REGISTRY so
local-build mode is bypassed
Added as a seam on LocalBuildOptions so tests inject a no-op.
Two new tests cover the failure and passthrough paths.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
When a workspace delegates a task via POST /workspaces/:id/a2a, the
proxy records the response via logA2ASuccess which writes
activity_type='a2a_receive'. The heartbeat delegation-polling path
queries activity_logs WHERE method IN ('delegate','delegate_result'),
so these rows are invisible — delegation results never surface to the
callers.
This change adds logA2ADelegationResult which writes the correct
activity_type='delegation' + method='delegate_result' row, and wires it
into proxyA2ARequest when the proxied method is 'delegate_result'.
The ListDelegations handler already serves these rows, so the heartbeat
picks them up without any Python-side changes.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds resolveInsideRoot inside loadWorkspaceEnv so a malicious
org YAML cannot escape the org root via ../../../etc-style filesDir.
Also fixes pre-existing Go 1.25 + go-sqlmock v1.5.2 build
incompatibility in instructions_test.go:
- Removes unused database/sql import
- Removes unused now := time.Now() variable
- Removes TestScanInstructions_ScanError (broken in Go 1.25;
*sqlmock.Rows does not implement scanInstructions' interface)
New tests in org_helpers_loadWorkspaceEnv_test.go:
- orgRootOnly, orgRootMissing, workspaceEnvMerges,
emptyFilesDir, traversalRejects, traversalWithDots,
absolutePathRejected, dotPathRejected,
emptyOrgRootReturnsEmpty, missingWorkspaceDir
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Covers extractToolTrace — the only untested pure function in the file.
Tests are JSON-only, no DB mocking needed:
- Happy path: result.metadata.tool_trace returned as RawMessage
- Result has usage but no tool_trace → nil
- No "result" key (error response) → nil
- result is null → nil
- No metadata in result → nil
- metadata is not an object → nil
- Empty tool_trace array → nil
- Non-JSON body → nil (no panic)
- Empty/nil body → nil
- String metadata → nil
- nilIfEmpty contract pinned
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Follow-up to #369. `resolveInsideRoot` used `filepath.Abs` which does NOT
resolve symlinks — so "workspaces/dev/leaked" where "leaked" is a symlink
to "/etc" would lexically pass the prefix check but resolve outside root.
Fix: call `filepath.EvalSymlinks` before the final prefix check. If the
resolved path points outside root the function returns "path escapes root".
Broken symlinks are also rejected (fail closed).
Also add TestResolveInsideRoot_RejectsSymlinkTraversal covering:
- Symlink pointing outside → rejected (CWE-59)
- Symlink staying inside root → allowed
- Broken symlink → rejected
When GITHUB_APP_ID/INSTALLATION_ID/PRIVATE_KEY_FILE are unset (Gitea-
canonical deployment or suspended GitHub App org), generateAppInstallation
Token() returns "required" — a permanent configuration error, not a
transient one. Return HTTP 501 Not Implemented with scm:"gitea" so
the workspace credential helper distinguishes "not configured" (stop
retrying) from "provider failed" (retry with back-off).
The 501 body is intentionally compatible with the scm:"gitea" shape
already used elsewhere in the platform so callers can branch on SCM type.
The Canvas template-deploy path returned HTTP 500 with raw pq error
when a user clicked a template card twice in quick succession. Root
cause: migration 20260506000000 added the partial-unique index
`workspaces_parent_name_uniq` on (COALESCE(parent_id, sentinel), name)
WHERE status != 'removed' to close TOCTOU on /org/import (#2872). The
org-import handler resolves the constraint via ON CONFLICT DO NOTHING
+ idempotent re-select. The Canvas Create handler did not — it
bubbled the pq violation as a generic 500.
Fix: auto-suffix the user-typed name on collision via a small retry
helper that pins on SQLSTATE 23505 + constraint name (so unrelated
unique indexes still fail loud), retries with " (2)", " (3)" up to
N=20, and threads the actually-persisted name back into the response
+ broadcast payload (so the canvas displays what the DB actually
holds). Exhaustion maps to a clean 409 Conflict instead of a 500.
#2872 protection is preserved unchanged — the index stays in place,
and /org/import's ON CONFLICT path is unaffected. The bundle-import
INSERT (handlers/bundle.go) is a separate code path and is not
touched here; if it surfaces the same UX issue a follow-up can adopt
the same helper.
Verification (against running localhost:8080 platform):
Three back-to-back POSTs with name="ManualVerify-1778459812":
POST #1 -> 201, id=db2dacf7-…, persisted name="ManualVerify-1778459812"
POST #2 -> 201, id=f468083d-…, persisted name="ManualVerify-1778459812 (2)"
POST #3 -> 201, id=5f5ae905-…, persisted name="ManualVerify-1778459812 (3)"
Log lines: "name collision auto-suffix \"…\" -> \"… (N)\""
Tests:
- workspace_create_name_test.go — 4 unit tests via sqlmock pin the
retry contract (happy path no-suffix, single-collision -> " (2)",
non-retryable error pass-through, exhaustion -> errWorkspaceNameExhausted).
- workspace_create_name_integration_test.go — 2 real-Postgres tests
(build tag `integration`) confirm the partial-unique index
behaviour AND the WHERE status != 'removed' tombstone exemption.
- Watch-it-fail confirmed: temporarily removing the
`fmt.Sprintf("%s (%d)", baseName, attempt+1)` candidate-naming
line makes TestInsertWorkspaceWithNameRetry_SecondAttemptSuffixed
fail with the expected argument-mismatch from sqlmock.
Pre-existing test failures in handlers/ (TestExecuteDelegation_…,
TestMCPHandler_CommitMemory_GlobalScope_Blocked) reproduce on
unmodified staging and are NOT caused by this change.
Cherry-pick of d79a4bd2 from PR #318 onto fresh main base (PR #318 closed).
Issue #310: platform a2a-proxy logs ~300/hr
`timeout awaiting response headers` because ResponseHeaderTimeout was hardcoded
to 60s. Opus agent turns (big context + internal delegate_task round-trips)
routinely exceed 60s, so the proxy gave up before headers arrived even when
the workspace agent was healthy.
Changes:
- a2a_proxy.go: ResponseHeaderTimeout: 60s hardcoded →
envx.Duration("A2A_PROXY_RESPONSE_HEADER_TIMEOUT", 180s).
180s gives Opus turns comfortable headroom. The X-Timeout caller header
still bounds the absolute request ceiling independently.
- a2a_proxy_test.go: TestA2AClientResponseHeaderTimeout verifies the 180s
default and env-override parsing logic.
Env var: A2A_PROXY_RESPONSE_HEADER_TIMEOUT (e.g. 5m, 300s).
Closes#310.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The Molecule-AI GitHub org was suspended 2026-05-06; canonical SCM is
now git.moleculesai.app. external_connection.go was still emitting
github.com URLs in operator-facing copy-paste blocks, breaking
external-agent onboarding silently.
Per-site decisions (8 emit sites in 1 file):
- L124 (channel template doc comment): swap source-of-truth comment to
Gitea host.
- L137 /plugin marketplace add Molecule-AI/...: swap to explicit Gitea
HTTPS URL form. End-to-end-verified path per internal#37 § 1.A.
- L138 /plugin install molecule@molecule-mcp-claude-channel: marketplace
name is molecule-channel (per remote .claude-plugin/marketplace.json),
not the repo name. Fix to molecule@molecule-channel.
- L157 --channels plugin:molecule@molecule-mcp-claude-channel: same
marketplace-name fix.
- L179 user-facing GitHub URL: swap to Gitea.
- L261 pip install git+https://github.com/Molecule-AI/molecule-sdk-python:
not on PyPI; swap to git+https://git.moleculesai.app/molecule-ai/...
- L310 hermes-channel doc comment: swap source-of-truth comment.
- L339 pip install git+https://github.com/Molecule-AI/hermes-channel-molecule:
not on PyPI; swap to Gitea.
- L369 issue-tracker URL: swap to Gitea.
Verification:
- molecule-ai-workspace-runtime, codex-channel-molecule are on PyPI (200);
no swap needed for those pip lines (they were already package-name form).
- molecule-mcp-claude-channel, molecule-sdk-python, hermes-channel-molecule
are NOT on PyPI; swapped to git+https://git.moleculesai.app/molecule-ai/
form. All three repos are public on Gitea (default branch main) and
serve git-upload-pack unauthenticated (verified curl 200 against
/info/refs?service=git-upload-pack).
- Third-party github URLs (gin import, openai/codex, NousResearch/
hermes-agent upstream issue trackers, npm @openai/codex) intentionally
preserved.
Adds TestExternalTemplates_NoBrokenMoleculeAIGitHubURLs regression guard
to prevent the same broken URLs from re-emerging on future template
edits.
go vet / go build / existing TestExternal* — all clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two surfaces in workspace-server hardcoded `ghcr.io` and silently bypassed
the `MOLECULE_IMAGE_REGISTRY` env override that flips every other image
operation to the configured private mirror (e.g. AWS ECR in production):
1. internal/imagewatch/watch.go — image-auto-refresh polled
`https://ghcr.io/v2/...` and `https://ghcr.io/token` directly. Post-
suspension, with the platform pointed at ECR, the watcher silently
stopped seeing digest changes (every poll either 404'd or hung on a
registry it has no business talking to).
2. internal/handlers/admin_workspace_images.go — Docker Engine auth
payload pinned `serveraddress: "ghcr.io"`, so when the operator sets
`MOLECULE_IMAGE_REGISTRY=…ecr…/molecule-ai` the engine matched the
wrong credential entry on every authenticated pull.
Fix: extract `provisioner.RegistryHost()` returning the host portion of
`RegistryPrefix()` (e.g. `ghcr.io` ← `ghcr.io/molecule-ai`, or
`004947743811.dkr.ecr.us-east-2.amazonaws.com` ← the ECR mirror prefix),
and route both surfaces through it. Default behavior is unchanged for
OSS users on GHCR.
Tests
- New `TestRegistryHost_SplitsHostFromOrgPath` and
`TestRegistryHost_NeverEmpty` pin the helper across GHCR / ECR /
self-hosted Gitea / bare-host edge cases.
- New `TestGHCRAuthHeader_RespectsRegistryEnv` asserts the Docker auth
payload's `serveraddress` follows MOLECULE_IMAGE_REGISTRY (and never
leaks the org-path suffix).
- New `TestRemoteDigest_RegistryHostFollowsEnv` stands up an httptest
server, points MOLECULE_IMAGE_REGISTRY at it, and confirms both the
token endpoint and the manifest HEAD land there — i.e. the full image-
watch loop respects the env override end-to-end.
Both new tests were verified to FAIL on the pre-fix code path before the
helper was wired in, so a future revert can't silently re-introduce the
bug.
Out of scope (followup needed)
ECR uses `aws ecr get-authorization-token` (SigV4 + basic-auth) instead
of GHCR's `/token?service=…&scope=…` flow. This PR makes the URL host-
configurable; the bearer-token negotiation in `fetchPullToken` still
speaks the GHCR flavor. On ECR with `IMAGE_AUTO_REFRESH=true`, the
watcher will now fail loudly at the token fetch (logged per tick) rather
than silently hitting ghcr.io. Operators on ECR should keep
IMAGE_AUTO_REFRESH=false until ECR auth is wired — tracked as a separate
task. Net effect of this PR alone is strictly better than pre-fix:
fail-loud > silent-broken.
Refs: RFC #229 P2-4
tier:low
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
internal#226 follow-up #1. `molecule_runtime.config` resolves the picked
model as `MOLECULE_MODEL` > `MODEL` > (legacy) `MODEL_PROVIDER` (#280) —
this side of the boundary now matches:
- applyRuntimeModelEnv reads `MOLECULE_MODEL` ahead of `MODEL` /
`MODEL_PROVIDER`, and exports BOTH `MOLECULE_MODEL` and `MODEL`
(the latter kept for back-compat with everything that already reads
`os.environ["MODEL"]`). So a workspace whose secrets carry
`MOLECULE_MODEL` (the unambiguous name) is honoured, and the
`MODEL_PROVIDER` misnomer — which got set to provider slugs
("minimax") and even runtime names ("claude-code") — is the lowest-
priority fallback, exactly as on the runtime side.
- the resolution-order comment is updated to flag MODEL_PROVIDER as the
legacy-and-misleadingly-named var.
Also drops a stray trailing `}` in delegation_test.go (committed in
97768272 "test(delegation): add isDeliveryConfirmedSuccess helper") that
made `internal/handlers` fail to parse — one of the things keeping the
package from compiling for tests.
Tests: TestApplyRuntimeModelEnv_SetsUniversalMODELForAllRuntimes extended
to assert MOLECULE_MODEL mirrors MODEL on every case, plus two new cases
(MOLECULE_MODEL env fallback; MOLECULE_MODEL beats MODEL_PROVIDER). Could
not run `go test ./internal/handlers/` locally — the package is still
blocked behind `internal/plugins` `SourceResolver` redeclaration (the
#248 plugin-router/resolver refactor, Core-BE's lane); CI validates once
that lands. The applyRuntimeModelEnv change is mechanical (same shape as
the existing `MODEL` handling) — reviewer please eyeball.
Companion: molecule-core#280 (runtime config.py side), molecule-ai-workspace-template-claude-code#14 (CLI-stream-error surfacing).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #256 introduced PluginResolver to break the SourceResolver redeclaration
deadlock, but missed three downstream call-sites that left main uncompilable:
1. plugins/drift_sweeper.go: PluginResolver.Resolve was declared returning
PluginResolver (recursive). *Registry.Resolve returns the production
SourceResolver from source.go, so *Registry didn't satisfy PluginResolver.
Fix: Resolve returns SourceResolver. Add compile-time assertion that
*Registry satisfies PluginResolver so any future signature drift fails
the build instead of router wiring.
2. plugins/drift_sweeper_test.go: stubResolver was still declared with the
old SourceResolver shape AND asserted against SourceResolver — the
assertion failed because stubResolver lacks Scheme()/Fetch(). Fix: stub
is a PluginResolver; assertion targets PluginResolver. Drop the unused
"database/sql" import that fails go vet.
3. router/router.go:
- The 70f84823 reorder moved the plgh init block above its dockerCli
dependency (line 538 used; line 594 declared). Moved the dockerCli
declaration up so it's available where used; replaced the orphaned
declaration in the terminal block with a comment.
- Setup's pluginResolver param was typed plugins.SourceResolver — wrong
for *plugins.Registry (Registry is not a per-scheme resolver). Retyped
to plugins.PluginResolver, which *Registry actually satisfies.
- Removed the broken `plgh.WithSourceResolver(pluginResolver)` call —
WithSourceResolver expects a per-scheme SourceResolver, not a
PluginResolver/registry. plgh has its own internal default registry
(github+local) from NewPluginsHandler, so dropping the call is
functionally a no-op vs the broken state. Kept the param so the
drift sweeper (main.go) can share scheme enumeration when needed.
4. go.sum: add the content hash entry for go.moleculesai.app/plugin/
gh-identity/pluginloader (only the /go.mod hash was present, breaking
`go build ./cmd/server`).
Verified locally:
go build ./... ✓
go vet ./... ✓ (only pre-existing org_external append warning)
go test ./internal/plugins/... ✓
go test ./internal/router/... ✓
6 pre-existing handler test failures (TestExecuteDelegation_*,
TestHandleDiagnose_*) are orthogonal — they did not run before because the
package didn't compile. Out of scope for this fix; tracking separately.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Plgh was referenced at line 505 before it was created at line 632, causing
"undefined: plgh" on main. Moved the entire Plugins block to before the
drift handler block. No functional change to registered routes — only
declaration order. Combined with d88a320f (SourceResolver→PluginResolver
rename, SSRF guard placement, and test regressions) this makes main fully
compile again.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- plugins/drift_sweeper.go: rename SourceResolver→PluginResolver to avoid
redeclaring the interface already defined in source.go (core#228)
- handlers/workspace.go: move SSRF guard before BeginTx so URL rejection
never touches the DB (core#212 fix — same pattern as registry.go:324)
- handlers/restart_signals.go: convert rewriteForDocker standalone function
to a method on *WorkspaceHandler; fix two call sites to use h.rewriteForDocker
- handlers/plugins.go: change Sources() return type from plugins.SourceResolver
to pluginSources (the narrow interface satisfied by *Registry)
- handlers/admin_plugin_drift.go: remove unused "context" import
- handlers/delegation_test.go: remove stray closing brace
- handlers/restart_signals_test.go: rewrite with correct miniredis v2 API
(mr.Get takes context, mr.Set requires TTL), resolveURLTestWrapper embedding
pattern, and corrected Redis key handling
- handlers/workspace_test.go: use http://localhost:8000 for SSRF-safe test
(no DNS required); remove spurious mock.ExpectExec for Redis CacheURL call
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Issue #212: POST /workspaces with runtime=external and a URL wrote the
URL directly to the DB without validateAgentURL checking (the same check
that registry.go:324 applies to the heartbeat path). An attacker with
AdminAuth could register a workspace URL at a cloud metadata endpoint
(169.254.169.254) and exfiltrate IAM credentials when the platform
fires pre-restart drain signals.
Changes:
- workspace.go: add validateAgentURL(payload.URL) guard before the
UPDATE at line 386. 400 on unsafe URL, no DB write occurs.
- workspace_test.go: add 3 regression tests:
- TestWorkspaceCreate_ExternalURL_SSRFSafe: safe public URL → 201
- TestWorkspaceCreate_ExternalURL_SSRFMetadataBlocked: 169.254.169.254 → 400
- TestWorkspaceCreate_ExternalURL_SSRFLoopbackBlocked: 127.0.0.1 → 400
Both unsafe tests assert zero DB calls (the handler rejects before
any transaction).
Ref: issue #212.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
## Summary
Adds the version-subscription drift detection and operator-apply workflow for
per-workspace plugin tracking (core#113).
## Components
**Migration** (`20260510000000_plugin_drift_queue`):
- Adds `installed_sha` column to `workspace_plugins` — records the commit SHA
installed so the drift sweeper can compare against upstream.
- Creates `plugin_update_queue` table with status: pending | applied | dismissed.
- Adds partial unique index to prevent duplicate pending rows per
(workspace_id, plugin_name).
**GithubResolver** (`github.go`):
- `LastFetchSHA` field + `LastSHA()` getter — populated by `Fetch` after a
successful shallow clone (captured before `.git` is stripped). Used by the
install pipeline to seed `installed_sha`.
- `ResolveRef(ctx, spec)` method — resolves a plugin spec to its full commit
SHA using `git fetch --depth=1 + git rev-parse`. Used by the drift sweeper
to get the current upstream SHA for a tracked ref (tag:vX.Y.Z, tag:latest,
sha:…, or bare branch).
**Drift sweeper** (`plugins/drift_sweeper.go`):
- Periodic sweep every 1h: SELECTs rows where `tracked_ref != 'none' AND
installed_sha IS NOT NULL`, resolves upstream SHA, queues drift if different.
- `ListPendingUpdates()` — reads pending queue rows for the admin endpoint.
- `ApplyDriftUpdate()` — marks entry applied (idempotent).
- ctx.Err() guard on ticker arm to avoid post-shutdown work.
**Install pipeline** (`plugins_install_pipeline.go`, `plugins_tracking.go`,
`plugins_install.go`):
- `stageResult.InstalledSHA` field — carries the SHA from Fetch to the DB.
- `recordWorkspacePluginInstall` now accepts and stores `installed_sha`.
- `deleteWorkspacePluginRow` — removes tracking row on uninstall so a stale
SHA doesn't prevent the next install from creating a fresh row.
- Both Docker and EIC uninstall paths call `deleteWorkspacePluginRow`.
**Admin endpoints** (`handlers/admin_plugin_drift.go`):
- `GET /admin/plugin-updates-pending` — list all pending drift entries.
- `POST /admin/plugin-updates/:id/apply` — re-installs plugin from source_raw
(re-fetching the same tracked ref), records the new SHA, marks entry applied,
triggers workspace restart. Idempotent (already-applied returns 200).
**Router wiring** (`router.go`, `cmd/server/main.go`):
- Plugin registry created in main.go and shared between PluginsHandler and drift
sweeper.
- `router.Setup` accepts optional `pluginResolver` param.
- `PluginsHandler.Sources()` export for the sweeper wiring pattern.
## Tests
- `plugins/github_test.go` — `ResolveRef` coverage (invalid spec, git error,
not-found mapping, no-panic for all ref shapes).
- `plugins/drift_sweeper_test.go` — `ResolveRef` happy path, stub resolver
interface compliance.
- `handlers/admin_plugin_drift_test.go` — ListPending (empty, non-empty, DB
error), Apply (not found, already applied, already dismissed, workspace_plugins
missing).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Race-detector CI runs (-race) slow goroutines enough that a
prior sweeper goroutine (e.g. TestStartSweeper_TransientErrorDoesNotCrashLoop)
can still be running and incrementing pendingUploadsSweepErrors after
metricDelta() captures its baseline, but before the success-path sweeper
records its success metrics. The test then reads deltaError=1 instead of 0.
Fix: add waitForMetricDelta(t, deltaError, 0, 2*time.Second) before the
assertion, matching the polling pattern already used in the error-path
test (TestStartSweeper_RecordsMetricsOnError). This ensures the error
counter has settled before we assert on it.
Fixes molecule-core#22.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
PUT /workspaces/:id/files and DELETE /workspaces/:id/files updated the
config volume but never restarted the container, so the running agent
continued serving stale file content from its in-memory cache. The
SecretsHandler already had this pattern (issue #15); TemplatesHandler
was missing it.
Fix: after every successful write/delete in WriteFile, DeleteFile, and
ReplaceFiles, call h.wh.RestartByID(workspaceID) asynchronously, guarded
by h.wh != nil (nil-tolerant for callers that only use read-only
surfaces). The RestartByID coalescing gate prevents thundering-herd on
concurrent requests.
Fixes#151.
Fixes#87 (duplicate effort closed — core-be also filed #183).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Issue #86: TestStartSweeper_RecordsMetricsOnSuccess fails in full-suite.
Root cause: two cooperating bugs in the sweeper test harness.
1. Sweeper loop called sweepOnce after ctx cancellation (double-increment).
When ctx was cancelled the loop's select received ctx.Done(), called
sweepOnce with the cancelled ctx, storage.Sweep returned context error,
and metrics.PendingUploadsSweepError() incremented the error counter a
SECOND time before the loop exited. Subsequent tests captured a polluted
error baseline and their deltaError assertions failed.
2. Tests called defer cancel() without waiting for the goroutine to exit.
The goroutine could still be blocked on Sweep (waiting for the next
ticker's C channel) when the next test called metricDelta(). If the
goroutine's Sweep returned during the next test's measurement window,
the shared metric counters mutated mid-baseline.
Fix (production code):
- Guard the ticker arm: if ctx.Err() != nil, continue instead of calling
sweepOnce. This prevents the post-cancellation sweep from running.
Fix (test harness):
- startSweeperWithInterval gains a done chan struct{} parameter. When the
loop exits the channel is closed exactly once.
- StartSweeperForTest starts the goroutine and returns the done channel,
allowing tests to drain it with <-done after cancel() — guaranteeing
the goroutine has fully terminated before the next test's baseline.
All 8 sweeper tests now use StartSweeperForTest and drain the done
channel before returning, ensuring stable metric baselines across the
full suite.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
os.Chmod(dst, 0o555) silently passes when os.Geteuid() == 0 because
root bypasses POSIX permission checks. A previous attempt to use a
symlink to /dev/full also fails: Go's os.MkdirAll resolves the symlink
during path traversal and the kernel allows mkdir("/dev/full") as a
device-table entry — io.Copy to /dev/full then succeeds with 0 bytes
written and returns nil.
The honest, consistent fix mirrors TestLocalResolver_CopyFileSourceUnreadable:
skip when running as root. The write-failure propagation logic is
exercised correctly in non-root CI environments.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
fix(plugins/test): skip TestLocalResolver_BubblesUpCopyFailure when running as root
Fixes issue #87: the test sets chmod(dst, 0o555) to make the
destination read-only and asserts the copy fails. On Linux, root
bypasses filesystem permissions and can write to 0o555 directories,
so the copy succeeds when running as root and the assertion fails.
Fix: check os.Getuid() == 0 at the start of the test and skip with
a clear message. Mirrors the existing skip in
TestLocalResolver_CopyFileSourceUnreadable (line 175) which already
handles the same root-bypass issue for unreadable source files.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
[core-lead-agent] Closes the regression-test gap on PR #170 (Core-BE's
fix for #159 retry-storm). Original PR shipped the inline conditional
without a unit test; this commit:
1. Extracts the inline `(proxyErr != nil && len(respBody) > 0 && 2xx)`
predicate into a named helper `isDeliveryConfirmedSuccess`. Same
behavior; the call site now reads `if isDeliveryConfirmedSuccess(...)`.
2. Adds `TestIsDeliveryConfirmedSuccess` — 10-case table test covering:
- The new branch (2xx + body + transport error → recover as success):
status=200, status=299, status=200+min-body
- Each precondition failing in isolation:
* nil proxyErr → false (no decision)
* empty/nil body → false (no work to recover)
* 4xx/5xx/3xx body → false (agent-signalled failure or redirect)
* <200 status → false (not 2xx)
Test-pattern mirrors the existing `TestIsTransientProxyError_Retries...`
and `TestIsQueuedProxyResponse` table tests in the same file — same
file-local mock-error pattern, no new test infra.
fix: Treat delivery-confirmed proxy errors as delegation success
Two-part fix for issue #159 — successful delegation responses were
rendered as error banners:
PART 1 — a2a_proxy.go: When io.ReadAll fails mid-stream (e.g., TCP
connection drops after the agent sent its 200 OK response), the prior
code returned (0, nil, BadGateway) discarding both the HTTP status code
and any partial body bytes already received. Fix: return
(resp.StatusCode, respBody, error) so callers can inspect what was
delivered even when the body read failed.
PART 2 — delegation.go: New condition in executeDelegation after the
transient-error retry block:
if proxyErr != nil && len(respBody) > 0 && status >= 200 && status < 300 {
goto handleSuccess
}
When proxyA2ARequest returns a delivery-confirmed error (status 2xx +
non-empty partial body), route to success instead of failure. This
prevents the retry-storm pattern where the canvas shows "error" with
a Restart-workspace suggestion even though the delegation actually
completed and the response is available.
Regression tests (delegation_test.go):
- TestExecuteDelegation_DeliveryConfirmedProxyError_TreatsAsSuccess:
server sends 200 + partial body then closes; second attempt succeeds.
Verifies the new condition fires for delivery-confirmed 2xx responses.
- TestExecuteDelegation_ProxyErrorNon2xx_RemainsFailed: server sends
500 + partial body then closes. Verifies non-2xx routes to failure.
- TestExecuteDelegation_ProxyErrorEmptyBody_RemainsFailed: server
returns 502 Bad Gateway (empty body, transient). Verifies empty-body
errors still route to failure (condition len(respBody) > 0 guards it).
- TestExecuteDelegation_CleanProxyResponse_Unchanged: clean 200 OK.
Verifies baseline (proxyErr == nil path) is unaffected.
Fixes issue #159.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
fix(template_import): Remove silent template-dir fallback in ReplaceFiles offline path
When the workspace container is offline and writeViaEphemeral fails
(docker unavailable), ReplaceFiles previously fell back to writing
to the host-side template directory. This silently returned 200 with
"source: template" while the file change was invisible after restart
because the restart handler reads from the Docker volume, not the
template dir (issue #151).
Now returns 503 Service Unavailable with a message telling the caller
to retry after the workspace starts. The ephemeral write path is
the only correct mechanism for offline-container updates.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
fix: Treat delivery-confirmed proxy errors as delegation success
When proxyA2ARequest returns an error but we have a non-empty
response body with a 2xx status code, the agent completed the work
successfully. The error is a delivery/transport error (e.g., connection
reset after response was received).
Previously, executeDelegation would mark these as "failed" even though
the work was done, causing:
- Retry storms (canvas suggests restart, user retries)
- "error" rendering in canvas even though result is available
- Data loss risk from unnecessary restarts
Now we check for valid response data before marking as failed.
Fixes issue #159.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>