Comprehensive-review follow-ups (MED/LOW/NIT) — reconciler debounce coupling, drift-gate UNREACHABLE soft spot, cross-repo sync detection #2284

Closed
opened 2026-06-05 03:33:15 +00:00 by hongming · 0 comments
Owner

Lower-severity findings from the comprehensive review (HIGHs already fixed in PRs #2282/#2283)

The HIGH findings are remediated (PR #2282 providers byte-sync, PR #2283 reconciler TOCTOU/degraded/cycle-deadline; PR #2281 e2e merged). These remaining items are MED/LOW/NIT — tracked here.

[MED] Reconciler debounce window == reconcile interval (fragile coupling)

The reconciler delegates ALL status-writing to async RestartByID; the only guard against a cycle N+1 double-flip of a still-/again-online row is restartDebounceWindow = 60s, set EXACTLY equal to the reconcile interval (main.go). Zero margin, and nothing in the reconciler references the debounce — the protection is incidental to two constants matching. If someone drops the reconciler interval (the orphan-sweeper comment invites 30s) the double-reprovision thrash class (internal#544) silently reopens. PR #2283's per-row re-confirm narrows but doesn't eliminate this. Fix: make the coupling explicit (derive/assert the debounce ≥ interval, or have the reconciler set an intermediate status itself) + a comment.

[LOW] cp#508 enforcing gate: check-template-models UNREACHABLE-skip is a soft spot

The google-adk template config.yaml currently 404s and is treated as a non-fatal skip (exit 0). Now that the gate is branch-required, a template that becomes PERMANENTLY 404 (renamed/deleted/token-scope) would silently never be drift-checked — a hole in the gate. By-design fail-safe + pre-existing, but worth a bounded-staleness guard (e.g. fail if a known template has been UNREACHABLE for >N days, or alert).

[NIT] serving-e2e TestEveryOfferedProviderHasAnArm omits hermes

offeredRuntimes (serving_e2e_test.go) lists claude-code/codex/google-adk/openclaw but not hermes, so hermes's platform arm (moonshot/kimi-k2.6/.5) isn't checked for serving coverage. Pre-existing (predates cp#529). Add hermes to the list.

[LOW] Cross-repo byte-sync detection is weak (the gap that let #2282 happen)

core's hermetic sync_canonical_test only pins core-vs-its-own-copy; only the live sync-providers-yaml CI catches cross-repo drift, and it's (a) AUTO_SYNC_TOKEN-gated (else exit 0 skip) and (b) paths-filtered to core's own providers.yaml — so a CP-ONLY providers.yaml change (cp#521) is caught only by a daily cron, not at merge time. Consider: a CP-side gate that fails if CP providers.yaml changes without a matching core sync, or widening the core sync-check trigger.

Refs: comprehensive review of the cp#529 + core#2261 session. HIGHs fixed in #2281/#2282/#2283.

## Lower-severity findings from the comprehensive review (HIGHs already fixed in PRs #2282/#2283) The HIGH findings are remediated (PR #2282 providers byte-sync, PR #2283 reconciler TOCTOU/degraded/cycle-deadline; PR #2281 e2e merged). These remaining items are MED/LOW/NIT — tracked here. ### [MED] Reconciler debounce window == reconcile interval (fragile coupling) The reconciler delegates ALL status-writing to async `RestartByID`; the only guard against a cycle N+1 double-flip of a still-/again-`online` row is `restartDebounceWindow = 60s`, set EXACTLY equal to the reconcile interval (main.go). Zero margin, and nothing in the reconciler references the debounce — the protection is incidental to two constants matching. If someone drops the reconciler interval (the orphan-sweeper comment invites 30s) the double-reprovision thrash class (internal#544) silently reopens. PR #2283's per-row re-confirm narrows but doesn't eliminate this. Fix: make the coupling explicit (derive/assert the debounce ≥ interval, or have the reconciler set an intermediate status itself) + a comment. ### [LOW] cp#508 enforcing gate: `check-template-models` UNREACHABLE-skip is a soft spot The google-adk template config.yaml currently 404s and is treated as a non-fatal skip (`exit 0`). Now that the gate is branch-required, a template that becomes PERMANENTLY 404 (renamed/deleted/token-scope) would silently never be drift-checked — a hole in the gate. By-design fail-safe + pre-existing, but worth a bounded-staleness guard (e.g. fail if a known template has been UNREACHABLE for >N days, or alert). ### [NIT] serving-e2e `TestEveryOfferedProviderHasAnArm` omits `hermes` `offeredRuntimes` (serving_e2e_test.go) lists claude-code/codex/google-adk/openclaw but not hermes, so hermes's `platform` arm (moonshot/kimi-k2.6/.5) isn't checked for serving coverage. Pre-existing (predates cp#529). Add hermes to the list. ### [LOW] Cross-repo byte-sync detection is weak (the gap that let #2282 happen) core's hermetic `sync_canonical_test` only pins core-vs-its-own-copy; only the live `sync-providers-yaml` CI catches cross-repo drift, and it's (a) `AUTO_SYNC_TOKEN`-gated (else exit 0 skip) and (b) paths-filtered to core's own providers.yaml — so a CP-ONLY providers.yaml change (cp#521) is caught only by a daily cron, not at merge time. Consider: a CP-side gate that fails if CP providers.yaml changes without a matching core sync, or widening the core sync-check trigger. Refs: comprehensive review of the cp#529 + core#2261 session. HIGHs fixed in #2281/#2282/#2283.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2284