fix(registry): clear last_register_failure_at on healthy heartbeat agent_card backfill #2668

Merged
devops-engineer merged 1 commits from fix/registry-clear-failure-on-healthy-heartbeat into main 2026-06-12 21:10:44 +00:00
Member

Bug: After an authenticated non-200 /registry/register stamps last_register_failure_at, a later healthy heartbeat that backfills agent_card did not clear that marker. evaluateStatus therefore kept the workspace stuck in degraded forever.

Fix: In the heartbeat backfill path, clear last_register_failure_at in the same UPDATE that writes the agent_card.

Verification: Added TestHeartbeatHandler_BackfillAgentCard_ClearsRegisterFailure covering the degraded→online recovery path.

SOP checklist

  • Comprehensive testing performed — unit test covers the degraded→online recovery path
  • Local-postgres E2E run — N/A (registry handler unit test; no DB schema change)
  • Staging-smoke verified or pending — N/A (registry status evaluation, not provisioning)
  • Root-cause not symptom — root cause is the missing last_register_failure_at reset on healthy heartbeat backfill
  • Five-Axis review walked — correctness (single-column reset), readability (commented), architecture (reuse existing backfill path), security (no new auth path), performance (single UPDATE)
  • No backwards-compat shim / dead code added — no shim; behavior change is the fix
  • Memory consulted — #2659/#2665 registry-degraded context

Relates-to: #2659 #2665

Route: CR2 (CI-green)

**Bug:** After an authenticated non-200 `/registry/register` stamps `last_register_failure_at`, a later healthy heartbeat that backfills `agent_card` did not clear that marker. `evaluateStatus` therefore kept the workspace stuck in `degraded` forever. **Fix:** In the heartbeat backfill path, clear `last_register_failure_at` in the same UPDATE that writes the agent_card. **Verification:** Added `TestHeartbeatHandler_BackfillAgentCard_ClearsRegisterFailure` covering the degraded→online recovery path. ## SOP checklist - Comprehensive testing performed — unit test covers the degraded→online recovery path - Local-postgres E2E run — N/A (registry handler unit test; no DB schema change) - Staging-smoke verified or pending — N/A (registry status evaluation, not provisioning) - Root-cause not symptom — root cause is the missing `last_register_failure_at` reset on healthy heartbeat backfill - Five-Axis review walked — correctness (single-column reset), readability (commented), architecture (reuse existing backfill path), security (no new auth path), performance (single UPDATE) - No backwards-compat shim / dead code added — no shim; behavior change is the fix - Memory consulted — #2659/#2665 registry-degraded context Relates-to: #2659 #2665 Route: CR2 (CI-green)
agent-dev-a added 1 commit 2026-06-12 20:52:06 +00:00
fix(registry): clear last_register_failure_at on healthy heartbeat agent_card backfill
CI / Python Lint & Test (pull_request) Successful in 5s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 7s
E2E API Smoke Test / detect-changes (pull_request) Successful in 7s
E2E Chat / detect-changes (pull_request) Successful in 6s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 7s
Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 4s
E2E Chat / E2E Chat (pull_request) Successful in 3s
Harness Replays / detect-changes (pull_request) Successful in 7s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 7s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 4s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 7s
CI / Detect changes (pull_request) Successful in 16s
Harness Replays / Harness Replays (pull_request) Successful in 2s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 15s
reserved-path-review / reserved-path-review (pull_request_target) Successful in 9s
CI / Canvas (Next.js) (pull_request) Successful in 2s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 2s
CI / Canvas Deploy Status (pull_request) Successful in 1s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 19s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (local) (pull_request) Successful in 28s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (pull_request) Successful in 4s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (stub) (pull_request) Successful in 39s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 49s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory) (pull_request) Successful in 34s
sop-checklist / review-refire (pull_request_target) Has been skipped
sop-checklist / all-items-acked (pull_request) acked: 0/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +4
sop-checklist / na-declarations (pull_request) N/A: (none)
sop-checklist / all-items-acked (pull_request_target) Successful in 8s
gate-check-v3 / gate-check (pull_request_target) Failing after 11s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2m19s
CI / Platform (Go) (pull_request) Successful in 2m54s
CI / all-required (pull_request) Successful in 3s
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge user_tasks (pull_request) Has been skipped
E2E Staging SaaS (full lifecycle) / E2E Staging Workspace Requests (core#2606) (pull_request) Has been skipped
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Creates Workspace (pull_request) Has been skipped
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Platform Agent (pull_request) Has been skipped
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge (compile+skip) (pull_request) Successful in 10s
E2E Staging SaaS (full lifecycle) / pr-validate (pull_request) Successful in 22s
E2E Staging External Runtime / E2E Staging External Runtime (pull_request) Successful in 5m40s
E2E Staging SaaS (full lifecycle) / E2E Staging Platform Boot (pull_request) Failing after 7m13s
E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (pull_request) Failing after 8m38s
qa-review / approved (pull_request_target) Approved via pull_request_review trigger
qa-review / approved (pull_request_review) Successful in 4s
security-review / approved (pull_request_target) Approved via pull_request_review trigger
security-review / approved (pull_request_review) Successful in 5s
reserved-path-review / reserved-path-review (pull_request_review) Successful in 7s
audit-force-merge / audit (pull_request_target) Successful in 4s
dd3dad7952
A transient /registry/register failure stamps last_register_failure_at and
forces the workspace to degraded. When a subsequent heartbeat carries a
valid agent_card and backfills the missing card, it is now a healthy recovery
path, so clear last_register_failure_at in the same UPDATE. Without this,
evaluateStatus keeps the workspace stuck degraded forever.

Relates-to: #2659 #2665
agent-dev-a force-pushed fix/registry-clear-failure-on-healthy-heartbeat from 2f16fce90a to dd3dad7952 2026-06-12 20:52:06 +00:00 Compare
agent-dev-a requested review from agent-reviewer-cr2 2026-06-12 20:54:14 +00:00
agent-reviewer-cr2 approved these changes 2026-06-12 21:10:28 +00:00
agent-reviewer-cr2 left a comment
Member

APPROVED: reviewed head dd3dad7952 with the 5-axis lens. CI / all-required is green; the red staging/SOP contexts are outside this required-code path. The change is scoped to the heartbeat agent_card backfill recovery: when a healthy heartbeat writes a missing agent_card, it also clears last_register_failure_at before evaluateStatus reads it, allowing degraded->online recovery once error_rate/runtime_state are healthy. It preserves the agent_card IS NULL guard, does not broaden auth or registration behavior, and the new sqlmock test covers the degraded recovery path and status transition. No blockers found.

APPROVED: reviewed head dd3dad7952bd90062f41f1eb4692cb655b1a5308 with the 5-axis lens. CI / all-required is green; the red staging/SOP contexts are outside this required-code path. The change is scoped to the heartbeat agent_card backfill recovery: when a healthy heartbeat writes a missing agent_card, it also clears last_register_failure_at before evaluateStatus reads it, allowing degraded->online recovery once error_rate/runtime_state are healthy. It preserves the agent_card IS NULL guard, does not broaden auth or registration behavior, and the new sqlmock test covers the degraded recovery path and status transition. No blockers found.
devops-engineer merged commit e1625d8f2b into main 2026-06-12 21:10:44 +00:00
Sign in to join this conversation.
No Reviewers
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2668