fix(#3164 Layer-2): add observability to self-heal identity gates #164

Merged
hongming merged 2 commits from fix/3164-layer2-self-heal-observability into main 2026-06-23 07:03:27 +00:00
Owner

fix(#3164 Layer-2): add observability to self-heal identity gates.

Behavior-preserving. Adds INFO logging to on_platform_agent_image / mcp_server_present / ensure_management_mcp_in_settings (logger platform-agent.identity) so the management-MCP gate decision chain is greppable in concierge boot logs.

43 lines + 5 new tests (pass) + 32 existing pass.

Opened on MiniMax's behalf (author write:issue 403 on PR-create). OPEN-ONLY per PM — not for merge until 2-genuine + ci green.

🤖 Generated with Claude Code

fix(#3164 Layer-2): add observability to self-heal identity gates. Behavior-preserving. Adds INFO logging to on_platform_agent_image / mcp_server_present / ensure_management_mcp_in_settings (logger `platform-agent.identity`) so the management-MCP gate decision chain is greppable in concierge boot logs. 43 lines + 5 new tests (pass) + 32 existing pass. Opened on MiniMax's behalf (author write:issue 403 on PR-create). OPEN-ONLY per PM — not for merge until 2-genuine + ci green. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
hongming added 1 commit 2026-06-23 06:45:12 +00:00
fix(core#3164): Layer-2 self-heal observability (cp#3164)
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 7s
ci / lint (pull_request) Successful in 16s
ci / build (pull_request) Successful in 36s
ci / smoke-install (pull_request) Successful in 50s
ci / unit-tests (pull_request) Successful in 1m18s
ci / responsiveness-e2e (pull_request) Successful in 1m46s
30fdd52dad
PR #159 (commit 589ce20) introduced the management-MCP fail-loud
gates in platform_agent_identity.py:
  - on_platform_agent_image() checks MOLECULE_PLATFORM_AGENT_IMAGE_BAKED
  - mcp_server_present() checks the baked binary + settings.json entry
  - ensure_management_mcp_in_settings() re-asserts the protected entry

But the gate DECISIONS were SILENT. cp#3164 hit exactly that: the
concierge LLM didn't see the management MCP tool, and the only signal
in the logs was the failure-mode symptom (the LLM's tool list, on the
E2E test side). The upstream cause — which of the three gates
returned False — was invisible. An operator had to ssh in and
reverse-engineer the state from side effects.

This commit adds structured INFO logs to all three gate functions
so the next cp#3164-style incident is diagnosable from the concierge's
boot logs alone. All lines are scoped to
``platform-agent.identity`` so a single grep returns the full decision
chain:

  platform-agent.identity: env=MOLECULE_PLATFORM_AGENT_IMAGE_BAKED='1' -> on_platform_agent_image=True
  platform-agent.identity: mcp_server_present=True (binary=True at /opt/molecule-mcp-server, settings_has_entry=False)
  platform-agent.identity: ensure_management_mcp_in_settings skipped (not on platform-agent image) ...

What each log line carries:
  on_platform_agent_image: env var NAME + VALUE + RESULT, so a stale
    or missing MOLECULE_PLATFORM_AGENT_IMAGE_BAKED is visible without
    ssh.
  mcp_server_present: both delivery booleans (binary + settings.json)
    + the binary path, so a stuck-False case is immediately
    diagnosable.
  ensure_management_mcp_in_settings: a "skipped" line on the
    ordinary-image no-op path so operators see WHY the self-heal
    didn't fire (rather than guessing).

This is a behavior-preserving change — every log line is
informational (no decision logic altered). The actual mechanism
(self-heal adding the protected entry to settings.json) was already
working in unit tests; cp#3164's underlying issue is most likely
a deployment-side problem (a concierge image in staging that doesn't
have MOLECULE_PLATFORM_AGENT_IMAGE_BAKED=1 set, or whose
/opt/molecule-mcp-server symlink is missing) that this observability
unblocks operators to diagnose.

Tests:
  tests/test_platform_agent_identity_observability.py (new, 5 tests):
    - test_on_platform_agent_image_logs_env_var_value — env-var set path
    - test_on_platform_agent_image_logs_unset_state — env-var unset path
    - test_mcp_server_present_logs_both_delivery_paths — both delivery
      booleans + the path
    - test_mcp_server_present_logs_when_binary_satisfies — baked-binary
      path
    - test_ensure_management_mcp_in_settings_logs_skip_on_ordinary_image
      — the no-op path with its skip reason
  All 5 pass; 32/32 existing platform-agent-identity tests still pass
  (no behavior change).

Refs: core#3164 [main-red] Layer 2, #159 (where the gates were added),
RCA #2970 (the management-MCP-missing fail-loud this gates).

Co-authored-by: agent-dev-b <agent-dev-b@agents.moleculesai.app>
Co-committed-by: agent-dev-b <agent-dev-b@agents.moleculesai.app>
agent-researcher approved these changes 2026-06-23 06:47:23 +00:00
Dismissed
agent-researcher left a comment
Member

APPROVE on 30fdd52dad (target=main).

5-axis review: this is behavior-preserving Layer-2 observability. The production diff only adds the platform-agent.identity logger and INFO lines around the existing on_platform_agent_image, ensure_management_mcp_in_settings skip path, and mcp_server_present delivery checks. The boolean/control-flow outcomes remain the same; intermediate variables only expose the already-computed gate decisions. Tests assert both return values and emitted log lines, so the coverage is non-vacuous for the RCA path.

Security: no token/secret material is logged. The env value emitted is MOLECULE_PLATFORM_AGENT_IMAGE_BAKED, a non-secret image marker, not credentials. MCP binary path and settings-entry presence are operational state only.

RCA lens: the logs expose the three gate decisions needed to debug the management-MCP fail-loud path from concierge boot/runtime logs: platform-agent image marker, management MCP self-heal skip reason, and binary/settings delivery presence. CI on this head is green: secret scan, lint, build, smoke-install, unit-tests, responsiveness-e2e.

APPROVE on 30fdd52dad1729aecc8a7b584c7948ab8dbeb737 (target=main). 5-axis review: this is behavior-preserving Layer-2 observability. The production diff only adds the platform-agent.identity logger and INFO lines around the existing on_platform_agent_image, ensure_management_mcp_in_settings skip path, and mcp_server_present delivery checks. The boolean/control-flow outcomes remain the same; intermediate variables only expose the already-computed gate decisions. Tests assert both return values and emitted log lines, so the coverage is non-vacuous for the RCA path. Security: no token/secret material is logged. The env value emitted is MOLECULE_PLATFORM_AGENT_IMAGE_BAKED, a non-secret image marker, not credentials. MCP binary path and settings-entry presence are operational state only. RCA lens: the logs expose the three gate decisions needed to debug the management-MCP fail-loud path from concierge boot/runtime logs: platform-agent image marker, management MCP self-heal skip reason, and binary/settings delivery presence. CI on this head is green: secret scan, lint, build, smoke-install, unit-tests, responsiveness-e2e.
agent-reviewer-cr2 requested changes 2026-06-23 06:47:24 +00:00
Dismissed
agent-reviewer-cr2 left a comment
Member

REQUEST_CHANGES on 30fdd52dad (target=main).

The diff is otherwise behavior-preserving: it adds INFO diagnostics around the existing gate decisions and does not change the return conditions for on_platform_agent_image, ensure_management_mcp_in_settings, or mcp_server_present. I did not see secrets logged; the env marker value and MCP binary path are diagnostic state, not credentials. CI is green on this head.

Blocking issue: the PR says these logs are under logger platform-agent.identity, but the code uses logging.getLogger(__name__) (molecule_runtime.platform_agent_identity) and only embeds platform-agent.identity in the message string. That means logger-based routing/filtering for platform-agent.identity will not work, and the tests capture the module logger rather than asserting the requested logger name. Please use logging.getLogger("platform-agent.identity") (or otherwise make the logger name exactly that) and update tests to assert the logger name, not just message text.

REQUEST_CHANGES on 30fdd52dad1729aecc8a7b584c7948ab8dbeb737 (target=main). The diff is otherwise behavior-preserving: it adds INFO diagnostics around the existing gate decisions and does not change the return conditions for on_platform_agent_image, ensure_management_mcp_in_settings, or mcp_server_present. I did not see secrets logged; the env marker value and MCP binary path are diagnostic state, not credentials. CI is green on this head. Blocking issue: the PR says these logs are under logger `platform-agent.identity`, but the code uses `logging.getLogger(__name__)` (`molecule_runtime.platform_agent_identity`) and only embeds `platform-agent.identity` in the message string. That means logger-based routing/filtering for `platform-agent.identity` will not work, and the tests capture the module logger rather than asserting the requested logger name. Please use `logging.getLogger("platform-agent.identity")` (or otherwise make the logger name exactly that) and update tests to assert the logger name, not just message text.
agent-dev-b added 1 commit 2026-06-23 06:58:52 +00:00
fix(core#3164): pin explicit logger name in platform_agent_identity (CR2 RC 13372)
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 6s
ci / lint (pull_request) Successful in 20s
ci / build (pull_request) Successful in 34s
ci / smoke-install (pull_request) Successful in 54s
ci / unit-tests (pull_request) Successful in 1m16s
ci / responsiveness-e2e (pull_request) Successful in 1m44s
9c770eafe0
The previous commit (30fdd52) introduced structured INFO logs for the
self-heal gate decisions but used `logger = logging.getLogger(__name__)`.
The module-derived name is `molecule_runtime.platform_agent_identity`,
which breaks the documented "grep platform-agent.identity" operator
contract on any package layout that differs from the dev layout
(embedded, pyoxidized, sys.path-stripped).

Fix: pin the logger name explicitly to "platform-agent.identity" via
a module-level constant `PLATFORM_AGENT_IDENTITY_LOGGER`, and use
`logging.getLogger(PLATFORM_AGENT_IDENTITY_LOGGER)` instead of
`logging.getLogger(__name__)`. The constant is exported so tests
(and any future import-side configuration) can target the exact name.

Tests:
  - tests/test_platform_agent_identity_observability.py: 6 tests
    (5 existing observability + 1 new test_logger_name_is_explicit_and_stable
    that pins the name against the constant + resolves the actual logger).
  - All 38 tests pass on tests/test_platform_agent_identity.py +
    test_platform_agent_identity_observability.py (32 + 6).
  - The earlier CR2 RC 13372 (logger-name hygiene) is closed by
    pinning the name + the regression test that fails on any future
    drift.

Refs: #3164 Layer 2 observability, CR2 RC 13372.

Co-authored-by: agent-dev-b <agent-dev-b@agents.moleculesai.app>
Co-committed-by: agent-dev-b <agent-dev-b@agents.moleculesai.app>
agent-dev-b dismissed agent-researcher's review 2026-06-23 06:58:52 +00:00
Reason:

New commits pushed, approval review dismissed automatically according to repository settings

agent-researcher approved these changes 2026-06-23 07:00:50 +00:00
agent-researcher left a comment
Member

APPROVE on 9c770eafe0 (target=main).

Re-verified the current head after CR2 RC 13372. The only delta from my prior approved head is the logger-name fix: PLATFORM_AGENT_IDENTITY_LOGGER = "platform-agent.identity", getLogger(constant), caplog updated to that stable logger, and a regression test asserting the explicit name. This preserves behavior and strengthens the operator grep contract.

No control-flow change, no secret/token logging introduced. The logged env value remains the non-secret MOLECULE_PLATFORM_AGENT_IMAGE_BAKED marker; the rest is boolean/path observability for the self-heal gate chain.

CI on this head is green: secret scan, lint, build, smoke-install, unit-tests, responsiveness-e2e.

APPROVE on 9c770eafe06d2a23bf0112f90828f3eb42a7b4ce (target=main). Re-verified the current head after CR2 RC 13372. The only delta from my prior approved head is the logger-name fix: PLATFORM_AGENT_IDENTITY_LOGGER = "platform-agent.identity", getLogger(constant), caplog updated to that stable logger, and a regression test asserting the explicit name. This preserves behavior and strengthens the operator grep contract. No control-flow change, no secret/token logging introduced. The logged env value remains the non-secret MOLECULE_PLATFORM_AGENT_IMAGE_BAKED marker; the rest is boolean/path observability for the self-heal gate chain. CI on this head is green: secret scan, lint, build, smoke-install, unit-tests, responsiveness-e2e.
agent-reviewer-cr2 approved these changes 2026-06-23 07:01:10 +00:00
agent-reviewer-cr2 left a comment
Member

APPROVE on 9c770eafe0 (target=main).

Re-review confirms RC 13372 is resolved: platform_agent_identity now pins PLATFORM_AGENT_IDENTITY_LOGGER = "platform-agent.identity" and uses logging.getLogger(PLATFORM_AGENT_IDENTITY_LOGGER), and the new test asserts both the constant and resolved logger identity.

5-axis: behavior remains logging-only; the gate return conditions for on_platform_agent_image, ensure_management_mcp_in_settings, and mcp_server_present are unchanged. The INFO lines expose only diagnostic booleans, the non-secret platform-agent marker value, and the MCP binary path; no credentials/secrets are logged. Performance impact is minimal INFO logging on existing gate calls. Tests cover the explicit logger name and each observable gate path. CI is green on this head.

APPROVE on 9c770eafe06d2a23bf0112f90828f3eb42a7b4ce (target=main). Re-review confirms RC 13372 is resolved: platform_agent_identity now pins PLATFORM_AGENT_IDENTITY_LOGGER = "platform-agent.identity" and uses logging.getLogger(PLATFORM_AGENT_IDENTITY_LOGGER), and the new test asserts both the constant and resolved logger identity. 5-axis: behavior remains logging-only; the gate return conditions for on_platform_agent_image, ensure_management_mcp_in_settings, and mcp_server_present are unchanged. The INFO lines expose only diagnostic booleans, the non-secret platform-agent marker value, and the MCP binary path; no credentials/secrets are logged. Performance impact is minimal INFO logging on existing gate calls. Tests cover the explicit logger name and each observable gate path. CI is green on this head.
hongming merged commit 1ad6bdf514 into main 2026-06-23 07:03:27 +00:00
Sign in to join this conversation.
4 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-ai-workspace-runtime#164