canary-staging.yml: 38h+ chronic red — A2A agent error + teardown leak + Gitea-incompatible alerting #129

Closed
opened 2026-05-08 17:42:00 +00:00 by claude-ceo-assistant · 0 comments

Closed — all 3 failure modes resolved

Live verification 2026-05-08T20:14-20:17 UTC:

[13:14:24] tenant up → e2e-canary-20260508-multipath-17.staging.moleculesai.app
[13:16:58] ✅ A2A returned PONG
[13:17:05] ✅ Teardown clean — no orphan resources (0s)

Fix chain (in resolution order)

Mode Fix Status
1. Agent error (Exception) template-claude-code#6 (Dockerfile bundles config.yaml) + #7 (restore multi-path _load_providers)
2. Teardown leak molecule-controlplane#45 (CleanupTunnelConnections before DeleteTunnel)
3. Issue-filing 404 molecule-core#130 (sticky issue, no listWorkflowRuns)

Side-quests unblocked

  • molecule-controlplane#43 (MOLECULE_IMAGE_REGISTRY env propagation) — closed via PR #44
  • molecule-controlplane#41 (TestRedeploy ctx-cancel timing) — closed via PR #42
  • 4 of 27 hermes-agent test failures from hermes-agent#9 — closed via hermes-agent #10
  • internal#100 (recover-tunnels.py auto-discover) — closed via PR #103
  • internal#101 runbook update with reno-stars 4th tenant
  • internal#104 Decision B RFC (full AWS account split, future)

Operational fix applied during investigation

Updated CP staging Neon runtime_image_pins.image_digest for claude-code to the rebuilt image digest sha256:d36b80e3fedef4e9c4779d0e2f8b5879798f034a71b3f7b8ff8a251a6942d3f7. Manual SQL (the admin endpoint to manage pins is still unimplemented per 047_runtime_image_pins.up.sql followup note).

## Closed — all 3 failure modes resolved Live verification 2026-05-08T20:14-20:17 UTC: ``` [13:14:24] tenant up → e2e-canary-20260508-multipath-17.staging.moleculesai.app [13:16:58] ✅ A2A returned PONG [13:17:05] ✅ Teardown clean — no orphan resources (0s) ``` ## Fix chain (in resolution order) | Mode | Fix | Status | |---|---|---| | 1. Agent error (Exception) | `template-claude-code#6` (Dockerfile bundles config.yaml) + `#7` (restore multi-path `_load_providers`) | ✅ | | 2. Teardown leak | `molecule-controlplane#45` (CleanupTunnelConnections before DeleteTunnel) | ✅ | | 3. Issue-filing 404 | `molecule-core#130` (sticky issue, no `listWorkflowRuns`) | ✅ | ## Side-quests unblocked - `molecule-controlplane#43` (`MOLECULE_IMAGE_REGISTRY` env propagation) — closed via PR #44 - `molecule-controlplane#41` (TestRedeploy ctx-cancel timing) — closed via PR #42 - 4 of 27 hermes-agent test failures from `hermes-agent#9` — closed via hermes-agent #10 - `internal#100` (recover-tunnels.py auto-discover) — closed via PR #103 - `internal#101` runbook update with reno-stars 4th tenant - `internal#104` Decision B RFC (full AWS account split, future) ## Operational fix applied during investigation Updated CP staging Neon `runtime_image_pins.image_digest` for `claude-code` to the rebuilt image digest `sha256:d36b80e3fedef4e9c4779d0e2f8b5879798f034a71b3f7b8ff8a251a6942d3f7`. Manual SQL (the admin endpoint to manage pins is still unimplemented per `047_runtime_image_pins.up.sql` followup note).
Sign in to join this conversation.
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#129
No description provided.