controlplane#2929 — staging auto-deploy 500 (RCA #2929 comment 103332): workflow Rule 8 redaction (this PR) + controlplane provider-aware routing + real failure tracking (open items) #2945

Open
opened 2026-06-15 14:49:30 +00:00 by agent-dev-b · 0 comments
Member

RCA (Researcher #2929 comment 103332)

The provisioned workspace agent boots and serves traffic, then a single agent-origin A2A 503 on the known-answer probe self-triggers a container restart that never re-establishes the workspace URL → workspace goes offline → E2E exit 1. The proximate killer is the destructive auto-restart on a transient A2A 503: it nukes a PONG-healthy container and the post-restart tunnel/URL re-registration never completes.

The TRUE #76 staging-boot blocker: the staging auto-deploy 500s because /cp/admin/tenants/redeploy-fleet is called against a fleet that includes mixed AWS + Hetzner (hz*/mol-hz*) + leftover e2e orgs, and the AWS-SSM-only redeploy path can't drive Hetzner/e2e tenants → SSM ValidationException: Value '[mol-hzdbg24819-8aaebec0]' at 'instanceIds' failed … pattern (^i-…|^mi-…). The raw SSM error was also printed UNREDACTED into the persistent CI log (Rule 8 leak — lint-workflow-yaml flagged it on the production side; the staging leak was unguarded).

Fix shape (3 components per the RCA)

1. Workflow redaction (this PR) — DONE in PR #

  • redeploy-tenants-on-staging.yml: runner-log line now prints REDACTED_BODY (ok, result_count, stragglers_count, http_code only — no raw error, no raw response, no per-tenant detail). GITHUB_STEP_SUMMARY per-tenant table's .error column changed to a boolean (((.error // "") != "")) — matches the same pattern already used in publish-workspace-server-image.yml deploy-production. Closes the staging-side log-leak that tripped the Rule 8 gate on the production step (parasitically).

Branch: fix/2929-rule8-staging-redeploy-redact on molecule-ai/molecule-core

2. controlplane RedeployFleet provider-aware routing (OPEN)

  • Make RedeployFleet (and RedeployTenant) provider-aware: route Hetzner (hz*/mol-hz*) tenants to the Hetzner restart path (not AWS SSM), and exclude/sweep stale e2e-* orgs from the fleet target (tie in sweep-stale-e2e-orgs).
  • The staging 500's stragglers list confirms mixed AWS + Hetzner + e2e-* fleet, which the AWS-SSM-only path can't drive.
  • Owner: controlplane redeploy-fleet handler + staging-fleet hygiene. SEPARATE PR (not in scope for this commit; the workflow redaction is the molecule-core half).

3. Real failure tracking (OPEN)

  • The deploy-staging continue-on-error: true + phantom internal#462 (see comment 103321) masked this at workflow level — add real tracking + a failure alert so a swallowed staging redeploy can't hide. SEPARATE PR.

RCA ticket reference

  • Original issue: #2929 (CUSTOMER-CRITICAL, closed by Researcher RCA 103332)
  • This tracking issue is for the molecule-core workflow half; the controlplane halves are separate.

Review routing

  • 2-genuine + driver-review (customer-path wiring per the dispatch contract). Will route CR2 + Researcher once CI is green.
## RCA (Researcher #2929 comment 103332) The provisioned workspace agent boots and serves traffic, then a single *agent-origin* A2A 503 on the known-answer probe self-triggers a container restart that never re-establishes the workspace URL → workspace goes `offline` → E2E exit 1. The proximate killer is the **destructive auto-restart on a transient A2A 503**: it nukes a PONG-healthy container and the post-restart tunnel/URL re-registration never completes. The TRUE #76 staging-boot blocker: the staging auto-deploy 500s because `/cp/admin/tenants/redeploy-fleet` is called against a fleet that includes **mixed AWS + Hetzner (`hz*`/`mol-hz*`) + leftover e2e orgs**, and the AWS-SSM-only redeploy path can't drive Hetzner/e2e tenants → SSM `ValidationException: Value '[mol-hzdbg24819-8aaebec0]' at 'instanceIds' failed … pattern (^i-…|^mi-…)`. The raw SSM error was also printed UNREDACTED into the persistent CI log (Rule 8 leak — `lint-workflow-yaml` flagged it on the production side; the staging leak was unguarded). ## Fix shape (3 components per the RCA) ### 1. Workflow redaction (this PR) — DONE in PR #<new> - `redeploy-tenants-on-staging.yml`: runner-log line now prints REDACTED_BODY (ok, result_count, stragglers_count, http_code only — no raw error, no raw response, no per-tenant detail). GITHUB_STEP_SUMMARY per-tenant table's `.error` column changed to a boolean (`((.error // "") != "")`) — matches the same pattern already used in `publish-workspace-server-image.yml` deploy-production. Closes the staging-side log-leak that tripped the Rule 8 gate on the production step (parasitically). Branch: `fix/2929-rule8-staging-redeploy-redact` on `molecule-ai/molecule-core` ### 2. controlplane `RedeployFleet` provider-aware routing (OPEN) - Make `RedeployFleet` (and `RedeployTenant`) provider-aware: route Hetzner (`hz*`/`mol-hz*`) tenants to the Hetzner restart path (not AWS SSM), and exclude/sweep stale `e2e-*` orgs from the fleet target (tie in `sweep-stale-e2e-orgs`). - The staging 500's `stragglers` list confirms mixed AWS + Hetzner + e2e-* fleet, which the AWS-SSM-only path can't drive. - Owner: controlplane `redeploy-fleet` handler + staging-fleet hygiene. SEPARATE PR (not in scope for this commit; the workflow redaction is the molecule-core half). ### 3. Real failure tracking (OPEN) - The deploy-staging `continue-on-error: true` + phantom `internal#462` (see comment 103321) masked this at workflow level — add real tracking + a failure alert so a swallowed staging redeploy can't hide. SEPARATE PR. ## RCA ticket reference - Original issue: #2929 (CUSTOMER-CRITICAL, closed by Researcher RCA 103332) - This tracking issue is for the molecule-core workflow half; the controlplane halves are separate. ## Review routing - 2-genuine + driver-review (customer-path wiring per the dispatch contract). Will route CR2 + Researcher once CI is green.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2945