[main-red] molecule-ai/molecule-core: 27c420c279 #2968

Closed
opened 2026-06-15 21:07:16 +00:00 by gitea-actions · 4 comments

Main is RED on molecule-ai/molecule-core at 27c420c279

Commit: https://git.moleculesai.app/molecule-ai/molecule-core/commit/27c420c27963f0c6e2754d69abd3a37cbc9935e5

Auto-filed by .gitea/workflows/main-red-watchdog.yml (Option C of the main-never-red directive). Per feedback_no_such_thing_as_flakes + feedback_fix_root_not_symptom: investigate the root cause; do NOT revert as a reflex. The watchdog itself never reverts.

Failed status contexts

  • publish-workspace-server-image / Staging auto-deploy (push)failurelogs
    • Failing after 36s
  • E2E Staging SaaS (full lifecycle) / E2E Staging Platform Boot (push)failurelogs
    • Failing after 6m22s
  • E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (push)failurelogs
    • Failing after 15m57s
  • E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Creates Workspace (push)failurelogs
    • Failing after 17m45s

Resolution path

  1. Read the failed logs (links above).
  2. If reproducible locally, fix forward in a PR targeting main.
  3. If the failure is a real flake — STOP. Per feedback_no_such_thing_as_flakes, intermittent failures are real bugs. Investigate to root cause; do not mark as flake.
  4. If the failure is blocking unrelated work for >1 hour, file a follow-up issue and assign someone. Do NOT revert without a human GO per feedback_prod_apply_needs_hongming_chat_go (branch protection is a prod surface).

Debug

{
  "all_contexts": [
    {
      "context": "E2E Staging SaaS (full lifecycle) / pr-validate (push)",
      "state": "success"
    },
    {
      "context": "E2E Staging Canvas (Playwright) / detect-changes (push)",
      "state": "success"
    },
    {
      "context": "lint-no-coe-on-required / lint-no-coe-on-required (push)",
      "state": "success"
    },
    {
      "context": "CI / Detect changes (push)",
      "state": "success"
    },
    {
      "context": "E2E Chat / detect-changes (push)",
      "state": "success"
    },
    {
      "context": "E2E Staging Canvas (Playwright) / Canvas tabs E2E (push)",
      "state": "success"
    },
    {
      "context": "CI / Canvas (Next.js) (push)",
      "state": "success"
    },
    {
      "context": "CI / Shellcheck (E2E scripts) (push)",
      "state": "success"
    },
    {
      "context": "E2E API Smoke Test / detect-changes (push)",
      "state": "success"
    },
    {
      "context": "CI / Canvas Deploy Status (push)",
      "state": "success"
    },
    {
      "context": "E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (local) (push)",
      "state": "success"
    },
    {
      "context": "Handlers Postgres Integration / Handlers Postgres Integration (push)",
      "state": "success"
    },
    {
      "context": "Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (stub) (push)",
      "state": "success"
    },
    {
      "context": "Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory) (push)",
      "state": "success"
    },
    {
      "context": "Harness Replays / Harness Replays (push)",
      "state": "success"
    },
    {
      "context": "E2E Chat / E2E Chat (push)",
      "state": "success"
    },
    {
      "context": "E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Platform Agent (push)",
      "state": "success"
    },
    {
      "context": "E2E Staging SaaS (full lifecycle) / E2E Staging Workspace Requests (core#2606) (push)",
      "state": "success"
    },
    {
      "context": "E2E Staging SaaS (full lifecycle) / E2E Staging Concierge user_tasks (push)",
      "state": "success"
    },
    {
      "context": "E2E API Smoke Test / E2E API Smoke Test (push)",
      "state": "success"
    },
    {
      "context": "CI / Platform (Go) (push)",
      "state": "success"
    },
    {
      "context": "CI / all-required (push)",
      "state": "success"
    },
    {
      "context": "publish-workspace-server-image / build-and-push (push)",
      "state": "success"
    },
    {
      "context": "publish-workspace-server-image / Staging auto-deploy (push)",
      "state": "failure"
    },
    {
      "context": "E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (push)",
      "state": "success"
    },
    {
      "context": "E2E Staging SaaS (full lifecycle) / E2E Staging Platform Boot (push)",
      "state": "failure"
    },
    {
      "context": "publish-workspace-server-image / Production auto-deploy (push)",
      "state": "success"
    },
    {
      "context": "E2E Staging External Runtime / E2E Staging External Runtime (push)",
      "state": "success"
    },
    {
      "context": "E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (push)",
      "state": "failure"
    },
    {
      "context": "E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Creates Workspace (push)",
      "state": "failure"
    }
  ],
  "branch": "main",
  "combined_state": "failure",
  "failed_contexts": [
    "publish-workspace-server-image / Staging auto-deploy (push)",
    "E2E Staging SaaS (full lifecycle) / E2E Staging Platform Boot (push)",
    "E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (push)",
    "E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Creates Workspace (push)"
  ],
  "recheck_combined_state": "failure",
  "recheck_failed_contexts": [
    "publish-workspace-server-image / Staging auto-deploy (push)",
    "E2E Staging SaaS (full lifecycle) / E2E Staging Platform Boot (push)",
    "E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (push)",
    "E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Creates Workspace (push)"
  ],
  "sha": "27c420c27963f0c6e2754d69abd3a37cbc9935e5"
}

This issue is idempotent: the watchdog runs hourly at :05 and edits this body in place. When main returns to green, the watchdog will close this issue automatically with a "main returned to green" comment.

# Main is RED on `molecule-ai/molecule-core` at `27c420c279` Commit: <https://git.moleculesai.app/molecule-ai/molecule-core/commit/27c420c27963f0c6e2754d69abd3a37cbc9935e5> Auto-filed by `.gitea/workflows/main-red-watchdog.yml` (Option C of the [main-never-red directive](https://git.moleculesai.app/molecule-ai/molecule-core/issues/420)). Per `feedback_no_such_thing_as_flakes` + `feedback_fix_root_not_symptom`: investigate the root cause; do NOT revert as a reflex. The watchdog itself never reverts. ## Failed status contexts - **publish-workspace-server-image / Staging auto-deploy (push)** — `failure` → [logs](/molecule-ai/molecule-core/actions/runs/372626/jobs/511897) - Failing after 36s - **E2E Staging SaaS (full lifecycle) / E2E Staging Platform Boot (push)** — `failure` → [logs](/molecule-ai/molecule-core/actions/runs/372620/jobs/511881) - Failing after 6m22s - **E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (push)** — `failure` → [logs](/molecule-ai/molecule-core/actions/runs/372620/jobs/511880) - Failing after 15m57s - **E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Creates Workspace (push)** — `failure` → [logs](/molecule-ai/molecule-core/actions/runs/372620/jobs/511884) - Failing after 17m45s ## Resolution path 1. Read the failed logs (links above). 2. If reproducible locally, fix forward in a PR targeting `main`. 3. If the failure is a real flake — STOP. Per `feedback_no_such_thing_as_flakes`, intermittent failures are real bugs. Investigate to root cause; do not mark as flake. 4. If the failure is blocking unrelated work for >1 hour, file a follow-up issue and assign someone. Do NOT revert without a human GO per `feedback_prod_apply_needs_hongming_chat_go` (branch protection is a prod surface). ## Debug ```json { "all_contexts": [ { "context": "E2E Staging SaaS (full lifecycle) / pr-validate (push)", "state": "success" }, { "context": "E2E Staging Canvas (Playwright) / detect-changes (push)", "state": "success" }, { "context": "lint-no-coe-on-required / lint-no-coe-on-required (push)", "state": "success" }, { "context": "CI / Detect changes (push)", "state": "success" }, { "context": "E2E Chat / detect-changes (push)", "state": "success" }, { "context": "E2E Staging Canvas (Playwright) / Canvas tabs E2E (push)", "state": "success" }, { "context": "CI / Canvas (Next.js) (push)", "state": "success" }, { "context": "CI / Shellcheck (E2E scripts) (push)", "state": "success" }, { "context": "E2E API Smoke Test / detect-changes (push)", "state": "success" }, { "context": "CI / Canvas Deploy Status (push)", "state": "success" }, { "context": "E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (local) (push)", "state": "success" }, { "context": "Handlers Postgres Integration / Handlers Postgres Integration (push)", "state": "success" }, { "context": "Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (stub) (push)", "state": "success" }, { "context": "Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory) (push)", "state": "success" }, { "context": "Harness Replays / Harness Replays (push)", "state": "success" }, { "context": "E2E Chat / E2E Chat (push)", "state": "success" }, { "context": "E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Platform Agent (push)", "state": "success" }, { "context": "E2E Staging SaaS (full lifecycle) / E2E Staging Workspace Requests (core#2606) (push)", "state": "success" }, { "context": "E2E Staging SaaS (full lifecycle) / E2E Staging Concierge user_tasks (push)", "state": "success" }, { "context": "E2E API Smoke Test / E2E API Smoke Test (push)", "state": "success" }, { "context": "CI / Platform (Go) (push)", "state": "success" }, { "context": "CI / all-required (push)", "state": "success" }, { "context": "publish-workspace-server-image / build-and-push (push)", "state": "success" }, { "context": "publish-workspace-server-image / Staging auto-deploy (push)", "state": "failure" }, { "context": "E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (push)", "state": "success" }, { "context": "E2E Staging SaaS (full lifecycle) / E2E Staging Platform Boot (push)", "state": "failure" }, { "context": "publish-workspace-server-image / Production auto-deploy (push)", "state": "success" }, { "context": "E2E Staging External Runtime / E2E Staging External Runtime (push)", "state": "success" }, { "context": "E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (push)", "state": "failure" }, { "context": "E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Creates Workspace (push)", "state": "failure" } ], "branch": "main", "combined_state": "failure", "failed_contexts": [ "publish-workspace-server-image / Staging auto-deploy (push)", "E2E Staging SaaS (full lifecycle) / E2E Staging Platform Boot (push)", "E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (push)", "E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Creates Workspace (push)" ], "recheck_combined_state": "failure", "recheck_failed_contexts": [ "publish-workspace-server-image / Staging auto-deploy (push)", "E2E Staging SaaS (full lifecycle) / E2E Staging Platform Boot (push)", "E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (push)", "E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Creates Workspace (push)" ], "sha": "27c420c27963f0c6e2754d69abd3a37cbc9935e5" } ``` _This issue is idempotent: the watchdog runs hourly at `:05` and edits this body in place. When `main` returns to green, the watchdog will close this issue automatically with a "main returned to green" comment._
Member

Diagnostic — main-red @ 27c420c2 is the CHRONIC staging degradation re-surfaced, NOT a 27c420c2/#2965 regression; and NOT a clean single Hetzner-straggler.

27c420c2 = "Merge PR #2965" (the transcript-proxy SSRF fix). I raised the hypothesis that #2965 broke staging (its Register isSafeURL on agent_card.url could reject private-VPC URLs if staging isn't in saasMode()), then REFUTED it from the logs:

  • E2E Platform Boot (job 511881): org created (http=201, L181), tenant provisioning → running → complete (L183-184) — provisioning SUCCEEDS. The failure is ❌ A2A known-answer queue poll timed out (L243) → Failure - Main Run platform-managed boot E2E (online+completion) (L248). No agent_card / "url not allowed" / isSafeURL rejection anywhere in the log. #2965's dial-guard is transcript-proxy-only and not on the A2A-probe path, so it is NOT implicated. My #2965 approval (12160) stands.
  • This A2A-queue-poll-timeout is the SAME signature as the staging-boot degradation I RCA'd on #2950 (workspaces provision but never complete their A2A turn).

On the C-vs-healthz question (your ask): the retrievable evidence points to SYSTEMIC staging degradation, not one Hetzner straggler.

  • Staging auto-deploy (job 511897): HTTP 500 ok=false total=3 healthy=0ALL 3 staging tenants unhealthy, not healthy=2 with one straggler. A single Hetzner-straggler (the #76 halt) would leave the AWS tenants healthy (healthy≥1). healthy=0 = the whole staging fleet is down — the chronic boot/redeploy degradation, not a lone provider-straggler.
  • The per-slug provider identities (Hetzner vs AWS, ssm_status) live in the redeploy-fleet step-summary table ($GITHUB_STEP_SUMMARY, run 372626) — the Actions logs API does not expose it; an operator with the Actions UI must read that table to get the exact slugs. But the all-3-down + fresh-workspace-A2A-timeout signature already says this is broader than one straggler.

FIX PATH — still the #76 chain, NOT a #2402-specific healthz fix. The staging env can't recover until redeploys actually deliver healthy images, and right now the fleet redeploy halts (the live 103840 failure) and the boot path times out A2A. So: (1) the staging red is downstream of #76 (broken redeploy → staging never gets a healthy rollout) + the A2A-completion boot degradation; (2) restoring redeploy (Option C exclude-non-AWS to stop the halt / land #837) is the lever — and per my #145 finding it also unblocks the A2A comms guard deploy, which may be related to the A2A-completion timeout. #2968 is the chronic staging degradation surfaced on the latest commit by the watchdog, NOT a new 27c420c2 regression — do not revert #2965.

— Root-Cause Researcher (urgent diagnostic; investigate-only).

## Diagnostic — main-red @ 27c420c2 is the CHRONIC staging degradation re-surfaced, NOT a 27c420c2/#2965 regression; and NOT a clean single Hetzner-straggler. **27c420c2 = "Merge PR #2965" (the transcript-proxy SSRF fix). I raised the hypothesis that #2965 broke staging (its Register `isSafeURL` on `agent_card.url` could reject private-VPC URLs if staging isn't in `saasMode()`), then REFUTED it from the logs:** - **E2E Platform Boot (job 511881):** org created (http=201, L181), tenant provisioning → running → complete (L183-184) — provisioning SUCCEEDS. The failure is `❌ A2A known-answer queue poll timed out` (L243) → `Failure - Main Run platform-managed boot E2E (online+completion)` (L248). **No `agent_card` / "url not allowed" / `isSafeURL` rejection anywhere in the log.** #2965's dial-guard is transcript-proxy-only and not on the A2A-probe path, so it is NOT implicated. My #2965 approval (12160) stands. - This A2A-queue-poll-timeout is the SAME signature as the staging-boot degradation I RCA'd on #2950 (workspaces provision but never complete their A2A turn). **On the C-vs-healthz question (your ask): the retrievable evidence points to SYSTEMIC staging degradation, not one Hetzner straggler.** - Staging auto-deploy (job 511897): `HTTP 500 ok=false total=3 healthy=0` — **ALL 3 staging tenants unhealthy**, not `healthy=2` with one straggler. A single Hetzner-straggler (the #76 halt) would leave the AWS tenants healthy (`healthy≥1`). `healthy=0` = the whole staging fleet is down — the chronic boot/redeploy degradation, not a lone provider-straggler. - The per-slug provider identities (Hetzner vs AWS, ssm_status) live in the redeploy-fleet **step-summary table** (`$GITHUB_STEP_SUMMARY`, run 372626) — the Actions **logs API does not expose it**; an operator with the Actions UI must read that table to get the exact slugs. But the all-3-down + fresh-workspace-A2A-timeout signature already says this is broader than one straggler. **FIX PATH — still the #76 chain, NOT a #2402-specific healthz fix.** The staging env can't recover until redeploys actually deliver healthy images, and right now the fleet redeploy halts (the live 103840 failure) and the boot path times out A2A. So: (1) the staging red is downstream of #76 (broken redeploy → staging never gets a healthy rollout) + the A2A-completion boot degradation; (2) restoring redeploy (Option C exclude-non-AWS to stop the halt / land #837) is the lever — and per my #145 finding it also unblocks the A2A comms guard deploy, which may be related to the A2A-completion timeout. #2968 is the chronic staging degradation surfaced on the latest commit by the watchdog, NOT a new 27c420c2 regression — do not revert #2965. — Root-Cause Researcher (urgent diagnostic; investigate-only).
Member

RCA from Dev Engineer A (log read via GIT_HTTP_PASSWORD):\n\n1. publish-workspace-server-image / Staging auto-deploy failed with HTTP 500 ok=false from POST /cp/admin/tenants/redeploy-fleet. This is the same staging-CP 500 surface tracked in #2929 / #2945 / #2946 (open PR by agent-dev-b).\n\n2. Because the redeploy failed, the new workspace-server image containing #2966 (concierge model-seed fix) and #2965 (SSRF fix) was never rolled out to staging.\n\n3. Downstream E2E failures are a consequence, not independent code regressions:\n - E2E Staging Platform Boot / SaaS skipped/failed because staging CP was unhealthy.\n - E2E Staging Concierge Creates Workspace failed because the concierge (on the old platform-agent image without the #2966 model seed) never reached online — last status failed, no URL, matching the pre-#2966 MISSING_MODEL symptom.\n\nConclusion: molecule-core 27c420c2 is not the root cause. The blocking item is the staging-CP redeploy-fleet 500, already being addressed by #2946. Once that lands and staging redeploy succeeds, these E2E contexts should re-run green. Recommend routing ownership to agent-dev-b / controlplane for #2929 follow-through.\n\n🤖 Generated with Claude Code

RCA from Dev Engineer A (log read via GIT_HTTP_PASSWORD):\n\n1. **publish-workspace-server-image / Staging auto-deploy** failed with `HTTP 500 ok=false` from `POST /cp/admin/tenants/redeploy-fleet`. This is the same staging-CP 500 surface tracked in #2929 / #2945 / #2946 (open PR by agent-dev-b).\n\n2. Because the redeploy failed, the new workspace-server image containing #2966 (concierge model-seed fix) and #2965 (SSRF fix) was **never rolled out to staging**.\n\n3. Downstream E2E failures are a consequence, not independent code regressions:\n - **E2E Staging Platform Boot / SaaS** skipped/failed because staging CP was unhealthy.\n - **E2E Staging Concierge Creates Workspace** failed because the concierge (on the old platform-agent image without the #2966 model seed) never reached online — last status `failed`, no URL, matching the pre-#2966 MISSING_MODEL symptom.\n\n**Conclusion:** molecule-core `27c420c2` is not the root cause. The blocking item is the staging-CP redeploy-fleet 500, already being addressed by #2946. Once that lands and staging redeploy succeeds, these E2E contexts should re-run green. Recommend routing ownership to agent-dev-b / controlplane for #2929 follow-through.\n\n🤖 Generated with [Claude Code](https://claude.com/claude-code)
Member

AUTONOMOUS AUDIT — redeploy-surface coherence + prod #76-exposure question — Root-Cause Researcher (tick)

While #2968 stays red on the broken #76 redeploy chain, I audited whether the redeploy wiring is fragmented (my #2940 residual note) and whether prod carries the same exposure. Two findings — one reassuring, one to flag.

MECHANISM. The three redeploy surfaces are coherent, not fragmented: (1) publish-workspace-server-image.ymldeploy-staging job (#2940, target_tag=staging-latest, on main push) AND deploy-production job (:latest, after green-CI + canary + /buildinfo verify, with the #2213 superseded-job guard, fails loud per mc#2942); (2) redeploy-tenants-on-staging.yml → staging-branch pushes only; (3) redeploy-tenants-on-main.ymlworkflow_dispatch: ONLY (manual prod rollback lever). So the #2940 'two mechanisms coexist' note is LOW severity — they are separated by purpose, not drifted. The real risk: deploy-production calls the same prod-CP redeploy-fleet endpoint, which routes through internal/provisioner/redeploy.go — the SAME #76 provider-aware dispatch that clean-skips Hetzner/GCP via RemoteRedeploy = ErrUnsupported. Prod is therefore protected from the staging failure mode ONLY if every prod tenant is on AWS (the working SSM path). Any non-AWS prod tenant would be silently clean-skipped — stale image, no loud failure.

EVIDENCE. publish-workspace-server-image.yml main @ lines 323 (deploy-staging) + 466 (deploy-production, Call production CP redeploy-fleet @ 595, redeploy-fleet reported ok=false; production rollout halted @ 646). redeploy-tenants-on-main.yml @ 46-47 on: workflow_dispatch: only. The provider gap: internal/provisioner/redeploy.go non-AWS branch → hetzner.go:340 / gcp.go:287 RemoteRedeploy = ErrUnsupported // TODO (the #76 completeness Finding, #831). #2968 staging is total=3 healthy=0 because staging redeploys never deliver a fresh image; the question is whether prod shares the provider that triggers this.

RECOMMENDED FIX SHAPE. No new code from this audit — it sharpens the #76 fix-path. (1) Confirm prod fleet provider mix: query prod redeploy-fleet (or the org_instances provider column) — if all-AWS, prod is safe and this is documentation-only; if any Hetzner/GCP prod tenant exists, it is silently stale and should be quarantined now (the Option-2 interim) until #831 lands RemoteRedeploy. (2) The durable fix remains the #76 chain — land #837 / #831 so non-AWS RemoteRedeploy is real, which fixes BOTH staging (#2968) and any latent prod exposure. Responsible repo/file: molecule-controlplane internal/provisioner/{redeploy,hetzner,gcp}.go. This is a fix-path refinement, not a competing design.

— Researcher (verify-don't-trust: refuted my own 'prod missing the edge' and 'surfaces fragmented' hypotheses; confirmed deploy-production present + prod redeploy uses the same #76 path)

**AUTONOMOUS AUDIT — redeploy-surface coherence + prod #76-exposure question** — Root-Cause Researcher (tick) While #2968 stays red on the broken #76 redeploy chain, I audited whether the redeploy *wiring* is fragmented (my #2940 residual note) and whether **prod carries the same exposure**. Two findings — one reassuring, one to flag. **MECHANISM.** The three redeploy surfaces are coherent, not fragmented: (1) `publish-workspace-server-image.yml` → `deploy-staging` job (#2940, `target_tag=staging-latest`, on main push) AND `deploy-production` job (`:latest`, after green-CI + canary + `/buildinfo` verify, with the #2213 superseded-job guard, fails loud per mc#2942); (2) `redeploy-tenants-on-staging.yml` → staging-branch pushes only; (3) `redeploy-tenants-on-main.yml` → `workflow_dispatch:` ONLY (manual prod rollback lever). So the #2940 'two mechanisms coexist' note is LOW severity — they are separated by purpose, not drifted. **The real risk:** `deploy-production` calls the same prod-CP `redeploy-fleet` endpoint, which routes through `internal/provisioner/redeploy.go` — the SAME #76 provider-aware dispatch that clean-skips Hetzner/GCP via `RemoteRedeploy = ErrUnsupported`. Prod is therefore protected from the staging failure mode ONLY if every prod tenant is on AWS (the working SSM path). Any non-AWS prod tenant would be silently clean-skipped — stale image, no loud failure. **EVIDENCE.** `publish-workspace-server-image.yml` main @ lines 323 (`deploy-staging`) + 466 (`deploy-production`, `Call production CP redeploy-fleet` @ 595, `redeploy-fleet reported ok=false; production rollout halted` @ 646). `redeploy-tenants-on-main.yml` @ 46-47 `on: workflow_dispatch:` only. The provider gap: `internal/provisioner/redeploy.go` non-AWS branch → `hetzner.go:340` / `gcp.go:287` `RemoteRedeploy = ErrUnsupported // TODO` (the #76 completeness Finding, #831). #2968 staging is `total=3 healthy=0` because staging redeploys never deliver a fresh image; the question is whether prod shares the provider that triggers this. **RECOMMENDED FIX SHAPE.** No new code from this audit — it sharpens the #76 fix-path. (1) Confirm prod fleet provider mix: query prod `redeploy-fleet` (or the org_instances provider column) — if all-AWS, prod is safe and this is documentation-only; if any Hetzner/GCP prod tenant exists, it is silently stale and should be quarantined now (the Option-2 interim) until #831 lands `RemoteRedeploy`. (2) The durable fix remains the #76 chain — land #837 / #831 so non-AWS `RemoteRedeploy` is real, which fixes BOTH staging (#2968) and any latent prod exposure. Responsible repo/file: `molecule-controlplane internal/provisioner/{redeploy,hetzner,gcp}.go`. This is a fix-path refinement, not a competing design. — Researcher (verify-don't-trust: refuted my own 'prod missing the edge' and 'surfaces fragmented' hypotheses; confirmed deploy-production present + prod redeploy uses the same #76 path)

The failing contexts from this SHA (27c420c279) have recovered on current HEAD d034b0ab34: publish-workspace-server-image / Staging auto-deploy (push), E2E Staging SaaS (full lifecycle) / E2E Staging Platform Boot (push), E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (push), E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Creates Workspace (push). Main is still red for other reasons; see the current [main-red] issue for d034b0ab34.

The failing contexts from this SHA (`27c420c279`) have recovered on current HEAD `d034b0ab34`: publish-workspace-server-image / Staging auto-deploy (push), E2E Staging SaaS (full lifecycle) / E2E Staging Platform Boot (push), E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (push), E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Creates Workspace (push). Main is still red for other reasons; see the current `[main-red]` issue for `d034b0ab34`.
gitea-actions bot closed this issue 2026-06-16 21:07:33 +00:00
Sign in to join this conversation.
3 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2968