P1: core Production auto-deploy fails fleet-wide (mkdir /home/hongming perm denied + rollout stragglers) #2193

Closed
opened 2026-06-04 02:33:43 +00:00 by hongming · 2 comments
Owner

publish-workspace-server-image / Production auto-deploy fails fleet-wide — blocks ALL core prod deploys

The image build-and-push succeeds (CI-timeout flakes eased), but the Production auto-deploy job fails on every recent main commit (28633805, eb31bcf6, 45eb7adc, …), so no core main change is reaching production tenants right now.

Two failures in the job (run 204980 / job 272922 and siblings)

  1. Runner-config bug (the hard stop):
    Error saving credentials: mkdir /home/hongming: permission denied
    ##[error]Process completed with exit code 1
    
    The deploy step's $HOME resolves to /home/hongming (a developer's home, not present/writable in the runner container), so the docker/credential save fails and the job exits 1.
  2. Incomplete rollout / stragglers:
    ::error::incomplete rollout — tenants not on target tag $TARGET_TAG: $STRAGGLERS
    ::error::redeploy-fleet reported ok=false; production rollout halted.
    ::error::$slug is stale: actual=…, expected=…
    

Impact

P1 — every merged core change (including the google-adk platform-provider SSOT, cp#511/core#2182) is stuck undeployed. Likely related to the in-flight sre/*-timeout CI work.

Asks

  • Fix the runner $HOME (set a writable HOME for the docker-credential step, or DOCKER_CONFIG to a tmp dir).
  • Investigate the straggler tenants not reaching the target tag.

Filed from the google-adk SSOT work (task #65) — its merged code is blocked solely on this deploy step.

## `publish-workspace-server-image / Production auto-deploy` fails fleet-wide — blocks ALL core prod deploys The image **build-and-push succeeds** (CI-timeout flakes eased), but the **Production auto-deploy** job fails on every recent main commit (28633805, eb31bcf6, 45eb7adc, …), so **no core main change is reaching production tenants** right now. ### Two failures in the job (run 204980 / job 272922 and siblings) 1. **Runner-config bug (the hard stop):** ``` Error saving credentials: mkdir /home/hongming: permission denied ##[error]Process completed with exit code 1 ``` The deploy step's `$HOME` resolves to `/home/hongming` (a developer's home, not present/writable in the runner container), so the docker/credential save fails and the job exits 1. 2. **Incomplete rollout / stragglers:** ``` ::error::incomplete rollout — tenants not on target tag $TARGET_TAG: $STRAGGLERS ::error::redeploy-fleet reported ok=false; production rollout halted. ::error::$slug is stale: actual=…, expected=… ``` ### Impact P1 — every merged core change (including the google-adk platform-provider SSOT, cp#511/core#2182) is stuck undeployed. Likely related to the in-flight `sre/*-timeout` CI work. ### Asks - Fix the runner `$HOME` (set a writable HOME for the docker-credential step, or `DOCKER_CONFIG` to a tmp dir). - Investigate the straggler tenants not reaching the target tag. Filed from the google-adk SSOT work (task #65) — its merged code is blocked solely on this deploy step.
Member

MECHANISM: recurrence on molecule-core main 5f0351c: publish-workspace-server-image / Production auto-deploy again reaches the post-rollout Promote :latest to the verified prod image step and fails before docker buildx imagetools create because docker login attempts to save credentials under unwritable /home/hongming.

EVIDENCE: job 273239 (run 205206, completed 2026-06-04T02:48:49Z) has conclusion failure at head 5f0351c59ffa. Log excerpt: Error saving credentials: mkdir /home/hongming: permission denied. The log immediately preceding the error is the :latest promotion block (docker login, then imagetools create).

RECOMMENDED FIX SHAPE: same as issue body: set a writable Docker credential path in .gitea/workflows/publish-workspace-server-image.yml for the production promotion step, e.g. DOCKER_CONFIG=$RUNNER_TEMP/docker-config before both ECR logins; then re-run to see whether the straggler class remains after credential save no longer hard-stops.

MECHANISM: recurrence on molecule-core main `5f0351c`: `publish-workspace-server-image / Production auto-deploy` again reaches the post-rollout `Promote :latest to the verified prod image` step and fails before `docker buildx imagetools create` because `docker login` attempts to save credentials under unwritable `/home/hongming`. EVIDENCE: job `273239` (`run 205206`, completed `2026-06-04T02:48:49Z`) has conclusion `failure` at head `5f0351c59ffa`. Log excerpt: `Error saving credentials: mkdir /home/hongming: permission denied`. The log immediately preceding the error is the `:latest` promotion block (`docker login`, then `imagetools create`). RECOMMENDED FIX SHAPE: same as issue body: set a writable Docker credential path in `.gitea/workflows/publish-workspace-server-image.yml` for the production promotion step, e.g. `DOCKER_CONFIG=$RUNNER_TEMP/docker-config` before both ECR logins; then re-run to see whether the straggler class remains after credential save no longer hard-stops.
Member

Resolution check: current molecule-core main b9d2f023c8e4 has publish-workspace-server-image / Production auto-deploy job 273520 completed success at 2026-06-04T03:08:51Z. The log reaches the :latest promotion block and ends Job succeeded; the prior Error saving credentials: mkdir /home/hongming: permission denied hard-stop is not present on this run.

Remaining caveat: this only proves the credential-path hard stop cleared for this SHA. Continue watching for the separate rollout-straggler class from the issue body on later deploys.

Resolution check: current molecule-core main `b9d2f023c8e4` has `publish-workspace-server-image / Production auto-deploy` job `273520` completed `success` at `2026-06-04T03:08:51Z`. The log reaches the `:latest` promotion block and ends `Job succeeded`; the prior `Error saving credentials: mkdir /home/hongming: permission denied` hard-stop is not present on this run. Remaining caveat: this only proves the credential-path hard stop cleared for this SHA. Continue watching for the separate rollout-straggler class from the issue body on later deploys.
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2193