fix(entrypoint,ci): fail-fast on .claude agent-perms regression (de172b55) #23

Open
core-devops wants to merge 1 commits from fix/de172b55-claude-perms-fail-fast-smoke into main
Member

Summary

Fixes the de172b55 claude-code workspace bash outage (2026-05-15) and adds the regression gate that would have caught it.

Incident: the de172b55 workspace booted on a pre-PR#21 image pin. Its /home/agent/.claude was root:root with no session-env subdir. The Claude Agent SDK runs mkdir -p ~/.claude/session-env on every Bash tool call, so the agent got EACCES: permission denied, mkdir /home/agent/.claude/session-env on every command — it could not run any bash and flailed (delegating its assigned review to peers because it could not do it itself). Tenant was hot-patched separately (chown + session-env create + settings.json stub, verified live: agent now runs bash end-to-end).

Root cause class: identical to template-claude-code PR#21 (.claude chown gap). PR#21 already fixed the substantive entrypoint logic on main. This PR adds the missing piece: a fail-fast so a broken image never silently ships again.

Changes

  • entrypoint.sh — after the existing .claude mkdir+chown, reproduce the exact SDK operation as the agent user (gosu agent mkdir+write session-env) and exit 1 with a FATAL log on failure. Turns a green-but-broken container into a loud docker logs boot failure. Idempotent, ~instant on a healthy tree.
  • .gitea/workflows/publish-image.yml — new pre-push smoke step that boots the image through its real entrypoint (root → gosu agent privilege drop) and asserts the agent can write ~/.claude/session-env and .claude is agent:agent. The pre-existing import smoke uses --entrypoint sh, which bypasses the privilege drop + chown entirely — that is precisely why this class shipped silently. Gates the ECR push.

Test plan

  • CI green (lint/validate-runtime + the new entrypoint-perms smoke must PASS on the build)
  • Confirm the new smoke step prints PERMS_OK and agent:agent ... in the publish-image run
  • Negative check (reviewer, optional): temporarily chown root /home/agent/.claude in a scratch Dockerfile layer → smoke step + entrypoint guard both fail loudly
  • No interaction with in-flight PRs (#13/#14/#15/#18/#20) — disjoint files

Notes

  • Tenant de172b55 already hot-patched (perms fix + verified bash works live via A2A probe); no urgency, normal review.
  • Genuine peer review required (non-author persona). core-devops authored; route APPROVE to a different team persona.

🤖 Generated with Claude Code

## Summary Fixes the **de172b55 claude-code workspace bash outage** (2026-05-15) and adds the regression gate that would have caught it. **Incident:** the de172b55 workspace booted on a pre-PR#21 image pin. Its `/home/agent/.claude` was `root:root` with no `session-env` subdir. The Claude Agent SDK runs `mkdir -p ~/.claude/session-env` on **every Bash tool call**, so the agent got `EACCES: permission denied, mkdir /home/agent/.claude/session-env` on every command — it could not run any bash and flailed (delegating its assigned review to peers because it could not do it itself). Tenant was hot-patched separately (chown + session-env create + settings.json stub, verified live: agent now runs bash end-to-end). **Root cause class:** identical to template-claude-code **PR#21** (`.claude` chown gap). PR#21 already fixed the substantive entrypoint logic on `main`. This PR adds the missing piece: a fail-fast so a broken image never silently ships again. ## Changes - **`entrypoint.sh`** — after the existing `.claude` mkdir+chown, reproduce the exact SDK operation as the `agent` user (`gosu agent mkdir+write session-env`) and `exit 1` with a `FATAL` log on failure. Turns a green-but-broken container into a loud `docker logs` boot failure. Idempotent, ~instant on a healthy tree. - **`.gitea/workflows/publish-image.yml`** — new pre-push smoke step that boots the image through its **real entrypoint** (root → `gosu agent` privilege drop) and asserts the agent can write `~/.claude/session-env` and `.claude` is `agent:agent`. The pre-existing import smoke uses `--entrypoint sh`, which bypasses the privilege drop + chown entirely — that is precisely why this class shipped silently. Gates the ECR push. ## Test plan - [ ] CI green (lint/validate-runtime + the new entrypoint-perms smoke must PASS on the build) - [ ] Confirm the new smoke step prints `PERMS_OK` and `agent:agent ...` in the publish-image run - [ ] Negative check (reviewer, optional): temporarily `chown root /home/agent/.claude` in a scratch Dockerfile layer → smoke step + entrypoint guard both fail loudly - [ ] No interaction with in-flight PRs (#13/#14/#15/#18/#20) — disjoint files ## Notes - Tenant de172b55 already hot-patched (perms fix + verified bash works live via A2A probe); no urgency, normal review. - Genuine peer review required (non-author persona). `core-devops` authored; route APPROVE to a different team persona. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
core-devops added 1 commit 2026-05-16 01:30:57 +00:00
fix(entrypoint,ci): fail-fast on .claude agent-perms regression (de172b55)
CI / validate (pull_request) Blocked by required conditions
CI / validate (push) Blocked by required conditions
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 29s
CI / Template validation (static) (pull_request) Successful in 1m42s
CI / Adapter unit tests (pull_request) Successful in 1m43s
CI / Template validation (static) (push) Successful in 1m57s
CI / Adapter unit tests (push) Successful in 1m56s
CI / Template validation (runtime) (push) Successful in 10m10s
CI / Template validation (runtime) (pull_request) Successful in 11m26s
8b2f54a65f
The 2026-05-15 de172b55 incident: a claude-code workspace running a
pre-PR#21 image pin booted with /home/agent/.claude owned root:root and
no session-env subdir. The Claude Agent SDK does
`mkdir -p ~/.claude/session-env` on EVERY Bash tool call, so the agent
EACCESed on every command, could not run any bash, and flailed
(delegating its assigned work to peers). PR#21 already fixed the
substantive entrypoint logic; what was missing was anything that would
*catch* a broken image before it ships.

Two defenses, both idempotent and ~instant on a healthy tree:

- entrypoint.sh: after the .claude mkdir+chown, reproduce the exact SDK
  operation as the agent user (gosu agent mkdir+write session-env) and
  exit 1 with a FATAL log if it fails. A green container that EACCESes
  on the canvas becomes a loud boot failure in docker logs instead.

- publish-image.yml: new smoke step that boots the image through its
  REAL entrypoint (root -> gosu agent privilege drop) and asserts agent
  can write ~/.claude/session-env and that .claude is agent:agent. The
  pre-existing import smoke uses --entrypoint sh, bypassing the
  privilege drop + chown entirely, which is why this class shipped
  silently. Gates the ECR push.

Root cause class: same as template-claude-code PR#21 (.claude chown
gap); this adds the regression gate PR#21 did not.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
core-devops requested review from infra-sre 2026-05-16 01:32:39 +00:00
agent-reviewer approved these changes 2026-06-11 11:45:08 +00:00
agent-reviewer left a comment
Member

APPROVED — CR3 5-axis review on head 8b2f54a65f.

Correctness: the entrypoint guard exercises the same ~/.claude/session-env mkdir/write path that failed in the de172b55 incident, and the publish-image smoke runs the real entrypoint rather than bypassing setup with --entrypoint sh.
Robustness: the guard is idempotent and fail-closed with clear log output; the smoke checks writability and agent:agent ownership before image publish.
Security: no new secret material is introduced; fake API values are used only for smoke environment, and the change reduces broken-image exposure.
Performance: only constant-time filesystem checks and one pre-push container smoke; no runtime hot-path cost beyond a small startup check.
Readability: comments explain the incident class and why both entrypoint and CI coverage are needed.

Disposition: approved for review purposes. The PR remains mergeable=false/rebase-backlog, so I am not attempting a merge.

APPROVED — CR3 5-axis review on head 8b2f54a65f2d490fe4ae9e3e27c499d2eed75bc3. Correctness: the entrypoint guard exercises the same ~/.claude/session-env mkdir/write path that failed in the de172b55 incident, and the publish-image smoke runs the real entrypoint rather than bypassing setup with --entrypoint sh. Robustness: the guard is idempotent and fail-closed with clear log output; the smoke checks writability and agent:agent ownership before image publish. Security: no new secret material is introduced; fake API values are used only for smoke environment, and the change reduces broken-image exposure. Performance: only constant-time filesystem checks and one pre-push container smoke; no runtime hot-path cost beyond a small startup check. Readability: comments explain the incident class and why both entrypoint and CI coverage are needed. Disposition: approved for review purposes. The PR remains mergeable=false/rebase-backlog, so I am not attempting a merge.
All checks were successful
CI / validate (pull_request) Blocked by required conditions
CI / validate (push) Blocked by required conditions
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 29s
Required
Details
CI / Template validation (static) (pull_request) Successful in 1m42s
Required
Details
CI / Adapter unit tests (pull_request) Successful in 1m43s
Required
Details
CI / Template validation (static) (push) Successful in 1m57s
CI / Adapter unit tests (push) Successful in 1m56s
CI / Template validation (runtime) (push) Successful in 10m10s
CI / Template validation (runtime) (pull_request) Successful in 11m26s
Required
Details
Checking for merge conflicts…
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u origin fix/de172b55-claude-perms-fail-fast-smoke:fix/de172b55-claude-perms-fail-fast-smoke
git checkout fix/de172b55-claude-perms-fail-fast-smoke
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-ai-workspace-template-claude-code#23