From 5d5114d99e4e173cdc6e110f00e7a4d233c21a79 Mon Sep 17 00:00:00 2001 From: core-devops Date: Wed, 20 May 2026 09:30:31 -0700 Subject: [PATCH] fix(image,ci): create /agent-home + expose silent T4 probe failures MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Unblocks chronic-red T4 tier-4 conformance gate on this repo (the pilot template for the uniform contract migration; e31c176). ## Root cause Commit e31c176 ("consume uniform privilege contract from molecule-core (pilot)") migrated this template's T4 gate from the OLD hand-written 2-probe shell to the 10-capability uniform contract emitted by molecule-core/workspace-server/internal/provisioner/t4_privilege_contract.go. The uniform contract added an `agent_home_writable` capability (severity hard) that asserts /agent-home is writable by uid-1000 agent WITHOUT sudo (per task #128 Files API redesign). The template's image never created /agent-home, so every run since e31c176 has failed: FAIL agent_home_writable (hard): rc=2 stderr: sh: 1: cannot create /agent-home/.t4-cap-write-probe-*: Directory nonexistent The other two failing probes — `docker_socket_reachable` and `pid_host_visible` — emit NO output on failure (the contract probe redirects to /dev/null, and `readlink` comparisons are silent), so the iterator could not diagnose them. The set-x diag re-run added in this PR makes the next run self-explanatory if either still fails. Fleet check: template-hermes / template-openclaw / template-codex all have GREEN T4 because they still consume the OLDER hand-written gate; template-cc was the migration pilot and is the only one bound to the uniform contract today. ## Fix shape 1. Dockerfile — `install -d -m 0755 -o agent -g agent /agent-home` immediately after the `useradd agent` + T4 escalation leg block. This is image-side only; no platform/provisioner contract change. 2. entrypoint.sh — defense-in-depth `mkdir -p /agent-home + chown` in the root branch, mirroring the existing /configs handling. Covers the case where a platform volume mount masks the build-time directory or comes up root-owned. Idempotent. 3. .gitea/workflows/ci.yml — when a probe fails with no stderr AND no stdout (the docker_socket_reachable / pid_host_visible class), the iterator now re-runs it with `sh -xc` and prints the tail. Purely diagnostic — the verdict is still the original returncode. ## Empirical evidence - cc#38 head e017015913 — T4 run 197 job 2 (op-host log filename molecule-ai/molecule-ai-workspace-template-claude-code/...; see PR status target_url): FAIL agent_home_writable + docker_socket + pid_host_visible. - cc T4 run 165 (commit 19945021973b) PASSED — but that was BEFORE e31c176 with the OLD 2-probe gate (PASS line: "uid-1000 agent reaches host root AND /configs/.auth_token is agent-owned"). - hermes T4 run 220 (today) PASSED — still the OLD hand-written gate. ## Anti-regression - The `agent_home_writable` fix is verified at build time (`install` fails closed if uid/gid invalid; agent exists by then per useradd). - The diagnostic re-run does NOT mask any failure; rc gate unchanged. Refs: - task #305 (T4 chronic red) - task #128 (Files API redesign) - RFC internal#456 (uniform privilege-contract class) - memory feedback_platform_must_hardgate_base_contract - memory feedback_hermes_listpeers_401_token_root600_unreadable_by_uid1000 --- .gitea/workflows/ci.yml | 23 +++++++++++++++++++++++ Dockerfile | 18 ++++++++++++++++++ entrypoint.sh | 9 +++++++++ 3 files changed, 50 insertions(+) diff --git a/.gitea/workflows/ci.yml b/.gitea/workflows/ci.yml index dcc04d8..2229ea7 100644 --- a/.gitea/workflows/ci.yml +++ b/.gitea/workflows/ci.yml @@ -356,8 +356,31 @@ jobs: else: msg = f"FAIL {name} ({sev}): rc={r.returncode} source={cap.get('source','?')}" print(f"::error::{msg}") + # Some probes redirect stderr to /dev/null in the + # contract YAML (e.g. docker_socket_reachable) or + # produce nothing on failure (e.g. test-bracket + # exits). Re-run those WITHOUT the contract's + # internal redirection by prefixing `set -x` so the + # actual failing command + its error surface in + # stderr, then dump both streams. This is purely a + # diagnostic re-run — the verdict above + # (`r.returncode`) is the gate. if r.stderr.strip(): print(f" stderr: {r.stderr.strip()}") + if r.stdout.strip(): + print(f" stdout: {r.stdout.strip()}") + if not (r.stderr.strip() or r.stdout.strip()): + # Silent failure — re-run with `set -x` to + # expose the failing command path. + r2 = subprocess.run( + ["docker","exec","-u","agent",probe,"sh","-xc",probe_sh], + capture_output=True, text=True, + ) + tail = (r2.stderr + r2.stdout).strip().splitlines()[-10:] + if tail: + print(" diag (set -x tail):") + for line in tail: + print(f" | {line}") if sev == "hard": fails_hard.append(name) else: diff --git a/Dockerfile b/Dockerfile index f8864cd..c03c745 100644 --- a/Dockerfile +++ b/Dockerfile @@ -57,6 +57,24 @@ RUN set -eux; \ usermod -aG docker agent; \ id agent +# --- Files API redesign root (task #128) — /agent-home ----------------- +# /agent-home is the user-writable file-tree root the runtime exposes +# to the agent (per task #128). The Layer-3 T4 conformance gate asserts +# `agent_home_writable` via the uniform contract emitted by molecule-core +# (workspace-server/internal/provisioner/t4_privilege_contract.go). +# The probe writes a marker file at /agent-home/.t4-cap-write-probe-* as +# uid-1000 agent WITHOUT sudo — so the directory must (a) exist in the +# image, and (b) be writable by agent without a recursive chown step at +# entrypoint time (the entrypoint may not run in the T4 smoke probe — it +# starts the container with `--entrypoint /bin/sh ... 'sleep 600'`). +# Creating it during the image build, owned agent:agent mode 0755, is +# the minimal change that satisfies the contract for both the live boot +# path (entrypoint may re-chown if a volume mount masks the build-time +# dir) and the smoke-probe path (which bypasses entrypoint). +# +# This is image-side only; no platform/provisioner contract changes. +RUN install -d -m 0755 -o agent -g agent /agent-home + WORKDIR /app # RUNTIME_VERSION is forwarded from the reusable publish workflow as diff --git a/entrypoint.sh b/entrypoint.sh index d360fa3..f7cd310 100644 --- a/entrypoint.sh +++ b/entrypoint.sh @@ -52,6 +52,15 @@ if [ "$(id -u)" = "0" ]; then # Layer-3 conformance gate asserts owner_uid==1000 on the running # container alongside the host-root-reach assertion. chown -R agent:agent /configs 2>/dev/null + # /agent-home — Files API redesign root (task #128). Created with + # agent:agent ownership during image build (see Dockerfile). The + # idempotent mkdir + chown here is defense-in-depth in case a + # platform volume mount masks the build-time directory or comes up + # root-owned (typical for empty Docker named volumes on Linux). The + # T4 conformance gate's `agent_home_writable` probe runs as uid-1000 + # WITHOUT sudo, so ownership must be correct before exec gosu. + mkdir -p /agent-home 2>/dev/null || true + chown agent:agent /agent-home 2>/dev/null || true # /workspace handling — only chown when the contents are root-owned # (typical on Docker Desktop on Windows where host uid maps to 0). # On Linux Docker with matching uids the recursive chown is skipped -- 2.52.0