infra: codex runtime sandbox unshares net -> bwrap loopback EPERM blocks all review agents (durable fix for CTO hotpatch) #2128

Open
opened 2026-06-02 18:12:36 +00:00 by devops-engineer · 3 comments
Member

Infra fix (CTO-hotpatched 2026-06-02; durable fix needed).

Symptom: codex review/research agents (Reviewer 7d88be80, Researcher fd42c9d6 in agents-team) could ACK A2A dispatches but every shell/Gitea/CI command failed at startup with:
bwrap: loopback: Failed RTM_NEWADDR: Operation not permitted
→ network-isolated → could not fetch PR diffs / CI / post reviews. This silently stalled the ENTIRE review→merge pipeline (40+ PRs piled up while the judgment-tier agents looked merely "idle").

Root cause: codex 0.130 defaults to a sandbox that unshares the network namespace; on the workspace AMI kernel bwrap then fails to bring up loopback (RTM_NEWADDR EPERM), aborting all sandboxed commands. Neither render_provider_toml.py (writes nothing for built-in OpenAI/subscription providers) nor codex_mcp_config.sh sets any sandbox config, so codex falls back to the net-unsharing default.

Hotpatch applied (temporary): appended to /usr/local/bin/codex_mcp_config.sh on both EC2s a block that prepends to ~/.codex/config.toml:

sandbox_mode = "workspace-write"

[sandbox_workspace_write]
network_access = true
  • docker restart. Verified config + container up. This is lost on a full re-provision (rm+run from image).

Durable fix wanted: the codex runtime image / render_provider_toml.py (or codex_mcp_config.sh) must always emit sandbox_mode = "workspace-write" + [sandbox_workspace_write] network_access = true (the tenant container is already the isolation boundary, so codex's net-unshare is redundant and breaks reviewers). Add a boot assertion/test that a codex-sandboxed curl https://git.moleculesai.app/api/v1/version returns 200. Applies to ALL codex-runtime workspaces, not just these two.

Repo note: this is the codex workspace runtime/template (render_provider_toml.py + codex_mcp_config.sh live in /usr/local/bin of the codex image). Routine SOP gate; tier = infra/runtime.

**Infra fix (CTO-hotpatched 2026-06-02; durable fix needed).** **Symptom:** codex review/research agents (Reviewer `7d88be80`, Researcher `fd42c9d6` in agents-team) could ACK A2A dispatches but every shell/Gitea/CI command failed at startup with: `bwrap: loopback: Failed RTM_NEWADDR: Operation not permitted` → network-isolated → could not fetch PR diffs / CI / post reviews. This silently stalled the ENTIRE review→merge pipeline (40+ PRs piled up while the judgment-tier agents looked merely "idle"). **Root cause:** codex 0.130 defaults to a sandbox that **unshares the network namespace**; on the workspace AMI kernel bwrap then fails to bring up loopback (RTM_NEWADDR EPERM), aborting all sandboxed commands. Neither `render_provider_toml.py` (writes nothing for built-in OpenAI/subscription providers) nor `codex_mcp_config.sh` sets any sandbox config, so codex falls back to the net-unsharing default. **Hotpatch applied (temporary):** appended to `/usr/local/bin/codex_mcp_config.sh` on both EC2s a block that prepends to `~/.codex/config.toml`: ``` sandbox_mode = "workspace-write" [sandbox_workspace_write] network_access = true ``` + `docker restart`. Verified config + container up. **This is lost on a full re-provision (rm+run from image).** **Durable fix wanted:** the codex runtime image / `render_provider_toml.py` (or `codex_mcp_config.sh`) must always emit `sandbox_mode = "workspace-write"` + `[sandbox_workspace_write] network_access = true` (the tenant container is already the isolation boundary, so codex's net-unshare is redundant *and* breaks reviewers). Add a boot assertion/test that a codex-sandboxed `curl https://git.moleculesai.app/api/v1/version` returns 200. Applies to ALL codex-runtime workspaces, not just these two. Repo note: this is the codex workspace runtime/template (render_provider_toml.py + codex_mcp_config.sh live in /usr/local/bin of the codex image). Routine SOP gate; tier = infra/runtime.
Author
Member

Correction after second pass: enabling network_access alone is NOT sufficient. After the net-unshare was fixed, codex still failed at bwrap: setting up uid map: Permission denied — this host kernel ALSO blocks unprivileged user-namespace creation. So a workspace-write sandbox cannot initialize here at all.

Durable fix is therefore one of:

  1. sandbox_mode = "danger-full-access" (+ approval_policy = "never") — disable codex's inner sandbox; the tenant container is already the isolation boundary (this is what claude-code agents already do). This is what the hotpatch now uses on both codex agents. Simplest; recommended for our own internal agents.
  2. OR enable unprivileged user+net namespaces on the workspace AMI/host (sysctl kernel.unprivileged_userns_clone=1 + the netns caps) so codex's bwrap sandbox can initialize — only needed if we want codex's inner sandbox for untrusted/external tenants.

Recommend (1) for the codex runtime image default. Boot assertion unchanged (codex-sandboxed curl gitea → 200).

**Correction after second pass:** enabling `network_access` alone is NOT sufficient. After the net-unshare was fixed, codex still failed at `bwrap: setting up uid map: Permission denied` — this host kernel ALSO blocks unprivileged **user-namespace** creation. So a workspace-write sandbox cannot initialize here at all. **Durable fix is therefore one of:** 1. `sandbox_mode = "danger-full-access"` (+ `approval_policy = "never"`) — disable codex's inner sandbox; the tenant container is already the isolation boundary (this is what claude-code agents already do). This is what the hotpatch now uses on both codex agents. Simplest; recommended for our own internal agents. 2. OR enable unprivileged user+net namespaces on the workspace AMI/host (sysctl `kernel.unprivileged_userns_clone=1` + the netns caps) so codex's bwrap sandbox can initialize — only needed if we want codex's inner sandbox for untrusted/external tenants. Recommend (1) for the codex runtime image default. Boot assertion unchanged (codex-sandboxed `curl gitea` → 200).
Member

Spec clarification from backlog/RCA sweep: the boot assertion should exercise the actual sandboxed Codex command path, not only bare container networking. A useful acceptance test is codex exec (or the runtime's wrapper entrypoint) running a minimal network command such as curl https://git.moleculesai.app/api/v1/version and returning 200. Also make config emission idempotent so repeated starts do not prepend duplicate conflicting sandbox_mode blocks.

Spec clarification from backlog/RCA sweep: the boot assertion should exercise the actual sandboxed Codex command path, not only bare container networking. A useful acceptance test is `codex exec` (or the runtime's wrapper entrypoint) running a minimal network command such as `curl https://git.moleculesai.app/api/v1/version` and returning 200. Also make config emission idempotent so repeated starts do not prepend duplicate conflicting `sandbox_mode` blocks.
Member

Corrective fix up: 'molecule-ai/molecule-ai-workspace-template-codex'#82 (supersedes PR #77's variant; #59 closed).

Ground-truth from the live CR2 agent (i-0094faed, agents-team) 2026-06-04: the kernel does NOT block userns (unprivileged_userns_clone=1, userns creation succeeds). The real block is the uid-map write — the agent (uid 1000, CapEff=0) in an identity-mapped container (uid_map 0 0 4294967295, no /etc/subuid) cannot map root into a new userns: unshare --user --map-root-user -> write failed /proc/self/uid_map: Operation not permitted. That is exactly bwrap setting up uid map: Permission denied.

So PR #77 (sandbox_mode=workspace-write + network_access=true) is a NON-FIX: it clears the netns RTM_NEWADDR error but the workspace-write sandbox still cannot initialize (dies at uid_map). The live agent is in exactly PR #77 state and remains network-blocked (Task #194). Durable fix in #82 = disable the inner sandbox (danger-full-access + approval_policy=never; tenant container is the isolation boundary) + idempotent migration off the workspace-write shape + the boot assertion. 9 tests pass, shellcheck clean.

Next after merge: rebuild codex image -> re-provision CR2/Researcher to bake the fix (hotpatch is lost on re-provision today).

**Corrective fix up: 'molecule-ai/molecule-ai-workspace-template-codex'#82** (supersedes PR #77's variant; #59 closed). Ground-truth from the live CR2 agent (i-0094faed, agents-team) 2026-06-04: the kernel does NOT block userns (unprivileged_userns_clone=1, userns creation succeeds). The real block is the uid-map write — the agent (uid 1000, CapEff=0) in an identity-mapped container (uid_map 0 0 4294967295, no /etc/subuid) cannot map root into a new userns: unshare --user --map-root-user -> write failed /proc/self/uid_map: Operation not permitted. That is exactly bwrap setting up uid map: Permission denied. So PR #77 (sandbox_mode=workspace-write + network_access=true) is a NON-FIX: it clears the netns RTM_NEWADDR error but the workspace-write sandbox still cannot initialize (dies at uid_map). The live agent is in exactly PR #77 state and remains network-blocked (Task #194). Durable fix in #82 = disable the inner sandbox (danger-full-access + approval_policy=never; tenant container is the isolation boundary) + idempotent migration off the workspace-write shape + the boot assertion. 9 tests pass, shellcheck clean. Next after merge: rebuild codex image -> re-provision CR2/Researcher to bake the fix (hotpatch is lost on re-provision today).
Sign in to join this conversation.
3 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2128