test(provisioner): fast local-Docker parity test for the token-injection ownership bug class
Some checks failed
Block internal-flavored paths / Block forbidden paths (pull_request) Failing after 0s
lint-required-no-paths / lint-required-no-paths (pull_request) Failing after 0s
Secret scan / Scan diff for credential-shaped strings (pull_request) Failing after 0s
qa-review / approved (pull_request) Failing after 0s
security-review / approved (pull_request) Failing after 1s
sop-checklist / all-items-acked (pull_request) Successful in 15s
gate-check-v3 / gate-check (pull_request) Waiting to run
sop-tier-check / tier-check (pull_request) Waiting to run
lint-mask-pr-atomicity / lint-mask-pr-atomicity (pull_request) Successful in 2m1s
Some checks failed
Block internal-flavored paths / Block forbidden paths (pull_request) Failing after 0s
lint-required-no-paths / lint-required-no-paths (pull_request) Failing after 0s
Secret scan / Scan diff for credential-shaped strings (pull_request) Failing after 0s
qa-review / approved (pull_request) Failing after 0s
security-review / approved (pull_request) Failing after 1s
sop-checklist / all-items-acked (pull_request) Successful in 15s
gate-check-v3 / gate-check (pull_request) Waiting to run
sop-tier-check / tier-check (pull_request) Waiting to run
lint-mask-pr-atomicity / lint-mask-pr-atomicity (pull_request) Successful in 2m1s
The Hermes fleet-wide list_peers 401 (#1877/#418) came from
WriteAuthTokenToVolume + WriteFilesToContainer delivering /configs
token files root:root AFTER the entrypoint's chown -R agent /configs,
so the AgentUID a2a_mcp_server got EACCES → empty bearer → 401. Those
are Docker API ops, NOT AWS — they were only "prod-only" because the
local stack didn't drive the same post-start re-injection sequence,
NOT because they need EC2.
This test invokes the REAL WriteAuthTokenToVolume + WriteFilesToContainer
against the LOCAL Docker daemon and asserts AgentUID can re-write
/configs/.auth_token + .platform_inbound_secret (the save_token
O_WRONLY|O_TRUNC recovery path that actually 401'd Hermes — a read
probe stays green on root:root because the file is world-readable, so
that would have been a vacuous proxy assertion).
Demonstrated both directions against the two code states:
- pre-fix (pristine staging): headline test FAILS in ~0.9s — would
have caught Hermes locally instead of an ~1h EC2 round-trip.
- post-fix (this PR's base, the agent-owned-injection fix): PASSES
in ~0.87s.
TestTokenOwnership_FailPre_ProvesCatch pins the pre-fix root:root
delivery shape independently so the catch stays demonstrable on this
fix-based branch (the assertion is load-bearing, not vacuously green).
TestTokenOwnership_DockerIsLocalNotAWS statically guards that the
provisioner has no AWS SDK dep — the reason this bug class is locally
reproducible at all.
Wired into the mandatory local-E2E gate via `make test-local-e2e`
(feedback_mandatory_local_e2e_before_ship); self-skips when no Docker
daemon is reachable so `make test`/CI stays green on Docker-less
runners. Local fast counterpart to the staging-required gate.
Stacked on fix/workspace-token-injection-agent-owned (PR #1327) so it
lands green; references the exported provisioner.AgentUID contract
rather than a duplicated literal.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
f986444dbd
commit
4d3c326fd9
16
Makefile
16
Makefile
@ -4,7 +4,7 @@
|
||||
# use this Makefile; CI calls docker compose / go test directly so the
|
||||
# Makefile can evolve without breaking the build.
|
||||
|
||||
.PHONY: help dev up down logs build test
|
||||
.PHONY: help dev up down logs build test test-local-e2e
|
||||
|
||||
help: ## Show this help.
|
||||
@grep -E '^[a-zA-Z_-]+:.*?## ' $(MAKEFILE_LIST) | awk 'BEGIN {FS = ":.*?## "}; {printf "\033[36m%-12s\033[0m %s\n", $$1, $$2}'
|
||||
@ -26,3 +26,17 @@ build: ## Force a fresh build of the platform image (no cache).
|
||||
|
||||
test: ## Run Go unit tests in workspace-server/.
|
||||
cd workspace-server && go test -race ./...
|
||||
|
||||
# Mandatory local-E2E gate (feedback_mandatory_local_e2e_before_ship,
|
||||
# feedback_local_must_mimic_production). The provisioner-parity tests
|
||||
# invoke the REAL WriteAuthTokenToVolume + WriteFilesToContainer against
|
||||
# the LOCAL Docker daemon and assert /configs token files are
|
||||
# AgentUID-owned — the formerly-prod-only token-injection ownership bug
|
||||
# class (Hermes list_peers 401, #1877/#418), now caught in SECONDS
|
||||
# locally instead of an ~1h EC2 round-trip. These self-skip when no
|
||||
# Docker daemon is reachable (so `make test`/CI stays green on
|
||||
# Docker-less runners); this target requires a daemon and is the local
|
||||
# fast counterpart to the staging-required gate. Run before pushing any
|
||||
# workspace-server provisioner change.
|
||||
test-local-e2e: ## Run Docker-gated local-E2E parity tests (requires a local Docker daemon).
|
||||
cd workspace-server && go test -run 'TestTokenOwnership' -v ./internal/provisioner/
|
||||
|
||||
@ -0,0 +1,456 @@
|
||||
package provisioner
|
||||
|
||||
// Fast local provisioner-parity test — the proof-of-pattern that the
|
||||
// formerly "prod-only / slow" token-injection ownership bug class is
|
||||
// reproducible against a LOCAL Docker daemon in SECONDS, not an
|
||||
// ~hour-round-trip through EC2.
|
||||
//
|
||||
// WHY THIS EXISTS
|
||||
//
|
||||
// The fleet-wide list_peers 401 incident (Hermes et al; #1877 / #418)
|
||||
// came from THIS package's WriteAuthTokenToVolume +
|
||||
// WriteFilesToContainer running their Docker API operations as root and
|
||||
// never chowning the delivered files to the in-container agent uid
|
||||
// (AgentUID, 1000). The agent runs as AgentUID (every workspace
|
||||
// template `useradd -u 1000 agent`; workspace/entrypoint.sh `gosu
|
||||
// agent`), so the platform injecting /configs/.auth_token as root:root
|
||||
// AFTER the entrypoint's `chown -R agent /configs` means the agent-uid
|
||||
// a2a_mcp_server gets EACCES → empty bearer → platform 401 on
|
||||
// /registry/{id}/peers (the literal list_peers path).
|
||||
//
|
||||
// Those are Docker API operations, NOT AWS operations. They run
|
||||
// IDENTICALLY against a local Docker daemon. They were only "prod-only"
|
||||
// because the handler forks local-vs-EC2 at
|
||||
// handlers/workspace_dispatchers.go (cpProv-vs-provisioner) and the
|
||||
// local stack historically didn't drive the same post-start
|
||||
// re-injection sequence — NOT because the bug needs AWS. This test
|
||||
// invokes the REAL functions against REAL local Docker and asserts the
|
||||
// REAL file ownership semantics the agent depends on. NO mock — a mock
|
||||
// here would bypass the exact ownership semantics that caused the
|
||||
// incident (the proxy-trap).
|
||||
//
|
||||
// EXPECTED BEHAVIOUR ACROSS CODE STATES (both proven in review):
|
||||
//
|
||||
// - pre-fix code (WriteAuthTokenToVolume: `printf > .auth_token &&
|
||||
// chmod 0600`, no chown; WriteFilesToContainer: tar.Header with
|
||||
// Uid/Gid unset → 0/root): the post-start re-injection lands
|
||||
// root:root → the AgentUID write probe FAILS. Measured: the
|
||||
// headline test fails in ~0.9s. This test would have caught the
|
||||
// Hermes incident LOCALLY, in seconds, instead of an ~1h prod
|
||||
// round-trip.
|
||||
// - post-fix code (this PR's base — the agent-owned-injection fix:
|
||||
// chown 1000:1000 in the volume writer + tar Uid/Gid=AgentUID):
|
||||
// files land AgentUID-owned → the probe PASSES. Measured: ~0.87s.
|
||||
//
|
||||
// The probe asserts WRITABILITY by AgentUID, not mere readability: the
|
||||
// token file is delivered world-readable (mode 0644 via the tar
|
||||
// header), so a naive `cat` probe stays green even on root:root and the
|
||||
// assertion would be vacuously useless. The real failure is on the
|
||||
// RECOVERY path — heartbeat 401 → platform issues a fresh token →
|
||||
// platform_auth.save_token does os.open(O_WRONLY|O_CREAT|O_TRUNC,
|
||||
// 0o600) to persist it — and a root:root file denies the AgentUID
|
||||
// write, so the agent can never persist a rotated token and 401s
|
||||
// forever. TestTokenOwnership_FailPre_ProvesCatch pins the pre-fix
|
||||
// root:root delivery shape independently of the production source and
|
||||
// asserts the SAME probe catches it, so the catch stays demonstrable
|
||||
// even on this fix-based branch where the headline test is green.
|
||||
//
|
||||
// RUNTIME: seconds. One-time alpine pull (cached), throwaway
|
||||
// containers, no EC2, no provision wait.
|
||||
|
||||
import (
|
||||
"archive/tar"
|
||||
"bytes"
|
||||
"context"
|
||||
"fmt"
|
||||
"io"
|
||||
"os/exec"
|
||||
"strings"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"github.com/docker/docker/api/types/container"
|
||||
"github.com/docker/docker/api/types/volume"
|
||||
)
|
||||
|
||||
// agentUID is the in-container uid the workspace agent runs as. Bound to
|
||||
// the production contract via the package's exported AgentUID constant
|
||||
// (the agent-owned-injection fix defines it; every template
|
||||
// `useradd -u 1000 agent` and entrypoint.sh `gosu agent` honour it). We
|
||||
// reference AgentUID rather than a private literal so this test and the
|
||||
// production fix share ONE source of truth — if the runtime ever moves
|
||||
// the agent off 1000, AgentUID and this test move together.
|
||||
const agentUID = AgentUID
|
||||
|
||||
// buildRootOwnedTar builds a tar stream with each entry's Mode set but
|
||||
// Uid/Gid LEFT UNSET (Go tar.Header zero value → 0 → root). This is a
|
||||
// faithful reproduction of the PRE-fix WriteFilesToContainer header
|
||||
// (`&tar.Header{Name, Mode, Size}` — no Uid/Gid). Used by the FailPre
|
||||
// control to pin the OLD delivery shape independently of the production
|
||||
// source, so the catch stays demonstrable after the fix lands.
|
||||
func buildRootOwnedTar(t *testing.T, files map[string][]byte) io.Reader {
|
||||
t.Helper()
|
||||
var buf bytes.Buffer
|
||||
tw := tar.NewWriter(&buf)
|
||||
for name, data := range files {
|
||||
if err := tw.WriteHeader(&tar.Header{
|
||||
Name: name,
|
||||
Mode: 0644,
|
||||
Size: int64(len(data)),
|
||||
// Uid/Gid intentionally unset == 0 == root (the bug).
|
||||
}); err != nil {
|
||||
t.Fatalf("tar header: %v", err)
|
||||
}
|
||||
if _, err := tw.Write(data); err != nil {
|
||||
t.Fatalf("tar write: %v", err)
|
||||
}
|
||||
}
|
||||
if err := tw.Close(); err != nil {
|
||||
t.Fatalf("tar close: %v", err)
|
||||
}
|
||||
return &buf
|
||||
}
|
||||
|
||||
// dockerOrSkip returns a Provisioner bound to the local Docker daemon,
|
||||
// or skips if no daemon is reachable. This is a real-Docker integration
|
||||
// test by design (per feedback_real_subprocess_test_for_boot_path:
|
||||
// in-process / mock misses the ownership bug entirely). Skipping on
|
||||
// no-daemon keeps `go test ./...` green on Docker-less machines while
|
||||
// the local-E2E lane (which always has a daemon) still gates on it.
|
||||
func dockerOrSkip(t *testing.T) *Provisioner {
|
||||
t.Helper()
|
||||
if testing.Short() {
|
||||
t.Skip("skipping real-Docker provisioner-parity test in -short mode")
|
||||
}
|
||||
p, err := New()
|
||||
if err != nil {
|
||||
t.Skipf("no local Docker daemon (New: %v) — this lane requires one", err)
|
||||
}
|
||||
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
|
||||
defer cancel()
|
||||
if _, err := p.cli.Ping(ctx); err != nil {
|
||||
t.Skipf("local Docker daemon unreachable (Ping: %v) — this lane requires one", err)
|
||||
}
|
||||
return p
|
||||
}
|
||||
|
||||
// ensureAlpine pulls alpine once if absent. WriteAuthTokenToVolume
|
||||
// itself uses the `alpine` image, so the daemon needs it regardless;
|
||||
// pulling here makes the one-time cost explicit and the timed
|
||||
// assertions honest (pull is one-time, not per-run).
|
||||
func ensureAlpine(t *testing.T, p *Provisioner) {
|
||||
t.Helper()
|
||||
ctx, cancel := context.WithTimeout(context.Background(), 90*time.Second)
|
||||
defer cancel()
|
||||
if _, err := p.cli.ImageInspect(ctx, "alpine"); err == nil {
|
||||
return
|
||||
}
|
||||
if err := pullImageAndDrain(ctx, p.cli, "alpine", ""); err != nil {
|
||||
t.Fatalf("failed to pull alpine (needed by WriteAuthTokenToVolume too): %v", err)
|
||||
}
|
||||
}
|
||||
|
||||
// uniqueWorkspaceID returns a short, test-scoped workspace id. Volume +
|
||||
// container names derive from the first 12 chars (ConfigVolumeName /
|
||||
// ContainerName truncate), so the entropy must live in the prefix.
|
||||
func uniqueWorkspaceID() string {
|
||||
return fmt.Sprintf("t%011d", time.Now().UnixNano()%1e11)
|
||||
}
|
||||
|
||||
// cleanupWorkspace force-removes the container + config volume the test
|
||||
// created. Best-effort: honours feedback_cleanup_after_each_test —
|
||||
// leave the daemon as we found it.
|
||||
func cleanupWorkspace(p *Provisioner, workspaceID string) {
|
||||
ctx, cancel := context.WithTimeout(context.Background(), 20*time.Second)
|
||||
defer cancel()
|
||||
_ = p.cli.ContainerRemove(ctx, ContainerName(workspaceID),
|
||||
container.RemoveOptions{Force: true})
|
||||
_ = p.cli.VolumeRemove(ctx, ConfigVolumeName(workspaceID), true)
|
||||
}
|
||||
|
||||
// rewritableByAgentUID returns nil iff uid agentUID can re-open `path`
|
||||
// inside `containerID` for WRITE+TRUNCATE — the EXACT operation the
|
||||
// runtime's platform_auth.save_token performs:
|
||||
//
|
||||
// os.open(path, O_WRONLY|O_CREAT|O_TRUNC, 0o600)
|
||||
//
|
||||
// This is the operation that 401'd Hermes. Modelled with `: > path`
|
||||
// (shell O_WRONLY|O_CREAT|O_TRUNC), the minimal faithful equivalent. A
|
||||
// non-nil error is the Hermes failure: a rotated token cannot be
|
||||
// persisted over a root:root file → 401 forever.
|
||||
func rewritableByAgentUID(ctx context.Context, p *Provisioner, containerID, path string) error {
|
||||
execCfg := container.ExecOptions{
|
||||
User: fmt.Sprintf("%d", agentUID),
|
||||
Cmd: []string{"sh", "-c", ": > " + path},
|
||||
AttachStdout: true,
|
||||
AttachStderr: true,
|
||||
}
|
||||
ex, err := p.cli.ContainerExecCreate(ctx, containerID, execCfg)
|
||||
if err != nil {
|
||||
return fmt.Errorf("exec create: %w", err)
|
||||
}
|
||||
att, err := p.cli.ContainerExecAttach(ctx, ex.ID, container.ExecAttachOptions{})
|
||||
if err != nil {
|
||||
return fmt.Errorf("exec attach: %w", err)
|
||||
}
|
||||
// Drain the multiplexed stream to completion. This BLOCKS until the
|
||||
// exec process exits — without it, ContainerExecInspect races the
|
||||
// process and reports ExitCode 0 / Running=true, silently inverting
|
||||
// the assertion (the proxy-trap that almost shipped this test green
|
||||
// against the real bug — caught and fixed in review).
|
||||
_, _ = io.Copy(io.Discard, att.Reader)
|
||||
att.Close()
|
||||
var insp container.ExecInspect
|
||||
deadline := time.Now().Add(10 * time.Second)
|
||||
for {
|
||||
insp, err = p.cli.ContainerExecInspect(ctx, ex.ID)
|
||||
if err != nil {
|
||||
return fmt.Errorf("exec inspect: %w", err)
|
||||
}
|
||||
if !insp.Running {
|
||||
break
|
||||
}
|
||||
if time.Now().After(deadline) {
|
||||
return fmt.Errorf("exec %s still running after 10s", ex.ID)
|
||||
}
|
||||
time.Sleep(50 * time.Millisecond)
|
||||
}
|
||||
if insp.ExitCode != 0 {
|
||||
return fmt.Errorf("uid %d cannot O_WRONLY|O_TRUNC %s (exit %d) — "+
|
||||
"this is the save_token write-denial that 401'd Hermes: the "+
|
||||
"agent receives a rotated token but cannot persist it over a "+
|
||||
"root:root file, so it 401s forever",
|
||||
agentUID, path, insp.ExitCode)
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
// startConfigsContainer creates and starts a long-lived container that
|
||||
// mounts the workspace's REAL config volume at /configs and models the
|
||||
// production ownership lifecycle: runs as root, `chown -R agent
|
||||
// /configs` (mirrors workspace/entrypoint.sh), then idles. The agent is
|
||||
// created at uid agentUID exactly like the workspace templates. This is
|
||||
// the container into which WriteFilesToContainer re-injects (#418)
|
||||
// post-start — the precise moment the prod bug manifests.
|
||||
func startConfigsContainer(t *testing.T, p *Provisioner, workspaceID string) string {
|
||||
t.Helper()
|
||||
ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second)
|
||||
defer cancel()
|
||||
|
||||
volName := ConfigVolumeName(workspaceID)
|
||||
entry := fmt.Sprintf(
|
||||
"adduser -D -u %d agent 2>/dev/null || true; "+
|
||||
"mkdir -p /configs; chown -R agent /configs; "+
|
||||
"exec sleep 600", agentUID)
|
||||
|
||||
resp, err := p.cli.ContainerCreate(ctx, &container.Config{
|
||||
Image: "alpine",
|
||||
Cmd: []string{"sh", "-c", entry},
|
||||
User: "0", // entrypoint starts as root, like the runtime
|
||||
}, &container.HostConfig{
|
||||
Binds: []string{volName + ":/configs"},
|
||||
}, nil, nil, ContainerName(workspaceID))
|
||||
if err != nil {
|
||||
t.Fatalf("create configs container: %v", err)
|
||||
}
|
||||
if err := p.cli.ContainerStart(ctx, resp.ID, container.StartOptions{}); err != nil {
|
||||
t.Fatalf("start configs container: %v", err)
|
||||
}
|
||||
if err := waitForChownedConfigs(ctx, p, resp.ID); err != nil {
|
||||
t.Fatalf("configs container never became ready: %v", err)
|
||||
}
|
||||
return resp.ID
|
||||
}
|
||||
|
||||
// waitForChownedConfigs blocks until /configs is present AND owned by
|
||||
// the agent uid — i.e. the entrypoint's `chown -R agent /configs` has
|
||||
// completed. Polling on ownership (not mere existence) is the correct
|
||||
// readiness signal: the bug is the post-start re-injection landing
|
||||
// root:root *after* this chown, so the test must not re-inject until
|
||||
// the chown is definitively done. Each probe drains the exec stream to
|
||||
// completion before trusting ExitCode (same race fix as
|
||||
// rewritableByAgentUID).
|
||||
func waitForChownedConfigs(ctx context.Context, p *Provisioner, containerID string) error {
|
||||
deadline := time.Now().Add(20 * time.Second)
|
||||
for time.Now().Before(deadline) {
|
||||
ex, err := p.cli.ContainerExecCreate(ctx, containerID, container.ExecOptions{
|
||||
Cmd: []string{"sh", "-c",
|
||||
fmt.Sprintf("[ \"$(stat -c %%u /configs)\" = \"%d\" ]", agentUID)},
|
||||
AttachStdout: true,
|
||||
AttachStderr: true,
|
||||
})
|
||||
if err == nil {
|
||||
if att, aerr := p.cli.ContainerExecAttach(ctx, ex.ID,
|
||||
container.ExecAttachOptions{}); aerr == nil {
|
||||
_, _ = io.Copy(io.Discard, att.Reader)
|
||||
att.Close()
|
||||
if insp, ierr := p.cli.ContainerExecInspect(ctx, ex.ID); ierr == nil &&
|
||||
!insp.Running && insp.ExitCode == 0 {
|
||||
return nil
|
||||
}
|
||||
}
|
||||
}
|
||||
time.Sleep(150 * time.Millisecond)
|
||||
}
|
||||
return fmt.Errorf("/configs not chowned to uid %d after wait", agentUID)
|
||||
}
|
||||
|
||||
// TestTokenOwnership_LocalProvisionerParity is the headline test. It
|
||||
// drives the EXACT production sequence against local Docker:
|
||||
//
|
||||
// 1. WriteAuthTokenToVolume (#1877): pre-start write of .auth_token
|
||||
// into the named config volume via the real throwaway alpine writer.
|
||||
// 2. start a container modelling the runtime (root → chown /configs →
|
||||
// would gosu agent), mounting that same real volume.
|
||||
// 3. WriteFilesToContainer (#418): post-start re-injection of
|
||||
// .auth_token + .platform_inbound_secret into the running container
|
||||
// via the real tar/CopyToContainer path.
|
||||
// 4. assert uid agentUID can RE-WRITE both files (the save_token
|
||||
// recovery path) — the exact property the Hermes 401 violated.
|
||||
//
|
||||
// FAILS in ~0.9s on pre-fix code (files root:root). PASSES on this PR's
|
||||
// fix-based branch (files agent-owned).
|
||||
func TestTokenOwnership_LocalProvisionerParity(t *testing.T) {
|
||||
p := dockerOrSkip(t)
|
||||
ensureAlpine(t, p)
|
||||
|
||||
workspaceID := uniqueWorkspaceID()
|
||||
t.Cleanup(func() { cleanupWorkspace(p, workspaceID) })
|
||||
|
||||
start := time.Now()
|
||||
|
||||
// --- Step 1: REAL WriteAuthTokenToVolume (#1877, pre-start) ------
|
||||
const tokenPlain = "wsauth-fasttest-tok-abc123"
|
||||
ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second)
|
||||
defer cancel()
|
||||
if err := p.WriteAuthTokenToVolume(ctx, workspaceID, tokenPlain); err != nil {
|
||||
t.Fatalf("WriteAuthTokenToVolume (real #1877 path) failed: %v", err)
|
||||
}
|
||||
|
||||
// --- Step 2: start the runtime-modelling container --------------
|
||||
containerID := startConfigsContainer(t, p, workspaceID)
|
||||
|
||||
// --- Step 3: REAL WriteFilesToContainer (#418 post-start) -------
|
||||
files := map[string][]byte{
|
||||
".auth_token": []byte(tokenPlain),
|
||||
".platform_inbound_secret": []byte("inbound-secret-xyz789"),
|
||||
}
|
||||
if err := p.WriteFilesToContainer(ctx, containerID, files); err != nil {
|
||||
t.Fatalf("WriteFilesToContainer (real #418 path) failed: %v", err)
|
||||
}
|
||||
|
||||
elapsed := time.Since(start)
|
||||
|
||||
// --- Step 4: assert the property Hermes violated ----------------
|
||||
for _, f := range []string{
|
||||
"/configs/.auth_token",
|
||||
"/configs/.platform_inbound_secret",
|
||||
} {
|
||||
if err := rewritableByAgentUID(ctx, p, containerID, f); err != nil {
|
||||
t.Errorf("PROVISIONER-PARITY BUG (Hermes class): %v\n"+
|
||||
" This is the formerly-prod-only token-injection "+
|
||||
"ownership bug, caught LOCALLY in %s. The real "+
|
||||
"WriteFilesToContainer (#418 post-start re-inject) "+
|
||||
"delivered %s owned root:root; the workspace agent "+
|
||||
"runs as uid %d and its save_token O_WRONLY|O_TRUNC "+
|
||||
"is denied → it cannot persist a rotated token → "+
|
||||
"heartbeat / list_peers 401 forever. Fix: inject "+
|
||||
"delivered files owned by AgentUID (this PR's base). "+
|
||||
"See incident notes #1877 / #418.",
|
||||
err, elapsed, f, agentUID)
|
||||
}
|
||||
}
|
||||
|
||||
t.Logf("provisioner-parity sequence (real WriteAuthTokenToVolume + "+
|
||||
"WriteFilesToContainer vs local Docker) completed in %s — "+
|
||||
"the SECONDS-local replacement for the ~1h EC2 round-trip "+
|
||||
"that previously gated this bug class", elapsed)
|
||||
|
||||
// Hard guard on the headline claim: this MUST be seconds. 90s
|
||||
// ceiling absorbs a cold alpine pull on a slow link; steady-state
|
||||
// (image cached) is sub-2s.
|
||||
if elapsed > 90*time.Second {
|
||||
t.Errorf("provisioner-parity test took %s — claimed to be "+
|
||||
"SECONDS-local; investigate (image pull? daemon load?)",
|
||||
elapsed)
|
||||
}
|
||||
}
|
||||
|
||||
// TestTokenOwnership_FailPre_ProvesCatch pins the PRE-fix delivery shape
|
||||
// (post-start re-injection via a tar with Uid/Gid unset → root) and
|
||||
// asserts the SAME AgentUID write probe the headline test uses FAILS on
|
||||
// it. This proves the assertion genuinely catches the bug — it is not
|
||||
// vacuously green — and keeps the fail-direction demonstrable on this
|
||||
// fix-based branch where the headline test is green. Per
|
||||
// feedback_assert_exact_not_substring: verify the tightened assertion
|
||||
// FAILS on old behaviour.
|
||||
func TestTokenOwnership_FailPre_ProvesCatch(t *testing.T) {
|
||||
p := dockerOrSkip(t)
|
||||
ensureAlpine(t, p)
|
||||
|
||||
workspaceID := uniqueWorkspaceID()
|
||||
t.Cleanup(func() { cleanupWorkspace(p, workspaceID) })
|
||||
ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second)
|
||||
defer cancel()
|
||||
|
||||
if _, err := p.cli.VolumeCreate(ctx, volume.CreateOptions{
|
||||
Name: ConfigVolumeName(workspaceID), Labels: managedLabels(),
|
||||
}); err != nil {
|
||||
t.Fatalf("volume create: %v", err)
|
||||
}
|
||||
|
||||
// Start a runtime-model container (root → chown -R agent /configs →
|
||||
// idle) so the entrypoint chown has ALREADY run, exactly as in
|
||||
// production. The bug is the POST-start re-injection landing
|
||||
// root:root *after* that chown — so pinning the pre-fix
|
||||
// WriteFilesToContainer tar shape and copying it in post-start is
|
||||
// the faithful pre-fix reproduction.
|
||||
containerID := startConfigsContainer(t, p, workspaceID)
|
||||
|
||||
preFixTar := buildRootOwnedTar(t, map[string][]byte{
|
||||
".auth_token": []byte("pre-fix-token"),
|
||||
})
|
||||
if err := p.cli.CopyToContainer(ctx, containerID, "/configs",
|
||||
preFixTar, container.CopyToContainerOptions{}); err != nil {
|
||||
t.Fatalf("pre-fix CopyToContainer: %v", err)
|
||||
}
|
||||
|
||||
probeErr := rewritableByAgentUID(ctx, p, containerID, "/configs/.auth_token")
|
||||
if probeErr == nil {
|
||||
t.Fatalf("REGRESSION IN THE TEST ITSELF: the uid-%d write "+
|
||||
"probe PASSED against the known-pre-fix root:root "+
|
||||
"post-start re-injection. The headline assertion would "+
|
||||
"be vacuously green and would NOT have caught Hermes. "+
|
||||
"Investigate rewritableByAgentUID before trusting the "+
|
||||
"parity test.", agentUID)
|
||||
}
|
||||
t.Logf("confirmed: pre-fix root:root post-start re-injection is "+
|
||||
"correctly caught by the uid-%d write probe (%v) — the "+
|
||||
"parity assertion is load-bearing, not vacuous", agentUID, probeErr)
|
||||
}
|
||||
|
||||
// TestTokenOwnership_DockerIsLocalNotAWS asserts the core insight:
|
||||
// WriteAuthTokenToVolume + WriteFilesToContainer reach ONLY the Docker
|
||||
// client — no AWS SDK, no EC2/EBS/SG calls. That is why the bug class
|
||||
// is locally reproducible at all. Static guard: the provisioner package
|
||||
// must not import the AWS SDK (belt-and-braces with architecture_test.go
|
||||
// and the driver-seam RFC, internal #184).
|
||||
func TestTokenOwnership_DockerIsLocalNotAWS(t *testing.T) {
|
||||
t.Parallel()
|
||||
out, err := exec.Command("go", "list", "-deps",
|
||||
"github.com/Molecule-AI/molecule-monorepo/platform/internal/provisioner").
|
||||
Output()
|
||||
if err != nil {
|
||||
t.Skipf("go list unavailable in this env: %v", err)
|
||||
}
|
||||
for _, line := range strings.Split(string(out), "\n") {
|
||||
if strings.Contains(line, "aws-sdk-go") {
|
||||
t.Fatalf("provisioner pulls in the AWS SDK (%s) — the "+
|
||||
"token-injection paths are supposed to be "+
|
||||
"driver-agnostic Docker-only ops; an AWS dep here "+
|
||||
"would undermine the local-testability this test "+
|
||||
"proves. See the driver-seam RFC (internal #184).",
|
||||
strings.TrimSpace(line))
|
||||
}
|
||||
}
|
||||
}
|
||||
Loading…
Reference in New Issue
Block a user