molecule-core/scripts/ops/audit-railway-sha-pins.sh
Hongming Wang 026f5e51d9 ops: add Railway SHA-pin drift audit script + regression test (#2001)
#2000 fixed one symptom — TENANT_IMAGE pinned to `staging-a14cf86`
(10 days stale) silently no-op'd four upstream fixes on 2026-04-24.
This adds the audit pattern as a re-runnable script so the broader
class is observable on demand without new CI infrastructure.

Audit results today (2026-04-27):
  controlplane / production: 54 vars audited, 0 drift-prone pins
  controlplane / staging:    52 vars audited, 0 drift-prone pins

So the immediate audit deliverable is clean — TENANT_IMAGE is the only
known violation and #2000 already fixed it. The script makes the
ongoing audit a 5-second command instead of a manual one.

Detection regex catches:
  * branch-SHA suffixes (`staging|main|prod|production-<6+ hex>`)
    — the exact 2026-04-24 incident shape
  * version pins after `:` or `=`  (`:v1.2.3`, `=v0.1.16`)
    — same drift class, just rendered differently

Anchoring on `:` or `=` keeps prose like "version 1.2.3 of the api"
out of the false-positive set. UUIDs, ARNs, AMI IDs, secrets, and
floating tags (`:staging-latest`, `:main`) pass through untouched.

Regression test (tests/ops/test_audit_railway_sha_pins.sh) pins 20
representative cases — 9 should-flag (covering all four branch
prefixes + semver variants + middle-of-value matches) and 11
should-pass (the false-positive guards).  Same regex inlined in both
files so a future tweak that weakens detection fails the test in
lockstep with weakening the audit.

Both files shellcheck clean.

CI gate (acceptance criterion's "regression: add a CI check") is
deliberately scoped out — querying Railway from CI requires plumbing
RAILWAY_TOKEN as a repo secret, which is multi-step setup. The
re-runnable script + test cover the same surface today; the CI
workflow is a small follow-up once the token is provisioned.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 05:01:23 -07:00

100 lines
3.8 KiB
Bash
Executable File

#!/usr/bin/env bash
# Audit Railway env vars for drift-prone image-tag pins.
#
# Background (#2001): on 2026-04-24 a stale `:staging-a14cf86` SHA pin
# in CP's TENANT_IMAGE caused 3+ hours of E2E failure with the
# appearance that "every fix didn't propagate" — really the tenant
# image was so old it didn't read the env vars those fixes produced.
# This script flags anywhere we've re-introduced that pattern.
#
# Pattern matched: any env-var value ending in `<branch>-<hex>` (e.g.
# `staging-a14cf86`) or `:vN.M.P` semver tag, OR containing such a
# substring (catches embedded refs like `repo/img:staging-abc1234`).
# Floating tags (`:staging-latest`, `:main`, `:latest`) and other
# values pass through untouched.
#
# Usage:
# bash scripts/ops/audit-railway-sha-pins.sh # both envs
# bash scripts/ops/audit-railway-sha-pins.sh production # one env
# bash scripts/ops/audit-railway-sha-pins.sh staging
#
# Exit codes:
# 0 — no drift-prone pins
# 1 — drift detected, list printed
# 2 — railway CLI unauthenticated / project unlinked
#
# Pre-req: run from a directory linked to a Railway project
# (e.g. molecule-controlplane). The script does not chdir for you
# because the linked project's identity matters.
set -euo pipefail
ENV_FILTER="${1:-}"
ENVS=()
case "$ENV_FILTER" in
"") ENVS=(production staging) ;;
production|staging) ENVS=("$ENV_FILTER") ;;
*) echo "usage: $0 [production|staging]" >&2; exit 2 ;;
esac
# All services in the linked Railway project. Discovery isn't worth
# the complexity — list them explicitly and add new services here.
SERVICES=(controlplane)
# A single regex that matches:
# - `<branch>-<hex>` at end of value
# - `:vN.M.P` semver tag at end
# - either pattern as a substring
# Drift-prone patterns — same class as the 2026-04-24 TENANT_IMAGE
# incident. Matched against full env-var lines (KEY=VALUE).
#
# branch-SHA (e.g. `staging-a14cf86`):
# anchored by branch-name prefix + 6+ hex chars, so a UUID hex
# run that happens to look hex-shaped doesn't trip the audit
# (UUIDs use dashes, ARNs use colons).
#
# semver pin (`:v1.2.3`, `=v0.1.16`):
# requires `:` or `=` immediately before, so prose like
# "version 1.2.3 of the api" is NOT flagged. The trailing
# negated-class ensures we don't fold patches like 1.2.34
# into 1.2.3.
DRIFT_REGEX='(staging|main|prod|production)-[a-f0-9]{6,}|[:=]v?[0-9]+\.[0-9]+\.[0-9]+([^a-z0-9]|$)'
drift_count=0
for env in "${ENVS[@]}"; do
for svc in "${SERVICES[@]}"; do
echo "─── env=$env service=$svc ───"
if ! out=$(railway variables --service "$svc" --environment "$env" --kv 2>&1); then
# Detect "not authenticated" / "no linked project" vs "service not found"
if echo "$out" | grep -qiE 'not (authenticated|logged in)|unlinked|no project'; then
echo " ❌ railway CLI not authenticated or project not linked" >&2
exit 2
fi
echo " (skipped: $out)" >&2
continue
fi
matched=$(echo "$out" | grep -nE "=.*($DRIFT_REGEX)" || true)
if [ -z "$matched" ]; then
total=$(echo "$out" | grep -c '=' || echo 0)
echo "$total env vars audited, no drift-prone pins"
else
lines=$(echo "$matched" | wc -l | tr -d ' ')
drift_count=$((drift_count + lines))
echo "$lines drift-prone pin(s):"
# Truncate values past 80 chars so a tokenful one-liner doesn't
# hide the relevant suffix off-screen.
echo "$matched" | sed -E 's/(.{80}).+/\1.../' | sed 's/^/ /'
fi
done
done
if [ "$drift_count" -gt 0 ]; then
echo
echo "Total drift-prone pins: $drift_count"
echo "Replace with floating tags (e.g. :staging-latest, :main) unless"
echo "intentional and documented in the ops runbook."
exit 1
fi
echo
echo "✓ Clean — no drift-prone image pins in any audited env."
exit 0