fix(workspaces): add missing 'awaiting_agent' + 'hibernating' to workspace_status enum
Migration 043 (2026-04-25) introduced the workspace_status enum but
omitted two values application code had been writing for days, so every
UPDATE that tried to write either value failed silently in production:
'awaiting_agent' (since 2026-04-24, commit 1e8b5e01):
- handlers/workspace.go:333 — external workspace pre-register
- handlers/registry.go (via PR #2382) — liveness offline transition
- registry/healthsweep.go (via PR #2382) — heartbeat-staleness sweep
'hibernating' (since hibernation feature shipped):
- handlers/workspace_restart.go:271 — DB-level claim before stop
All four/five sites swallowed the enum-cast error. User-visible impact:
external workspaces never transition to a stale state when their agent
disconnects (canvas shows them stuck on 'online'/'degraded' indefinitely),
new external workspaces never advance past 'provisioning', and idle
workspaces never auto-hibernate (resources held forever).
PR #2382 didn't *cause* this — it inherited the gap and added two more
silent-fail paths on top. The pre-existing two had been broken for five
days and went unnoticed because:
1. sqlmock matches SQL by regex, not against the live enum constraint.
Every test passed despite the prod-only failure.
2. The handlers either drop the Exec error entirely (workspace.go:333)
or log+continue without an alert (the other three).
Fix in three pieces:
1. migrations/046_*.up.sql — ALTER TYPE workspace_status ADD VALUE
'awaiting_agent', 'hibernating'. IF NOT EXISTS makes it idempotent
across re-runs (RunMigrations re-applies until schema_migrations
records the file). ALTER TYPE ADD VALUE doesn't take a heavy lock
and commits immediately, safe under live traffic.
2. migrations/046_*.down.sql — full rename → recreate → cast → drop
recipe. Postgres has no DROP VALUE so this is the only honest
rollback. Pre-flights existing rows to compatible values
(awaiting_agent → offline, hibernating → hibernated) before the
type swap.
3. internal/db/workspace_status_enum_drift_test.go — static gate that
parses every UPDATE/INSERT against `workspaces` in workspace-server/
internal/, extracts every status literal, and asserts each is in
the enum union (CREATE TYPE + every ALTER TYPE ADD VALUE). The gate
runs in unit tests, no DB required, and would have caught both
omissions on the day they shipped. Pattern matches
feedback_behavior_based_ast_gates and feedback_mock_at_drifting_layer.
Verification:
- go test ./internal/db/ -count=1 -race ✓
- go vet ./... ✓
- Drift gate flips red if I delete either ADD VALUE from the migration
(validated via local mutation).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
9b061672a0
commit
c6cb82e1c0
227
workspace-server/internal/db/workspace_status_enum_drift_test.go
Normal file
227
workspace-server/internal/db/workspace_status_enum_drift_test.go
Normal file
@ -0,0 +1,227 @@
|
||||
package db_test
|
||||
|
||||
// Static drift gate: every workspaces.status literal written in the Go
|
||||
// tree must exist in the workspace_status enum defined by the migrations.
|
||||
//
|
||||
// Why this exists: the `workspace_status` enum (migrations 043 + 046)
|
||||
// shipped without 'awaiting_agent' even though application code wrote
|
||||
// that value, and every UPDATE silently failed in production for five
|
||||
// days before the gap surfaced (see 046_workspace_status_awaiting_agent.up.sql).
|
||||
// The unit tests passed because sqlmock matches SQL by regex, not against
|
||||
// a live enum constraint.
|
||||
//
|
||||
// Approach: extract every Go string literal whose body matches
|
||||
// (?i)workspaces[^a-z_].*status (so "UPDATE workspaces SET status",
|
||||
// "FROM workspaces WHERE ... status", "INSERT INTO workspaces ... status",
|
||||
// CTEs that reference workspaces, etc.). For each such SQL fragment,
|
||||
// pull the single-quoted status values out of `status =`, `status IN`,
|
||||
// `THEN`, and `ELSE`. Every value must be in the union of CREATE TYPE +
|
||||
// ALTER TYPE ADD VALUE across all migrations.
|
||||
|
||||
import (
|
||||
"os"
|
||||
"path/filepath"
|
||||
"regexp"
|
||||
"sort"
|
||||
"strings"
|
||||
"testing"
|
||||
)
|
||||
|
||||
func TestWorkspaceStatusEnum_NoLiteralDrift(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
repoRoot := findRepoRoot(t)
|
||||
migrationsDir := filepath.Join(repoRoot, "workspace-server", "migrations")
|
||||
internalDir := filepath.Join(repoRoot, "workspace-server", "internal")
|
||||
|
||||
enum := loadWorkspaceStatusEnum(t, migrationsDir)
|
||||
if len(enum) == 0 {
|
||||
t.Fatalf("could not parse workspace_status enum from %s — gate is non-functional", migrationsDir)
|
||||
}
|
||||
|
||||
literals := collectWorkspacesStatusLiterals(t, internalDir)
|
||||
if len(literals) == 0 {
|
||||
t.Fatalf("found zero workspaces.status literals under %s — gate is non-functional", internalDir)
|
||||
}
|
||||
|
||||
var rogue []string
|
||||
for lit := range literals {
|
||||
if _, ok := enum[lit]; ok {
|
||||
continue
|
||||
}
|
||||
rogue = append(rogue, lit)
|
||||
}
|
||||
if len(rogue) > 0 {
|
||||
sort.Strings(rogue)
|
||||
t.Errorf(
|
||||
"workspaces.status literal(s) %v are written by Go code but not in the workspace_status enum.\n"+
|
||||
"Add a migration `ALTER TYPE workspace_status ADD VALUE 'X';` (see 046 for shape).\n"+
|
||||
"Enum currently is: %v",
|
||||
rogue, sortedKeys(enum),
|
||||
)
|
||||
}
|
||||
}
|
||||
|
||||
// loadWorkspaceStatusEnum scans every *.up.sql file for either:
|
||||
//
|
||||
// CREATE TYPE workspace_status AS ENUM ('a', 'b', ...)
|
||||
// ALTER TYPE workspace_status ADD VALUE [IF NOT EXISTS] 'X' [BEFORE|AFTER 'Y']
|
||||
//
|
||||
// and returns the union of every value the enum will hold after all
|
||||
// migrations apply.
|
||||
func loadWorkspaceStatusEnum(t *testing.T, migrationsDir string) map[string]struct{} {
|
||||
t.Helper()
|
||||
|
||||
out := make(map[string]struct{})
|
||||
|
||||
files, err := filepath.Glob(filepath.Join(migrationsDir, "*.up.sql"))
|
||||
if err != nil {
|
||||
t.Fatalf("glob migrations: %v", err)
|
||||
}
|
||||
sort.Strings(files)
|
||||
|
||||
createRE := regexp.MustCompile(`(?is)CREATE\s+TYPE\s+workspace_status\s+AS\s+ENUM\s*\(([^)]+)\)`)
|
||||
addValueRE := regexp.MustCompile(`(?i)ALTER\s+TYPE\s+workspace_status\s+ADD\s+VALUE(?:\s+IF\s+NOT\s+EXISTS)?\s+'([^']+)'`)
|
||||
literalRE := regexp.MustCompile(`'([^']+)'`)
|
||||
|
||||
for _, f := range files {
|
||||
body, err := os.ReadFile(f)
|
||||
if err != nil {
|
||||
t.Fatalf("read %s: %v", f, err)
|
||||
}
|
||||
for _, m := range createRE.FindAllStringSubmatch(string(body), -1) {
|
||||
for _, lit := range literalRE.FindAllStringSubmatch(m[1], -1) {
|
||||
out[lit[1]] = struct{}{}
|
||||
}
|
||||
}
|
||||
for _, m := range addValueRE.FindAllStringSubmatch(string(body), -1) {
|
||||
out[m[1]] = struct{}{}
|
||||
}
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
// collectWorkspacesStatusLiterals walks every non-test .go file under
|
||||
// root, finds Go string literals that contain `UPDATE workspaces` or
|
||||
// `INSERT INTO workspaces`, and extracts the status literals appearing
|
||||
// inside the matching SQL statement.
|
||||
//
|
||||
// Why this scope: any UPDATE/INSERT against `workspaces` is the moment
|
||||
// a status literal hits the column constrained by the enum. Read-side
|
||||
// SQL (SELECT ... WHERE status = 'X') cannot fail on enum drift, so it's
|
||||
// out of scope. JOINs to `workspaces` from other tables (e.g. approvals
|
||||
// joining workspaces for display) write to a different table's status —
|
||||
// also out of scope. Anchoring on the leading `UPDATE workspaces` /
|
||||
// `INSERT INTO workspaces` keyword unambiguously identifies the writes
|
||||
// we care about.
|
||||
func collectWorkspacesStatusLiterals(t *testing.T, root string) map[string]struct{} {
|
||||
t.Helper()
|
||||
|
||||
// Match raw-string and double-quoted Go string literals. Backtick
|
||||
// strings can span multiple lines. Both forms are extracted via the
|
||||
// same DOTALL regex over the whole file body.
|
||||
rawRE := regexp.MustCompile("(?s)`([^`]*?)`")
|
||||
dquoteRE := regexp.MustCompile(`"((?:[^"\\]|\\.)*)"`)
|
||||
|
||||
// A SQL string is in scope if it begins (after optional leading
|
||||
// whitespace) with UPDATE workspaces or INSERT INTO workspaces.
|
||||
// `(?i)` is case-insensitive; `\s*` allows the format-friendly
|
||||
// leading newline and indent that the codebase uses.
|
||||
updateWorkspacesRE := regexp.MustCompile(`(?is)^\s*UPDATE\s+workspaces\b`)
|
||||
insertWorkspacesRE := regexp.MustCompile(`(?is)^\s*INSERT\s+INTO\s+workspaces\b`)
|
||||
|
||||
// Inside a scoped SQL fragment, status literals appear in:
|
||||
// status = 'X' — assignment in SET (or filter in WHERE)
|
||||
// status IN ('X', ...) — filter
|
||||
// status NOT IN ('X') — filter
|
||||
// THEN 'X' — CASE arm
|
||||
// ELSE 'X' — CASE default
|
||||
statusEqRE := regexp.MustCompile(`(?i)status\s*(?:=|!=|<>)\s*'([a-z_]+)'`)
|
||||
statusInRE := regexp.MustCompile(`(?i)status\s+(?:NOT\s+)?IN\s*\(([^)]*)\)`)
|
||||
thenRE := regexp.MustCompile(`(?i)THEN\s+'([a-z_]+)'`)
|
||||
elseRE := regexp.MustCompile(`(?i)ELSE\s+'([a-z_]+)'`)
|
||||
inListLiteralRE := regexp.MustCompile(`'([a-z_]+)'`)
|
||||
|
||||
out := make(map[string]struct{})
|
||||
|
||||
walkErr := filepath.Walk(root, func(path string, info os.FileInfo, err error) error {
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
if info.IsDir() {
|
||||
return nil
|
||||
}
|
||||
if !strings.HasSuffix(path, ".go") {
|
||||
return nil
|
||||
}
|
||||
if strings.HasSuffix(path, "_test.go") {
|
||||
return nil
|
||||
}
|
||||
body, err := os.ReadFile(path)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
text := string(body)
|
||||
|
||||
harvest := func(fragment string) {
|
||||
if !updateWorkspacesRE.MatchString(fragment) && !insertWorkspacesRE.MatchString(fragment) {
|
||||
return
|
||||
}
|
||||
for _, m := range statusEqRE.FindAllStringSubmatch(fragment, -1) {
|
||||
out[m[1]] = struct{}{}
|
||||
}
|
||||
for _, m := range statusInRE.FindAllStringSubmatch(fragment, -1) {
|
||||
for _, lit := range inListLiteralRE.FindAllStringSubmatch(m[1], -1) {
|
||||
out[lit[1]] = struct{}{}
|
||||
}
|
||||
}
|
||||
for _, m := range thenRE.FindAllStringSubmatch(fragment, -1) {
|
||||
out[m[1]] = struct{}{}
|
||||
}
|
||||
for _, m := range elseRE.FindAllStringSubmatch(fragment, -1) {
|
||||
out[m[1]] = struct{}{}
|
||||
}
|
||||
}
|
||||
|
||||
for _, m := range rawRE.FindAllStringSubmatch(text, -1) {
|
||||
harvest(m[1])
|
||||
}
|
||||
for _, m := range dquoteRE.FindAllStringSubmatch(text, -1) {
|
||||
harvest(m[1])
|
||||
}
|
||||
return nil
|
||||
})
|
||||
if walkErr != nil {
|
||||
t.Fatalf("walk %s: %v", root, walkErr)
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
func findRepoRoot(t *testing.T) string {
|
||||
t.Helper()
|
||||
dir, err := os.Getwd()
|
||||
if err != nil {
|
||||
t.Fatalf("getwd: %v", err)
|
||||
}
|
||||
for i := 0; i < 8; i++ {
|
||||
if _, err := os.Stat(filepath.Join(dir, "workspace-server", "migrations")); err == nil {
|
||||
return dir
|
||||
}
|
||||
parent := filepath.Dir(dir)
|
||||
if parent == dir {
|
||||
break
|
||||
}
|
||||
dir = parent
|
||||
}
|
||||
t.Fatalf("could not locate repo root with workspace-server/migrations from %s", dir)
|
||||
return ""
|
||||
}
|
||||
|
||||
func sortedKeys(m map[string]struct{}) []string {
|
||||
out := make([]string, 0, len(m))
|
||||
for k := range m {
|
||||
out = append(out, k)
|
||||
}
|
||||
sort.Strings(out)
|
||||
return out
|
||||
}
|
||||
@ -0,0 +1,53 @@
|
||||
-- 046_workspace_status_awaiting_agent.down.sql
|
||||
--
|
||||
-- Reverse 046_workspace_status_awaiting_agent.up.sql.
|
||||
--
|
||||
-- Postgres does NOT support DROP VALUE on an enum. The standard rollback
|
||||
-- recipe is rename → recreate → cast → drop, which is intrusive (locks
|
||||
-- workspaces with ACCESS EXCLUSIVE the same way migration 043 did). We
|
||||
-- punt: only run this manually, and only if you're prepared to nuke and
|
||||
-- recreate the type. Application code WILL fail if the value disappears
|
||||
-- and any row currently has it; pre-flight with:
|
||||
--
|
||||
-- UPDATE workspaces SET status = 'offline'
|
||||
-- WHERE status = 'awaiting_agent';
|
||||
--
|
||||
-- before running the recipe below.
|
||||
|
||||
BEGIN;
|
||||
|
||||
SET LOCAL lock_timeout = '5s';
|
||||
|
||||
-- Convert any existing awaiting_agent / hibernating rows to a value the
|
||||
-- new enum will accept. 'offline' is the safest fallback for awaiting_agent
|
||||
-- (operator can re-register to bring them back online); 'hibernated' is
|
||||
-- the natural terminal of an in-flight 'hibernating'.
|
||||
UPDATE workspaces SET status = 'offline' WHERE status = 'awaiting_agent';
|
||||
UPDATE workspaces SET status = 'hibernated' WHERE status = 'hibernating';
|
||||
|
||||
ALTER TYPE workspace_status RENAME TO workspace_status_with_awaiting;
|
||||
|
||||
CREATE TYPE workspace_status AS ENUM (
|
||||
'provisioning',
|
||||
'online',
|
||||
'offline',
|
||||
'degraded',
|
||||
'failed',
|
||||
'removed',
|
||||
'paused',
|
||||
'hibernated'
|
||||
);
|
||||
|
||||
ALTER TABLE workspaces
|
||||
ALTER COLUMN status DROP DEFAULT;
|
||||
|
||||
ALTER TABLE workspaces
|
||||
ALTER COLUMN status TYPE workspace_status
|
||||
USING status::text::workspace_status;
|
||||
|
||||
ALTER TABLE workspaces
|
||||
ALTER COLUMN status SET DEFAULT 'provisioning'::workspace_status;
|
||||
|
||||
DROP TYPE workspace_status_with_awaiting;
|
||||
|
||||
COMMIT;
|
||||
@ -0,0 +1,60 @@
|
||||
-- 046_workspace_status_awaiting_agent.up.sql
|
||||
--
|
||||
-- Add the missing 'awaiting_agent' and 'hibernating' values to the
|
||||
-- workspace_status enum.
|
||||
--
|
||||
-- Migration 043 (2026-04-25) introduced the workspace_status enum but
|
||||
-- omitted two values that application code had already been writing:
|
||||
--
|
||||
-- 'awaiting_agent' — handlers/workspace.go:333 (since 2026-04-24,
|
||||
-- commit 1e8b5e01, "first-class BYO-compute"):
|
||||
-- external workspaces created without a URL are
|
||||
-- meant to park here until the agent registers.
|
||||
-- 'hibernating' — handlers/workspace_restart.go:271-272 (since
|
||||
-- 29_workspace_hibernation): intermediate state
|
||||
-- between 'online' and the final 'hibernated';
|
||||
-- acts as a DB-level claim while provisioner.Stop
|
||||
-- runs.
|
||||
--
|
||||
-- Every UPDATE that tried to write either value failed with
|
||||
-- `invalid input value for enum workspace_status: "..."`. The
|
||||
-- consequences were silent because both call sites either dropped or
|
||||
-- log-and-continued the error:
|
||||
--
|
||||
-- - workspace.go:333 discards the Exec result entirely. Canvas shows
|
||||
-- `awaiting_agent` in the response, but the row stays in
|
||||
-- 'provisioning' until first /registry/register.
|
||||
-- - workspace_restart.go:277 logs and returns; hibernation simply
|
||||
-- never happens. Idle workspaces consumed resources indefinitely
|
||||
-- for five days before this gate flagged it.
|
||||
--
|
||||
-- 2026-04-30 PR #2382 ("default external runtime to poll-mode +
|
||||
-- awaiting_agent") added two more silent-fail sites in
|
||||
-- registry/liveness.go and registry/healthsweep.go. Both log + continue.
|
||||
-- Result: liveness expiry no longer transitions runtime='external'
|
||||
-- rows, and the heartbeat-staleness sweep is a no-op for them.
|
||||
-- UI/canvas shows external workspaces stuck on 'online' or 'degraded'
|
||||
-- indefinitely after the agent disconnects.
|
||||
--
|
||||
-- Tests across all four sites used sqlmock, which matches SQL by regex
|
||||
-- but does not validate against the live enum constraint, so unit
|
||||
-- tests passed despite the prod-only failures. See
|
||||
-- feedback_mock_at_drifting_layer in operator memory.
|
||||
--
|
||||
-- This migration adds both enum values so all six call sites start
|
||||
-- succeeding. IF NOT EXISTS makes the migration idempotent across
|
||||
-- re-runs (RunMigrations in postgres.go re-applies migrations until
|
||||
-- schema_migrations records them; a future operator-driven re-run is
|
||||
-- harmless).
|
||||
--
|
||||
-- ALTER TYPE ADD VALUE is committed immediately by Postgres regardless
|
||||
-- of transaction wrapping, and does NOT take a heavy lock on workspaces.
|
||||
-- Safe to run during normal traffic.
|
||||
--
|
||||
-- A regression gate lives at internal/db/workspace_status_enum_drift_test.go:
|
||||
-- it parses every UPDATE/INSERT against `workspaces` in the Go tree
|
||||
-- and asserts every status literal is in the enum. That test would
|
||||
-- have caught both omissions on the day they shipped.
|
||||
|
||||
ALTER TYPE workspace_status ADD VALUE IF NOT EXISTS 'awaiting_agent';
|
||||
ALTER TYPE workspace_status ADD VALUE IF NOT EXISTS 'hibernating';
|
||||
Loading…
Reference in New Issue
Block a user