Compare commits

...

26 Commits

Author SHA1 Message Date
claude-ceo-assistant 72b0d4b1ab feat(plugins): workspace_plugins tracking table — version-subscription foundation
CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Successful in 7s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Successful in 6s
CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Successful in 14s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 35s
CI / Detect changes (pull_request) Successful in 43s
Check migration collisions / Migration version collision check (pull_request) Successful in 44s
E2E API Smoke Test / detect-changes (pull_request) Successful in 31s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 28s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 27s
Harness Replays / detect-changes (pull_request) Successful in 33s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 30s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 22s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 9s
CI / Canvas (Next.js) (pull_request) Successful in 12s
CI / Python Lint & Test (pull_request) Successful in 15s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 14s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 12s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
Harness Replays / Harness Replays (pull_request) Failing after 29s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2m20s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 7m1s
CI / Platform (Go) (pull_request) Successful in 14m52s
Closes core#113 partial. Adds the DB foundation for the
version-subscription model. Drift detection + queue + admin apply
endpoint are follow-up scope (separate PR; filed as a new issue).

WHY THIS PR ONLY GETS US PART-WAY
  Plugin install state today is filesystem-only — '/configs/plugins/<name>/'
  inside the container. There's no DB record of 'plugin X installed at
  workspace W from source S, tracking ref T'. That makes drift detection
  impossible: nothing to compare upstream tags against.

  This PR adds the table + the install-endpoint hook that writes to it.
  With baseline tags now on every plugin (post internal#92), the table
  starts collecting tracked-ref values immediately on the next install.
  The actual drift-check job + queue + apply endpoint layer on top.

WHAT THIS ADDS
  workspace_plugins table:
    workspace_id   FK → workspaces(id) ON DELETE CASCADE
    plugin_name    canonical name from plugin.yaml
    source_raw     full source URL the install used
    tracked_ref    'none' | 'tag:vX.Y.Z' | 'tag:latest' | 'sha:<full>'
    installed_at, updated_at

  installRequest gains optional 'track' field (defaults to 'none').
  Install handler upserts the workspace_plugins row after delivery
  succeeds. DB write failure is logged but doesn't fail the install
  (the plugin IS in the container; surfacing 500 misleads the caller).

  validateTrackedRef enforces the closed set of accepted shapes:
    'none' | 'tag:<non-empty>' | 'sha:<non-empty>'
  Bare values like 'latest' / 'main' / version-strings without
  prefix are rejected — the drift detector keys on prefix to know
  what kind of resolution to do.

WHAT THIS DOES NOT ADD (filed separately)
  - Drift detector job (cron / on-demand) that scans
    'WHERE tracked_ref != none' rows and queues updates on upstream drift
  - plugin_update_queue table (separate migration once detector lands)
  - GET /admin/plugin-updates-pending and POST .../apply endpoints
  - Tier-aware apply (core#115 — composes here)

PHASE 4 SELF-REVIEW (FIVE-AXIS)
  Correctness: No finding — install endpoint behavior unchanged for
    callers that don't pass 'track'. DB write is best-effort + logged
    on failure. validateTrackedRef rejects ambiguous bare strings.
  Readability: No finding — separate file plugins_tracking.go isolates
    the new concern; install handler delta is a single 4-line block.
  Architecture: No finding — additive table; existing schema untouched.
    Migration 20260508160000_* uses the timestamp-prefixed convention.
  Security: No finding — INSERT params via  placeholders (no string
    interpolation). validateTrackedRef rejects unexpected shapes before
    the column constraint would.
  Performance: No finding — one extra ExecContext per install. Install
    is already seconds-scale (network fetch + tar + docker exec); rounds
    to noise.

TESTS (1 new, all green)
  TestValidateTrackedRef — pin closed set + structural validators

REFS
  core#113 — this issue (foundation only; drift+queue+apply = follow-up)
  internal#92, internal#93 — plugin/template baseline tags (now exists for tracking)
  core#114 — atomic install (this PR composes — no atomicity regression)
  core#115 — canary tier filter (will key off the same DB foundation)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 08:52:35 -07:00
claude-ceo-assistant f78d844960 Merge pull request 'feat(plugins): hot-reload classifier — skip restart on SKILL-content-only updates' (#121) from feat/plugin-hot-reload-classifier into main
CodeQL / Analyze (${{ matrix.language }}) (go) (push) Successful in 3s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (push) Successful in 3s
CodeQL / Analyze (${{ matrix.language }}) (python) (push) Successful in 5s
Block internal-flavored paths / Block forbidden paths (push) Successful in 18s
CI / Detect changes (push) Successful in 21s
E2E API Smoke Test / detect-changes (push) Successful in 15s
E2E Staging Canvas (Playwright) / detect-changes (push) Successful in 15s
Handlers Postgres Integration / detect-changes (push) Successful in 17s
Harness Replays / detect-changes (push) Successful in 18s
publish-workspace-server-image / build-and-push (push) Failing after 20s
Secret scan / Scan diff for credential-shaped strings (push) Successful in 19s
Runtime PR-Built Compatibility / detect-changes (push) Successful in 21s
CI / Shellcheck (E2E scripts) (push) Successful in 9s
CI / Canvas (Next.js) (push) Successful in 10s
CI / Canvas Deploy Reminder (push) Has been skipped
CI / Python Lint & Test (push) Successful in 9s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (push) Successful in 11s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (push) Successful in 7s
Harness Replays / Harness Replays (push) Failing after 19s
E2E API Smoke Test / E2E API Smoke Test (push) Failing after 1m15s
Handlers Postgres Integration / Handlers Postgres Integration (push) Successful in 3m20s
CI / Platform (Go) (push) Failing after 4m59s
Canary — staging SaaS smoke (every 30 min) / Canary smoke (push) Failing after 3m52s
2026-05-08 15:26:32 +00:00
claude-ceo-assistant 249e760fbd feat(plugins): hot-reload classifier — skip restart on SKILL-content-only updates
CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Successful in 6s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Successful in 6s
CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Successful in 5s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 16s
Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 17s
branch-protection drift check / Branch protection drift (pull_request) Successful in 21s
E2E API Smoke Test / detect-changes (pull_request) Successful in 20s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 20s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 19s
Harness Replays / detect-changes (pull_request) Successful in 22s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 23s
CI / Detect changes (pull_request) Successful in 27s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 20s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 22s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 12s
CI / Canvas (Next.js) (pull_request) Successful in 10s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 6s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
CI / Python Lint & Test (pull_request) Successful in 8s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 9s
Harness Replays / Harness Replays (pull_request) Failing after 25s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 1m41s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 3m33s
CI / Platform (Go) (pull_request) Successful in 5m11s
Closes molecule-core#112. Composes with #114 (atomic install).

Before issuing restartFunc, classify the diff between staged and live:
  - skill-content-only: only **/SKILL.md content changed
                        → skip restart (Claude Code re-reads SKILL.md on
                          each Skill invocation; no in-memory cache)
  - cold: anything else
                        → restartFunc as before
                          (hooks/settings load at session start;
                          plugin.yaml is structural; added/removed files
                          require a fresh load)

DETECTION
  - Hash every regular file in staged tree (host filesystem, sha256)
  - Hash every regular file in live tree (in-container via docker exec
    sh -c 'cd <livePath> && find . -type f -print0 | xargs -0 sha256sum')
  - .complete marker dropped from comparison (mtime varies install-to-
    install; including it would force-cold every reinstall)
  - File added/removed → cold
  - File content differs but isn't SKILL.md → cold
  - All differences are SKILL.md basenames → skill-content-only

DEFAULTS COLD
  - First install (no live tree) → cold
  - Live tree read failure → cold (conservative; never hot-reload speculatively)
  - Symlinks skipped during hash (same posture as tar walker)

PHASE 4 SELF-REVIEW
  Correctness: No finding — all error paths default to cold; never
    falsely classify as skill-content-only. The .complete drop is
    a deliberate exception (the marker is bookkeeping, not content).
  Readability: No finding — single-purpose helpers (hashLocalTree,
    hashContainerTree, isSkillMarkdown, shQuote) each do one thing.
    The classifier itself reads as 'compare set, then walk diff with
    isSkillMarkdown gate.'
  Architecture: No finding — composes existing execAsRoot primitive;
    new helpers in plugins_classifier.go don't touch any other
    handler. Old behavior unchanged when live read fails.
  Security: No finding — shQuote single-quotes any non-trivial path,
    pluginName comes from validatePluginName-validated source, and
    the docker exec command takes the path as a single arg (xargs -0
    handles binary-safe path delimiting). Symlinks skipped.
  Performance: No finding — adds two tree walks (host + container)
    per install. Container walk is one docker exec call returning
    sha256 lines; for typical plugins (~10-50 files) round-trip is
    ~100ms. Versus the saved ~5-10s of restart on a hot-reloadable
    update, this is a clear win.

TESTS (4 new, all green; full handler suite green)
  TestIsSkillMarkdown        — basename match, case-sensitive
  TestHashLocalTree_StableHash — re-hash same dir = same map
  TestHashLocalTree_SymlinkSkipped — hostile link doesn't poison classifier
  TestShQuote                — quoting boundary for shell injection safety

REFS
  molecule-core#112 — this issue
  molecule-core#114 — atomic install (.complete marker added there)
  Reno-Stars iteration safety (Hongming 2026-05-08)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 08:26:05 -07:00
claude-ceo-assistant 3a4b62a52a Merge pull request 'chore(workflows): delete obsolete promote/sync workflows (Phase 3C of internal#81)' (#119) from chore/trunk-based-delete-obsolete-workflows into main
CI / Canvas Deploy Reminder (push) Blocked by required conditions
CodeQL / Analyze (${{ matrix.language }}) (go) (push) Successful in 5s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (push) Successful in 5s
CodeQL / Analyze (${{ matrix.language }}) (python) (push) Successful in 6s
Block internal-flavored paths / Block forbidden paths (push) Successful in 21s
E2E API Smoke Test / detect-changes (push) Successful in 18s
Check merge_group trigger on required workflows / Required workflows have merge_group trigger (push) Successful in 20s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (push) Successful in 14s
CI / Detect changes (push) Successful in 21s
Secret scan / Scan diff for credential-shaped strings (push) Successful in 13s
E2E Staging Canvas (Playwright) / detect-changes (push) Successful in 17s
Handlers Postgres Integration / detect-changes (push) Successful in 16s
Runtime PR-Built Compatibility / detect-changes (push) Successful in 16s
CI / Shellcheck (E2E scripts) (push) Successful in 6s
CI / Platform (Go) (push) Successful in 7s
E2E API Smoke Test / E2E API Smoke Test (push) Successful in 8s
Handlers Postgres Integration / Handlers Postgres Integration (push) Successful in 7s
CI / Python Lint & Test (push) Successful in 8s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (push) Successful in 7s
CI / Canvas (Next.js) (push) Successful in 8s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (push) Successful in 11s
2026-05-08 15:26:00 +00:00
claude-ceo-assistant b4eab9cef2 Merge branch 'main' into chore/trunk-based-delete-obsolete-workflows
CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Successful in 7s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Successful in 7s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 24s
Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 24s
CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Successful in 18s
branch-protection drift check / Branch protection drift (pull_request) Successful in 25s
pr-guards / disable-auto-merge-on-push (pull_request) Successful in 10s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 23s
E2E API Smoke Test / detect-changes (pull_request) Successful in 29s
CI / Detect changes (pull_request) Successful in 35s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 24s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 22s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 21s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 21s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 10s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 9s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 12s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 12s
CI / Canvas (Next.js) (pull_request) Successful in 12s
CI / Platform (Go) (pull_request) Successful in 13s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
CI / Python Lint & Test (pull_request) Successful in 7s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 6s
2026-05-08 15:24:55 +00:00
claude-ceo-assistant 3e96184d6f Merge pull request 'feat(plugins): atomic install — stage→snapshot→swap→marker (docker path)' (#120) from feat/plugin-atomic-install into main
CodeQL / Analyze (${{ matrix.language }}) (go) (push) Successful in 5s
CodeQL / Analyze (${{ matrix.language }}) (python) (push) Successful in 5s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (push) Successful in 5s
Block internal-flavored paths / Block forbidden paths (push) Successful in 14s
CI / Detect changes (push) Successful in 19s
E2E API Smoke Test / detect-changes (push) Successful in 14s
Auto-sync main → staging / sync-staging (push) Failing after 25s
Handlers Postgres Integration / detect-changes (push) Successful in 16s
Harness Replays / detect-changes (push) Successful in 17s
E2E Staging Canvas (Playwright) / detect-changes (push) Successful in 19s
Secret scan / Scan diff for credential-shaped strings (push) Successful in 14s
publish-workspace-server-image / build-and-push (push) Failing after 18s
Runtime PR-Built Compatibility / detect-changes (push) Successful in 18s
CI / Shellcheck (E2E scripts) (push) Successful in 8s
CI / Canvas (Next.js) (push) Successful in 9s
CI / Python Lint & Test (push) Successful in 9s
CI / Canvas Deploy Reminder (push) Has been skipped
E2E Staging Canvas (Playwright) / Canvas tabs E2E (push) Successful in 12s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (push) Successful in 7s
Harness Replays / Harness Replays (push) Failing after 18s
E2E API Smoke Test / E2E API Smoke Test (push) Failing after 1m30s
CI / Platform (Go) (push) Has been cancelled
Handlers Postgres Integration / Handlers Postgres Integration (push) Has been cancelled
2026-05-08 15:23:31 +00:00
claude-ceo-assistant 48a24e6b3e Merge branch 'main' into chore/trunk-based-delete-obsolete-workflows
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Successful in 5s
CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Successful in 4s
CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Successful in 5s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 10s
Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 9s
pr-guards / disable-auto-merge-on-push (pull_request) Successful in 4s
branch-protection drift check / Branch protection drift (pull_request) Successful in 15s
CI / Detect changes (pull_request) Successful in 14s
E2E API Smoke Test / detect-changes (pull_request) Successful in 13s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 11s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 13s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 12s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 12s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 13s
CI / Platform (Go) (pull_request) Successful in 7s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 6s
CI / Canvas (Next.js) (pull_request) Successful in 7s
CI / Python Lint & Test (pull_request) Successful in 7s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 9s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 9s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 10s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 1m4s
2026-05-08 15:23:05 +00:00
claude-ceo-assistant 7fbb8cb6e9 feat(plugins): atomic install — stage→snapshot→swap→marker (docker path)
CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Successful in 4s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Successful in 4s
CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Successful in 4s
Retarget main PRs to staging / Retarget to staging (pull_request) Has been skipped
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 10s
E2E API Smoke Test / detect-changes (pull_request) Successful in 13s
CI / Detect changes (pull_request) Successful in 15s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 13s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 14s
Harness Replays / detect-changes (pull_request) Successful in 15s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 15s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 15s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 9s
CI / Canvas (Next.js) (pull_request) Successful in 10s
CI / Python Lint & Test (pull_request) Successful in 8s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 9s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 11s
Harness Replays / Harness Replays (pull_request) Failing after 20s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 1m55s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 3m47s
CI / Platform (Go) (pull_request) Successful in 7m36s
Closes molecule-core#114 for the docker (local-OSS) path.
EIC (SaaS) path tracked as a follow-up — same shape, different
exec primitives (ssh vs docker exec); shipping both in one PR
doubles the test surface.

THE FOUR-STEP DANCE
  1. STAGE     — docker.CopyToContainer extracts tar into
                 /configs/plugins/.staging/<name>.<ts>/
  2. SNAPSHOT  — if /configs/plugins/<name>/ exists, mv to
                 /configs/plugins/.previous/<name>.<ts>/
  3. SWAP      — atomic mv staging → live (single rename(2))
  4. MARKER    — touch /configs/plugins/<name>/.complete

Workspace-side plugin loaders should refuse to load any plugin dir
without .complete (separate small change, not in this PR — the marker
write is the necessary precursor; consumer side is a follow-up so
existing-content plugins don't break before they're re-installed).

ROLLBACK
  - Stage failure: rm -rf staging dir; live untouched
  - Snapshot failure: rm -rf staging dir; live untouched (no rename happened)
  - Swap failure with snapshot present: mv previous back to live
  - Swap failure (no snapshot): rm -rf staging; live (which never
    existed) stays absent
  - Marker failure: content already in place, log loudly with manual
    recovery hint (touch <plugin>/.complete) — don't roll back since
    the new content is what we wanted, just unmarked

GC
  Best-effort delete of previous-version snapshot after successful
  marker write. Failures non-fatal — next install or a separate
  sweeper reclaims. Sweeper for stale .previous/* across reboots is
  follow-up scope.

CONCURRENCY
  Each install gets a unique stamp (UTC second precision), so two
  concurrent reinstalls land in distinct staging dirs and the second
  swap simply overwrites the first's live result. The atomicity is
  per-install, not cross-install — by design (the platform serializes
  POST /workspaces/:id/plugins via Go-side semaphore upstream of
  this code, so cross-install collisions don't reach here).

CHANGES
  + plugins_atomic.go        — installVersion + atomicCopyToContainer
  + plugins_atomic_tar.go    — tarWalk/tarHostDirWithPrefix helpers
  + plugins_atomic_test.go   — 5 unit tests (paths, stamp shape,
                               tar happy path, symlink-skip, prefix
                               normalization). All green.
  ~ plugins_install_pipeline.go::deliverToContainer — swap
    copyPluginToContainer call to atomicCopyToContainer

Old copyPluginToContainer is retained (still called by Download()) so
this PR is purely additive on the install path; no public API change.

PHASE 4 SELF-REVIEW (FIVE-AXIS)
  Correctness: Required (addressed) — swap-failure rollback writes mv
    of previous back to live before returning the error; if rollback
    itself fails, we wrap both errors and surface the combined fault.
    Marker-write failure is treated as content-landed-but-unmarked
    (LOG, don't roll back the new content).
  Readability: No finding — installVersion path methods make the
    /staging/.previous/live/marker layout obvious from one struct.
    tarWalk extracted from the inline filepath.Walk in
    plugins_install_pipeline.go for testability.
  Architecture: No finding — atomicCopyToContainer composes existing
    execAsRoot / docker.CopyToContainer primitives; no new dependencies.
    Old copyPluginToContainer kept for Download() — single responsibility
    per function.
  Security: No finding — symlinks still skipped during tar walk
    (defense vs hostile plugin escaping its own dir). Marker writes
    use composeable path.Join, no user input touches the path.
  Performance: No finding — adds ~3 docker exec calls per install
    (mkdir, mv-snapshot, mv-swap, touch — actually 4) on top of the
    one CopyToContainer. Each exec ~50-100ms in practice; install
    end-to-end was already seconds-scale, this rounds to noise.

REFS
  molecule-core#114 — this issue
  Companion: molecule-core#112 (hot-reload classifier — depends on .complete marker)
  Companion: molecule-core#113 (version subscription — uses install machinery)
  EIC follow-up: separate issue to be filed for SaaS path parity

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 08:22:52 -07:00
claude-ceo-assistant d543138bde Merge pull request 'chore: promote 5 staging-only feature PRs to main (Phase 3 of internal#81)' (#108) from chore/promote-staging-features-to-main into main
CodeQL / Analyze (${{ matrix.language }}) (go) (push) Successful in 9s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (push) Successful in 9s
CodeQL / Analyze (${{ matrix.language }}) (python) (push) Successful in 9s
Block internal-flavored paths / Block forbidden paths (push) Successful in 12s
CI / Detect changes (push) Successful in 13s
Auto-sync main → staging / sync-staging (push) Failing after 15s
E2E Staging Canvas (Playwright) / detect-changes (push) Successful in 12s
Secret scan / Scan diff for credential-shaped strings (push) Successful in 11s
Handlers Postgres Integration / detect-changes (push) Successful in 12s
E2E API Smoke Test / detect-changes (push) Successful in 14s
Runtime PR-Built Compatibility / detect-changes (push) Successful in 12s
CI / Shellcheck (E2E scripts) (push) Successful in 4s
CI / Python Lint & Test (push) Successful in 6s
CI / Platform (Go) (push) Successful in 7s
CI / Canvas (Next.js) (push) Successful in 6s
Handlers Postgres Integration / Handlers Postgres Integration (push) Successful in 6s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (push) Successful in 7s
CI / Canvas Deploy Reminder (push) Has been skipped
E2E API Smoke Test / E2E API Smoke Test (push) Successful in 7s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (push) Successful in 6s
2026-05-08 15:22:12 +00:00
claude-ceo-assistant bfcb0fc445 Merge branch 'main' into chore/promote-staging-features-to-main
CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Successful in 6s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Successful in 6s
CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Successful in 6s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 12s
pr-guards / disable-auto-merge-on-push (pull_request) Successful in 7s
CI / Detect changes (pull_request) Successful in 16s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 17s
E2E API Smoke Test / detect-changes (pull_request) Successful in 18s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 17s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 19s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 19s
CI / Platform (Go) (pull_request) Successful in 8s
CI / Canvas (Next.js) (pull_request) Successful in 10s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 8s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
CI / Python Lint & Test (pull_request) Successful in 8s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 7s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 6s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 7s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 8s
2026-05-08 15:21:18 +00:00
claude-ceo-assistant 2752a217c8 Merge pull request 'fix(pendinguploads): wait for error metric before test exit' (#111) from fix/pendinguploads-test-isolation into main
CodeQL / Analyze (${{ matrix.language }}) (go) (push) Successful in 2s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (push) Successful in 4s
CodeQL / Analyze (${{ matrix.language }}) (python) (push) Successful in 4s
Block internal-flavored paths / Block forbidden paths (push) Successful in 11s
E2E API Smoke Test / detect-changes (push) Successful in 14s
Auto-sync main → staging / sync-staging (push) Failing after 19s
E2E Staging Canvas (Playwright) / detect-changes (push) Successful in 16s
publish-workspace-server-image / build-and-push (push) Failing after 18s
Secret scan / Scan diff for credential-shaped strings (push) Successful in 16s
Handlers Postgres Integration / detect-changes (push) Successful in 19s
Harness Replays / detect-changes (push) Successful in 20s
Runtime PR-Built Compatibility / detect-changes (push) Successful in 18s
CI / Detect changes (push) Successful in 24s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (push) Successful in 12s
Handlers Postgres Integration / Handlers Postgres Integration (push) Successful in 10s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (push) Successful in 10s
CI / Shellcheck (E2E scripts) (push) Successful in 7s
CI / Canvas (Next.js) (push) Successful in 10s
CI / Python Lint & Test (push) Successful in 8s
CI / Canvas Deploy Reminder (push) Has been skipped
Harness Replays / Harness Replays (push) Failing after 21s
CI / Platform (Go) (push) Has been cancelled
E2E API Smoke Test / E2E API Smoke Test (push) Has been cancelled
2026-05-08 15:21:08 +00:00
claude-ceo-assistant c3686a4bb3 Merge branch 'main' into fix/pendinguploads-test-isolation
CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Successful in 0s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Successful in 0s
CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Successful in 1s
pr-guards / disable-auto-merge-on-push (pull_request) Successful in 1s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 5s
CI / Detect changes (pull_request) Successful in 8s
E2E API Smoke Test / detect-changes (pull_request) Successful in 7s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 6s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 6s
Harness Replays / detect-changes (pull_request) Successful in 7s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 5s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 6s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 2s
CI / Python Lint & Test (pull_request) Successful in 3s
CI / Canvas (Next.js) (pull_request) Successful in 4s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 4s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 4s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 4s
Harness Replays / Harness Replays (pull_request) Failing after 5s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 1m59s
CI / Platform (Go) (pull_request) Successful in 4m39s
2026-05-08 15:20:36 +00:00
claude-ceo-assistant e37a289eb6 Merge pull request 'feat(org-import): inject per-role persona env from operator-host bootstrap dir' (#110) from feat/persona-env-injection into main
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (push) Successful in 7s
Block internal-flavored paths / Block forbidden paths (push) Successful in 8s
CodeQL / Analyze (${{ matrix.language }}) (go) (push) Successful in 7s
CodeQL / Analyze (${{ matrix.language }}) (python) (push) Successful in 7s
Auto-sync main → staging / sync-staging (push) Failing after 12s
CI / Detect changes (push) Successful in 11s
E2E API Smoke Test / detect-changes (push) Successful in 12s
Harness Replays / detect-changes (push) Successful in 12s
Handlers Postgres Integration / detect-changes (push) Successful in 13s
E2E Staging Canvas (Playwright) / detect-changes (push) Successful in 13s
publish-workspace-server-image / build-and-push (push) Failing after 13s
Runtime PR-Built Compatibility / detect-changes (push) Successful in 12s
Secret scan / Scan diff for credential-shaped strings (push) Successful in 11s
CI / Shellcheck (E2E scripts) (push) Successful in 3s
CI / Python Lint & Test (push) Successful in 4s
CI / Canvas (Next.js) (push) Successful in 6s
CI / Canvas Deploy Reminder (push) Has been skipped
Runtime PR-Built Compatibility / PR-built wheel + import smoke (push) Successful in 4s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (push) Successful in 5s
Harness Replays / Harness Replays (push) Failing after 6s
E2E API Smoke Test / E2E API Smoke Test (push) Successful in 54s
Handlers Postgres Integration / Handlers Postgres Integration (push) Successful in 1m18s
CI / Platform (Go) (push) Successful in 2m23s
2026-05-08 15:17:17 +00:00
claude-ceo-assistant 61166f8848 Merge pull request 'feat(local-dev): air-based hot-reload for workspace-server in docker-compose dev mode' (#118) from feat/air-hot-reload-dev into main
Runtime PR-Built Compatibility / PR-built wheel + import smoke (push) Blocked by required conditions
CodeQL / Analyze (${{ matrix.language }}) (go) (push) Successful in 0s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (push) Successful in 1s
CodeQL / Analyze (${{ matrix.language }}) (python) (push) Successful in 1s
Block internal-flavored paths / Block forbidden paths (push) Successful in 4s
CI / Detect changes (push) Successful in 9s
E2E API Smoke Test / detect-changes (push) Successful in 8s
Auto-sync main → staging / sync-staging (push) Failing after 10s
E2E Staging Canvas (Playwright) / detect-changes (push) Successful in 8s
Secret scan / Scan diff for credential-shaped strings (push) Successful in 7s
Harness Replays / detect-changes (push) Successful in 8s
publish-workspace-server-image / build-and-push (push) Failing after 8s
Handlers Postgres Integration / detect-changes (push) Successful in 8s
CI / Shellcheck (E2E scripts) (push) Successful in 2s
CI / Python Lint & Test (push) Successful in 4s
CI / Canvas (Next.js) (push) Successful in 4s
Handlers Postgres Integration / Handlers Postgres Integration (push) Successful in 3s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (push) Successful in 5s
CI / Canvas Deploy Reminder (push) Has been skipped
Harness Replays / Harness Replays (push) Failing after 6s
CI / Platform (Go) (push) Has been cancelled
E2E API Smoke Test / E2E API Smoke Test (push) Has been cancelled
Runtime PR-Built Compatibility / detect-changes (push) Has been cancelled
2026-05-08 15:16:58 +00:00
claude-ceo-assistant 9d50a6dae4 feat(local-dev): air-based hot-reload for workspace-server
CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Successful in 1s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Successful in 1s
CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Successful in 1s
Retarget main PRs to staging / Retarget to staging (pull_request) Has been skipped
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 4s
CI / Detect changes (pull_request) Successful in 7s
E2E API Smoke Test / detect-changes (pull_request) Successful in 6s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 7s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 7s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 7s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 7s
Harness Replays / detect-changes (pull_request) Successful in 7s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 3s
CI / Python Lint & Test (pull_request) Successful in 3s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 2s
CI / Canvas (Next.js) (pull_request) Successful in 6s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 5s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 3s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
Harness Replays / Harness Replays (pull_request) Failing after 6s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 43s
CI / Platform (Go) (pull_request) Successful in 2m1s
Closes core#116. Brings local-dev iteration parity with the canvas's
Turbopack HMR — edit a Go file, see the platform restart in <5s
instead of running 'docker compose up --build' (~30s) per change.

USAGE
  make dev   # docker compose with air-driven live reload
  make up    # production-shape stack (no air, normal Dockerfile)

WHAT THIS ADDS
  workspace-server/.air.toml      — air watch config
  workspace-server/Dockerfile.dev — air-on-golang:1.25-alpine, dev-only
  docker-compose.dev.yml          — overlay swapping platform service
                                    to Dockerfile.dev + bind-mounting
                                    workspace-server/ source
  Makefile                        — make {dev,up,down,logs,build,test}

WHAT THIS DOES NOT TOUCH
  workspace-server/Dockerfile (production multi-stage build)
  docker-compose.yml          (prod-shape stack)
  CI workflows                (build prod image directly)
  Tenant deployment / SaaS    (image swap stays the model)

Pure additive. Existing 'docker compose up' path unchanged; production
stays on the static binary. Air install pinned via go install at image
build time so the dev image is reproducible-enough for local use (we
don't pin air to a SHA — the dev image is rebuilt locally and updates
opportunistically).

PHASE 4 SELF-REVIEW (FIVE-AXIS)
  Correctness: No finding — additive change, no existing path modified.
    .air.toml watches .go + .yaml under workspace-server/, excludes
    _test.go and tests dir so test edits don't trigger rebuild.
    Dockerfile.dev mirrors prod's 'go mod download' so first rebuild
    is fast.
  Readability: No finding — three small files plus a Makefile, each
    with header comments explaining the WHY, not just the WHAT. The
    Makefile uses the standard ## help-target pattern.
  Architecture: No finding — overlay pattern (docker-compose.dev.yml
    on top of docker-compose.yml) is the standard compose convention
    for env-specific overrides. Doesn't fork the prod path.
  Security: No finding because no production code path; dev-only image
    isn't built in CI and isn't published to ECR.
  Performance: No finding — air debounce=500ms, exclude_unchanged=true
    so a save that doesn't change content is a no-op rebuild.

REFS
  core#116 — this issue
  Companion: core#117 (workspace-side config-watcher for hot-reload of
  config.yaml) — different scope; this issue is platform-only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 08:10:50 -07:00
dev-lead 9e18ab4620 fix(pendinguploads): wait for error metric before test exit
CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Successful in 0s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Successful in 0s
CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Successful in 1s
Retarget main PRs to staging / Retarget to staging (pull_request) Has been skipped
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 5s
CI / Detect changes (pull_request) Successful in 8s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 6s
E2E API Smoke Test / detect-changes (pull_request) Successful in 8s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 7s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 8s
Harness Replays / detect-changes (pull_request) Successful in 7s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 7s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 3s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 3s
CI / Python Lint & Test (pull_request) Successful in 5s
CI / Canvas (Next.js) (pull_request) Successful in 5s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 4s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 5s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
Harness Replays / Harness Replays (pull_request) Failing after 6s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 1m0s
CI / Platform (Go) (pull_request) Successful in 4m34s
TestStartSweeper_TransientErrorDoesNotCrashLoop leaks an in-flight
metric write across the test boundary: cycleDone fires inside the
fake's Sweep defer (before Sweep returns), waitForCycle returns
immediately after, cancel() lands, but the goroutine still has
metrics.PendingUploadsSweepError() to execute. Whether that write
happens before or after the next test's metricDelta() baseline read
is a coin-flip on slow CI hosts.

Outcome: TestStartSweeper_RecordsMetricsOnSuccess fails with
"error counter delta = 1, want 0" — looks like a real bug, isn't.
Instrumented analysis (per the file's existing waitForMetricDelta
docstring covering the same shape) confirms the metric IS getting
recorded, just AFTER the next test reads its baseline.

The Records* tests already use waitForMetricDelta to close this race
on their own assertions. This change extends the same shape to
TransientErrorDoesNotCrashLoop so it doesn't poison subsequent tests'
baselines.

Verified by running `go test -race -count=20 ./internal/pendinguploads/...`
locally — passes deterministically.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 07:37:45 -07:00
claude-ceo-assistant 08e8d325e2 chore(workflows): delete obsolete promote/sync workflows (Phase 3C of internal#81)
CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Successful in 3s
CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Successful in 4s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Successful in 4s
Retarget main PRs to staging / Retarget to staging (pull_request) Has been skipped
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 9s
Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 9s
branch-protection drift check / Branch protection drift (pull_request) Successful in 13s
CI / Detect changes (pull_request) Successful in 13s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 13s
E2E API Smoke Test / detect-changes (pull_request) Successful in 13s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 11s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 14s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 12s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 12s
Harness Replays / detect-changes (pull_request) Successful in 14s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 4s
CI / Python Lint & Test (pull_request) Successful in 7s
CI / Canvas (Next.js) (pull_request) Successful in 8s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 8s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 9s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 6s
Harness Replays / Harness Replays (pull_request) Successful in 6s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 25s
CI / Platform (Go) (pull_request) Successful in 7m2s
Trunk-based migration final cleanup for molecule-core. The 6 workflows
deleted here all existed to manage the staging↔main branch dance that
trunk-based makes obsolete:

  - auto-promote-staging.yml         fast-forward staging→main on green
  - auto-promote-on-e2e.yml          alt promote path on E2E green
  - auto-promote-stale-alarm.yml     alarm if staging promotion stalls
  - auto-sync-main-to-staging.yml    sync main→staging after UI merges
  - auto-sync-canary.yml             dry-run probe of the auto-sync
                                     token+push path
  - retarget-main-to-staging.yml     rebase open PRs onto staging

After Phase 3A (PR #108 promoted 5 staging-only feature PRs to main)
and Phase 3B (PR #109 dropped staging-branch triggers from the 4 e2e
workflows), main is the only branch the CI cares about. None of the
above workflows have anything to do; they're 1977 lines of dead Go-time-
no-Gitea-time-yes code.

Rollback: `git revert` this commit to restore the workflows. They still
work mechanically; trunk-based just doesn't need them.

The `staging` branch on the remote is deleted in a follow-up step
(`git push origin --delete staging`) after this PR merges, so reviewers
can confirm CI runs cleanly on the new shape before the ref disappears.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 14:18:35 +00:00
claude-ceo-assistant ff8cc48340 ci: retrigger after AUTO_SYNC_TOKEN rotated to devops-engineer (was 401 against any repo)
CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Successful in 5s
CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Successful in 5s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Successful in 10s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 20s
Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 20s
branch-protection drift check / Branch protection drift (pull_request) Successful in 27s
CI / Detect changes (pull_request) Successful in 26s
pr-guards / disable-auto-merge-on-push (pull_request) Successful in 7s
E2E API Smoke Test / detect-changes (pull_request) Successful in 22s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 23s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 19s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 16s
Harness Replays / detect-changes (pull_request) Successful in 23s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 11s
CI / Canvas (Next.js) (pull_request) Successful in 13s
CI / Python Lint & Test (pull_request) Successful in 10s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 24s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 23s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 14s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 9s
Harness Replays / Harness Replays (pull_request) Failing after 24s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 6m4s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 6m26s
CI / Platform (Go) (pull_request) Failing after 9m31s
2026-05-08 14:16:27 +00:00
claude-ceo-assistant 43b33bcaa5 feat(org-import): inject per-role persona env from operator-host bootstrap dir
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Successful in 0s
CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Successful in 1s
CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Successful in 0s
Retarget main PRs to staging / Retarget to staging (pull_request) Has been skipped
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 6s
E2E API Smoke Test / detect-changes (pull_request) Successful in 7s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 8s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 8s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 8s
Harness Replays / detect-changes (pull_request) Successful in 8s
CI / Detect changes (pull_request) Successful in 9s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 8s
CI / Python Lint & Test (pull_request) Successful in 4s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 4s
CI / Canvas (Next.js) (pull_request) Successful in 4s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 3s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 5s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
Harness Replays / Harness Replays (pull_request) Failing after 5s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 2m16s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2m24s
CI / Platform (Go) (pull_request) Successful in 3m23s
Wires the 28 dev-tree persona credentials minted 2026-05-08 into the
workspace-secrets path used by org_import. When a workspace.yaml carries
`role: <name>`, the importer now reads
$MOLECULE_PERSONA_ROOT/<role>/env (default
/etc/molecule-bootstrap/personas/<role>/env, populated by the bootstrap
kit on the tenant host) and merges the role's GITEA_USER /
GITEA_TOKEN / GITEA_TOKEN_SCOPES / GITEA_USER_EMAIL /
GITEA_SSH_KEY_PATH into the same envVars map that already feeds
workspace_secrets via parseEnvFile + crypto.Encrypt + INSERT.

PRECEDENCE
  Persona env is the LOWEST layer:
    0. Persona env (per-role)
    1. Org root .env (shared)
    2. Workspace .env (per-workspace)
  Each later layer overrides the previous, so a workspace .env can
  pin a different GITEA_TOKEN if it ever needs to (testing, override).

WHY THIS LAYERING
  Workspaces should boot with the role's identity by default. .env
  files stay the explicit-override mechanism for the (rare) case where
  a workspace needs to deviate. No new behavior for workspaces with no
  role: persona load is silent no-op when ws.Role is empty or unsafe.

SECURITY
  isSafeRoleName accepts only [A-Za-z0-9_-]+ (no '..', '/', or
  separators) — admin-only construct, but defense-in-depth keeps the
  persona dir shape invariant. Test
  TestLoadPersonaEnvFile_RejectsTraversal pins the rejection set against
  a planted target file.

OPERATOR-HOST CONTRACT
  The 28 persona env files live at /etc/molecule-bootstrap/personas/<role>/env
  (mode 600, owner root:root) with the per-role token-scope tailoring
  Hongming approved 2026-05-08 (D5). Synced via task #241. Override via
  MOLECULE_PERSONA_ROOT for tests + non-prod hosts.

TESTS (7 new, all green)
  TestLoadPersonaEnvFile_HappyPath        — typical persona-env shape
  TestLoadPersonaEnvFile_MissingDir       — silent no-op when file absent
  TestLoadPersonaEnvFile_EmptyRole        — silent no-op when role empty
  TestLoadPersonaEnvFile_RejectsTraversal — planted file unreachable
                                            via '../../etc/passwd' etc.
  TestLoadPersonaEnvFile_DefaultRoot      — falls back to /etc/...
  TestLoadPersonaEnvFile_OverwritesEmptyMap
  TestIsSafeRoleName_Acceptance           — positive + negative role names

PHASE 4 SELF-REVIEW (FIVE-AXIS)
  Correctness: No finding — additive change, silent no-op on the ws.Role==''
    path covers every existing workspace; tests cover happy path + each
    rejection mode + missing-dir.
  Readability: No finding — helper sits next to parseEnvFile in
    org_helpers.go with a comment block explaining WHY persona is
    lowest precedence.
  Architecture: No finding — fits the existing 'merge .env into envVars
    then INSERT INTO workspace_secrets' pattern that's been in place
    since the .env-driven workspace secrets feature; no new dependencies,
    no new tables.
  Security: Required (addressed) — path traversal blocked by
    isSafeRoleName. No finding beyond that since persona files are
    admin-managed and the helper does not log token values.
  Performance: No finding — one extra os.ReadFile per workspace at
    import time; amortized over workspace lifetime, cost is negligible.

REFS
  internal#85 — RFC for SOP Phase 4 + structured Five-Axis (parent context)
  Saved memories: feedback_per_agent_gitea_identity_default,
                  feedback_unified_credentials_file
  Task #241 — operator-host sync (already DONE; populated 28 dirs)
  Task #242 — this PR

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 07:09:40 -07:00
claude-ceo-assistant c5669aa304 ci: retrigger after operator disk freed (was ENOSPC during harness boot)
Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 10s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 13s
CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Successful in 4s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Successful in 4s
CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Successful in 4s
branch-protection drift check / Branch protection drift (pull_request) Successful in 17s
pr-guards / disable-auto-merge-on-push (pull_request) Successful in 4s
CI / Detect changes (pull_request) Successful in 15s
E2E API Smoke Test / detect-changes (pull_request) Successful in 13s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 12s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 14s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 12s
Harness Replays / detect-changes (pull_request) Successful in 13s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 13s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 13s
CI / Canvas (Next.js) (pull_request) Successful in 10s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 11s
CI / Python Lint & Test (pull_request) Successful in 12s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 16s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 12s
Harness Replays / Harness Replays (pull_request) Failing after 25s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 5m25s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 5m46s
CI / Platform (Go) (pull_request) Successful in 8m55s
2026-05-08 14:00:14 +00:00
claude-ceo-assistant bbfcaedece ci: retrigger after harness-tenant-alpha unhealthy on first run
CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Successful in 8s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Successful in 8s
CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Successful in 8s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 15s
Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 18s
branch-protection drift check / Branch protection drift (pull_request) Successful in 23s
CI / Detect changes (pull_request) Successful in 24s
pr-guards / disable-auto-merge-on-push (pull_request) Successful in 8s
E2E API Smoke Test / detect-changes (pull_request) Successful in 22s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 23s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 24s
Harness Replays / detect-changes (pull_request) Successful in 23s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 20s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 25s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 17s
CI / Canvas (Next.js) (pull_request) Successful in 19s
CI / Python Lint & Test (pull_request) Successful in 18s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 30s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 22s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 15s
CI / Platform (Go) (pull_request) Failing after 2m24s
Harness Replays / Harness Replays (pull_request) Failing after 2m8s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2m35s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Failing after 2m19s
Harness Replays job failed at "dependency failed to start: container
harness-tenant-alpha-1 is unhealthy" — that is not caused by this
merge (which adds workspace-server/internal/handlers code, not
container infra). Retry to confirm it was a transient environmental
issue (likely operator-host load/disk per internal#78).
2026-05-08 13:31:27 +00:00
devops-engineer 7d3a6a46c5 chore: sync main → staging (auto, ae2d9eab)
CodeQL / Analyze (${{ matrix.language }}) (go) (push) Successful in 9s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (push) Successful in 9s
Check merge_group trigger on required workflows / Required workflows have merge_group trigger (push) Successful in 16s
Block internal-flavored paths / Block forbidden paths (push) Successful in 17s
CodeQL / Analyze (${{ matrix.language }}) (python) (push) Successful in 8s
CI / Detect changes (push) Successful in 20s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (push) Successful in 18s
Secret scan / Scan diff for credential-shaped strings (push) Successful in 21s
E2E API Smoke Test / detect-changes (push) Successful in 24s
Handlers Postgres Integration / detect-changes (push) Successful in 24s
Runtime PR-Built Compatibility / detect-changes (push) Successful in 25s
CI / Shellcheck (E2E scripts) (push) Successful in 11s
CI / Platform (Go) (push) Successful in 13s
CI / Canvas (Next.js) (push) Successful in 13s
CI / Python Lint & Test (push) Successful in 11s
CI / Canvas Deploy Reminder (push) Has been skipped
Runtime PR-Built Compatibility / PR-built wheel + import smoke (push) Successful in 9s
E2E API Smoke Test / E2E API Smoke Test (push) Failing after 3m30s
Handlers Postgres Integration / Handlers Postgres Integration (push) Failing after 3m29s
2026-05-08 13:30:46 +00:00
claude-ceo-assistant ae2d9eabf6 Merge pull request 'chore(workflows): drop staging-branch triggers (Phase 3b of internal#81)' (#109) from chore/trunk-based-drop-staging-from-workflow-triggers into main
CodeQL / Analyze (${{ matrix.language }}) (go) (push) Successful in 8s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (push) Successful in 8s
CodeQL / Analyze (${{ matrix.language }}) (python) (push) Successful in 8s
Block internal-flavored paths / Block forbidden paths (push) Successful in 14s
Check merge_group trigger on required workflows / Required workflows have merge_group trigger (push) Successful in 16s
CI / Detect changes (push) Successful in 19s
Auto-sync main → staging / sync-staging (push) Successful in 22s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (push) Successful in 14s
Handlers Postgres Integration / detect-changes (push) Successful in 17s
E2E API Smoke Test / detect-changes (push) Successful in 19s
E2E Staging Canvas (Playwright) / detect-changes (push) Successful in 20s
Runtime PR-Built Compatibility / detect-changes (push) Successful in 18s
Secret scan / Scan diff for credential-shaped strings (push) Successful in 18s
CI / Shellcheck (E2E scripts) (push) Successful in 11s
CI / Platform (Go) (push) Successful in 14s
CI / Canvas (Next.js) (push) Successful in 14s
CI / Python Lint & Test (push) Successful in 13s
Handlers Postgres Integration / Handlers Postgres Integration (push) Successful in 12s
CI / Canvas Deploy Reminder (push) Has been skipped
E2E API Smoke Test / E2E API Smoke Test (push) Successful in 12s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (push) Successful in 11s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (push) Successful in 6m3s
branch-protection drift check / Branch protection drift (push) Failing after 12s
Canary — staging SaaS smoke (every 30 min) / Canary smoke (push) Failing after 3m38s
2026-05-08 13:30:24 +00:00
claude-ceo-assistant 2fac4b61b4 chore(workflows): drop staging-branch triggers (Phase 3b of internal#81)
CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Successful in 9s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Successful in 18s
Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 19s
CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Successful in 18s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 21s
Retarget main PRs to staging / Retarget to staging (pull_request) Has been skipped
branch-protection drift check / Branch protection drift (pull_request) Successful in 23s
CI / Detect changes (pull_request) Successful in 23s
E2E API Smoke Test / detect-changes (pull_request) Successful in 25s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 18s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 25s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 26s
CI / Platform (Go) (pull_request) Successful in 10s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 8s
CI / Canvas (Next.js) (pull_request) Successful in 11s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 16s
CI / Python Lint & Test (pull_request) Successful in 9s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 17s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 10s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 12s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 6s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 4m32s
Trunk-based migration: main is the only branch. Update 4 workflows
that fired on staging-branch pushes to fire on main instead.

  - e2e-staging-canvas.yml: drop staging from push + pull_request
  - e2e-staging-external.yml: drop staging from push + pull_request
  - e2e-staging-saas.yml: drop staging from push + pull_request,
    update header comment that references the (now-obsolete)
    staging→main auto-promote flow
  - redeploy-tenants-on-staging.yml: workflow_run.branches changes
    from [staging] to [main] so the tenant redeploy fires when
    publish-workspace-server-image runs on main

Workflows that target the staging tenant FLEET (canary-staging.yml,
e2e-staging-sanity.yml) are not changed — they fire on cron, the word
"staging" in their filenames refers to the deployment target environ-
ment, not the git branch.

Lands as Phase 3b after #108 promotes the 5 staging-only feature PRs
(Phase 3a). Phase 3c deletes the obsolete promote/sync workflows
(auto-promote-staging, auto-sync-main-to-staging, etc.) plus the
staging branch itself, after we no-op-verify both Phase 3a and 3b
green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 13:08:51 +00:00
claude-ceo-assistant 2597511d7b chore: promote 5 staging-only feature PRs to main (Phase 3 of internal#81)
Retarget main PRs to staging / Retarget to staging (pull_request) Has been skipped
CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Successful in 3s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Successful in 6s
CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Successful in 7s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 9s
pr-guards / disable-auto-merge-on-push (pull_request) Successful in 4s
CI / Detect changes (pull_request) Successful in 11s
E2E API Smoke Test / detect-changes (pull_request) Successful in 13s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 13s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 14s
Harness Replays / detect-changes (pull_request) Successful in 13s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 11s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 11s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 5s
CI / Canvas (Next.js) (pull_request) Successful in 9s
CI / Python Lint & Test (pull_request) Successful in 7s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 5s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 8s
Harness Replays / Harness Replays (pull_request) Failing after 57s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 3m8s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 3m10s
CI / Platform (Go) (pull_request) Successful in 4m36s
This was supposed to fast-forward when each PR merged on staging,
but auto-promote-staging.yml has not been firing reliably on Gitea
since the GitHub suspension. Result: main is missing 5 substantive
feature PRs that landed on staging between 2026-04-29 and 2026-05-07:

  - #102: test(org-include) symlink-based subtree composition contract
  - #103: test(local-e2e) dev-department extraction end-to-end
  - #104: fix(provisioner)+test EvalSymlinks templatePath; stage-2 e2e
  - #105: feat(org-import) !external cross-repo subtree resolver (#222)
  - #106: test(org-external) integration + e2e for !external resolver

Each PR was independently reviewed and CI-green at staging-merge time;
this commit promotes the merged state atomically. Use git log on main
after the merge to see the original PR-merge commits preserved.

Sister work: Phase 3 of internal#81 (trunk-based migration). Workflow
trigger updates land in a follow-up PR; staging-branch deletion happens
after a no-op verification deploy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 13:07:22 +00:00
claude-ceo-assistant 5abc4f74ca harden(org-external): token via http.extraHeader, .complete cache marker, ref .. deny, naming cleanup (#107)
CodeQL / Analyze (${{ matrix.language }}) (go) (push) Successful in 1s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (push) Successful in 1s
CodeQL / Analyze (${{ matrix.language }}) (python) (push) Successful in 0s
Block internal-flavored paths / Block forbidden paths (push) Successful in 4s
CI / Detect changes (push) Successful in 9s
Secret scan / Scan diff for credential-shaped strings (push) Successful in 7s
E2E API Smoke Test / detect-changes (push) Successful in 9s
E2E Staging Canvas (Playwright) / detect-changes (push) Successful in 10s
Harness Replays / detect-changes (push) Successful in 9s
Handlers Postgres Integration / detect-changes (push) Successful in 9s
Runtime PR-Built Compatibility / detect-changes (push) Successful in 10s
CI / Shellcheck (E2E scripts) (push) Successful in 3s
CI / Python Lint & Test (push) Successful in 4s
CI / Canvas (Next.js) (push) Successful in 6s
CI / Canvas Deploy Reminder (push) Has been skipped
E2E Staging Canvas (Playwright) / Canvas tabs E2E (push) Successful in 6s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (push) Successful in 5s
Harness Replays / Harness Replays (push) Successful in 1m7s
publish-workspace-server-image / build-and-push (push) Successful in 1m58s
Handlers Postgres Integration / Handlers Postgres Integration (push) Successful in 2m4s
E2E API Smoke Test / E2E API Smoke Test (push) Successful in 2m10s
CI / Platform (Go) (push) Successful in 3m13s
Five-Axis self-review pass on the !external resolver work (PRs #105+#106) caught three real issues that the unstructured 3-weakest review missed:

1. Cache validity gap — partial cache writes looked complete
2. Token persistence — token in URL userinfo got persisted to .git/config
3. Misleading function name post-refactor

This PR fixes all three:
- .complete marker file written atomically; wipe-and-refetch on partial cache
- Token via -c http.extraHeader, never embedded in URL
- Defense-in-depth ref .. deny (was already validated by repoSafeRefRegex but explicit + tested)
- Renamed buildCloneURL -> buildExternalCloneURL (collision with artifacts.go), rewriteFilesDirAndIncludes -> rewriteFilesDir
- Removed unused redactToken/shortHash helpers and crypto/sha1, encoding/hex, fmt imports

Approved by platform-engineer 2026-05-08T12:55Z.
2026-05-08 13:04:00 +00:00
29 changed files with 1433 additions and 2003 deletions
-467
View File
@@ -1,467 +0,0 @@
name: Auto-promote :latest after main image build
# Retags `ghcr.io/molecule-ai/{platform,platform-tenant}:staging-<sha>`
# → `:latest` after either the image build or E2E completes on a `main`
# push, gated on E2E Staging SaaS not being red for that SHA.
#
# Why two triggers:
#
# `publish-workspace-server-image` and `e2e-staging-saas` are both
# paths-filtered, but with DIFFERENT path sets:
#
# publish-workspace-server-image:
# workspace-server/**, canvas/**, manifest.json
#
# e2e-staging-saas (full lifecycle):
# workspace-server/internal/handlers/{registry,workspace_provision,
# a2a_proxy}.go, workspace-server/internal/middleware/**,
# workspace-server/internal/provisioner/**, tests/e2e/test_staging_full_saas.sh
#
# The E2E set is a strict SUBSET of the publish set. So:
# - canvas/** changes → publish fires, E2E does not
# - workspace-server/cmd/** changes → publish fires, E2E does not
# - workspace-server/internal/sweep/** → publish fires, E2E does not
#
# The previous version triggered ONLY on E2E completion, which meant
# non-E2E-path changes (canvas, cmd, sweep, etc.) rebuilt the image
# but never advanced `:latest`. Result: as of 2026-04-28 this workflow
# had run zero times since merge despite eight main pushes — `:latest`
# was ~7 hours / 9 PRs behind main with no human realising. See
# `molecule-core` Slack discussion 2026-04-28.
#
# Adding `publish-workspace-server-image` as a second trigger closes
# the gap: any image rebuild on main eligibly advances `:latest`.
#
# Why E2E remains a kill-switch (not the trigger):
#
# When E2E DID run for this SHA and ended red, we abort — `:latest`
# stays on the prior known-good digest. When E2E didn't run (paths
# filtered out), we proceed: pre-merge gates already validated this
# SHA on staging via auto-promote-staging requiring CI + E2E Canvas +
# E2E API + CodeQL all green. Image content for non-E2E-paths
# (canvas, cmd, sweep) is exercised by those staging gates.
#
# Why `main` only:
#
# `:latest` is what prod tenants pull. We only want SHAs that have
# reached main (via auto-promote-staging) to advance `:latest`.
# Triggering on staging would let a staging-only revert advance
# `:latest` to a SHA that never reaches main, breaking the "production
# runs what's on main" invariant.
#
# Idempotency:
#
# When a SHA touches paths that match BOTH publish and E2E, both
# workflows fire and complete. Both trigger this workflow on
# completion → two runs race. Both retag `:staging-<sha>` →
# `:latest`. crane tag is idempotent (re-tagging the same digest is a
# no-op), so the second run is harmless. concurrency group serializes
# them anyway.
on:
workflow_run:
workflows:
- 'E2E Staging SaaS (full lifecycle)'
- 'publish-workspace-server-image'
types: [completed]
branches: [main]
workflow_dispatch:
inputs:
sha:
description: 'Short sha to promote (override; defaults to upstream workflow_run head_sha)'
required: false
type: string
permissions:
contents: read
packages: write
concurrency:
# Serialize promotes per-SHA so the publish+E2E both-fired race lands
# cleanly. Different SHAs can promote in parallel.
group: auto-promote-latest-${{ github.event.workflow_run.head_sha || github.event.inputs.sha || github.sha }}
cancel-in-progress: false
env:
IMAGE_NAME: ghcr.io/molecule-ai/platform
TENANT_IMAGE_NAME: ghcr.io/molecule-ai/platform-tenant
jobs:
promote:
# Proceed if upstream succeeded OR manual dispatch. Upstream-failure
# paths are filtered here; the E2E-was-red kill-switch lives in the
# gate-check step below (covers the case where upstream is publish
# success but E2E for the same SHA failed).
if: |
github.event_name == 'workflow_dispatch' ||
(github.event_name == 'workflow_run' && github.event.workflow_run.conclusion == 'success')
runs-on: ubuntu-latest
steps:
- name: Compute short sha
id: sha
run: |
set -euo pipefail
if [ -n "${{ github.event.inputs.sha }}" ]; then
FULL="${{ github.event.inputs.sha }}"
else
FULL="${{ github.event.workflow_run.head_sha }}"
fi
echo "short=${FULL:0:7}" >> "$GITHUB_OUTPUT"
echo "full=${FULL}" >> "$GITHUB_OUTPUT"
- name: Gate — E2E Staging SaaS state for this SHA
# When upstream IS E2E success, we know it's green (filtered by
# the job-level `if` already). When upstream is publish, look up
# E2E state for the same SHA. Four buckets:
#
# - completed/success: E2E confirmed safe → proceed
# - completed/failure|cancelled|timed_out: E2E found a
# regression → ABORT (exit 1), `:latest` stays put
# - in_progress|queued|requested: E2E is RACING with publish
# for a runtime-touching SHA. publish typically completes
# ~5-10min before E2E (~10-15min). If we promote on the
# publish signal here, a later E2E failure can't roll back
# `:latest` — it'd already be wrongly advanced. So we DEFER:
# skip subsequent steps (proceed=false) and let E2E's own
# completion event re-fire this workflow, which then takes
# the upstream-is-E2E path. exit 0 so the run shows as
# success rather than a noisy fake-failure.
# - none/none: E2E was paths-filtered out for this SHA (the
# change touched canvas/cmd/sweep/etc. — paths covered by
# publish but not by E2E). pre-merge gates on staging
# already validated this SHA → proceed.
#
# Manual dispatch skips this check — operator override.
id: gate
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
REPO: ${{ github.repository }}
SHA: ${{ steps.sha.outputs.full }}
UPSTREAM_NAME: ${{ github.event.workflow_run.name }}
EVENT_NAME: ${{ github.event_name }}
run: |
set -euo pipefail
if [ "$EVENT_NAME" = "workflow_dispatch" ]; then
echo "proceed=true" >> "$GITHUB_OUTPUT"
echo "::notice::Manual dispatch — skipping E2E gate (operator override)"
exit 0
fi
if [ "$UPSTREAM_NAME" = "E2E Staging SaaS (full lifecycle)" ]; then
echo "proceed=true" >> "$GITHUB_OUTPUT"
echo "::notice::Upstream is E2E itself (success per job-level if) — gate trivially satisfied"
exit 0
fi
# Upstream is publish-workspace-server-image. Check E2E state
# for the same SHA via Gitea's commit-status API.
#
# GitHub-era this was `gh run list --workflow=X --commit=SHA
# --json status,conclusion` returning either `[]` (no run on
# this SHA) or `[{status, conclusion}]` (the run's state).
# Gitea has NO workflow-runs API at all — `/api/v1/repos/.../
# actions/runs` returns 404 (verified 2026-05-07, issue #75).
# However Gitea Actions DOES emit a commit status per workflow
# job, with `context = "<Workflow Name> / <Job Name> (<event>)"`,
# which is exactly what we need: each E2E run leg becomes one
# status row on the SHA, and the aggregate state encodes the
# run's outcome.
#
# Mapping:
# 0 matched contexts → "none/none" (E2E paths-
# filtered
# out — same
# semantic
# as before)
# any context = pending → "in_progress/none" (defer)
# any context = error|failure → "completed/failure" (abort)
# all contexts = success → "completed/success" (proceed)
#
# The "completed/cancelled" and "completed/timed_out" buckets
# don't have direct Gitea analogs (Gitea statuses are
# success / failure / error / pending / warning). Per-SHA
# concurrency cancellation surfaces as `error` on Gitea, which
# we map to "completed/failure" rather than "completed/cancelled"
# — losing the soft-defer semantic of the cancelled bucket on
# this fleet. Tradeoff: the staleness alarm (auto-promote-stale-
# alarm.yml) still catches a stuck :latest within 4h, and a
# legitimate cancel is rare enough that aborting + manual
# re-dispatch is acceptable. If we measure cancel frequency
# > 1/week, revisit by reading the run-step-summary text via
# a follow-up script.
#
# Network or auth blips collapse to "none/none" via the curl
# `|| true` fallback, matching the pre-Gitea behaviour where
# an empty list also degenerated to none/none.
GITEA_API_URL="${GITHUB_SERVER_URL:-https://git.moleculesai.app}/api/v1"
STATUSES_JSON=$(curl --fail-with-body -sS \
-H "Authorization: token ${GH_TOKEN}" \
-H "Accept: application/json" \
"${GITEA_API_URL}/repos/${REPO}/commits/${SHA}/statuses?limit=100" \
2>/dev/null || echo "[]")
RESULT=$(printf '%s' "$STATUSES_JSON" | jq -r '
# Filter to E2E Staging SaaS (full lifecycle) statuses.
# Match by leading workflow-name prefix so the "<job>
# (<event>)" tail is irrelevant. Gitea emits the workflow
# name verbatim from the YAML `name:` field.
[.[] | select(.context | startswith("E2E Staging SaaS (full lifecycle) /"))] as $rows
| if ($rows | length) == 0 then
"none/none"
elif any($rows[]; .status == "pending") then
"in_progress/none"
elif any($rows[]; .status == "failure" or .status == "error") then
"completed/failure"
elif all($rows[]; .status == "success") then
"completed/success"
else
# Mixed / unknown — fall through to *) bucket below.
"completed/" + ($rows[0].status // "unknown")
end
' 2>/dev/null || echo "none/none")
echo "E2E Staging SaaS for ${SHA:0:7}: $RESULT"
case "$RESULT" in
completed/success)
echo "proceed=true" >> "$GITHUB_OUTPUT"
echo "::notice::E2E green for this SHA — proceeding with promote"
;;
completed/failure|completed/timed_out)
echo "proceed=false" >> "$GITHUB_OUTPUT"
{
echo "## ❌ Auto-promote aborted — E2E Staging SaaS failed"
echo
echo "E2E Staging SaaS for \`${SHA:0:7}\`: \`$RESULT\`"
echo "\`:latest\` stays on the prior known-good digest."
echo
echo "If the failure was a flake, manually dispatch this workflow with the same sha to override."
} >> "$GITHUB_STEP_SUMMARY"
exit 1
;;
completed/cancelled)
# GitHub-era only: cancelled ≠ failure. Gitea statuses
# don't expose a "cancelled" state — a per-SHA concurrency
# cancellation surfaces as `failure` or `error` on Gitea
# and is now handled by the failure branch above. This
# arm is kept for backwards compatibility / dual-host
# operation (if we ever add a non-Gitea fallback) but
# under the post-#75 flow it's unreachable.
echo "proceed=false" >> "$GITHUB_OUTPUT"
{
echo "## ⏭ Auto-promote deferred — E2E Staging SaaS was cancelled"
echo
echo "E2E Staging SaaS for \`${SHA:0:7}\`: \`$RESULT\`"
echo "Likely per-SHA concurrency (newer push superseded this E2E run)."
echo "The newer SHA's E2E will fire its own promote when it lands."
echo "If you need this specific SHA promoted, manually dispatch."
} >> "$GITHUB_STEP_SUMMARY"
;;
in_progress/*|queued/*|requested/*|waiting/*|pending/*)
echo "proceed=false" >> "$GITHUB_OUTPUT"
{
echo "## ⏳ Auto-promote deferred — E2E Staging SaaS still running"
echo
echo "Publish completed before E2E for \`${SHA:0:7}\` (state: \`$RESULT\`)."
echo "Skipping retag here — E2E's own completion event will re-fire this workflow."
echo "If E2E ends green, that run promotes \`:latest\`. If red, it aborts."
} >> "$GITHUB_STEP_SUMMARY"
;;
none/none)
echo "proceed=true" >> "$GITHUB_OUTPUT"
echo "::notice::E2E paths-filtered out for this SHA — pre-merge staging gates carry"
;;
*)
echo "proceed=false" >> "$GITHUB_OUTPUT"
{
echo "## ❓ Auto-promote aborted — unexpected E2E state"
echo
echo "E2E Staging SaaS for \`${SHA:0:7}\`: \`$RESULT\` (unhandled)"
echo "Manual investigation needed; re-dispatch with the same sha once resolved."
} >> "$GITHUB_STEP_SUMMARY"
exit 1
;;
esac
- if: steps.gate.outputs.proceed == 'true'
uses: imjasonh/setup-crane@6da1ae018866400525525ce74ff892880c099987 # v0.5
- name: GHCR login
if: steps.gate.outputs.proceed == 'true'
run: |
echo "${{ secrets.GITHUB_TOKEN }}" | \
crane auth login ghcr.io -u "${{ github.actor }}" --password-stdin
- name: Verify :staging-<sha> exists for both images
# Better to fail fast with a clear message than to half-tag
# (platform retagged but platform-tenant missing → tenants pull
# a stale image).
if: steps.gate.outputs.proceed == 'true'
run: |
set -euo pipefail
for img in "${IMAGE_NAME}" "${TENANT_IMAGE_NAME}"; do
tag="${img}:staging-${{ steps.sha.outputs.short }}"
if ! crane manifest "$tag" >/dev/null 2>&1; then
echo "::error::Missing tag: $tag"
echo "::error::publish-workspace-server-image must complete on this SHA before auto-promote can retag :latest."
exit 1
fi
echo " ok: $tag exists"
done
- name: Ancestry check — refuse to promote :latest backwards
# #2244: workflow_run completions arrive in arbitrary order. If
# SHA-A and SHA-B both reach main within ~10 min and SHA-B's E2E
# completes before SHA-A's, this workflow can fire for SHA-A
# AFTER it already promoted SHA-B → :latest goes backwards. The
# orphan-reconciler "next run corrects it" doesn't apply: there's
# no auto-corrective re-promote, :latest stays wrong until the
# next main push lands.
#
# Detection: read current :latest's `org.opencontainers.image.revision`
# label (set by publish-workspace-server-image.yml at build time)
# and ask the GitHub compare API whether the candidate SHA is
# ahead-of / identical-to / behind / diverged-from current.
# Hard-fail on `behind` and `diverged` per the approved design —
# silent-bypass is the class we're moving away from. Workflow
# goes red, oncall sees it, operator decides how to recover
# (manual dispatch with the right SHA, force-promote, etc.).
#
# Manual dispatch skips this check — operator override semantics
# match the gate-check step above.
#
# Backward-compat: when current :latest carries no revision
# label (legacy image pre-publish-with-label), skip-with-warning.
# All :latest images on main are post-label as of 2026-04-29, so
# this branch will be dead within 90 days; remove then.
if: steps.gate.outputs.proceed == 'true' && github.event_name != 'workflow_dispatch'
id: ancestry
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
REPO: ${{ github.repository }}
TARGET_SHA: ${{ steps.sha.outputs.full }}
run: |
set -euo pipefail
# Read the current :latest config and pull the revision label.
# `crane config` returns the OCI image config blob (not the manifest);
# labels live under `.config.Labels`. `// empty` makes jq return ""
# rather than the literal "null" so the test below works.
CURRENT_REVISION=$(crane config "${IMAGE_NAME}:latest" 2>/dev/null \
| jq -r '.config.Labels["org.opencontainers.image.revision"] // empty' \
|| true)
if [ -z "$CURRENT_REVISION" ]; then
echo "decision=skip-no-label" >> "$GITHUB_OUTPUT"
{
echo "## ⚠ Ancestry check skipped — current :latest has no revision label"
echo
echo "Likely a legacy image built before \`org.opencontainers.image.revision\` was set."
echo "Falling through to retag. After all \`:latest\` images are post-label (TODO 90 days), this branch is dead and should be removed."
} >> "$GITHUB_STEP_SUMMARY"
echo "::warning::Current :latest carries no revision label — skipping ancestry check (legacy image)"
exit 0
fi
if [ "$CURRENT_REVISION" = "$TARGET_SHA" ]; then
echo "decision=identical" >> "$GITHUB_OUTPUT"
echo "::notice:::latest already at ${TARGET_SHA:0:7} — retag will be a no-op"
exit 0
fi
# Ask GitHub which side of the merge graph TARGET_SHA sits on
# relative to CURRENT_REVISION. Returns one of: ahead | identical
# | behind | diverged. Network or auth errors collapse to "error"
# via the explicit fallback so the case below always matches.
STATUS=$(gh api \
"repos/${REPO}/compare/${CURRENT_REVISION}...${TARGET_SHA}" \
--jq '.status' 2>/dev/null || echo "error")
echo "ancestry compare ${CURRENT_REVISION:0:7} → ${TARGET_SHA:0:7}: $STATUS"
case "$STATUS" in
ahead)
echo "decision=ahead" >> "$GITHUB_OUTPUT"
echo "::notice::Target ${TARGET_SHA:0:7} is ahead of current :latest (${CURRENT_REVISION:0:7}) — proceeding with retag"
;;
identical)
echo "decision=identical" >> "$GITHUB_OUTPUT"
echo "::notice::Target identical to :latest — retag will be a no-op"
;;
behind)
echo "decision=behind" >> "$GITHUB_OUTPUT"
{
echo "## ❌ Auto-promote refused — target is BEHIND current :latest"
echo
echo "| Field | Value |"
echo "|---|---|"
echo "| Target SHA | \`$TARGET_SHA\` |"
echo "| Current :latest revision | \`$CURRENT_REVISION\` |"
echo "| GitHub compare status | \`behind\` |"
echo
echo "This guard catches the workflow_run-completion-order race (#2244):"
echo "two rapid main pushes whose E2Es complete out-of-order can otherwise"
echo "promote \`:latest\` backwards. \`:latest\` stays on \`${CURRENT_REVISION:0:7}\`."
echo
echo "**Recovery:** if this is a legitimate revert that should land on \`:latest\`,"
echo "manually dispatch this workflow with the target sha as input — the manual-dispatch"
echo "path skips the ancestry check (operator override)."
} >> "$GITHUB_STEP_SUMMARY"
exit 1
;;
diverged)
echo "decision=diverged" >> "$GITHUB_OUTPUT"
{
echo "## ❓ Auto-promote refused — history diverged"
echo
echo "| Field | Value |"
echo "|---|---|"
echo "| Target SHA | \`$TARGET_SHA\` |"
echo "| Current :latest revision | \`$CURRENT_REVISION\` |"
echo "| GitHub compare status | \`diverged\` |"
echo
echo "Likely cause: force-push rewrote main's history, leaving the previous"
echo "\`:latest\` revision orphaned. Needs human review before \`:latest\` advances."
} >> "$GITHUB_STEP_SUMMARY"
exit 1
;;
error|*)
echo "decision=error" >> "$GITHUB_OUTPUT"
{
echo "## ❌ Auto-promote aborted — ancestry-check API error"
echo
echo "\`gh api repos/${REPO}/compare/${CURRENT_REVISION}...${TARGET_SHA}\` returned unexpected status: \`$STATUS\`"
echo
echo "Manual dispatch with the target sha bypasses this check."
} >> "$GITHUB_STEP_SUMMARY"
exit 1
;;
esac
- name: Retag platform :staging-<sha> → :latest
if: steps.gate.outputs.proceed == 'true'
run: |
crane tag "${IMAGE_NAME}:staging-${{ steps.sha.outputs.short }}" latest
- name: Retag tenant :staging-<sha> → :latest
if: steps.gate.outputs.proceed == 'true'
run: |
crane tag "${TENANT_IMAGE_NAME}:staging-${{ steps.sha.outputs.short }}" latest
- name: Summary
if: steps.gate.outputs.proceed == 'true'
run: |
{
echo "## :latest promoted to ${{ steps.sha.outputs.short }}"
echo
if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then
echo "- Trigger: manual dispatch"
else
echo "- Upstream: \`${{ github.event.workflow_run.name }}\` ([run](${{ github.event.workflow_run.html_url }}))"
fi
echo "- platform:staging-${{ steps.sha.outputs.short }} → :latest"
echo "- platform-tenant:staging-${{ steps.sha.outputs.short }} → :latest"
echo
echo "Tenant fleet auto-pulls within 5 min via IMAGE_AUTO_REFRESH=true."
echo "Force immediate fanout: dispatch redeploy-tenants-on-main.yml."
} >> "$GITHUB_STEP_SUMMARY"
-492
View File
@@ -1,492 +0,0 @@
name: Auto-promote staging → main
# Fires after any of the staging-branch quality gates complete. When ALL
# required gates are green on the same staging SHA, opens (or re-uses)
# a PR `staging → main` and schedules Gitea auto-merge so the PR lands
# automatically once approval + status checks are satisfied.
#
# ============================================================
# What this workflow does
# ============================================================
#
# 1. On a workflow_run completion event for one of the staging gate
# workflows (CI, E2E Staging Canvas, E2E API Smoke, CodeQL),
# checks if the combined status on the staging head SHA is green.
# 2. If green, opens (or re-uses) a PR `head: staging → base: main`
# via Gitea REST `POST /api/v1/repos/.../pulls`.
# 3. Schedules auto-merge via `POST /api/v1/repos/.../pulls/{index}/merge`
# with `merge_when_checks_succeed: true`. Gitea waits for the
# approval requirement on `main` (`required_approvals: 1`) and
# the status-check gates, then merges.
# 4. The merge commit lands on `main` and fires
# `publish-workspace-server-image.yml` naturally via its
# `on: push: branches: [main]` trigger — no explicit dispatch
# needed (see "Why no workflow_dispatch tail" below).
#
# `auto-sync-main-to-staging.yml` is the reverse-direction
# counterpart (main → staging, fast-forward push). Together they
# keep the staging-superset-of-main invariant tight.
#
# ============================================================
# Why Gitea REST (and not `gh pr create`)
# ============================================================
#
# Pre-2026-05-06 this workflow used `gh pr create`, `gh pr merge --auto`,
# `gh run list`, and `gh workflow run` against GitHub. After the
# GitHub→Gitea cutover those calls fail because:
#
# - `gh pr create / merge / view / list` route to GitHub GraphQL
# (`/api/graphql`). Gitea does not expose a GraphQL endpoint;
# every call returns `HTTP 405 Method Not Allowed` — same root
# cause as #65 (auto-sync) which PR #66 fixed by dropping `gh`
# entirely.
# - `gh run list --workflow=...` GitHub-shape; Gitea has the
# simpler `GET /repos/.../commits/{ref}/status` combined-status
# endpoint instead.
# - `gh workflow run X.yml` calls `POST /repos/.../actions/workflows/{id}/dispatches`,
# which does NOT exist on Gitea 1.22.6 (verified via swagger.v1.json).
#
# So this workflow uses direct `curl` calls to Gitea REST. No `gh`
# CLI dependency, no GraphQL, no missing-endpoint footgun.
#
# ============================================================
# Why no workflow_dispatch tail (was load-bearing on GitHub, dead on Gitea)
# ============================================================
#
# The GitHub-era version had a 60-line polling step that waited for
# the promote PR to merge, then explicitly dispatched
# `publish-workspace-server-image.yml` on `--ref main`. That step
# existed because GitHub's GITHUB_TOKEN-initiated merges suppress
# downstream `on: push` workflows (the documented "no recursion" rule
# — https://docs.github.com/en/actions/using-workflows/triggering-a-workflow#triggering-a-workflow-from-a-workflow).
# The explicit dispatch was the workaround.
#
# Gitea Actions does NOT have this no-recursion rule. PR #66's auto-
# sync merge to main fired `auto-promote-staging` on the next push
# trigger naturally. So the cascade fires on the natural push event;
# the explicit dispatch is dead code. (And even if we wanted to
# preserve it, Gitea has no `workflow_dispatch` REST endpoint.)
#
# Removed in this rewrite. If we ever observe the cascade misfire,
# operator can push an empty commit to `main` to wake it.
#
# ============================================================
# Why open a PR (and not direct push)
# ============================================================
#
# `main` branch protection has `enable_push: false` with NO
# `push_whitelist_usernames`. Direct push is impossible for any
# persona, including admins. PR-mediated merge is the only path,
# which is intentional: prod state mutations (and staging→main IS a
# prod mutation, since the next deploy fans out to tenants) require
# Hongming's approval per `feedback_prod_apply_needs_hongming_chat_go`.
#
# The auto-merge schedule preserves this gate: `merge_when_checks_succeed`
# does NOT bypass `required_approvals: 1`. Gitea waits for BOTH
# approval AND green checks before merging. Hongming reviews via the
# canvas/chat-handle of the PR notification, approves, and Gitea
# auto-merges within seconds.
#
# ============================================================
# Identity + token (anti-bot-ring per saved-memory
# `feedback_per_agent_gitea_identity_default`)
# ============================================================
#
# This workflow uses `secrets.AUTO_SYNC_TOKEN` — a personal access
# token issued to the `devops-engineer` Gitea persona. NOT the
# founder PAT. The bot-ring fingerprint that triggered the GitHub
# org suspension on 2026-05-06 was characterised by founder PAT
# acting as CI at machine speed.
#
# Token scope: `push: true` (read+write) on this repo. The persona
# can: open PRs, comment on PRs, schedule auto-merge. The persona
# CANNOT bypass main's branch protection (`required_approvals: 1`
# still applies — only Hongming's review unblocks merge).
#
# Authorship: the PR is opened by `devops-engineer`; the merge
# commit credits Hongming-as-approver and `devops-engineer` as
# the merger.
#
# ============================================================
# Failure modes & operational notes
# ============================================================
#
# A — staging gates not all green at trigger time:
# - The combined-status check returns `state: pending|failure`.
# Workflow exits 0 with a step-summary "not all green; staying
# on current main". Re-fires on the next gate completion.
#
# B — Gitea PR-create returns non-201 (e.g. 422 already-exists):
# - Idempotent: the workflow first GETs the existing open
# staging→main PR. If found, reuse it; if not, POST a new one.
# 422 should never surface; if it does (race), step summary
# captures the body and the next workflow_run picks up.
#
# C — `merge_when_checks_succeed` schedule fails:
# - 422 with "Pull request is not mergeable" if there are
# conflicts or stale base. Step summary surfaces it; operator
# (or `auto-sync-main-to-staging`) needs to bring staging up
# to date with main first. Workflow exits 1 to surface red.
#
# D — `AUTO_SYNC_TOKEN` rotated / wrong scope:
# - 401/403 on first REST call. Step summary surfaces it.
# Re-issue the token from `~/.molecule-ai/personas/` on the
# operator host and update the repo Actions secret.
#
# ============================================================
# Loop safety
# ============================================================
#
# When the promote PR merges to main, `auto-sync-main-to-staging.yml`
# fires (on:push:main) and pushes the merge commit back to staging.
# That push to staging is by `devops-engineer`, NOT this workflow's
# token, and triggers the staging gate workflows. When they all
# complete, we end up back here — but the tree-diff guard catches
# it: staging tree == main tree (the merge commit changes nothing),
# so we skip and the cycle terminates.
on:
workflow_run:
workflows:
- CI
- E2E Staging Canvas (Playwright)
- E2E API Smoke Test
- CodeQL
types: [completed]
workflow_dispatch:
inputs:
force:
description: "Force promote even when AUTO_PROMOTE_ENABLED is unset (manual override)"
required: false
default: "false"
permissions:
contents: read
pull-requests: write
# Serialize auto-promote runs. Multiple staging gate completions can land
# in quick succession (CI + E2E + CodeQL all finish within seconds of
# each other on a green PR) — without this, two parallel runs both:
# 1. Would race the GET-or-POST PR step.
# 2. Would both call merge-schedule (idempotent — fine on Gitea).
# cancel-in-progress: false because the second run on a fresh staging
# tip should NOT kill the first which has already opened the PR.
concurrency:
group: auto-promote-staging
cancel-in-progress: false
jobs:
check-all-gates-green:
# Only consider staging pushes. PRs into staging don't promote.
if: >
(github.event_name == 'workflow_run' &&
github.event.workflow_run.head_branch == 'staging' &&
github.event.workflow_run.event == 'push')
|| github.event_name == 'workflow_dispatch'
runs-on: ubuntu-latest
outputs:
all_green: ${{ steps.gates.outputs.all_green }}
head_sha: ${{ steps.gates.outputs.head_sha }}
steps:
# Skip empty-tree promotes (the perpetual auto-promote↔auto-sync
# cycle observed pre-cutover on GitHub). On Gitea the cycle shape
# is different (auto-sync uses fast-forward, no merge commit),
# but the tree-diff guard is cheap insurance and protects against
# any future merge-style regression.
- name: Checkout for tree-diff check
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
fetch-depth: 0
ref: staging
- name: Skip if staging tree == main tree (cycle-break safety)
id: tree-diff
env:
HEAD_SHA: ${{ github.event.workflow_run.head_sha || github.sha }}
run: |
set -eu
git fetch origin main --depth=50 || { echo "::warning::git fetch main failed — proceeding (fail-open)"; exit 0; }
if git diff --quiet origin/main "$HEAD_SHA" -- 2>/dev/null; then
{
echo "## Skipped — no code to promote"
echo
echo "staging tip (\`${HEAD_SHA:0:8}\`) and \`main\` have identical trees."
echo "Skipping to avoid opening an empty promote PR."
} >> "$GITHUB_STEP_SUMMARY"
echo "::notice::auto-promote: staging tree == main tree — no code to promote, skipping"
echo "skip=true" >> "$GITHUB_OUTPUT"
else
echo "skip=false" >> "$GITHUB_OUTPUT"
fi
- name: Check combined status on staging head
if: steps.tree-diff.outputs.skip != 'true'
id: gates
env:
GITEA_TOKEN: ${{ secrets.AUTO_SYNC_TOKEN }}
HEAD_SHA: ${{ github.event.workflow_run.head_sha || github.sha }}
REPO: ${{ github.repository }}
GITEA_HOST: ${{ vars.GITEA_HOST || 'https://git.moleculesai.app' }}
run: |
set -euo pipefail
# Gitea-native combined-status endpoint aggregates every
# check context attached to a SHA. This is structurally
# cleaner than the GitHub-era per-workflow `gh run list`
# loop because:
#
# 1. There's no risk of "workflow name collision" (the
# GitHub-era code had to switch from `--workflow=NAME`
# to `--workflow=FILE.YML` to disambiguate "CodeQL"
# between the explicit workflow and GitHub's UI-
# configured default setup; Gitea has no such
# duplicate-name surface).
# 2. Gitea's combined state already encodes the AND
# across all contexts: success only if EVERY context
# is success. Pending or failure on any context
# produces non-success state.
#
# See https://docs.gitea.com/api/1.22 for the schema —
# `state` is one of: success, pending, failure, error.
echo "head_sha=${HEAD_SHA}" >> "$GITHUB_OUTPUT"
echo "Checking combined status on SHA ${HEAD_SHA}"
# `set +o pipefail` for the http-code capture pattern; restore
# immediately. Pattern hardened per `feedback_curl_status_capture_pollution`.
BODY_FILE=$(mktemp)
set +e
STATUS=$(curl -sS \
-H "Authorization: token ${GITEA_TOKEN}" \
-H "Accept: application/json" \
-o "${BODY_FILE}" \
-w "%{http_code}" \
"${GITEA_HOST}/api/v1/repos/${REPO}/commits/${HEAD_SHA}/status")
CURL_RC=$?
set -e
if [ "${CURL_RC}" -ne 0 ] || [ "${STATUS}" != "200" ]; then
echo "::error::combined-status fetch failed: curl=${CURL_RC} http=${STATUS}"
cat "${BODY_FILE}" | head -c 500 || true
rm -f "${BODY_FILE}"
echo "all_green=false" >> "$GITHUB_OUTPUT"
exit 0
fi
STATE=$(jq -r '.state // "missing"' < "${BODY_FILE}")
TOTAL=$(jq -r '.total_count // 0' < "${BODY_FILE}")
rm -f "${BODY_FILE}"
echo "Combined status: state=${STATE} total_count=${TOTAL}"
if [ "${STATE}" = "success" ] && [ "${TOTAL}" -gt 0 ]; then
echo "all_green=true" >> "$GITHUB_OUTPUT"
echo "::notice::All gates green on ${HEAD_SHA} (${TOTAL} contexts)"
else
echo "all_green=false" >> "$GITHUB_OUTPUT"
{
echo "## Not promoting — combined status not green"
echo
echo "- SHA: \`${HEAD_SHA:0:8}\`"
echo "- Combined state: \`${STATE}\`"
echo "- Context count: ${TOTAL}"
echo
echo "Will re-fire on the next gate completion. Investigate any red gate via the Actions UI."
} >> "$GITHUB_STEP_SUMMARY"
echo "::notice::auto-promote: combined status is ${STATE} on ${HEAD_SHA} — staying on current main"
fi
promote:
needs: check-all-gates-green
if: needs.check-all-gates-green.outputs.all_green == 'true'
runs-on: ubuntu-latest
steps:
- name: Check rollout gate
env:
AUTO_PROMOTE_ENABLED: ${{ vars.AUTO_PROMOTE_ENABLED }}
FORCE_INPUT: ${{ github.event.inputs.force }}
run: |
set -eu
# Repo variable AUTO_PROMOTE_ENABLED=true flips this on. While
# it's unset, the workflow dry-runs (logs what it would have
# done) but doesn't open the promote PR. Set the variable in
# Settings → Actions → Variables.
if [ "${AUTO_PROMOTE_ENABLED:-}" != "true" ] && [ "${FORCE_INPUT:-false}" != "true" ]; then
{
echo "## Auto-promote disabled"
echo
echo "Repo variable \`AUTO_PROMOTE_ENABLED\` is not set to \`true\`."
echo "All gates are green on staging; would have opened a promote PR to \`main\`."
echo
echo "To enable: Settings → Actions → Variables → \`AUTO_PROMOTE_ENABLED=true\`."
echo "To test once manually: workflow_dispatch with \`force=true\`."
} >> "$GITHUB_STEP_SUMMARY"
echo "::notice::auto-promote disabled — dry run only"
exit 0
fi
- name: Open or reuse promote PR + schedule auto-merge
if: ${{ vars.AUTO_PROMOTE_ENABLED == 'true' || github.event.inputs.force == 'true' }}
env:
GITEA_TOKEN: ${{ secrets.AUTO_SYNC_TOKEN }}
REPO: ${{ github.repository }}
TARGET_SHA: ${{ needs.check-all-gates-green.outputs.head_sha }}
GITEA_HOST: ${{ vars.GITEA_HOST || 'https://git.moleculesai.app' }}
run: |
set -euo pipefail
API="${GITEA_HOST}/api/v1/repos/${REPO}"
AUTH=(-H "Authorization: token ${GITEA_TOKEN}" -H "Accept: application/json")
# http_status_get RESULT_VAR URL
# Sets RESULT_VAR to "<http_code>:<body_file>". Curl status
# capture pattern per `feedback_curl_status_capture_pollution`:
# http_code goes to its own tempfile-equivalent (-w), body to
# another tempfile, set +e/-e bracket protects pipeline state.
http_get() {
local body_file="$1"; shift
local url="$1"; shift
set +e
local code
code=$(curl -sS "${AUTH[@]}" -o "${body_file}" -w "%{http_code}" "${url}")
local rc=$?
set -e
if [ "${rc}" -ne 0 ]; then
echo "::error::curl GET failed (rc=${rc}) on ${url}"
return 99
fi
echo "${code}"
}
http_post_json() {
local body_file="$1"; shift
local data="$1"; shift
local url="$1"; shift
set +e
local code
code=$(curl -sS "${AUTH[@]}" -H "Content-Type: application/json" \
-X POST -d "${data}" -o "${body_file}" -w "%{http_code}" "${url}")
local rc=$?
set -e
if [ "${rc}" -ne 0 ]; then
echo "::error::curl POST failed (rc=${rc}) on ${url}"
return 99
fi
echo "${code}"
}
# Step 1: look for an existing open staging→main promote PR
# (idempotent on workflow re-run). Gitea doesn't have a
# head/base filter on the list endpoint that's as ergonomic
# as gh's, but the dedicated `/pulls/{base}/{head}` lookup
# works.
BODY=$(mktemp)
STATUS=$(http_get "${BODY}" "${API}/pulls/main/staging") || true
PR_NUM=""
if [ "${STATUS}" = "200" ]; then
STATE=$(jq -r '.state // "missing"' < "${BODY}")
if [ "${STATE}" = "open" ]; then
PR_NUM=$(jq -r '.number // ""' < "${BODY}")
echo "::notice::Re-using existing open promote PR #${PR_NUM}"
fi
fi
rm -f "${BODY}"
# Step 2: if no open PR, create one.
if [ -z "${PR_NUM}" ]; then
TITLE="staging → main: auto-promote ${TARGET_SHA:0:7}"
BODY_TEXT=$(cat <<EOFBODY
Automated promotion of \`staging\` (\`${TARGET_SHA:0:8}\`) to \`main\`. All required staging gates are green at this SHA (combined status reported success).
This PR is auto-generated by \`.github/workflows/auto-promote-staging.yml\` whenever every required gate completes green on the same staging SHA.
**Approval gate:** \`main\` branch protection requires 1 approval before this can land. Once approved, Gitea will auto-merge (the workflow scheduled \`merge_when_checks_succeed: true\` immediately after open).
The reverse-direction sync (the merge commit on \`main\` → \`staging\`) is handled automatically by \`auto-sync-main-to-staging.yml\` after this PR lands.
---
- Source: staging at \`${TARGET_SHA}\`
- Opened by: \`devops-engineer\` persona (anti-bot-ring; never founder PAT)
- Refs: #65, #73, #195
EOFBODY
)
REQ=$(jq -n \
--arg title "${TITLE}" \
--arg body "${BODY_TEXT}" \
--arg base "main" \
--arg head "staging" \
'{title:$title, body:$body, base:$base, head:$head}')
BODY=$(mktemp)
STATUS=$(http_post_json "${BODY}" "${REQ}" "${API}/pulls")
if [ "${STATUS}" = "201" ]; then
PR_NUM=$(jq -r '.number // ""' < "${BODY}")
echo "::notice::Opened promote PR #${PR_NUM}"
else
echo "::error::Failed to create promote PR: HTTP ${STATUS}"
jq -r '.message // .' < "${BODY}" | head -c 500
rm -f "${BODY}"
exit 1
fi
rm -f "${BODY}"
fi
# Step 3: schedule auto-merge. merge_when_checks_succeed
# tells Gitea to wait for both:
# - all required status checks to pass
# - the required-approvals gate (1 approval on main)
# before merging. On approval+green, Gitea merges within
# seconds. On any check failing or approval being denied,
# the schedule stays armed but doesn't fire.
#
# Idempotent: re-arming on an already-armed PR is a no-op.
REQ=$(jq -n '{Do:"merge", merge_when_checks_succeed:true}')
BODY=$(mktemp)
STATUS=$(http_post_json "${BODY}" "${REQ}" "${API}/pulls/${PR_NUM}/merge")
# Gitea returns:
# - 200/204 on successful immediate merge (gates already green AND approved)
# - 405 "Please try again later" when scheduled successfully but waiting
# - 422 on "Pull request is not mergeable" (conflict, stale base, etc.)
#
# 405 here is benign — Gitea's way of saying "scheduled, not merging now".
# We treat 200/204/405 as success, anything else as failure.
case "${STATUS}" in
200|204)
MERGE_OUTCOME="merged-immediately"
echo "::notice::Promote PR #${PR_NUM} merged immediately (gates+approval already green)"
;;
405)
MERGE_OUTCOME="auto-merge-scheduled"
echo "::notice::Promote PR #${PR_NUM}: auto-merge scheduled (Gitea will land on approval+green)"
;;
422)
MERGE_OUTCOME="not-mergeable"
echo "::warning::Promote PR #${PR_NUM}: not mergeable (conflict, stale base, or already merging)."
jq -r '.message // .' < "${BODY}" | head -c 500
;;
*)
echo "::error::Unexpected status ${STATUS} on merge schedule"
jq -r '.message // .' < "${BODY}" | head -c 500
rm -f "${BODY}"
exit 1
;;
esac
rm -f "${BODY}"
{
echo "## Auto-promote PR opened"
echo
echo "- Source: staging at \`${TARGET_SHA:0:8}\`"
echo "- PR: #${PR_NUM}"
echo "- Outcome: \`${MERGE_OUTCOME}\`"
echo
if [ "${MERGE_OUTCOME}" = "auto-merge-scheduled" ]; then
echo "Gitea will auto-merge once Hongming approves and all checks are green. No human action needed beyond approval."
elif [ "${MERGE_OUTCOME}" = "merged-immediately" ]; then
echo "Merged immediately. \`publish-workspace-server-image.yml\` will fire naturally on the resulting \`main\` push."
else
echo "PR is not auto-merging. Operator may need to bring staging up to date with main, then re-trigger this workflow via workflow_dispatch."
fi
} >> "$GITHUB_STEP_SUMMARY"
@@ -1,83 +0,0 @@
name: auto-promote-stale-alarm
# Hourly cron + on-demand alarm for the silent-block failure mode that
# motivated issue #2975:
# - The auto-promote-staging.yml workflow opened a PR + armed
# auto-merge, but main's branch protection requires a human review
# (reviewDecision=REVIEW_REQUIRED). The PR sat BLOCKED with no
# surface-up-the-stack for 12+ hours, holding 25 commits hostage
# including the Memory v2 redesign and a reno-stars data-loss fix.
#
# This workflow runs `scripts/check-stale-promote-pr.sh` against the
# repo's open auto-promote PRs (base=main head=staging). When a PR has
# been BLOCKED on REVIEW_REQUIRED for >4h, it:
# 1. Emits a workflow-level warning (visible in run summary + the
# Actions UI feed).
# 2. Posts a comment on the PR (idempotent — one alarm per PR).
#
# The detection logic lives in scripts/check-stale-promote-pr.sh so
# it's unit-testable with stubbed `gh` (see test-check-stale-promote-pr.sh).
# This file is the schedule + invocation surface only — SSOT for the
# detector itself.
on:
schedule:
# Hourly. Cheap (one `gh pr list` + jq), and 1h granularity is
# plenty for a 4h staleness threshold — operators see the alarm
# within at most 1h of crossing the threshold.
- cron: "27 * * * *" # at :27 to dodge the cron herd at :00
workflow_dispatch:
inputs:
stale_hours:
description: "Hours after which a BLOCKED+REVIEW_REQUIRED PR is stale (default 4)"
required: false
default: "4"
post_comment:
description: "Post a comment on stale PRs (default true)"
required: false
default: "true"
permissions:
contents: read
pull-requests: write # post comments on stale PRs
# Serialize so the on-demand and scheduled runs don't double-comment
# the same PR. cancel-in-progress=false because the script is idempotent
# (existing comment marker prevents dupes), but a scheduled run firing
# while a manual one runs would just re-list the same PR set.
concurrency:
group: auto-promote-stale-alarm
cancel-in-progress: false
jobs:
scan:
runs-on: ubuntu-latest
steps:
- name: Checkout (need scripts/ only)
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
sparse-checkout: |
scripts/check-stale-promote-pr.sh
sparse-checkout-cone-mode: false
- name: Run stale-PR detector
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
GITHUB_REPOSITORY: ${{ github.repository }}
STALE_HOURS: ${{ inputs.stale_hours || '4' }}
POST_COMMENT: ${{ inputs.post_comment || 'true' }}
run: |
# The script's exit code reflects the count of stale PRs.
# We don't want a stale finding to fail the workflow run —
# the warning + comment are the signal, the green/red is
# noise. So convert any non-zero exit to a workflow notice
# and exit 0.
set +e
bash scripts/check-stale-promote-pr.sh
rc=$?
set -e
if [ "$rc" -ne 0 ]; then
echo "::notice::Stale PR detector found $rc PR(s) needing attention. See warnings above + comments on the PRs."
fi
# Always succeed — operator-facing surface is the warning,
# not the workflow status.
exit 0
-404
View File
@@ -1,404 +0,0 @@
name: Auto-sync canary — AUTO_SYNC_TOKEN rotation drift
# Synthetic health check for the AUTO_SYNC_TOKEN secret consumed by
# auto-sync-main-to-staging.yml (PR #66) and publish-workspace-server-image.yml.
#
# ============================================================
# Why this workflow exists
# ============================================================
#
# PR #66 fixed auto-sync (replaced GitHub-era `gh pr create` — which
# 405s on Gitea's GraphQL endpoint — with a direct git push from the
# `devops-engineer` persona's `AUTO_SYNC_TOKEN`). Hostile self-review
# weakest spot #3 of that PR:
#
# "Token rotation silently breaks auto-sync. If AUTO_SYNC_TOKEN is
# rotated without updating the repo secret, every push to main
# fails red on the auto-sync push step. The workflow surfaces the
# failure mode in the step summary (failure mode B in the header),
# but there's no proactive monitoring."
#
# Detection latency under the status quo: rotation is only caught on
# the next push to `main`. During quiet periods (no main push for
# many hours) the staging-superset-of-main invariant silently breaks.
#
# This workflow closes the gap: every 6 hours, it fires the auth
# surface that auto-sync depends on and emits a red workflow status
# if AUTO_SYNC_TOKEN has drifted out of validity.
#
# ============================================================
# What this checks (Option B — read-only verify)
# ============================================================
#
# 1. `GET /api/v1/user` against Gitea with the token → validates the
# token authenticates AND resolves to `devops-engineer` (catches
# the case where the token was regenerated under a different
# persona by mistake).
# 2. `GET /api/v1/repos/molecule-ai/molecule-core` with the token →
# validates the token has `read:repository` scope on this repo
# (the v2 scope contract — see saved memory
# `reference_persona_token_v2_scope`).
# 3. `git push --dry-run` of the current staging SHA back to
# `refs/heads/staging` via `https://oauth2:<token>@<gitea>/...`
# → validates the EXACT HTTPS basic-auth path that
# `actions/checkout` + `git push origin staging` use inside
# auto-sync-main-to-staging.yml. NOP by construction (push the
# current tip to itself = "Everything up-to-date"); auth is
# checked at the smart-protocol handshake BEFORE the empty-diff
# computation, so bad token → exit 128 with "Authentication
# failed". `git ls-remote` is NOT used here because Gitea
# falls back to anonymous read on public repos and would
# silently green-light a rotated token.
#
# Each step exits non-zero with an actionable error message if it
# fails. The workflow status itself is the operator-facing surface.
#
# ============================================================
# What this does NOT check (intentional)
# ============================================================
#
# - **Branch-protection authz** (failure mode C in auto-sync header):
# would require an actual write to staging. Already monitored by
# `branch-protection-drift.yml` daily. Don't duplicate.
# - **Conflict resolution** (failure mode A): a real conflict is data-
# driven, not auth-driven; can't synthesise it without polluting
# staging. Already surfaces immediately on the next main push.
# - **Concurrency** (failure mode D): handled by workflow concurrency
# group on auto-sync, not a credential issue.
#
# ============================================================
# Why Option B (read-only) and not the alternatives
# ============================================================
#
# Considered + rejected (see issue #72 for full write-up):
#
# - **Option A — full auto-sync on schedule**: every run creates a
# no-op merge commit on staging when main hasn't advanced. 4 noise
# commits/day. And races the real `push:` trigger when main has
# advanced. Rejected.
#
# - **Option C — push to dedicated `auto-sync-canary` branch**: would
# exercise authz too, but adds branch noise on Gitea AND requires
# maintaining a second branch protection (or expanding staging's
# whitelist to a junk branch). Authz already covered by
# `branch-protection-drift.yml`. Rejected.
#
# Prior art for the chosen Option B shape:
# - Cloudflare's `/user/tokens/verify` endpoint (read-only auth
# probe explicitly designed for credential canaries).
# - AWS Secrets Manager rotation Lambda's `testSecret` step (auth
# probe before promoting AWSPENDING → AWSCURRENT).
# - HashiCorp Vault's `vault token lookup` for renewal canaries.
#
# ============================================================
# Operator runbook — what to do when this workflow goes RED
# ============================================================
#
# 1. **Identify which step failed**:
# - Step "Verify token authenticates as devops-engineer" red →
# token is invalid OR resolves to wrong persona.
# - Step "Verify token has repo read scope" red → token valid but
# stripped of `read:repository` scope (or repo perms changed).
# - Step "Verify git HTTPS auth path via no-op dry-run push to
# staging" red → token rotated/revoked OR Gitea git-HTTPS
# surface is broken (rare). Auth check happens on the
# smart-protocol handshake, separate from the API path.
#
# 2. **Re-issue the token** on the operator host:
# ```
# ssh root@5.78.80.188 'docker exec --user git molecule-gitea-1 \
# gitea admin user generate-access-token \
# --username devops-engineer \
# --token-name persona-devops-engineer-vN \
# --scopes "read:repository,write:repository,read:user,read:organization,read:issue,write:issue,read:notification,read:misc"'
# ```
# Update `/etc/molecule-bootstrap/agent-secrets.env` in place
# (per `feedback_unified_credentials_file`). The previous token
# file lands at `.bak.<date>`.
#
# 3. **Update the repo Actions secret** at:
# Settings → Secrets and variables → Actions → AUTO_SYNC_TOKEN
# Paste the new token. (Don't echo it in chat — but per
# `feedback_passwords_in_chat_are_burned`, a paste in a 1:1
# Claude session is within trust boundary.)
#
# 4. **Re-run this canary** via workflow_dispatch. Confirm GREEN.
#
# 5. **Backfill any missed main → staging syncs** by re-running
# `auto-sync-main-to-staging.yml` from its workflow_dispatch
# surface, OR by pushing an empty commit to main (if you'd
# rather force a real trigger).
#
# ============================================================
# Security notes
# ============================================================
#
# - Token usage: read-only (`GET /api/v1/user`, `GET /api/v1/repos/...`,
# `git ls-remote`). No write paths. Same blast-radius profile as
# `actions/checkout` on a public repo.
# - The token NEVER appears in logs: every `curl` uses a header
# variable, never inline; the `git ls-remote` URL builds the
# `oauth2:$TOKEN@host` form into a single env var that's not
# echoed. GitHub Actions secret-masking covers anything that does
# slip through.
# - No new token introduced — same `AUTO_SYNC_TOKEN` the workflow
# under monitor uses. Per least-privilege we deliberately do NOT
# broaden scope for the canary.
on:
schedule:
# Every 6 hours at :17 (offsets the cron herd at :00). Justification
# from issue #72: cheap to run (~5s wall-clock, no quota), 3h average
# detection latency, 6h max. 1h would be 24× the runs for marginal
# benefit; daily would be 6× longer latency and worse than status
# quo on a quiet-main day.
- cron: '17 */6 * * *'
workflow_dispatch:
# No concurrency group needed — the canary is read-only and idempotent.
# Two parallel runs (e.g. operator dispatch during a scheduled tick) are
# harmless: same result, doubled HTTPS calls, no shared state.
permissions:
contents: read
jobs:
verify-token:
name: Verify AUTO_SYNC_TOKEN validity
runs-on: ubuntu-latest
# 2 min surfaces hangs (Gitea API stall, DNS issue) within one
# cron interval. Realistic worst case is ~10s: 2 curls + 1 git
# ls-remote, each capped by the explicit timeouts below.
timeout-minutes: 2
env:
# Pinned in env so individual steps can read it without
# repeating the secret reference. GitHub masks the value in
# logs automatically.
AUTO_SYNC_TOKEN: ${{ secrets.AUTO_SYNC_TOKEN }}
# MUST stay in sync with auto-sync-main-to-staging.yml's
# `git config user.name "devops-engineer"` line. Renaming the
# devops-engineer persona requires updating both files (and
# the staging branch protection's `push_whitelist_usernames`).
EXPECTED_PERSONA: devops-engineer
GITEA_HOST: git.moleculesai.app
REPO_PATH: molecule-ai/molecule-core
steps:
- name: Verify AUTO_SYNC_TOKEN secret is configured
# Schedule-vs-dispatch behaviour split, per
# `feedback_schedule_vs_dispatch_secrets_hardening`:
#
# - schedule: hard-fail when the secret is missing. The
# whole point of the canary is to surface drift; soft-
# skipping on missing-secret would make the canary
# itself drift-invisible (sweep-cf-orphans #2088 lesson).
# - workflow_dispatch: hard-fail too — there's no scenario
# where an operator wants this canary to silently no-op.
# The workflow has no other ad-hoc utility; if you ran
# it, you wanted the answer.
run: |
if [ -z "${AUTO_SYNC_TOKEN}" ]; then
echo "::error::AUTO_SYNC_TOKEN secret is not set on this repo." >&2
echo "::error::Set it at Settings → Secrets and variables → Actions." >&2
echo "::error::Without it, auto-sync-main-to-staging.yml will fail every push to main." >&2
exit 1
fi
echo "AUTO_SYNC_TOKEN is configured (value masked)."
- name: Verify token authenticates as ${{ env.EXPECTED_PERSONA }}
# Calls Gitea's `/api/v1/user` — the canonical
# auth-probe-with-no-side-effects endpoint (mirrors
# Cloudflare's /user/tokens/verify).
#
# Failure surfaces:
# - HTTP 401: token invalid (rotated, revoked, or never
# correctly registered).
# - HTTP 200 but username != devops-engineer: token was
# regenerated under the wrong persona — this would let
# auth pass but commit attribution would be wrong, and
# branch-protection authz would fail because only
# `devops-engineer` is whitelisted.
run: |
set -euo pipefail
response_file="$(mktemp)"
code_file="$(mktemp)"
# `--max-time 30`: full call ceiling. `--connect-timeout 10`:
# DNS + TCP. `-w "%{http_code}"` routed to a tempfile so curl's
# exit code can't pollute the captured status — see
# feedback_curl_status_capture_pollution + the
# `lint-curl-status-capture.yml` gate that rejects the unsafe
# `$(curl ... || echo "000")` shape.
set +e
curl -sS -o "$response_file" \
--max-time 30 --connect-timeout 10 \
-w "%{http_code}" \
-H "Authorization: token ${AUTO_SYNC_TOKEN}" \
-H "Accept: application/json" \
"https://${GITEA_HOST}/api/v1/user" >"$code_file" 2>/dev/null
set -e
status=$(cat "$code_file" 2>/dev/null || true)
[ -z "$status" ] && status="000"
if [ "$status" != "200" ]; then
echo "::error::Token rotation suspected: GET /api/v1/user returned HTTP $status (expected 200)." >&2
echo "::error::Likely cause: AUTO_SYNC_TOKEN has been rotated/revoked on Gitea but the repo Actions secret was not updated." >&2
echo "::error::Runbook: see header comment of this workflow file." >&2
# Print response body but redact anything that looks like a token.
sed -E 's/[A-Fa-f0-9]{32,}/<redacted>/g' "$response_file" >&2 || true
exit 1
fi
username=$(python3 -c "import json,sys; print(json.load(open(sys.argv[1])).get('login',''))" "$response_file")
if [ "$username" != "${EXPECTED_PERSONA}" ]; then
echo "::error::Token resolves to user '$username', expected '${EXPECTED_PERSONA}'." >&2
echo "::error::AUTO_SYNC_TOKEN must be the devops-engineer persona PAT (not founder PAT, not another persona)." >&2
echo "::error::Auto-sync push will fail because only 'devops-engineer' is whitelisted on staging branch protection." >&2
exit 1
fi
echo "Token authenticates as: $username ✓"
- name: Verify token has repo read scope
# `GET /api/v1/repos/<owner>/<repo>` requires `read:repository`
# on the persona's v2 scope contract. If the scope was
# narrowed/dropped on rotation we catch it here, before the
# next main push reveals it via a checkout failure.
run: |
set -euo pipefail
response_file="$(mktemp)"
code_file="$(mktemp)"
# See first probe step for the rationale on the tempfile-routed
# `-w "%{http_code}"` pattern — the unsafe `|| echo "000"` shape
# is rejected by lint-curl-status-capture.yml.
set +e
curl -sS -o "$response_file" \
--max-time 30 --connect-timeout 10 \
-w "%{http_code}" \
-H "Authorization: token ${AUTO_SYNC_TOKEN}" \
-H "Accept: application/json" \
"https://${GITEA_HOST}/api/v1/repos/${REPO_PATH}" >"$code_file" 2>/dev/null
set -e
status=$(cat "$code_file" 2>/dev/null || true)
[ -z "$status" ] && status="000"
if [ "$status" != "200" ]; then
echo "::error::Token lacks read:repository scope on ${REPO_PATH}: HTTP $status." >&2
echo "::error::Auto-sync's actions/checkout step will fail with this token." >&2
echo "::error::Re-issue with v2 scope contract: read:repository,write:repository,read:user,read:organization,read:issue,write:issue,read:notification,read:misc" >&2
sed -E 's/[A-Fa-f0-9]{32,}/<redacted>/g' "$response_file" >&2 || true
exit 1
fi
echo "Token has read:repository on ${REPO_PATH} ✓"
- name: Verify git HTTPS auth path via no-op dry-run push to staging
# Final probe: exercise the EXACT auth path that
# `actions/checkout` + `git push origin staging` use in
# auto-sync-main-to-staging.yml. Gitea's API and git-HTTPS
# surfaces share the token-lookup code path internally but
# the wire-level error shapes differ — historically (#173)
# the API path was healthy while git-HTTPS rejected, so
# checking only the API would have given false-green.
#
# IMPORTANT: `git ls-remote` on a public repo (which
# molecule-core is) succeeds even with a junk token because
# Gitea falls back to anonymous-read. `ls-remote` therefore
# CANNOT validate auth on this surface. We use
# `git push --dry-run` instead — push is auth-gated even on
# public repos.
#
# NOP shape: read the current staging SHA via authenticated
# ls-remote (the SHA itself is public; auth is incidental
# here, used only to colocate the discovery in one step), then
# `git push --dry-run <SHA>:refs/heads/staging`. Pushing the
# current tip back to itself is "Everything up-to-date" with
# exit 0 when auth succeeds. With a bad token Gitea returns
# HTTP 401 in the smart-protocol handshake and git exits 128
# with "Authentication failed".
#
# The dry-run never reaches Gitea's pre-receive hook (which
# is where branch-protection authz runs), so this probe does
# not validate failure mode C. That's intentional —
# branch-protection-drift.yml owns authz monitoring; this
# canary owns auth.
env:
# Don't hang waiting for password prompt if auth fails on a
# terminal-attached run. (In Actions there's no terminal,
# but the env-var hardens against an interactive runner
# config.)
GIT_TERMINAL_PROMPT: "0"
run: |
set -euo pipefail
# Token is in $AUTO_SYNC_TOKEN (job-level env). Compose the
# URL as a local var that's never echoed.
url="https://oauth2:${AUTO_SYNC_TOKEN}@${GITEA_HOST}/${REPO_PATH}"
# Step a: read current staging SHA. ~1KB; auth-gated only
# on private repos but always works on public — used here
# only to discover the SHA, not to validate auth.
staging_ref=$(timeout 30s git ls-remote --refs "$url" refs/heads/staging 2>&1) || {
redacted=$(echo "$staging_ref" | sed -E "s|oauth2:[^@]+@|oauth2:<redacted>@|g")
echo "::error::ls-remote against staging failed (network/DNS issue):" >&2
echo "$redacted" >&2
exit 1
}
if ! echo "$staging_ref" | grep -qE '^[0-9a-f]{40}[[:space:]]+refs/heads/staging$'; then
echo "::error::ls-remote returned unexpected shape:" >&2
echo "$staging_ref" | sed -E "s|oauth2:[^@]+@|oauth2:<redacted>@|g" >&2
exit 1
fi
staging_sha=$(echo "$staging_ref" | awk '{print $1}')
# Step b: spin up an ephemeral local repo. `git push` always
# requires a local repo even when pushing a remote SHA that
# isn't in the local object DB (the protocol negotiates and
# discovers we don't need to send any objects). We don't use
# `actions/checkout` for this — it would clone the whole
# repo (~hundreds of MB) for what's essentially `git init`.
tmp_repo="$(mktemp -d)"
trap 'rm -rf "$tmp_repo"' EXIT
git -C "$tmp_repo" init -q
# Author config required for any git operation; values are
# arbitrary because nothing gets committed here.
git -C "$tmp_repo" config user.email canary@auto-sync.local
git -C "$tmp_repo" config user.name auto-sync-canary
# Step c: dry-run push the current staging SHA back to
# staging. NOP by construction — the remote tip equals the
# SHA we're pushing, so "Everything up-to-date" is the
# success path.
#
# Authentication is checked at the smart-protocol handshake,
# BEFORE the dry-run can compute an empty diff. Bad token
# → "Authentication failed", exit 128. Good token → exit 0.
set +e
push_out=$(timeout 30s git -C "$tmp_repo" push --dry-run "$url" "${staging_sha}:refs/heads/staging" 2>&1)
push_rc=$?
set -e
if [ "$push_rc" -ne 0 ]; then
redacted=$(echo "$push_out" | sed -E "s|oauth2:[^@]+@|oauth2:<redacted>@|g")
echo "::error::Token rotation suspected: git push --dry-run against staging failed via the AUTO_SYNC_TOKEN HTTPS auth path (exit $push_rc)." >&2
echo "::error::This is the EXACT auth path that actions/checkout + git push use in auto-sync-main-to-staging.yml." >&2
echo "::error::Likely cause: AUTO_SYNC_TOKEN was rotated/revoked on Gitea but the repo Actions secret was not updated. Runbook: see header." >&2
echo "$redacted" >&2
exit 1
fi
echo "git HTTPS auth path: NOP push --dry-run to staging → ${staging_sha:0:8} ✓"
- name: Summarise canary result
# Everything passed — surface a green summary. (Failures
# already wrote ::error:: lines and exited above; if we got
# here, all three probes passed.)
run: |
{
echo "## Auto-sync canary: GREEN"
echo ""
echo "AUTO_SYNC_TOKEN is healthy:"
echo "- Authenticates as \`${EXPECTED_PERSONA}\` ✓"
echo "- Has \`read:repository\` scope on \`${REPO_PATH}\` ✓"
echo "- Git HTTPS auth path: no-op dry-run push to \`refs/heads/staging\` succeeds ✓"
echo ""
echo "Auto-sync main → staging will succeed on the next push to main."
echo "If this canary ever goes RED, see the runbook in this workflow's header."
} >> "$GITHUB_STEP_SUMMARY"
@@ -1,255 +0,0 @@
name: Auto-sync main → staging
# Reflects every push to `main` back onto `staging` so the
# staging-as-superset-of-main invariant holds.
#
# ============================================================
# What this workflow does
# ============================================================
#
# On every push to `main`:
# 1. Checks if staging already contains main → no-op.
# 2. Fetches both branches, merges main into staging in the
# runner workspace (fast-forward if possible, else
# `--no-ff` merge commit).
# 3. Pushes staging directly to origin via the
# `devops-engineer` persona's `AUTO_SYNC_TOKEN`.
#
# Authoritative path: a single `git push origin staging` from
# inside this workflow is the SSOT for advancing staging after
# a main push. No PR, no merge queue, no human approval —
# staging is mechanically maintained as a superset of main.
#
# `auto-promote-staging.yml` is the reverse-direction
# counterpart (staging → main, gated on green CI). Together
# they keep the staging-superset-of-main invariant tight.
#
# ============================================================
# Why direct push (and not "open a PR")
# ============================================================
#
# Pre-2026-05-06 the canonical SCM was GitHub.com, where:
# - The `staging` branch had a `merge_queue` ruleset that
# blocked ALL direct pushes (no bypass even for org
# admins or the GitHub Actions integration).
# - Therefore this workflow opened a PR via `gh pr create`
# and let auto-merge land it through the queue.
#
# Post-2026-05-06 the canonical SCM is Gitea
# (`git.moleculesai.app/molecule-ai/molecule-core`). Gitea:
# - Has no `merge_queue` concept.
# - Allows direct push to protected branches via per-user
# `push_whitelist_usernames` on the branch protection.
# - Does not expose a GraphQL endpoint, so `gh pr create`
# returns `HTTP 405 Method Not Allowed
# (https://git.moleculesai.app/api/graphql)` — the
# pre-suspension architecture cannot work on Gitea.
#
# The molecule-ai/molecule-core staging branch protection
# (verified via `GET /api/v1/repos/.../branch_protections`)
# whitelists `devops-engineer` for direct push. So the
# correct Gitea-shape architecture is: authenticate as
# `devops-engineer`, merge locally, push staging directly.
#
# This is structurally simpler than the GitHub-era PR dance
# and removes the dependence on `gh` CLI / GraphQL entirely.
#
# ============================================================
# Identity + token (anti-bot-ring per saved-memory
# `feedback_per_agent_gitea_identity_default`)
# ============================================================
#
# This workflow uses `secrets.AUTO_SYNC_TOKEN`, which is a
# personal access token issued to the `devops-engineer`
# persona on Gitea — NOT the founder PAT. The bot-ring
# fingerprint that triggered the GitHub org suspension on
# 2026-05-06 was characterised by founder PAT acting as CI
# at machine speed; per-persona identities split the
# attribution honestly.
#
# Token scope on Gitea: repo write. Push target restricted
# to `staging` (this workflow is the only writer; main is
# untouched). Compromise blast radius: bounded to staging
# branch + this repo's read surface.
#
# Commits are authored by the persona email
# `devops-engineer@agents.moleculesai.app` so commit history
# reflects which automation produced the merge.
#
# ============================================================
# Failure modes & operational notes
# ============================================================
#
# A — staging has commits main doesn't, and the merge
# conflicts:
# - The `--no-ff` merge step exits non-zero. Workflow
# fails red. Operator (devops-engineer or human)
# resolves manually:
# git fetch origin
# git checkout staging
# git merge --no-ff origin/main
# # resolve conflicts
# git push origin staging
# - Step summary surfaces the conflict so the failed run
# is self-explanatory.
#
# B — `AUTO_SYNC_TOKEN` rotated / wrong scope:
# - `git push` step exits non-zero with `HTTP 401` /
# `403`. Step summary surfaces the failed push.
# - Re-issue the token from `~/.molecule-ai/personas/`
# on the operator host and update the repo Actions
# secret. Re-run the workflow.
#
# C — staging branch protection no longer whitelists
# `devops-engineer`:
# - `git push` exits non-zero with a Gitea protected-
# branch rejection. Step summary surfaces it.
# - Re-add `devops-engineer` to
# `push_whitelist_usernames` on the staging
# protection (Settings → Branches → staging).
#
# D — concurrent push to main while a sync is in flight:
# - The `concurrency` group below serialises runs.
# The second waits for the first; if main advances
# again while we're syncing, the second run picks
# up the new tip on its own fetch.
#
# ============================================================
# Loop safety
# ============================================================
#
# The push to staging from this workflow does NOT itself
# fire a `push: branches: [main]` event (different branch),
# so there's no risk of self-recursion. `auto-promote-staging.yml`
# fires on `workflow_run` of CI etc. — it sees the new
# staging tip on its next gate-completion event, NOT on this
# push directly. No loop.
on:
push:
branches: [main]
# workflow_dispatch lets operators manually backfill a
# missed sync (e.g. if AUTO_SYNC_TOKEN was rotated and a
# main push slipped through while the secret was stale).
workflow_dispatch:
permissions:
contents: write
concurrency:
group: auto-sync-main-to-staging
cancel-in-progress: false
jobs:
sync-staging:
runs-on: ubuntu-latest
steps:
- name: Checkout staging (with devops-engineer push token)
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
fetch-depth: 0
ref: staging
# AUTO_SYNC_TOKEN authenticates as the
# `devops-engineer` Gitea persona — the only
# identity whitelisted for direct push to
# staging. See header comment for context.
token: ${{ secrets.AUTO_SYNC_TOKEN }}
- name: Configure git author
run: |
# Per-persona identity, NOT founder PAT.
# `feedback_per_agent_gitea_identity_default`.
git config user.name "devops-engineer"
git config user.email "devops-engineer@agents.moleculesai.app"
- name: Check if staging already contains main
id: check
run: |
set -euo pipefail
git fetch origin main
if git merge-base --is-ancestor origin/main HEAD; then
echo "needs_sync=false" >> "$GITHUB_OUTPUT"
{
echo "## No-op"
echo
echo "staging already contains \`origin/main\` ($(git rev-parse --short=8 origin/main))."
} >> "$GITHUB_STEP_SUMMARY"
else
echo "needs_sync=true" >> "$GITHUB_OUTPUT"
MAIN_SHORT=$(git rev-parse --short=8 origin/main)
echo "main_short=${MAIN_SHORT}" >> "$GITHUB_OUTPUT"
echo "::notice::staging is missing main's tip (${MAIN_SHORT}) — merging in-runner and pushing"
fi
- name: Merge main into staging (in-runner)
if: steps.check.outputs.needs_sync == 'true'
id: merge
run: |
set -euo pipefail
# Already on staging from checkout. Try fast-forward
# first (cleanest history); fall back to merge commit
# if staging has commits main doesn't.
if git merge --ff-only origin/main; then
echo "did_ff=true" >> "$GITHUB_OUTPUT"
echo "::notice::Fast-forwarded staging to origin/main"
else
echo "did_ff=false" >> "$GITHUB_OUTPUT"
if ! git merge --no-ff origin/main \
-m "chore: sync main → staging (auto, ${{ steps.check.outputs.main_short }})"; then
# Hygiene: leave the work tree clean before failing.
git merge --abort || true
{
echo "## Conflict"
echo
echo "Auto-merge \`main → staging\` failed with conflicts."
echo "A human (or devops-engineer persona) needs to resolve manually:"
echo
echo '```'
echo "git fetch origin"
echo "git checkout staging"
echo "git merge --no-ff origin/main"
echo "# resolve conflicts"
echo "git push origin staging"
echo '```'
} >> "$GITHUB_STEP_SUMMARY"
exit 1
fi
fi
- name: Push staging to origin
if: steps.check.outputs.needs_sync == 'true'
run: |
set -euo pipefail
# Direct push to staging. devops-engineer persona is
# whitelisted for direct push on the staging branch
# protection (Settings → Branches → staging).
#
# No --force / --force-with-lease: a fast-forward or
# legitimate merge commit on top of current staging
# is the only thing we'd ever push. If origin/staging
# advanced under us (concurrent merge), the push
# legitimately rejects and the next run picks up the
# new state.
if ! git push origin staging; then
{
echo "## Push rejected"
echo
echo "Direct push to \`staging\` failed. Likely causes:"
echo "- \`AUTO_SYNC_TOKEN\` rotated / wrong scope (HTTP 401/403)"
echo "- \`devops-engineer\` no longer in"
echo " \`push_whitelist_usernames\` on the staging"
echo " branch protection (HTTP 422)"
echo "- staging advanced concurrently — re-running this"
echo " workflow on the new main tip will pick it up"
} >> "$GITHUB_STEP_SUMMARY"
exit 1
fi
{
echo "## Auto-sync succeeded"
echo
echo "- staging advanced to: \`$(git rev-parse --short=8 HEAD)\`"
echo "- main tip: \`${{ steps.check.outputs.main_short }}\`"
echo "- Strategy: $([ "${{ steps.merge.outputs.did_ff }}" = "true" ] && echo "fast-forward" || echo "merge commit")"
echo "- Pushed by: \`devops-engineer\` (per-agent persona, anti-bot-ring)"
} >> "$GITHUB_STEP_SUMMARY"
+2 -2
View File
@@ -22,9 +22,9 @@ on:
# spending CI cycles. See e2e-api.yml for the rationale on why this
# is a single job rather than two-jobs-sharing-name.
push:
branches: [main, staging]
branches: [main]
pull_request:
branches: [main, staging]
branches: [main]
workflow_dispatch:
schedule:
# Weekly on Sunday 08:00 UTC — catches Chrome / Playwright / Next.js
+2 -2
View File
@@ -32,7 +32,7 @@ name: E2E Staging External Runtime
on:
push:
branches: [staging, main]
branches: [main]
paths:
- 'workspace-server/internal/handlers/workspace.go'
- 'workspace-server/internal/handlers/registry.go'
@@ -44,7 +44,7 @@ on:
- 'tests/e2e/test_staging_external_runtime.sh'
- '.github/workflows/e2e-staging-external.yml'
pull_request:
branches: [staging, main]
branches: [main]
paths:
- 'workspace-server/internal/handlers/workspace.go'
- 'workspace-server/internal/handlers/registry.go'
+6 -7
View File
@@ -20,13 +20,12 @@ name: E2E Staging SaaS (full lifecycle)
# via the same paths watcher that e2e-api.yml uses)
on:
# Fire on staging push too — previously this only ran on main, which
# meant the most thorough end-to-end test caught regressions AFTER
# they shipped to staging (and then to the auto-promote PR). Running
# on staging push catches them BEFORE the staging→main promotion
# opens, so a green canary into auto-promote is more meaningful.
# Trunk-based (Phase 3 of internal#81): main is the only branch.
# Previously this fired on staging push too because staging was a
# superset of main and ran the gate ahead of auto-promote; with no
# staging branch, main is where E2E gates the deploy.
push:
branches: [staging, main]
branches: [main]
paths:
- 'workspace-server/internal/handlers/registry.go'
- 'workspace-server/internal/handlers/workspace_provision.go'
@@ -36,7 +35,7 @@ on:
- 'tests/e2e/test_staging_full_saas.sh'
- '.github/workflows/e2e-staging-saas.yml'
pull_request:
branches: [staging, main]
branches: [main]
paths:
- 'workspace-server/internal/handlers/registry.go'
- 'workspace-server/internal/handlers/workspace_provision.go'
@@ -36,7 +36,7 @@ on:
workflow_run:
workflows: ['publish-workspace-server-image']
types: [completed]
branches: [staging]
branches: [main]
workflow_dispatch:
inputs:
target_tag:
@@ -1,276 +0,0 @@
name: Retarget main PRs to staging
# Mechanical enforcement of SHARED_RULES rule 8 ("Staging-first
# workflow, no exceptions"). When a bot opens a PR against `main`,
# retarget it to `staging` automatically and leave an explanatory
# comment. Human / CEO-authored PRs (the staging→main promotion
# PRs, etc.) are left alone — they're the authorised exception
# to the rule.
#
# ============================================================
# What this workflow does
# ============================================================
#
# On `pull_request_target` opened/reopened against `main`:
# 1. If the PR head is `staging`, skip (the auto-promote PRs
# MUST stay base=main).
# 2. If the PR author is a bot, retarget the PR base to
# `staging` via Gitea REST `PATCH /pulls/{N}` body
# `{"base":"staging"}`.
# 3. If the retarget returns 422 "pull request already exists
# for base branch 'staging'" (issue #1884 case: another PR
# on the same head already targets staging), close the
# now-redundant main-PR via Gitea REST instead of failing
# red.
# 4. Post an explainer comment on the retargeted PR via
# Gitea REST `POST /issues/{N}/comments`.
#
# ============================================================
# Why Gitea REST (and not `gh api / gh pr close / gh pr comment`)
# ============================================================
#
# Pre-2026-05-06 this workflow used `gh api -X PATCH "repos/{owner}/{repo}/pulls/{N}" -f base=staging`
# plus `gh pr close` and `gh pr comment`. After the GitHub→Gitea
# cutover those calls fail because:
#
# - `gh` CLI defaults to `api.github.com`. Even with `GH_HOST`
# pointing at Gitea, `gh pr close / comment` route through
# GraphQL (`/api/graphql`) which Gitea does not expose.
# Empirical: every `gh pr *` call returns
# `HTTP 405 Method Not Allowed (https://git.moleculesai.app/api/graphql)`
# — same root cause as #65 (auto-sync, fixed in PR #66) and
# #73/#195 (auto-promote, fixed in PR #78).
# - `gh api -X PATCH /pulls/{N}` happens to use a REST path
# that Gitea also has, but the `gh` host-resolution layer
# and pagination/retry logic don't always hit Gitea cleanly,
# and the cost of switching to direct `curl` is one extra
# line of code.
#
# So this workflow uses direct `curl` calls to Gitea REST. No
# `gh` CLI dependency, no GraphQL, no flaky host-resolution.
#
# ============================================================
# Identity + token (anti-bot-ring per saved-memory
# `feedback_per_agent_gitea_identity_default`)
# ============================================================
#
# Pre-fix this workflow used the per-job ephemeral
# `secrets.GITHUB_TOKEN`. On Gitea Actions that token has
# narrow scope and unpredictable cross-PR write capability.
#
# Post-fix: `secrets.AUTO_SYNC_TOKEN` (the `devops-engineer`
# Gitea persona). Same persona used by `auto-sync-main-to-staging.yml`
# (PR #66) and `auto-promote-staging.yml` (PR #78). Token scope:
# `push: true` repo write, sufficient for PR-edit + close + comment.
#
# Why this token does NOT need branch-protection bypass:
# patching a PR's base ref is a PR-level operation that does not
# require push perms on either branch (the PR's own commits stay
# put; only the metadata changes).
#
# ============================================================
# Failure modes & operational notes
# ============================================================
#
# A — PATCH base→staging returns 422 "pull request already exists"
# (issue #1884 case):
# - Detected by string-match on response body. Workflow
# falls through to closing the now-redundant main-PR
# (Gitea REST `PATCH /pulls/{N}` with `state: closed`)
# and posts an explanation comment. Step summary surfaces.
#
# B — `AUTO_SYNC_TOKEN` rotated / wrong scope:
# - First REST call returns 401/403. Step summary surfaces.
# Re-issue token from `~/.molecule-ai/personas/` on the
# operator host and update repo Actions secret.
#
# C — PR was deleted between trigger and run:
# - REST call returns 404. Workflow exits 0 with a notice
# (the rule was already enforced or the PR is gone).
#
# D — author is not actually a bot but the filter mis-fires:
# - Filter is conservative: only triggers on
# `user.type == 'Bot'`, `login` ends with `[bot]`, or
# known bot logins (`molecule-ai[bot]`, `app/molecule-ai`).
# Human PRs slip through unaffected. If a NEW bot login
# starts shipping main-PRs, add it to the filter.
on:
pull_request_target:
types: [opened, reopened]
branches: [main]
permissions:
pull-requests: write
jobs:
retarget:
name: Retarget to staging
runs-on: ubuntu-latest
# Only fire for bot-authored PRs. Human CEO PRs (staging→main
# promotion) are intentional and pass through.
#
# Head-ref guard: never retarget a PR whose head IS `staging`
# — those are the auto-promote staging→main PRs (opened by
# `devops-engineer` since PR #78 / #195 fix). Retargeting
# head=staging onto base=staging fails with HTTP 422 "no new
# commits between base 'staging' and head 'staging'", which
# would surface as a noisy red workflow run on every
# auto-promote (caught 2026-05-03 on the GitHub-era PR #2588).
if: >-
github.event.pull_request.head.ref != 'staging'
&& (
github.event.pull_request.user.type == 'Bot'
|| endsWith(github.event.pull_request.user.login, '[bot]')
|| github.event.pull_request.user.login == 'app/molecule-ai'
|| github.event.pull_request.user.login == 'molecule-ai[bot]'
|| github.event.pull_request.user.login == 'devops-engineer'
)
steps:
- name: Retarget PR base to staging via Gitea REST
id: retarget
env:
GITEA_TOKEN: ${{ secrets.AUTO_SYNC_TOKEN }}
GITEA_HOST: ${{ vars.GITEA_HOST || 'https://git.moleculesai.app' }}
REPO: ${{ github.repository }}
PR_NUMBER: ${{ github.event.pull_request.number }}
PR_AUTHOR: ${{ github.event.pull_request.user.login }}
# Issue #1884 case: when the bot opens a PR against main
# and there's already another PR on the same head branch
# targeting staging, Gitea's PATCH returns 422 with a
# body mentioning "pull request already exists for base
# branch 'staging'" (the Gitea message wording is
# slightly different from GitHub's; the substring match
# below covers both for forward/back compat).
# The retarget can't proceed — but the right response is
# to close the now-redundant main-PR, not to fail the
# workflow noisily. Detect that specific 422 and close
# instead.
run: |
set -euo pipefail
API="${GITEA_HOST}/api/v1/repos/${REPO}"
AUTH=(-H "Authorization: token ${GITEA_TOKEN}" -H "Accept: application/json")
echo "Retargeting PR #${PR_NUMBER} (author: ${PR_AUTHOR}) from main → staging"
# Curl-status-capture pattern per `feedback_curl_status_capture_pollution`:
# http_code via -w to its own scalar, body to a tempfile, set +e/-e
# bracket so curl's non-zero-on-4xx doesn't pollute the script's exit chain.
BODY_FILE=$(mktemp)
REQ='{"base":"staging"}'
set +e
STATUS=$(curl -sS "${AUTH[@]}" -H "Content-Type: application/json" \
-X PATCH -d "${REQ}" \
-o "${BODY_FILE}" -w "%{http_code}" \
"${API}/pulls/${PR_NUMBER}")
CURL_RC=$?
set -e
if [ "${CURL_RC}" -ne 0 ]; then
echo "::error::curl PATCH failed (rc=${CURL_RC})"
rm -f "${BODY_FILE}"
exit 1
fi
if [ "${STATUS}" = "201" ] || [ "${STATUS}" = "200" ]; then
NEW_BASE=$(jq -r '.base.ref // "?"' < "${BODY_FILE}")
rm -f "${BODY_FILE}"
if [ "${NEW_BASE}" = "staging" ]; then
echo "::notice::Retargeted PR #${PR_NUMBER} → staging"
echo "outcome=retargeted" >> "$GITHUB_OUTPUT"
exit 0
fi
echo "::error::PATCH returned ${STATUS} but base.ref is '${NEW_BASE}', not 'staging'"
exit 1
fi
# Specifically match the 422 duplicate-base/head error so
# any OTHER PATCH failure (auth, deleted PR, etc.) still
# surfaces as a real workflow failure.
BODY=$(cat "${BODY_FILE}" || true)
rm -f "${BODY_FILE}"
if [ "${STATUS}" = "422" ] && echo "${BODY}" | grep -qE "(pull request already exists for base branch 'staging'|already exists.*base.*staging)"; then
echo "::notice::PR #${PR_NUMBER}: duplicate target-staging PR exists on same head — closing this main-PR as redundant."
# Close the now-redundant main-PR via Gitea REST
# (PATCH state=closed). Post comment explaining
# rationale BEFORE close so the comment lands on the
# PR (commenting on a closed PR works on Gitea, but
# historically caused notification ordering surprises).
CLOSE_BODY_FILE=$(mktemp)
CMT_REQ=$(jq -n '{body:"[retarget-bot] Closing — another PR on the same head branch already targets `staging`. This PR is redundant. See issue #1884 for the rationale."}')
set +e
CMT_STATUS=$(curl -sS "${AUTH[@]}" -H "Content-Type: application/json" \
-X POST -d "${CMT_REQ}" \
-o "${CLOSE_BODY_FILE}" -w "%{http_code}" \
"${API}/issues/${PR_NUMBER}/comments")
set -e
if [ "${CMT_STATUS}" != "201" ]; then
echo "::warning::dup-close comment POST returned ${CMT_STATUS}; continuing to close anyway"
cat "${CLOSE_BODY_FILE}" | head -c 300 || true
fi
rm -f "${CLOSE_BODY_FILE}"
CLOSE_REQ='{"state":"closed"}'
CLOSE_RESP=$(mktemp)
set +e
CL_STATUS=$(curl -sS "${AUTH[@]}" -H "Content-Type: application/json" \
-X PATCH -d "${CLOSE_REQ}" \
-o "${CLOSE_RESP}" -w "%{http_code}" \
"${API}/pulls/${PR_NUMBER}")
set -e
if [ "${CL_STATUS}" = "201" ] || [ "${CL_STATUS}" = "200" ]; then
echo "::notice::Closed PR #${PR_NUMBER} as redundant"
echo "outcome=closed-as-duplicate" >> "$GITHUB_OUTPUT"
rm -f "${CLOSE_RESP}"
exit 0
fi
echo "::error::Failed to close redundant PR: HTTP ${CL_STATUS}"
cat "${CLOSE_RESP}" | head -c 300 || true
rm -f "${CLOSE_RESP}"
exit 1
fi
echo "::error::Retarget PATCH failed and was NOT a duplicate-base error: HTTP ${STATUS}"
echo "${BODY}" | head -c 500 >&2
exit 1
- name: Post explainer comment
if: steps.retarget.outputs.outcome == 'retargeted'
env:
GITEA_TOKEN: ${{ secrets.AUTO_SYNC_TOKEN }}
GITEA_HOST: ${{ vars.GITEA_HOST || 'https://git.moleculesai.app' }}
REPO: ${{ github.repository }}
PR_NUMBER: ${{ github.event.pull_request.number }}
run: |
set -euo pipefail
API="${GITEA_HOST}/api/v1/repos/${REPO}"
AUTH=(-H "Authorization: token ${GITEA_TOKEN}" -H "Accept: application/json")
# PR comments live on the issue endpoint in Gitea
# (PRs ARE issues — same endpoint, different sub-resources
# for diffs/files/etc.). The body uses jq to safely
# encode the multi-line markdown without shell-quote
# nightmares.
REQ=$(jq -n '{body:"[retarget-bot] This PR was opened against `main` and has been retargeted to `staging` automatically.\n\n**Why:** per [SHARED_RULES rule 8](https://git.moleculesai.app/molecule-ai/molecule-ai-org-template-molecule-dev/src/branch/main/SHARED_RULES.md), all feature work targets `staging` first; the CEO promotes `staging → main` separately.\n\n**What changed:** just the base branch — no code change. CI will re-run against `staging`. If you get merge conflicts, rebase on `staging`.\n\n**If this PR is the CEO`s staging→main promotion:** the Action skipped you (only bot-authored PRs are retargeted, head=staging is also exempted). If you see this comment on your CEO PR, that`s a bug — please tag @hongmingwang."}')
BODY_FILE=$(mktemp)
set +e
STATUS=$(curl -sS "${AUTH[@]}" -H "Content-Type: application/json" \
-X POST -d "${REQ}" \
-o "${BODY_FILE}" -w "%{http_code}" \
"${API}/issues/${PR_NUMBER}/comments")
set -e
if [ "${STATUS}" = "201" ]; then
echo "::notice::Posted explainer comment on PR #${PR_NUMBER}"
else
echo "::warning::Failed to post explainer (HTTP ${STATUS}) — retarget itself succeeded"
cat "${BODY_FILE}" | head -c 300 || true
fi
rm -f "${BODY_FILE}"
+28
View File
@@ -0,0 +1,28 @@
# Top-level Makefile — convenience wrappers around docker compose.
#
# Most molecule-core dev work happens via these shortcuts. CI doesn't
# use this Makefile; CI calls docker compose / go test directly so the
# Makefile can evolve without breaking the build.
.PHONY: help dev up down logs build test
help: ## Show this help.
@grep -E '^[a-zA-Z_-]+:.*?## ' $(MAKEFILE_LIST) | awk 'BEGIN {FS = ":.*?## "}; {printf "\033[36m%-12s\033[0m %s\n", $$1, $$2}'
dev: ## Start the full stack with air hot-reload for the platform service.
docker compose -f docker-compose.yml -f docker-compose.dev.yml up
up: ## Start the full stack in production-shape mode (no air, normal Dockerfile).
docker compose up
down: ## Stop the stack and remove containers (volumes preserved).
docker compose down
logs: ## Tail logs from all services (Ctrl-C to detach).
docker compose logs -f
build: ## Force a fresh build of the platform image (no cache).
docker compose build --no-cache platform
test: ## Run Go unit tests in workspace-server/.
cd workspace-server && go test -race ./...
+43
View File
@@ -0,0 +1,43 @@
# docker-compose.dev.yml — overlay over docker-compose.yml for local dev
# with air-driven live reload of the platform (workspace-server) service.
#
# Usage:
# docker compose -f docker-compose.yml -f docker-compose.dev.yml up
# (or `make dev` shorthand from repo root)
#
# What this overlay changes vs docker-compose.yml alone:
# - Platform service uses workspace-server/Dockerfile.dev (air on top of
# golang:1.25-alpine) instead of the multi-stage prod Dockerfile.
# - Platform service bind-mounts the host's workspace-server/ source
# into /app/workspace-server so air sees source edits live.
# - Other services (postgres, redis, langfuse, etc.) inherit unchanged
# from docker-compose.yml.
#
# What stays the same:
# - All env vars, volumes, depends_on, healthchecks from docker-compose.yml.
# - Network topology + ports.
# - Postgres/Redis as service containers (no in-process replacements).
services:
platform:
build:
context: .
dockerfile: workspace-server/Dockerfile.dev
# Rebind source: edits under host's workspace-server/ propagate live.
# The named volume on go-build-cache speeds up first build per container.
volumes:
- ./workspace-server:/app/workspace-server
- go-build-cache:/root/.cache/go-build
- go-mod-cache:/go/pkg/mod
# Air signals the running binary on rebuild; ensure shell stops cleanly.
init: true
# Mark the service as dev-mode so the platform can short-circuit any
# behavior that's incompatible with hot-reload (e.g. background
# cron-style watchers that don't survive process restart). No-op
# today; reserved for future flag use.
environment:
MOLECULE_DEV_HOT_RELOAD: "1"
volumes:
go-build-cache:
go-mod-cache:
+49
View File
@@ -0,0 +1,49 @@
# air.toml — live-reload config for local docker-compose dev mode.
#
# Active when the platform service runs from workspace-server/Dockerfile.dev
# (selected via docker-compose.dev.yml overlay). In production, the regular
# Dockerfile builds a static binary; air is dev-only.
#
# Reference: https://github.com/air-verse/air
root = "."
testdata_dir = "testdata"
tmp_dir = "tmp"
[build]
# Same build invocation as Dockerfile's builder stage minus the
# CGO_ENABLED=0 toggle (CGO ok in dev for richer race detector output).
cmd = "go build -o ./tmp/server ./cmd/server"
bin = "tmp/server"
full_bin = ""
args_bin = []
# Watch every .go and .yaml file under workspace-server/.
include_ext = ["go", "yaml", "tmpl"]
# Don't watch tests, build artifacts, vendored deps, or migration .sql
# (migrations need a clean DB anyway — handled by docker-compose down/up).
exclude_dir = ["assets", "tmp", "vendor", "testdata", "node_modules"]
exclude_file = []
# _test.go and *_mock.go shouldn't trigger a rebuild — saves cycles.
exclude_regex = ["_test\\.go$", "_mock\\.go$"]
exclude_unchanged = true
follow_symlink = false
log = "build-errors.log"
# Kill running binary 1s before starting new one.
kill_delay = "1s"
send_interrupt = true
stop_on_error = true
# Debounce: wait this long after last change before triggering rebuild.
delay = 500
[log]
time = false
[color]
main = "magenta"
watcher = "cyan"
build = "yellow"
runner = "green"
[misc]
# Don't keep the tmp/ dir around between runs.
clean_on_exit = true
+38
View File
@@ -0,0 +1,38 @@
# Dockerfile.dev — local-development image with air-driven live reload.
#
# Selected by docker-compose.dev.yml (overlay over docker-compose.yml).
# Production stays on workspace-server/Dockerfile (static binary, no air).
#
# Workflow:
# 1. docker compose -f docker-compose.yml -f docker-compose.dev.yml up
# 2. Edit any .go file under workspace-server/
# 3. air detects, rebuilds, kills old binary, starts new one (~3-5s)
# 4. No `docker compose up --build` needed
#
# Templates + plugins are NOT pre-cloned here — air-mode assumes the
# developer's filesystem has the workspace-configs-templates/ + plugins/
# dirs available, mounted at runtime via docker-compose.dev.yml.
FROM golang:1.25-alpine
# air + git (for go mod) + ca-certs (for TLS) + tzdata (for time-zone DB).
RUN apk add --no-cache git ca-certificates tzdata wget \
&& go install github.com/air-verse/air@latest
WORKDIR /app/workspace-server
# Pre-fetch deps so the first `air` rebuild on a fresh container is fast.
# These are bind-mount-overridden at runtime, so the COPY here is just
# to warm the module cache.
COPY workspace-server/go.mod workspace-server/go.sum ./
RUN go mod download
# Source is bind-mounted at runtime (see docker-compose.dev.yml volumes
# block) so the Dockerfile doesn't need to COPY it. air watches the
# bind-mounted dir for changes.
ENV CGO_ENABLED=1
ENV GOFLAGS="-buildvcs=false"
# Run air with the .air.toml in the bind-mounted source dir.
CMD ["air", "-c", ".air.toml"]
@@ -6,6 +6,7 @@ package handlers
import (
"fmt"
"log"
"os"
"path/filepath"
"regexp"
@@ -102,6 +103,56 @@ func loadWorkspaceEnv(orgBaseDir, filesDir string) map[string]string {
return envVars
}
// loadPersonaEnvFile merges per-role persona credentials into out. The file
// lives at $MOLECULE_PERSONA_ROOT/<role>/env (default
// /etc/molecule-bootstrap/personas) and is populated by the operator-host
// bootstrap kit — one persona per dev-tree role, each carrying the role's
// Gitea identity (GITEA_USER, GITEA_TOKEN, GITEA_TOKEN_SCOPES,
// GITEA_USER_EMAIL, GITEA_SSH_KEY_PATH).
//
// Lower precedence than the org and workspace .env files: callers should
// invoke this BEFORE parseEnvFile on those, so a workspace .env can
// override a persona-default value when needed.
//
// Silent no-op when role is empty, when the role name fails the safe-segment
// check, or when the env file does not exist (workspaces without a role —
// or running on hosts that don't ship the bootstrap dir — keep their old
// behavior).
func loadPersonaEnvFile(role string, out map[string]string) {
if !isSafeRoleName(role) {
if role != "" {
log.Printf("Org import: refusing persona env load for unsafe role name %q", role)
}
return
}
root := os.Getenv("MOLECULE_PERSONA_ROOT")
if root == "" {
root = "/etc/molecule-bootstrap/personas"
}
parseEnvFile(filepath.Join(root, role, "env"), out)
}
// isSafeRoleName accepts a single path segment of [A-Za-z0-9_-]+. Rejects
// empty, ".", "..", and anything containing a path separator — even though
// the construct is admin-only, defense-in-depth keeps the persona dir
// shape invariant: one flat directory per role, no climbing out.
func isSafeRoleName(s string) bool {
if s == "" || s == "." || s == ".." {
return false
}
for _, c := range s {
switch {
case c >= 'a' && c <= 'z':
case c >= 'A' && c <= 'Z':
case c >= '0' && c <= '9':
case c == '-' || c == '_':
default:
return false
}
}
return true
}
// parseEnvFile reads a .env file and adds KEY=VALUE pairs to the map.
// Skips comments (#) and empty lines. Values can be quoted.
func parseEnvFile(path string, out map[string]string) {
@@ -443,10 +443,18 @@ func (h *OrgHandler) createWorkspaceTree(ws OrgWorkspace, parentID *string, absX
configFiles["system-prompt.md"] = []byte(ws.SystemPrompt)
}
// Inject secrets from .env files as workspace secrets.
// Resolution: workspace .env → org root .env (workspace overrides org root).
// Inject secrets from persona env + .env files as workspace secrets.
// Resolution (later overrides earlier):
// 0. Persona env (per-role bootstrap creds; only when ws.Role is set
// and the operator-host bootstrap dir ships a matching file)
// 1. Org root .env (shared defaults)
// 2. Workspace-specific .env (per-workspace overrides)
// Each line: KEY=VALUE → stored as encrypted workspace secret.
envVars := map[string]string{}
// 0. Persona env (lowest precedence; injects the role's Gitea identity:
// GITEA_USER, GITEA_TOKEN, GITEA_TOKEN_SCOPES, GITEA_USER_EMAIL,
// GITEA_SSH_KEY_PATH). Workspace and org .env can override.
loadPersonaEnvFile(ws.Role, envVars)
if orgBaseDir != "" {
// 1. Org root .env (shared defaults)
parseEnvFile(filepath.Join(orgBaseDir, ".env"), envVars)
@@ -0,0 +1,171 @@
package handlers
import (
"os"
"path/filepath"
"testing"
)
// TestLoadPersonaEnvFile_HappyPath: the standard case — a persona-shaped
// env file exists at <root>/<role>/env and its KEY=VALUE pairs land in
// the out map. Mirrors what the operator-host bootstrap kit ships:
// GITEA_USER, GITEA_TOKEN, GITEA_TOKEN_SCOPES, GITEA_USER_EMAIL,
// GITEA_SSH_KEY_PATH.
func TestLoadPersonaEnvFile_HappyPath(t *testing.T) {
root := t.TempDir()
roleDir := filepath.Join(root, "dev-lead")
if err := os.MkdirAll(roleDir, 0o755); err != nil {
t.Fatal(err)
}
envBody := `# Persona env file — mode 600
GITEA_USER=dev-lead
GITEA_USER_EMAIL=dev-lead@agents.moleculesai.app
GITEA_TOKEN=abc123
GITEA_TOKEN_SCOPES=write:repository,write:issue,read:user
GITEA_SSH_KEY_PATH=/etc/molecule-bootstrap/personas/dev-lead/ssh_priv
`
if err := os.WriteFile(filepath.Join(roleDir, "env"), []byte(envBody), 0o600); err != nil {
t.Fatal(err)
}
t.Setenv("MOLECULE_PERSONA_ROOT", root)
out := map[string]string{}
loadPersonaEnvFile("dev-lead", out)
want := map[string]string{
"GITEA_USER": "dev-lead",
"GITEA_USER_EMAIL": "dev-lead@agents.moleculesai.app",
"GITEA_TOKEN": "abc123",
"GITEA_TOKEN_SCOPES": "write:repository,write:issue,read:user",
"GITEA_SSH_KEY_PATH": "/etc/molecule-bootstrap/personas/dev-lead/ssh_priv",
}
if len(out) != len(want) {
t.Fatalf("got %d keys, want %d: %#v", len(out), len(want), out)
}
for k, v := range want {
if out[k] != v {
t.Errorf("out[%q] = %q; want %q", k, out[k], v)
}
}
}
// TestLoadPersonaEnvFile_MissingDir: when the persona dir doesn't exist
// (e.g. dev-only host without the bootstrap kit, or a workspace whose
// role isn't a known persona), it's a silent no-op — out stays empty,
// no panic, no log noise that would break callers.
func TestLoadPersonaEnvFile_MissingDir(t *testing.T) {
t.Setenv("MOLECULE_PERSONA_ROOT", t.TempDir()) // empty dir
out := map[string]string{}
loadPersonaEnvFile("nonexistent-role", out)
if len(out) != 0 {
t.Errorf("expected empty out, got %#v", out)
}
}
// TestLoadPersonaEnvFile_EmptyRole: empty role string is the common case
// for non-dev workspaces (research/marketing/etc.). Skip silently.
func TestLoadPersonaEnvFile_EmptyRole(t *testing.T) {
t.Setenv("MOLECULE_PERSONA_ROOT", t.TempDir())
out := map[string]string{}
loadPersonaEnvFile("", out)
if len(out) != 0 {
t.Errorf("empty role should produce empty out; got %#v", out)
}
}
// TestLoadPersonaEnvFile_RejectsTraversal: even though role names come
// from server-side admin-only org templates, defense-in-depth — refuse
// any role string with path separators or "..". Verifies that a maliciously
// crafted template can't read /etc/passwd by setting role: "../../etc".
func TestLoadPersonaEnvFile_RejectsTraversal(t *testing.T) {
root := t.TempDir()
// Plant a file at /tmp/.../env so a bad traversal would reach it
if err := os.WriteFile(filepath.Join(root, "env"), []byte("STOLEN=yes\n"), 0o600); err != nil {
t.Fatal(err)
}
t.Setenv("MOLECULE_PERSONA_ROOT", filepath.Join(root, "personas"))
for _, bad := range []string{"..", "../personas", "../etc/passwd", "/abs", "with/slash", "dot.in.middle", "with space", "back\\slash", ".", ""} {
out := map[string]string{}
loadPersonaEnvFile(bad, out)
if len(out) != 0 {
t.Errorf("role %q should have been rejected; got %#v", bad, out)
}
}
}
// TestLoadPersonaEnvFile_DefaultRoot: when MOLECULE_PERSONA_ROOT is unset,
// the helper falls back to /etc/molecule-bootstrap/personas. We don't
// touch real /etc — just verify the function doesn't panic and produces
// empty out (since the test box isn't expected to ship that path).
func TestLoadPersonaEnvFile_DefaultRoot(t *testing.T) {
t.Setenv("MOLECULE_PERSONA_ROOT", "") // explicit empty
out := map[string]string{}
loadPersonaEnvFile("dev-lead", out)
// Don't assert content — production CI might or might not have the
// /etc dir mounted. Just verify the call returns cleanly.
_ = out
}
// TestLoadPersonaEnvFile_PrecedenceCallerOverrides: the contract is "lower
// precedence than later .env files." The helper writes into out without
// removing existing keys, so a caller pre-populating out simulates a
// later layer overriding persona defaults. We verify the helper does NOT
// clobber pre-existing entries… actually, parseEnvFile DOES overwrite,
// so the caller-side ordering (persona → org → workspace) is what enforces
// precedence. This test pins that contract: persona is loaded into a
// fresh map, then later layers can override.
func TestLoadPersonaEnvFile_OverwritesEmptyMap(t *testing.T) {
root := t.TempDir()
roleDir := filepath.Join(root, "core-be")
if err := os.MkdirAll(roleDir, 0o755); err != nil {
t.Fatal(err)
}
if err := os.WriteFile(filepath.Join(roleDir, "env"),
[]byte("GITEA_TOKEN=persona-value\n"), 0o600); err != nil {
t.Fatal(err)
}
t.Setenv("MOLECULE_PERSONA_ROOT", root)
out := map[string]string{"GITEA_TOKEN": "preset"}
loadPersonaEnvFile("core-be", out)
// Persona helper is meant to populate a FRESH map first in the
// caller's flow; calling it on a pre-populated map and seeing the
// value get overwritten is consistent with parseEnvFile semantics.
if out["GITEA_TOKEN"] != "persona-value" {
t.Errorf("loadPersonaEnvFile did not write into existing map; got %q", out["GITEA_TOKEN"])
}
}
// TestIsSafeRoleName_Acceptance: positive + negative cases for the
// validator. Pinned because every dev-tree role name must pass.
func TestIsSafeRoleName_Acceptance(t *testing.T) {
good := []string{
"dev-lead", "core-be", "cp-security", "infra-runtime-be",
"sdk-dev", "plugin-dev", "documentation-specialist",
"triage-operator", "fullstack-engineer", "release-manager",
"core_underscore_ok", "X", "a1", "Z9-0",
}
for _, s := range good {
if !isSafeRoleName(s) {
t.Errorf("isSafeRoleName(%q) = false; want true", s)
}
}
bad := []string{
"", ".", "..", "with/slash", "/abs", "dot.in.middle",
"with space", "back\\slash", "trailing-", // trailing-hyphen is fine actually
"with$dollar", "with?question", "newline\nsplit",
}
// trailing-hyphen IS allowed; remove from "bad" list:
bad = []string{
"", ".", "..", "with/slash", "/abs", "dot.in.middle",
"with space", "back\\slash", "with$dollar", "with?question",
"newline\nsplit",
}
for _, s := range bad {
if isSafeRoleName(s) {
t.Errorf("isSafeRoleName(%q) = true; want false", s)
}
}
}
@@ -0,0 +1,207 @@
package handlers
// plugins_atomic.go — atomic install pattern for plugin delivery into a
// running workspace container. Closes molecule-core#114.
//
// Replaces the prior "tar + docker.CopyToContainer to /configs/plugins/<name>"
// single-step write (no atomicity, no marker, no rollback) with a 4-step
// dance:
//
// 1. STAGE — extract tar into /configs/plugins/.staging/<name>.<ts>/
// 2. SNAPSHOT — if /configs/plugins/<name>/ exists, mv to .previous/<name>.<ts>/
// 3. SWAP — mv /configs/plugins/.staging/<name>.<ts>/ → /configs/plugins/<name>/
// 4. MARKER — touch /configs/plugins/<name>/.complete
//
// On any post-snapshot failure we attempt a best-effort rollback by mv-ing
// the previous snapshot back into place. The .complete marker is the
// canonical "this install is fully landed" signal — workspace-side plugin
// loaders should refuse to load a plugin dir without it.
//
// Scope: docker path only (workspace running as a local container). The
// SaaS path (deliverViaEIC, SSH-into-EC2) is unchanged in this PR; tracked
// as a follow-up. The same stage-then-swap shape applies but the exec
// primitives differ (ssh vs docker exec), and shipping both paths in one
// PR doubles the test surface.
import (
"bytes"
"context"
"fmt"
"path"
"strings"
"time"
"github.com/docker/docker/api/types/container"
)
const (
pluginsRoot = "/configs/plugins"
pluginsStagingDir = "/configs/plugins/.staging"
pluginsPrevDir = "/configs/plugins/.previous"
completeMarker = ".complete"
)
// installVersion identifies one install attempt — the plugin name plus a
// monotonic-ish UTC timestamp suffix. Used to namespace the staging dir
// and any snapshot of the previous version, so a reinstall mid-flight
// can't collide with a concurrent reinstall.
type installVersion struct {
plugin string
stamp string // e.g. 20260508T141530Z
}
func newInstallVersion(plugin string) installVersion {
return installVersion{
plugin: plugin,
stamp: time.Now().UTC().Format("20060102T150405Z"),
}
}
// stagedPath is the container path where the new content lands during fetch.
// e.g. /configs/plugins/.staging/molecule-skill-foo.20260508T141530Z
func (v installVersion) stagedPath() string {
return path.Join(pluginsStagingDir, v.plugin+"."+v.stamp)
}
// previousPath is where the prior live version is moved before swap.
// e.g. /configs/plugins/.previous/molecule-skill-foo.20260508T141530Z
func (v installVersion) previousPath() string {
return path.Join(pluginsPrevDir, v.plugin+"."+v.stamp)
}
// livePath is the destination after swap.
// e.g. /configs/plugins/molecule-skill-foo
func (v installVersion) livePath() string {
return path.Join(pluginsRoot, v.plugin)
}
// markerPath is the .complete file inside the live dir written last.
func (v installVersion) markerPath() string {
return path.Join(v.livePath(), completeMarker)
}
// atomicCopyToContainer does a stage→snapshot→swap→marker install of a
// host-side staged plugin tree into a running container's
// /configs/plugins/<name>/. Returns nil on success.
//
// On post-snapshot failure (swap or marker write), best-effort rollback
// restores the previous snapshot to the live path. Returns the original
// error wrapped — the caller should surface it; rollback success is
// logged separately.
func (h *PluginsHandler) atomicCopyToContainer(
ctx context.Context, containerName, hostDir, pluginName string,
) error {
v := newInstallVersion(pluginName)
// Step 0a: ensure staging + previous root dirs exist (idempotent).
if _, err := h.execAsRoot(ctx, containerName, []string{
"mkdir", "-p", pluginsStagingDir, pluginsPrevDir,
}); err != nil {
return fmt.Errorf("atomic install: mkdir staging/previous: %w", err)
}
// Step 0b: tar the host content with a path prefix that lands it in the
// staging dir — NOT directly into the live name. The prefix has no
// leading "/" because docker.CopyToContainer extracts paths relative
// to the dstPath argument we pass below.
stagedRel := strings.TrimPrefix(v.stagedPath(), "/")
tarBuf, err := tarHostDirWithPrefix(hostDir, stagedRel)
if err != nil {
return fmt.Errorf("atomic install: tar host dir: %w", err)
}
// Step 1: STAGE — extract tar into /configs/plugins/.staging/<name>.<ts>/
if err := h.docker.CopyToContainer(ctx, containerName, "/", &tarBuf,
container.CopyToContainerOptions{}); err != nil {
// Best-effort: clean up any partial staging extract before returning.
_, _ = h.execAsRoot(ctx, containerName, []string{
"rm", "-rf", v.stagedPath(),
})
return fmt.Errorf("atomic install: copy to container: %w", err)
}
// Step 2: SNAPSHOT — if a live version exists, move it aside.
// `test -d` exits 0 if the dir exists, non-zero otherwise; the helper
// returns a non-nil error in the non-zero case which we treat as
// "no previous version" rather than a real failure.
snapshotted := false
if _, err := h.execAsRoot(ctx, containerName, []string{
"test", "-d", v.livePath(),
}); err == nil {
if _, err := h.execAsRoot(ctx, containerName, []string{
"mv", v.livePath(), v.previousPath(),
}); err != nil {
// Snapshot failure: roll back the staged extract before failing.
_, _ = h.execAsRoot(ctx, containerName, []string{
"rm", "-rf", v.stagedPath(),
})
return fmt.Errorf("atomic install: snapshot previous version: %w", err)
}
snapshotted = true
}
// Step 3: SWAP — atomic rename of the staged dir into the live name.
// `mv` on the same filesystem is a single rename(2), atomic at the FS level.
if _, err := h.execAsRoot(ctx, containerName, []string{
"mv", v.stagedPath(), v.livePath(),
}); err != nil {
// Swap failure: roll back if we had a snapshot.
if snapshotted {
if _, rbErr := h.execAsRoot(ctx, containerName, []string{
"mv", v.previousPath(), v.livePath(),
}); rbErr != nil {
return fmt.Errorf("atomic install: swap failed AND rollback failed: swap=%w, rollback=%v", err, rbErr)
}
}
// Best-effort cleanup of the still-staged dir.
_, _ = h.execAsRoot(ctx, containerName, []string{
"rm", "-rf", v.stagedPath(),
})
return fmt.Errorf("atomic install: swap to live path: %w", err)
}
// Step 4: MARKER — touch .complete inside the live dir as the last write.
// Workspace-side plugin loaders treat a plugin dir without this marker
// as half-installed and skip it (or surface a clear error to the
// operator instead of loading a possibly-partial tree).
if _, err := h.execAsRoot(ctx, containerName, []string{
"touch", v.markerPath(),
}); err != nil {
// Marker write failure with the new content already in place is a
// weird state — content is fine on disk, but the plugin loader
// will refuse to use it. Log loudly; do NOT roll back, since the
// content is the latest, just unmarked. Operator can manually
// `touch <plugin>/.complete` to recover.
return fmt.Errorf("atomic install: write .complete marker (content landed but unmarked, manual recovery: touch %s): %w", v.markerPath(), err)
}
// Step 5: GC — best-effort delete the previous snapshot. Failures here
// just leave a directory; not load-bearing for correctness, the next
// install or a separate sweeper will reclaim the space.
if snapshotted {
_, _ = h.execAsRoot(ctx, containerName, []string{
"rm", "-rf", v.previousPath(),
})
}
return nil
}
// tarHostDirWithPrefix walks hostDir and writes a tar to a buffer with
// every entry's name prefixed by `prefix`. Mirrors the prior streaming
// shape used in copyPluginToContainer but with a configurable prefix
// (the prior version hardcoded "plugins/<name>/"; we use a full
// staging path so the extracted layout is the staging dir directly).
//
// Symlinks are skipped — same posture as streamDirAsTar elsewhere in
// this file. Skipping prevents a hostile plugin from injecting a
// symlink that, post-extract, points outside the plugin's own dir.
func tarHostDirWithPrefix(hostDir, prefix string) (bytes.Buffer, error) {
var buf bytes.Buffer
tw := newTarWriter(&buf)
defer tw.Close()
if err := tarWalk(hostDir, prefix, tw); err != nil {
return bytes.Buffer{}, err
}
return buf, nil
}
@@ -0,0 +1,70 @@
package handlers
// plugins_atomic_tar.go — tar-walk helpers split out so the main atomic
// install flow stays readable. The prefix argument lets the caller
// arrange where the tar's contents land at extract time.
import (
"archive/tar"
"io"
"os"
"path/filepath"
)
// newTarWriter is a thin wrapper so atomic_test.go can swap the writer
// destination if it needs to.
func newTarWriter(w io.Writer) *tar.Writer {
return tar.NewWriter(w)
}
// tarWalk walks hostDir and writes every regular file + dir to the tar
// writer with paths of the form `<prefix>/<relative>`. Symlinks are
// skipped — same posture as streamDirAsTar in plugins_install_pipeline.go.
//
// The trailing-slash on prefix is normalized away: prefix "foo" and
// prefix "foo/" produce identical archives.
func tarWalk(hostDir, prefix string, tw *tar.Writer) error {
prefix = filepath.Clean(prefix)
return filepath.Walk(hostDir, func(p string, info os.FileInfo, err error) error {
if err != nil {
return err
}
if info.Mode()&os.ModeSymlink != 0 {
return nil // skip symlinks; see doc above
}
rel, err := filepath.Rel(hostDir, p)
if err != nil {
return err
}
if rel == "." {
// Emit the prefix dir itself once, with the source dir's mode.
hdr, err := tar.FileInfoHeader(info, "")
if err != nil {
return err
}
hdr.Name = prefix + "/"
return tw.WriteHeader(hdr)
}
hdr, err := tar.FileInfoHeader(info, "")
if err != nil {
return err
}
hdr.Name = filepath.Join(prefix, rel)
if info.IsDir() {
hdr.Name += "/"
}
if err := tw.WriteHeader(hdr); err != nil {
return err
}
if !info.Mode().IsRegular() {
return nil
}
f, err := os.Open(p)
if err != nil {
return err
}
defer f.Close()
_, err = io.Copy(tw, f)
return err
})
}
@@ -0,0 +1,193 @@
package handlers
import (
"archive/tar"
"bytes"
"io"
"os"
"path/filepath"
"sort"
"strings"
"testing"
"time"
)
// TestInstallVersion_Paths: the path helpers must produce a stable shape
// the in-container exec calls depend on. Pinning the layout here
// catches a future refactor that accidentally changes where staging /
// previous / live dirs live, which would break the swap atomicity.
func TestInstallVersion_Paths(t *testing.T) {
v := installVersion{plugin: "molecule-skill-foo", stamp: "20260508T141530Z"}
if got, want := v.stagedPath(), "/configs/plugins/.staging/molecule-skill-foo.20260508T141530Z"; got != want {
t.Errorf("stagedPath = %q; want %q", got, want)
}
if got, want := v.previousPath(), "/configs/plugins/.previous/molecule-skill-foo.20260508T141530Z"; got != want {
t.Errorf("previousPath = %q; want %q", got, want)
}
if got, want := v.livePath(), "/configs/plugins/molecule-skill-foo"; got != want {
t.Errorf("livePath = %q; want %q", got, want)
}
if got, want := v.markerPath(), "/configs/plugins/molecule-skill-foo/.complete"; got != want {
t.Errorf("markerPath = %q; want %q", got, want)
}
}
// TestInstallVersion_StampUniqueness: two newInstallVersion calls within
// the same second produce the same stamp (we use second precision); the
// caller relies on the mv-rename being atomic, so collision-free
// stamping is NOT a correctness requirement — but a regression that
// changes stamp shape (e.g. RFC3339 with colons) would break the path
// helpers since path.Join treats a colon as a regular char but ssh +
// docker exec generally don't. Pin the no-colon shape.
func TestInstallVersion_StampShape(t *testing.T) {
v := newInstallVersion("anything")
if strings.Contains(v.stamp, ":") {
t.Errorf("stamp must not contain colons (breaks shell-quoting in exec): %q", v.stamp)
}
if strings.Contains(v.stamp, " ") {
t.Errorf("stamp must not contain spaces: %q", v.stamp)
}
// Sanity: stamp parses as the documented format.
if _, err := time.Parse("20060102T150405Z", v.stamp); err != nil {
t.Errorf("stamp %q does not parse as 20060102T150405Z: %v", v.stamp, err)
}
}
// TestTarHostDirWithPrefix_HappyPath: walks a host dir, builds a tar with
// the configured prefix, verifies every entry's name is rooted under
// the prefix, and the file contents survive round-trip.
func TestTarHostDirWithPrefix_HappyPath(t *testing.T) {
hostDir := t.TempDir()
// Plant: <host>/plugin.yaml + <host>/skills/foo/SKILL.md + <host>/.complete
files := map[string]string{
"plugin.yaml": "name: foo\nversion: 1.0.0\n",
"skills/foo/SKILL.md": "# Foo skill\n",
".complete": "", // upstream may already have a marker
}
for rel, body := range files {
full := filepath.Join(hostDir, rel)
if err := os.MkdirAll(filepath.Dir(full), 0o755); err != nil {
t.Fatal(err)
}
if err := os.WriteFile(full, []byte(body), 0o644); err != nil {
t.Fatal(err)
}
}
prefix := "configs/plugins/.staging/foo.20260508T141530Z"
buf, err := tarHostDirWithPrefix(hostDir, prefix)
if err != nil {
t.Fatalf("tar: %v", err)
}
// Read back the tar; collect names + body for regular files.
got := map[string]string{}
tr := tar.NewReader(&buf)
for {
hdr, err := tr.Next()
if err == io.EOF {
break
}
if err != nil {
t.Fatalf("tar reader: %v", err)
}
// Every entry must start with the prefix
if !strings.HasPrefix(hdr.Name, prefix) {
t.Errorf("entry %q does not start with prefix %q", hdr.Name, prefix)
}
if hdr.Typeflag == tar.TypeReg {
body, err := io.ReadAll(tr)
if err != nil {
t.Fatal(err)
}
rel := strings.TrimPrefix(hdr.Name, prefix+"/")
got[rel] = string(body)
}
}
for rel, want := range files {
if got[rel] != want {
t.Errorf("body[%q] = %q; want %q", rel, got[rel], want)
}
}
}
// TestTarHostDirWithPrefix_SkipsSymlinks: a hostile plugin shouldn't be
// able to ship a symlink that, post-extract, points outside its own
// dir. The walker silently skips symlinks (same posture as
// streamDirAsTar). Verify a planted symlink doesn't appear in the tar.
func TestTarHostDirWithPrefix_SkipsSymlinks(t *testing.T) {
hostDir := t.TempDir()
// Plant a real file + a symlink pointing outside hostDir.
if err := os.WriteFile(filepath.Join(hostDir, "real.txt"), []byte("ok"), 0o644); err != nil {
t.Fatal(err)
}
target := filepath.Join(t.TempDir(), "outside")
if err := os.WriteFile(target, []byte("SHOULD NOT APPEAR"), 0o644); err != nil {
t.Fatal(err)
}
if err := os.Symlink(target, filepath.Join(hostDir, "evil")); err != nil {
t.Fatal(err)
}
buf, err := tarHostDirWithPrefix(hostDir, "p")
if err != nil {
t.Fatal(err)
}
names := []string{}
tr := tar.NewReader(&buf)
for {
hdr, err := tr.Next()
if err == io.EOF {
break
}
if err != nil {
t.Fatal(err)
}
names = append(names, hdr.Name)
}
sort.Strings(names)
for _, n := range names {
if strings.Contains(n, "evil") {
t.Errorf("symlink leaked into tar: %q", n)
}
}
// real.txt should be present
found := false
for _, n := range names {
if strings.HasSuffix(n, "real.txt") {
found = true
break
}
}
if !found {
t.Errorf("real.txt missing from tar; got names: %v", names)
}
}
// TestTarHostDirWithPrefix_PrefixNormalization: trailing slash on prefix
// should not change the archive shape. Pinning this so a future caller
// passing "foo/" instead of "foo" doesn't double-slash entry names.
func TestTarHostDirWithPrefix_PrefixNormalization(t *testing.T) {
hostDir := t.TempDir()
if err := os.WriteFile(filepath.Join(hostDir, "x"), []byte("y"), 0o644); err != nil {
t.Fatal(err)
}
a, err := tarHostDirWithPrefix(hostDir, "foo")
if err != nil {
t.Fatal(err)
}
b, err := tarHostDirWithPrefix(hostDir, "foo/")
if err != nil {
t.Fatal(err)
}
if !bytes.Equal(a.Bytes(), b.Bytes()) {
t.Errorf("trailing-slash on prefix changed archive shape; tarHostDirWithPrefix should be slash-insensitive")
}
}
@@ -0,0 +1,214 @@
package handlers
// plugins_classifier.go — diff classifier for plugin updates.
//
// Closes molecule-core#112. Composes with #114 (atomic install) so the
// platform can decide *before* triggering restartFunc whether the
// update is content-only (SKILL.md text changed; agent re-reads at next
// Skill invocation) or structural (hooks/settings/plugin.yaml/file added
// or removed; agent must restart to pick up the new state).
//
// SKILL.md content is hot-reloadable because Claude Code reads the file
// on each Skill invocation — no in-memory cache. Hooks and settings.json
// are loaded at session start and need a session restart. plugin.yaml
// changes are structural by definition (manifest controls everything
// else).
//
// CLASSIFICATION RULE
// classify(staged, live) → "skill-content-only" if and only if
// every file present in either tree is one of:
// - identical between staged and live, OR
// - a **/SKILL.md file with content change (text body modified)
// AND no files were added or removed.
// Anything else → "cold" (the safe default).
//
// The classifier reads live-tree files from inside the container via
// `docker exec cat`. Comparison is by SHA-256 over file content, not
// mtime — mtime changes on every install regardless of content.
import (
"context"
"crypto/sha256"
"encoding/hex"
"fmt"
"io/fs"
"os"
"path/filepath"
"strings"
)
const (
// classifyKindSkillContentOnly: install can skip restartFunc; the
// only changes are SKILL.md body text.
classifyKindSkillContentOnly = "skill-content-only"
// classifyKindCold: must restart the workspace container; structural
// or hook/settings change.
classifyKindCold = "cold"
)
// classifyInstallChanges compares the staged plugin tree (host filesystem)
// against the currently-live plugin tree inside the container. Returns
// classifyKindSkillContentOnly when the only diff is SKILL.md content
// changes, classifyKindCold otherwise (added/removed files, hooks/
// settings.json edits, plugin.yaml edits, anything else).
//
// `noLive` is the sentinel returned when /configs/plugins/<name> doesn't
// exist (first install for this plugin). Treated as cold — no live state
// to hot-reload into.
func (h *PluginsHandler) classifyInstallChanges(
ctx context.Context, containerName, hostStagedDir, pluginName string,
) (string, error) {
livePath := "/configs/plugins/" + pluginName
// Probe: does live exist? If not, this is a first install — cold.
if _, err := h.execAsRoot(ctx, containerName, []string{
"test", "-d", livePath,
}); err != nil {
return classifyKindCold, nil
}
// Build hash maps for both trees.
stagedHashes, err := hashLocalTree(hostStagedDir)
if err != nil {
return classifyKindCold, fmt.Errorf("classifier: hash staged: %w", err)
}
liveHashes, err := h.hashContainerTree(ctx, containerName, livePath)
if err != nil {
// Live tree read failure: be conservative, cold-restart.
return classifyKindCold, nil
}
// Drop the .complete marker from comparison — its mtime/atime can
// vary across installs but content is empty/trivial; including it
// would force-cold every reinstall.
delete(stagedHashes, ".complete")
delete(liveHashes, ".complete")
// Set difference: any file in one but not the other → cold.
for rel := range stagedHashes {
if _, ok := liveHashes[rel]; !ok {
return classifyKindCold, nil // file added
}
}
for rel := range liveHashes {
if _, ok := stagedHashes[rel]; !ok {
return classifyKindCold, nil // file removed
}
}
// Same set of files. Walk the diff.
for rel, stagedHash := range stagedHashes {
liveHash := liveHashes[rel]
if stagedHash == liveHash {
continue
}
// Content differs. Allow if and only if it's a SKILL.md.
if !isSkillMarkdown(rel) {
return classifyKindCold, nil
}
}
return classifyKindSkillContentOnly, nil
}
// isSkillMarkdown returns true for any path whose basename is SKILL.md
// (case-sensitive, matches Claude Code's skill discovery rule).
func isSkillMarkdown(rel string) bool {
return filepath.Base(rel) == "SKILL.md"
}
// hashLocalTree walks a host directory and returns rel-path → sha256-hex.
// Symlinks are skipped (same posture as the tar walker).
func hashLocalTree(root string) (map[string]string, error) {
out := map[string]string{}
err := filepath.WalkDir(root, func(p string, d fs.DirEntry, err error) error {
if err != nil {
return err
}
if d.IsDir() {
return nil
}
info, err := d.Info()
if err != nil {
return err
}
if info.Mode()&os.ModeSymlink != 0 {
return nil
}
if !info.Mode().IsRegular() {
return nil
}
rel, err := filepath.Rel(root, p)
if err != nil {
return err
}
body, err := os.ReadFile(p)
if err != nil {
return err
}
sum := sha256.Sum256(body)
out[filepath.ToSlash(rel)] = hex.EncodeToString(sum[:])
return nil
})
if err != nil {
return nil, err
}
return out, nil
}
// hashContainerTree reads every regular file under livePath via docker
// exec sh -c 'cd <livePath> && find . -type f -not -name .complete | xargs -I {} sh -c "echo {}; sha256sum {}"'.
//
// The output is parsed line-by-line into rel-path → sha256-hex.
func (h *PluginsHandler) hashContainerTree(
ctx context.Context, containerName, livePath string,
) (map[string]string, error) {
out, err := h.execAsRoot(ctx, containerName, []string{
"sh", "-c",
// Find regular files, hash each, output `<hex> ./<relpath>`.
// `cd` then `find .` keeps paths relative to livePath.
fmt.Sprintf("cd %s && find . -type f -print0 | xargs -0 -r sha256sum 2>/dev/null", shQuote(livePath)),
})
if err != nil {
return nil, fmt.Errorf("hash container tree: %w", err)
}
hashes := map[string]string{}
for _, line := range strings.Split(out, "\n") {
line = strings.TrimSpace(line)
if line == "" {
continue
}
// sha256sum output: "<hex> <path>" (two spaces). Path starts with "./".
parts := strings.SplitN(line, " ", 2)
if len(parts) != 2 {
continue
}
hash := parts[0]
rel := strings.TrimPrefix(parts[1], "./")
hashes[rel] = hash
}
return hashes, nil
}
// shQuote single-quotes a string for safe insertion into a shell command.
// Returns the input unchanged if it's already shell-safe (alphanumeric +
// /._-). Otherwise wraps in single quotes and escapes inner '.
func shQuote(s string) string {
safe := true
for _, c := range s {
switch {
case c >= 'a' && c <= 'z':
case c >= 'A' && c <= 'Z':
case c >= '0' && c <= '9':
case c == '/' || c == '.' || c == '_' || c == '-':
default:
safe = false
}
if !safe {
break
}
}
if safe {
return s
}
return "'" + strings.ReplaceAll(s, "'", `'\''`) + "'"
}
@@ -0,0 +1,121 @@
package handlers
import (
"os"
"path/filepath"
"testing"
)
// TestIsSkillMarkdown: pin which paths the classifier considers
// hot-reloadable. SKILL.md by basename only — case-sensitive.
func TestIsSkillMarkdown(t *testing.T) {
yes := []string{
"SKILL.md",
"skills/foo/SKILL.md",
"deeply/nested/SKILL.md",
}
no := []string{
"plugin.yaml",
"hooks.json",
"settings.json",
"README.md",
"skill.md", // case-sensitive
"SKILLS.md", // not a skill file
"skills/foo/extra.md",
}
for _, s := range yes {
if !isSkillMarkdown(s) {
t.Errorf("isSkillMarkdown(%q) = false; want true", s)
}
}
for _, s := range no {
if isSkillMarkdown(s) {
t.Errorf("isSkillMarkdown(%q) = true; want false", s)
}
}
}
// TestHashLocalTree_StableHash: hashing the same content twice must
// produce identical maps. Pinned because if hashLocalTree ever picks up
// mtime/inode (e.g. via a refactor to use os.Lstat metadata), every
// install would classify as cold and we'd lose the hot-reload.
func TestHashLocalTree_StableHash(t *testing.T) {
dir := t.TempDir()
if err := os.MkdirAll(filepath.Join(dir, "skills/foo"), 0o755); err != nil {
t.Fatal(err)
}
if err := os.WriteFile(filepath.Join(dir, "plugin.yaml"), []byte("name: foo\n"), 0o644); err != nil {
t.Fatal(err)
}
if err := os.WriteFile(filepath.Join(dir, "skills/foo/SKILL.md"), []byte("# Foo\n"), 0o644); err != nil {
t.Fatal(err)
}
h1, err := hashLocalTree(dir)
if err != nil {
t.Fatal(err)
}
h2, err := hashLocalTree(dir)
if err != nil {
t.Fatal(err)
}
if len(h1) != len(h2) {
t.Fatalf("hash count differs: %d vs %d", len(h1), len(h2))
}
for k, v := range h1 {
if h2[k] != v {
t.Errorf("hash[%q] differs: %q vs %q", k, v, h2[k])
}
}
}
// TestHashLocalTree_SymlinkSkipped: symlinks should not appear in the
// hash map — same posture as the tar walker. Otherwise a hostile plugin
// could include a symlink whose hash changes when its target changes,
// silently flipping classification.
func TestHashLocalTree_SymlinkSkipped(t *testing.T) {
dir := t.TempDir()
if err := os.WriteFile(filepath.Join(dir, "real.txt"), []byte("ok"), 0o644); err != nil {
t.Fatal(err)
}
target := filepath.Join(t.TempDir(), "target")
if err := os.WriteFile(target, []byte("outside"), 0o644); err != nil {
t.Fatal(err)
}
if err := os.Symlink(target, filepath.Join(dir, "link")); err != nil {
t.Fatal(err)
}
h, err := hashLocalTree(dir)
if err != nil {
t.Fatal(err)
}
if _, exists := h["link"]; exists {
t.Errorf("symlink leaked into hash map: %v", h)
}
if _, exists := h["real.txt"]; !exists {
t.Errorf("real.txt missing from hash map: %v", h)
}
}
// TestShQuote: the classifier injects livePath into a shell command via
// docker exec. Path must be quoted to handle pluginName entries with
// hyphens (which are safe but exercised here) and any future special-
// character edge case. Pin the safe-vs-quoted boundary.
func TestShQuote(t *testing.T) {
cases := []struct {
in, want string
}{
{"foo", "foo"},
{"/configs/plugins/foo-bar", "/configs/plugins/foo-bar"},
{"with space", "'with space'"},
{"with'quote", "'with'\\''quote'"},
{"$envvar", "'$envvar'"},
{"path/with/dots.txt", "path/with/dots.txt"},
}
for _, tc := range cases {
if got := shQuote(tc.in); got != tc.want {
t.Errorf("shQuote(%q) = %q; want %q", tc.in, got, tc.want)
}
}
}
@@ -91,6 +91,14 @@ func (h *PluginsHandler) Install(c *gin.Context) {
return
}
// Record the install in workspace_plugins (core#113 — version-subscription
// foundation). Best-effort: DB write failure is logged but doesn't fail
// the install — the plugin IS in the container; surfacing a 500 here
// would mislead the caller about the install state.
if err := recordWorkspacePluginInstall(ctx, workspaceID, result.PluginName, result.Source.Raw(), req.Track); err != nil {
log.Printf("Plugin install: failed to record %s for %s in workspace_plugins: %v (install succeeded; tracking row missing)", result.PluginName, workspaceID, err)
}
log.Printf("Plugin install: %s via %s → workspace %s (restarting)", result.PluginName, result.Source.Scheme, workspaceID)
c.JSON(http.StatusOK, gin.H{
"status": "installed",
@@ -114,6 +114,15 @@ type installRequest struct {
// When present, resolveAndStage verifies the fetched content matches
// before allowing the install to proceed (SAFE-T1102 supply-chain hardening).
SHA256 string `json:"sha256,omitempty"`
// Track is the version-subscription mode for this install (core#113):
// "none" — no auto-update tracking (default)
// "tag:vX.Y.Z" — track a specific version tag
// "tag:latest" — track latest tag, drift on every new tag
// "sha:<full>" — pinned, no drift ever
// The drift detector (separate component, follow-up) reads
// workspace_plugins rows where tracked_ref != 'none' and queues
// updates when upstream resolves to a different SHA.
Track string `json:"track,omitempty"`
}
// stageResult bundles the outputs of resolveAndStage for the caller.
@@ -276,7 +285,22 @@ func (h *PluginsHandler) resolveAndStage(ctx context.Context, req installRequest
// using NewPluginsHandler without a DB; production wires it in router.go.
func (h *PluginsHandler) deliverToContainer(ctx context.Context, workspaceID string, r *stageResult) error {
if containerName := h.findRunningContainer(ctx, workspaceID); containerName != "" {
if err := h.copyPluginToContainer(ctx, containerName, r.StagedDir, r.PluginName); err != nil {
// Hot-reload classifier (molecule-core#112) — decide BEFORE the
// install whether this update can skip restartFunc. SKILL.md
// content changes are filesystem-visible to Claude Code on the
// next Skill invocation; hooks / settings.json / plugin.yaml /
// added-or-removed files need a container restart.
// Classifier reads live tree from container; on any read error
// it returns kindCold so we never hot-reload speculatively.
kind, _ := h.classifyInstallChanges(ctx, containerName, r.StagedDir, r.PluginName)
// Atomic stage→snapshot→swap→marker (molecule-core#114).
// Replaces the prior single docker.CopyToContainer write that
// left a partially-extracted tree on mid-install failure with
// no rollback path. atomicCopyToContainer writes a .complete
// marker as the last step; workspace-side plugin loaders should
// refuse to load a plugin dir without it.
if err := h.atomicCopyToContainer(ctx, containerName, r.StagedDir, r.PluginName); err != nil {
log.Printf("Plugin install: failed to copy %s to %s: %v", r.PluginName, workspaceID, err)
return newHTTPErr(http.StatusInternalServerError, gin.H{"error": "failed to copy plugin to container"})
}
@@ -284,7 +308,11 @@ func (h *PluginsHandler) deliverToContainer(ctx context.Context, workspaceID str
"chown", "-R", "1000:1000", "/configs/plugins/" + r.PluginName,
})
if h.restartFunc != nil {
go h.restartFunc(workspaceID)
if kind == classifyKindSkillContentOnly {
log.Printf("Plugin install: %s → workspace %s — SKILL-content-only update, SKIPPING restart", r.PluginName, workspaceID)
} else {
go h.restartFunc(workspaceID)
}
}
return nil
}
@@ -0,0 +1,78 @@
package handlers
// plugins_tracking.go — workspace_plugins DB tracking for the
// version-subscription model (core#113).
//
// Schema lives in migration 20260508160000_workspace_plugins_tracking.up.sql.
// This file is the Go-side write surface used at install time to record
// each plugin's install record. Drift detection / queue / apply are
// follow-up scope (filed as a separate issue once this lands).
import (
"context"
"errors"
"fmt"
"strings"
"github.com/Molecule-AI/molecule-monorepo/platform/internal/db"
)
// trackedRefValues is the closed set of bare-string values the
// workspace_plugins.tracked_ref column accepts. Prefixed values
// ("tag:..." / "sha:...") are validated structurally below.
var trackedRefValues = map[string]bool{
"none": true,
}
// validateTrackedRef returns the canonical form of a track string, or
// an error if the input is malformed. Empty input → "none" (default).
//
// Accepted shapes:
//
// "" — defaults to "none"
// "none" — no tracking
// "tag:vX.Y.Z" — track a specific tag
// "tag:latest" — track latest tag, drift on every new tag
// "sha:<full-sha>" — pinned to commit SHA
func validateTrackedRef(s string) (string, error) {
s = strings.TrimSpace(s)
if s == "" {
return "none", nil
}
if trackedRefValues[s] {
return s, nil
}
if strings.HasPrefix(s, "tag:") && len(s) > 4 {
return s, nil
}
if strings.HasPrefix(s, "sha:") && len(s) > 4 {
return s, nil
}
return "", fmt.Errorf("invalid track value %q: expected 'none' | 'tag:vX.Y.Z' | 'tag:latest' | 'sha:<full>'", s)
}
// recordWorkspacePluginInstall upserts the workspace_plugins row for a
// plugin install. ON CONFLICT (workspace_id, plugin_name) DO UPDATE so
// reinstalling the same plugin name (with a possibly-different source or
// track value) updates the existing row rather than failing.
func recordWorkspacePluginInstall(
ctx context.Context, workspaceID, pluginName, sourceRaw, track string,
) error {
if workspaceID == "" || pluginName == "" || sourceRaw == "" {
return errors.New("recordWorkspacePluginInstall: missing required field")
}
canonicalTrack, err := validateTrackedRef(track)
if err != nil {
return err
}
_, err = db.DB.ExecContext(ctx, `
INSERT INTO workspace_plugins (workspace_id, plugin_name, source_raw, tracked_ref)
VALUES ($1, $2, $3, $4)
ON CONFLICT (workspace_id, plugin_name)
DO UPDATE SET
source_raw = EXCLUDED.source_raw,
tracked_ref = EXCLUDED.tracked_ref,
updated_at = NOW()
`, workspaceID, pluginName, sourceRaw, canonicalTrack)
return err
}
@@ -0,0 +1,54 @@
package handlers
import "testing"
// TestValidateTrackedRef: pin the exact set of accepted track values
// the install endpoint stores. Drift detector reads this column; any
// value that slips through here without structural validation would
// silently fail at drift-check time.
func TestValidateTrackedRef(t *testing.T) {
cases := []struct {
in string
want string
err bool
}{
// Defaults
{"", "none", false},
{" ", "none", false},
{"none", "none", false},
// Tag shape
{"tag:v1.0.0", "tag:v1.0.0", false},
{"tag:v0.4.0-gitea.1", "tag:v0.4.0-gitea.1", false},
{"tag:latest", "tag:latest", false},
// SHA shape
{"sha:abc123", "sha:abc123", false},
{"sha:0123456789abcdef0123456789abcdef01234567", "sha:0123456789abcdef0123456789abcdef01234567", false},
// Reject malformed
{"tag:", "", true}, // empty after prefix
{"sha:", "", true}, // empty after prefix
{"latest", "", true}, // bare 'latest' is ambiguous (tag? branch?)
{"main", "", true}, // bare branch name not allowed
{"v1.0.0", "", true}, // missing tag: prefix
{"random", "", true}, // not in allowlist
{"tag", "", true}, // prefix without separator
}
for _, tc := range cases {
got, err := validateTrackedRef(tc.in)
if tc.err {
if err == nil {
t.Errorf("validateTrackedRef(%q) = (%q, nil); want error", tc.in, got)
}
continue
}
if err != nil {
t.Errorf("validateTrackedRef(%q) error: %v", tc.in, err)
continue
}
if got != tc.want {
t.Errorf("validateTrackedRef(%q) = %q; want %q", tc.in, got, tc.want)
}
}
}
@@ -207,20 +207,25 @@ func TestStartSweeper_TransientErrorDoesNotCrashLoop(t *testing.T) {
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
// 50ms ticker so the second cycle fires quickly enough for the test.
// We re-export SweepInterval as a const, but tests use the public
// StartSweeper that takes its own interval — wait, the public
// StartSweeper signature uses the package-level SweepInterval. Hmm,
// this means the test takes ~5 minutes. Let me reconsider.
//
// (We patch the test below to just look at the immediate-sweep call
// + an error path, since the immediate call is enough to prove the
// "error doesn't crash" contract — the loop continues afterward
// regardless of timing.)
// Capture metric baseline so we can wait for the error counter to
// settle before returning — otherwise this test's leaked metric
// write races with the next test's metricDelta() baseline read and
// causes a non-deterministic +1 leak (manifests as
// TestStartSweeper_RecordsMetricsOnSuccess: "error counter delta=1,
// want 0"). cycleDone fires inside the fake's Sweep defer, BEFORE
// sweepOnce records the error metric — so cancel() right after
// waitForCycle is too early.
_, _, deltaError := metricDelta(t)
go pendinguploads.StartSweeper(ctx, store, time.Hour)
// Wait for the first (errored) cycle.
store.waitForCycle(t, 1, 2*time.Second)
// Wait for the goroutine to record the error metric. After this
// returns, sweepOnce has fully completed and a subsequent cancel()
// stops the loop on the next select pass with no in-flight metric
// writes outstanding.
waitForMetricDelta(t, deltaError, 1, 2*time.Second)
// Cancel — the goroutine returns cleanly, proving the error path
// didn't crash the loop. Without this fix the goroutine would have
// either panicked (process abort visible at exit) or stuck (this
@@ -0,0 +1,3 @@
DROP INDEX IF EXISTS workspace_plugins_tracked_not_none;
DROP INDEX IF EXISTS workspace_plugins_workspace_name;
DROP TABLE IF EXISTS workspace_plugins;
@@ -0,0 +1,39 @@
-- workspace_plugins: per-workspace record of installed plugins, with the
-- tracked-ref needed for the version-subscription model (core#113).
--
-- Today plugin install state is filesystem-only — `/configs/plugins/<name>/`
-- inside the workspace container. There's no DB record of "what's installed
-- where, from what source, pinned to what." That's fine until you want
-- drift detection (compare upstream tag's resolved SHA vs the installed
-- one) and that's the foundation this table provides.
--
-- This migration is purely additive: existing install paths keep working;
-- they'll write to this table on next install. Workspaces with plugins
-- already installed before this migration won't have rows until they're
-- re-installed (acceptable — the tracking is forward-looking).
--
-- tracked_ref values:
-- 'none' — no auto-update tracking (default)
-- 'tag:vX.Y.Z' — track a specific version tag
-- 'tag:latest' — track the latest tag (drift on every new tag)
-- 'sha:<full>' — pinned to a specific commit SHA (no drift ever)
--
-- A subsequent migration adds the plugin_update_queue table once drift
-- detection lands.
CREATE TABLE IF NOT EXISTS workspace_plugins (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
workspace_id UUID NOT NULL REFERENCES workspaces(id) ON DELETE CASCADE,
plugin_name TEXT NOT NULL,
source_raw TEXT NOT NULL,
tracked_ref TEXT NOT NULL DEFAULT 'none',
installed_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE UNIQUE INDEX IF NOT EXISTS workspace_plugins_workspace_name
ON workspace_plugins(workspace_id, plugin_name);
-- Partial index for the drift detector: only scan rows opted into tracking.
CREATE INDEX IF NOT EXISTS workspace_plugins_tracked_not_none
ON workspace_plugins(tracked_ref) WHERE tracked_ref != 'none';