Files

T

History

Hongming Wang caf19e8980 feat(ops): hourly alarm for auto-promote PR stuck on REVIEW_REQUIRED (#2975 )

Closes the silent-block failure mode that left 25 commits — including
the Memory v2 redesign and the reno-stars data-loss fix — wedged on
staging for 12+ hours behind a single missing review. The auto-promote
workflow opened the PR + armed auto-merge, but main's branch protection
required a human review and nobody noticed until a user reported
"still seeing old memory tab".

## Detection logic — `scripts/check-stale-promote-pr.sh`

Reads open PRs `base=main head=staging` and alarms on:
  - `mergeStateStatus == BLOCKED`
  - `reviewDecision == REVIEW_REQUIRED`
  - createdAt older than `STALE_HOURS` (default 4h)

Other BLOCKED reasons (DIRTY, BEHIND, failed checks) are NOT alarmed —
those are the author's signal-to-fix. This script targets the specific
"no human reviewed yet" wedge.

Output:
  - `::warning` per stale PR (visible in workflow summary + Actions UI)
  - PR comment (idempotent via marker-string detection; one alarm
    per PR, never re-spammed)
  - Exit code = count of stale PRs (capped at 125)

Logic in a script (not inline workflow YAML) so it's:
  - **Unit-testable** — tests/test-check-stale-promote-pr.sh exercises
    every branch with stubbed fixture JSON + frozen clock. 23 tests
    covering: empty list, single stale, just-under-threshold, wrong
    reviewDecision, wrong mergeStateStatus, mixed list (only matching
    PRs alarm), custom threshold via --stale-hours, exit-code-counts-
    matching-PRs, --help, unknown arg → 64, missing repo → 2.
  - **Operator-runnable ad-hoc** — `scripts/check-stale-promote-pr.sh`
    works from any shell with `gh` + `jq`.
  - **SSOT** — one detector, the workflow YAML is just schedule +
    invocation surface. Future sibling workflows that need the same
    check call the same script.

## Workflow — `.github/workflows/auto-promote-stale-alarm.yml`

Triggers:
  - cron `27 * * * *` (hourly, off-the-hour to dodge cron herd)
  - workflow_dispatch with `stale_hours` + `post_comment` overrides

Concurrency: `auto-promote-stale-alarm` group, cancel-in-progress=false
(idempotent script; no benefit to cancelling a running scan).

Permissions: `contents: read` + `pull-requests: write` (post comments).

Sparse checkout — only fetches `scripts/check-stale-promote-pr.sh`.
No node_modules, no go modules, no slow setup steps. Workflow runs
in <30s on a clean repo.

## Why "alarm + comment" not "auto-approve"

Considered options in issue #2975:
  1. Slack/email alert — picked.
  2. Bot-account auto-approve via molecule-ops — circumvents the
     human-review gate that branch protection encodes.
  3. Trusted-promote bypass via CODEOWNERS — needs Org Admin config
     change; out of scope for a workflow PR.

The comment-on-PR pattern picks (1) without external dependencies
(no Slack token, no email config). Subscribers get notified via
GitHub's existing PR notification delivery; the warning shows up in
the Actions feed.

## Why this won't false-positive on legitimate slow reviews

Threshold is 4h. Most legitimate gates clear in <1h, so 4× headroom
is plenty for slow CI. The comment is idempotent (one alarm per PR,
never re-posted) — adding noise stops at 1 comment regardless of
how long the PR sits.

## Test plan

- [x] `bash scripts/test-check-stale-promote-pr.sh` — 23/23 pass
- [x] `python3 -c 'yaml.safe_load(...)'` clean
- [x] `bash -n` clean on both scripts
- [ ] Live verification: dispatch the workflow once main has caught up,
      confirm it correctly reports zero stale PRs

2026-05-05 17:55:27 -07:00

demo-freeze-snapshots

ops: demo-day freeze + rollback runbook

2026-05-01 12:04:30 -07:00

ops

feat(ops): add sweep-aws-secrets janitor — orphan tenant bootstrap secrets

2026-05-03 02:38:08 -07:00

build_runtime_package.py

fix(build): register a2a_response in TOP_LEVEL_MODULES

2026-05-05 17:34:05 -07:00

build-images.sh

initial commit — Molecule AI platform

2026-04-13 11:55:37 -07:00

bundle-compile.sh

initial commit — Molecule AI platform

2026-04-13 11:55:37 -07:00

canary-smoke.sh

feat(canary): smoke harness + GHA verification workflow (Phase 2)

2026-04-19 03:30:19 -07:00

check-cascade-list-vs-manifest.sh

feat(ci): structural drift gate for cascade list vs manifest (RFC #388 PR-3)

2026-05-03 03:52:39 -07:00

check-stale-promote-pr.sh

feat(ops): hourly alarm for auto-promote PR stuck on REVIEW_REQUIRED (#2975 )

2026-05-05 17:55:27 -07:00

cleanup-rogue-workspaces.sh

fix(provisioner): stop rogue config-missing restart loop (#17 )

2026-04-14 07:32:58 -07:00

clone-manifest.sh

fix(quickstart): wire up template/plugin registry via manifest.json

2026-04-23 14:55:34 -07:00

demo-day-runbook.md

ops: demo-day freeze + rollback runbook

2026-05-01 12:04:30 -07:00

demo-freeze.sh

ops: demo-day freeze + rollback runbook

2026-05-01 12:04:30 -07:00

demo-thaw.sh

ops: demo-day freeze + rollback runbook

2026-05-01 12:04:30 -07:00

dev-start.sh

fix(dev-start): detect missing Go and fall back to docker-compose platform

2026-04-29 20:04:37 -07:00

import-agent.sh

initial commit — Molecule AI platform

2026-04-13 11:55:37 -07:00

lockdown-tenant-sg.sh

feat(security): Phase 35.1 — SG lockdown script for tenant EC2 instances

2026-04-18 12:01:41 -07:00

measure-coordinator-task-bounds-runner.sh

fix(harness-runner): switch from non-existent /heartbeat-history to /activity

2026-04-28 23:12:51 -07:00

measure-coordinator-task-bounds.sh

docs: registry pattern + harness scripts READMEs

2026-04-28 22:19:40 -07:00

nuke-and-rebuild.sh

fix(scripts): nuke-and-rebuild self-bootstraps templates; add E2E test

2026-04-26 14:37:04 -07:00

post-rebuild-setup.sh

security: remove hardcoded API keys from post-rebuild-setup.sh

2026-04-20 13:02:52 -07:00

README.md

docs(scripts): rename /heartbeat-history → /activity in README

2026-04-29 02:23:00 -07:00

refresh-workspace-images.sh

feat(platform/admin): /admin/workspace-images/refresh + Docker SDK + GHCR auth

2026-04-26 10:17:21 -07:00

rollback-latest.sh

fix(scripts): correct platform dir path + add ROOT isolation (shellcheck clean)

2026-04-22 15:42:24 +00:00

test_build_runtime_package.py

chore: rewriter unit tests + drop misleading noqa on import inbox

2026-04-30 20:45:32 -07:00

test-a2a-cross-runtime.sh

initial commit — Molecule AI platform

2026-04-13 11:55:37 -07:00

test-all-adapters.sh

initial commit — Molecule AI platform

2026-04-13 11:55:37 -07:00

test-all-runtimes-a2a-e2e.sh

test(e2e): wire SaaS auth headers (TENANT_ADMIN_TOKEN + TENANT_ORG_ID)

2026-05-02 04:36:23 -07:00

test-all.sh

initial commit — Molecule AI platform

2026-04-13 11:55:37 -07:00

test-check-stale-promote-pr.sh

feat(ops): hourly alarm for auto-promote PR stuck on REVIEW_REQUIRED (#2975 )

2026-05-05 17:55:27 -07:00

test-cross-agent-chat.sh

initial commit — Molecule AI platform

2026-04-13 11:55:37 -07:00

test-hermes-plugin-e2e.sh

test(e2e): unified A2A round-trip parity harness across all 4 runtimes

2026-05-02 04:36:23 -07:00

test-nuke-and-rebuild.sh

fix(scripts): nuke-and-rebuild self-bootstraps templates; add E2E test

2026-04-26 14:37:04 -07:00

test-team-e2e.sh

initial commit — Molecule AI platform

2026-04-13 11:55:37 -07:00

wheel_smoke.py

feat(mcp): notifications/claude/channel for push-feel inbox UX

2026-04-30 20:10:01 -07:00

README.md

scripts/

Operational and one-off scripts for molecule-core. Most are self-documenting — see the header comments in each file.

RFC #2251 coordinator task-bound harnesses

There are three related scripts; pick the right one:

Script	Purpose	Targets
`measure-coordinator-task-bounds.sh`	Canonical v1 harness for the RFC #2251 / Issue 4 reproduction. Provisions a PM coordinator + Researcher child via `claude-code-default` + `langgraph` templates, sends a synthesis-heavy A2A kickoff, observes elapsed time + activity trace.	OSS-shape platform — localhost or any `/workspaces`-shaped endpoint. Has tenant/admin-token guards for non-localhost runs.
`measure-coordinator-task-bounds-runner.sh`	Generalised runner for the same measurement contract but with arbitrary template + secret + model combinations (Hermes/MiniMax, etc.). Useful for cross-runtime variants without modifying the canonical harness.	Same as above (local or SaaS via `MODE=saas`).
`measure-coordinator-task-bounds.sh` (in molecule-controlplane)	Production-shape variant that bootstraps a real staging tenant via `POST /cp/admin/orgs`, then runs the same measurement against `<slug>.staging.moleculesai.app`.	Staging controlplane only — refuses to run against production.

See reference_harness_pair_pattern (auto-memory) for when to use which and the cross-repo design rationale.

Common safety pattern across all three

Cleanup trap on EXIT/INT/TERM auto-deletes provisioned resources.
DRY_RUN=1 prints plan + auth fingerprint, exits before any state mutation. Run this before pointing at staging or any shared infrastructure.
Non-target guard refuses arbitrary endpoints (the controlplane variant is locked to staging-api.moleculesai.app; the OSS variant requires explicit auth + tenant scoping for non-localhost PLATFORM).
Cleanup failures emit cleanup_*_failed events with remediation hints; no silenced curl. ADMIN_TOKEN expiring mid-run surfaces as a structured event rather than a silent leak.

Activity trace caveat

If activity_trace.raw == "<endpoint_unavailable>", the per-workspace /activity endpoint isn't wired on the target build — the bound measurement is INCONCLUSIVE on the platform-ceiling question. Either wire the endpoint or replace with the equivalent Datadog query. Note that /activity accepts a since_secs query parameter; see the endpoint handler for the supported range.

Other scripts

cleanup-rogue-workspaces.sh — emergency teardown for leaked workspaces. Prompts for confirmation. Pair with the harnesses if a cleanup trap fails (see cleanup_*_failed events).
canary-smoke.sh — quick smoke test for canary releases.
dev-start.sh — local-dev platform bring-up.

The rest are self-documenting in their header comments.