molecule-core

Author	SHA1	Message	Date
claude-ceo-assistant	0be89053e8	Merge pull request 'chore(observability): edge-429 probe + ratelimit runbook (unblocks #62 , #64 )' (#85 ) from chore/edge-429-probe-and-ratelimit-runbook into main	2026-05-07 22:53:48 +00:00
claude-ceo-assistant	d81fb98163	Merge branch 'main' into fix/issue-72-auto-sync-token-canary-v2	2026-05-07 22:53:32 +00:00
claude-ceo-assistant	4d5c9a6646	Merge branch 'main' into fix/issue75-class-F-gh-run-list-to-statuses	2026-05-07 22:53:26 +00:00
claude-ceo-assistant	9ecee78782	Merge branch 'main' into fix/issue75-class-A-gh-pr-to-gitea-rest	2026-05-07 22:53:11 +00:00
claude-ceo-assistant	141dfdae52	Merge branch 'main' into feat/issue-63-local-build-from-gitea-v2	2026-05-07 22:53:04 +00:00
claude-ceo-assistant	d21c09babe	Merge branch 'main' into fix/195-auto-promote-staging-gitea-rest	2026-05-07 22:53:00 +00:00
claude-ceo-assistant	2b3a8f2e4d	Merge branch 'main' into fix/196-retarget-main-to-staging-gitea-rest	2026-05-07 22:52:35 +00:00
security-auditor	9eb530bbf0	Merge remote-tracking branch 'origin/main' into chore/edge-429-probe-and-ratelimit-runbook	2026-05-07 15:52:29 -07:00
security-auditor	62e793040e	chore(observability): edge-429 probe + ratelimit observability runbook Two artifacts that unblock the parked follow-ups from #59: 1. scripts/edge-429-probe.sh (closes the "operator-blocked" status of #62). An operator without CF/Vercel dashboard access can reproduce a canvas-sized burst against a tenant subdomain and read each 429's response shape — workspace-server bucket overflow (JSON body + X-RateLimit-* headers) is distinguishable from CF (cf-ray) and Vercel (x-vercel-id) by inspection of the report. Read-only, parallel via background subshells (no GNU parallel dependency), no credential use. Smoke-tested against example.com end-to-end. 2. docs/engineering/ratelimit-observability.md (closes the "metric-blocked" status of #64). The existing molecule_http_requests_total{path,status} counter + X-RateLimit-* response headers already cover #64's acceptance criterion ("watch metrics for two weeks"). The runbook collects the PromQL queries, a decision tree for the re-tune (keep / per-tenant override / change default), an alert rule template, and a hard "do not roll ad-hoc per-bucket-key exposure" note (in-memory map includes SHA-256 of bearer tokens — exposing it is a security review surface, file a follow-up if needed). Neither artifact changes runtime behaviour. Pure operational tooling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 15:48:34 -07:00
devops-engineer	34e05c35b9	chore: sync main → staging (auto, `6946cd12`)	2026-05-07 22:45:14 +00:00
claude-ceo-assistant (Claude Opus 4.7 on Hongming's MacBook)	16868c4ec1	fix(plugins): SaaS (EC2-per-workspace) install/uninstall via EIC SSH Closes the 🔴 docker-only row in docs/architecture/backends.md. Plugin install on every SaaS tenant currently 503s with "workspace container not running" because the handler is hardcoded to Docker exec but SaaS workspaces live on per-workspace EC2s. Caught on hongming.moleculesai.app when canvas POST /workspaces/<id>/plugins surfaced the error. Mirrors the Files API PR #1702 pattern: dispatch on workspaces.instance_id in deliverToContainer (and Uninstall). When set, push the staged plugin tarball to the EC2 over the existing withEICTunnel primitive (template_files_eic.go) and unpack into the runtime's bind-mounted config dir (/configs for claude-code, /home/ubuntu/.hermes for hermes — see workspaceFilePathPrefix). chown 1000:1000 to match the docker path's agent-uid contract; restart via the existing dispatcher. Direct host write rather than docker-cp via SSH because the runtime's config dir is already bind-mounted into the workspace container — the runtime sees the files on next start with no additional plumbing. Adds InstanceIDLookup (parallel to RuntimeLookup) so unit tests don't need a DB; production wires it in router.go like templates.go does. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 15:42:51 -07:00
claude-ceo-assistant	6946cd12c5	ci(branch-protection): check-name parity gate (#144 ) (#56 ) Adds tools/branch-protection/check_name_parity.sh regression guard + 6 shell tests + branch-protection-drift.yml wire-up. Closed #144. Approved by security-auditor.	2026-05-07 22:42:08 +00:00
devops-engineer	e43bd7ceb0	chore: 2nd verification trigger for #75 class A (per Phase 4 ≥2 green runs) Empty commit to trigger CI a second consecutive time per the SOP 'verify ≥1 representative workflow per class via workflow_dispatch or push event ... ≥2 consecutive successful runs per class'. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 15:41:00 -07:00
claude-ceo-assistant	85140f1c72	Merge branch 'main' into fix/issue-72-auto-sync-token-canary-v2	2026-05-07 22:40:56 +00:00
devops-engineer	5b3ce5c818	fix(ci): replace gh run list with Gitea commit-status query (#75 class F) Part of the post-#66 sweep to remove `gh` CLI dependencies that fail silently against Gitea. Class F covers `gh run list --workflow=X --commit=SHA` shapes — querying whether a specific workflow ran (and how it finished) for a specific SHA. Why this is the only call site in class F: `gh run list` hits GitHub's `/repos/.../actions/runs` REST endpoint. Gitea exposes ZERO endpoints under `/repos/.../actions/runs` — verified 2026-05-07 via swagger inspection: only secrets, variables, and runner-registration tokens live under /actions/. There's no way to query workflow run state via the Gitea v1 API directly. However, every Gitea Actions job DOES emit a commit status with `context = "<Workflow Name> / <Job Name> (<event>)"` (verified 2026-05-07 by reading /repos/.../commits/{sha}/statuses on a recent main SHA). That surface is exactly what we need: each workflow run leg is one status row, the aggregate state encodes the run outcome, and Gitea exposes it under `/api/v1/repos/.../commits/{sha}/statuses` which IS available. Affected: `auto-promote-on-e2e.yml` (lines 172-180): Old: `gh run list --workflow e2e-staging-saas.yml --commit $SHA --json status,conclusion --jq ...` returning a 5-bucket string like `completed/success` \| `in_progress/none` \| `none/none` \| `completed/failure` \| `completed/cancelled`. New: `curl /api/v1/repos/.../commits/$SHA/statuses` + jq filter on contexts whose name starts with `"E2E Staging SaaS (full lifecycle) /"`. Mapping: 0 matched contexts → "none/none" (E2E paths- filtered out — same as before) any context = pending → "in_progress/none" (defer) any context = error\|failure → "completed/failure" (abort) all contexts = success → "completed/success" (proceed) The `completed/cancelled` arm of the case statement becomes unreachable: Gitea status API doesn't expose a `cancelled` state (it has success/failure/error/pending/warning), so per-SHA concurrency cancellations now surface as `failure` and are handled by the failure branch. Documented in-place; the cancelled arm is kept as defense-in-depth for any future dual-host operation. Verification: - Live curl against the current main SHA returns `none/none` (E2E was paths-filtered for that change set — expected). - Synthetic-input jq tests verify all four mapping buckets: no contexts → "none/none" one context = pending → "in_progress/none" success + success → "completed/success" success + failure → "completed/failure" - YAML syntax validates. Token: continues to use act_runner's GITHUB_TOKEN (per-run, repo read scope). The `/commits/{sha}/statuses` endpoint is repo-scoped, no extra perms needed. Closes part of #75. Master tracking issue at #75; companion PRs: #80 (class A — `gh pr ...`), #81 (class D — `gh api ...`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 15:38:57 -07:00
claude-ceo-assistant	bcc72419ce	Merge branch 'main' into fix/144-branch-protection-check-name-parity-audit	2026-05-07 22:35:33 +00:00
claude-ceo-assistant	e4e1bf4080	ci(canary): annotate EXPECTED_PERSONA dual-update constraint Hostile-self-review weakest-spot #2: if the devops-engineer persona is ever renamed, the canary will go red even if everything else is fine. Add an inline comment pointing the next editor at both files that must update together (auto-sync-main-to-staging.yml's git config + this canary's EXPECTED_PERSONA + the staging branch protection's push_whitelist_usernames). No behaviour change — comment-only.	2026-05-07 15:35:22 -07:00
claude-ceo-assistant	62629eda4a	ci(canary): rewrite Probe 3 to actually validate auth (NOP push --dry-run) While verifying Phase 4, found a real flaw in Probe 3 (`git ls-remote refs/heads/staging`). On a public repo (which molecule-core is), Gitea falls back to anonymous read on bad auth, so `ls-remote` succeeds even with a junk token. The probe was therefore green-lighting rotated tokens — false-green, the worst possible canary failure mode. Rewritten to use `git push --dry-run` of the current staging SHA back to `refs/heads/staging`: - Push always authenticates (auth-gated on smart-protocol handshake, before the dry-run can compute the empty-diff). - NOP by construction: pushing the current tip back to itself is "Everything up-to-date" with exit 0. - Bad token → "Authentication failed", exit 128. - Doesn't reach pre-receive (where branch-protection authz runs), so scope is "auth only" — matches the design intent (failure mode B); authz already covered daily by branch-protection-drift.yml. Implementation note: `git push` requires a local repo. Spinning up a fresh `git init` in a tempdir (~1KB, ~50ms) instead of pulling the full repo via actions/checkout — actions/checkout would clone ~hundreds of MB for what amounts to "a place to run git from." Local mutation tests pass: - Real token: "Everything up-to-date" exit 0 - Junk token: "Authentication failed" exit 128 with actionable ::error:: messages pointing at the runbook Header comment + runbook step-mapping updated to reflect new probe shape. Refs: #72	2026-05-07 15:34:34 -07:00
devops-engineer	224b65764d	chore: sync main → staging (auto, `050cb035`)	2026-05-07 22:34:17 +00:00
claude-ceo-assistant	050cb035d6	fix(ci): pre-clone manifest deps in harness-replays workflow (#50 ) Mirrors PR #66/#173 pre-clone-manifest pattern. Closes #173 (followup). Approved by security-auditor.	2026-05-07 22:33:51 +00:00
devops-engineer	e075557b19	fix(ci): replace gh pr CLI with Gitea v1 REST in workflows + scripts (#75 class A) Part of the post-#66 sweep to remove `gh` CLI dependencies that fail silently against Gitea (which exposes /api/v1 only — no GraphQL → 405, no /api/v3 → 404). Class A covers `gh pr list / view / diff / comment` shapes. Affected: - `.github/workflows/auto-tag-runtime.yml` Replaced `gh pr list --search SHA --json number,labels` with a curl to `/api/v1/repos/.../pulls?state=closed&sort=newest&limit=50` + jq filter on `merge_commit_sha == github.sha`. Same end-to-end behaviour: locate the merged PR for this push, read its labels, pick the bump kind. Defensive `?.name // empty` jq guard handles unlabelled PRs without erroring. The 50-PR window is comfortably larger than the volume of staging→main promotes that close in any reasonable detection window. - `scripts/check-stale-promote-pr.sh` Rewrote `fetch_prs` and `post_comment` to call Gitea's REST API directly. Gitea doesn't expose GitHub's compound `mergeStateStatus` / `reviewDecision` fields, so the new fetcher pulls `/pulls?state=open&base=main` then for each PR pulls `/pulls/{n}/reviews` and synthesizes the GitHub-shape JSON the rest of the script (and the existing fixture-based unit tests) consume: BLOCKED + REVIEW_REQUIRED ↔ mergeable=true AND 0 APPROVED reviews DIRTY ↔ mergeable=false (alarm doesn't fire) CLEAN + APPROVED ↔ mergeable=true AND ≥1 APPROVED review Comment-posting moves to `POST /repos/.../issues/{n}/comments` (Gitea treats PRs as issues for the comment surface, same as GitHub's REST). All 23 fixture-driven unit tests still pass — fixtures pass GitHub-shape JSON via PR_FIXTURE which short-circuits the live fetch path. - `scripts/ops/check_migration_collisions.py` Replaced `gh pr list` + `gh pr diff` calls with stdlib `urllib` against /api/v1. Helper `_gitea_get` centralizes auth + error handling; uses GITEA_TOKEN env, falling back to GITHUB_TOKEN (act_runner) and GH_TOKEN. Return shape from `open_prs_with_migration_prefix` mimics the historical `--json number,headRefName` so the call sites are unchanged. All 9 regex-classifier unit tests still pass; live integration test against the production Gitea API returns 0 collisions for prefix=999 as expected. curl invocation pattern is `curl --fail-with-body -sS` (NOT `-fsS` — the two short-fail flags are mutually exclusive in modern curl; caught by `curl: You must select either --fail or --fail-with-body, not both` during local verification). Token model: workflows pass act_runner's GITHUB_TOKEN (per-run, repo read scope) — same surface used by the auto-sync fix in PR #66 plus the surrounding workflows. No new repo secrets required. Verification: bash unit tests (23/23 pass), python unittest (9/9 pass), live curl call against production Gitea returns 200 with the expected shape, YAML / shell / Python syntax all validate. Closes part of #75. Other classes (D — `gh api`; F — `gh run list`) land in follow-up PRs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 15:29:26 -07:00
devops-engineer	fab65c78d6	fix(ci): rewrite retarget-main-to-staging for Gitea REST API Root cause: same as #65/#73 — gh CLI calls Gitea GraphQL (/api/graphql) which returns HTTP 405. Specifically: - gh api -X PATCH /pulls/{N} sometimes works but is flaky on Gitea (depends on gh's host-resolution layer) - gh pr close / gh pr comment route through GraphQL → 405 Fix: replace all gh calls with direct curl REST calls to Gitea: - PATCH /api/v1/repos/{owner}/{repo}/pulls/{index} body {"base": "staging"} — retarget the PR base - POST /api/v1/repos/{owner}/{repo}/issues/{index}/comments — post the explainer comment (PRs are issues in Gitea, comments share the issue endpoint) - PATCH /api/v1/repos/{owner}/{repo}/pulls/{index} body {"state": "closed"} — close redundant PR for #1884 case Identity: switch from secrets.GITHUB_TOKEN (per-job ephemeral, narrow scope on Gitea) to secrets.AUTO_SYNC_TOKEN (devops-engineer persona). Same persona used by auto-sync (#66) and auto-promote (#78). Per feedback_per_agent_gitea_identity_default. PR-edit and comment do not need branch-protection bypass. Curl-status-capture pattern hardened per feedback_curl_status_capture_pollution: http_code via -w to its own scalar, body to a tempfile, set +e/-e bracket so curl's non-zero-on-4xx doesn't pollute the script's exit chain. Header comment block fully rewritten with 4 failure-mode runbooks (A: 422 dup-base, B: token rotated, C: PR deleted, D: filter mis-fire) per PR #66/#78's pattern. Refs: #65, #74, #196, PR #66 + #78 (canonical reference) Closes #74 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 15:28:26 -07:00
claude-ceo-assistant	0cef033a6a	ci(canary): route curl -w to tempfile to satisfy status-capture lint The two API probes used the unsafe shape rejected by lint-curl-status-capture.yml (per feedback_curl_status_capture_pollution): status=$(curl ... -w '%{http_code}' ... \|\| echo "000") When curl exits non-zero (transport error, --fail-with-body 4xx/5xx), the `-w` already wrote a code; the `\|\| echo "000"` then APPENDS another "000", yielding "000000" or "409000" — passes shape checks while looking right. Switch to the canonical safe shape (set +e + tempfile + cat): set +e curl ... -w '%{http_code}' >code_file 2>/dev/null set -e status=$(cat code_file 2>/dev/null \|\| true) [ -z "$status" ] && status="000" Inline comment in both probe steps explains the lint constraint so the next editor doesn't re-introduce the bad pattern. Refs: #72, lint failure on PR #77 (1/22 red → 22/22 expected green)	2026-05-07 15:26:22 -07:00
claude-ceo-assistant	b83b533381	Merge branch 'main' into fix/144-branch-protection-check-name-parity-audit	2026-05-07 22:24:45 +00:00
claude-ceo-assistant	a23cf6a6bb	Merge branch 'main' into fix/harness-replays-pre-clone-manifest	2026-05-07 22:24:42 +00:00
devops-engineer	6acd63fa5a	fix(ci): rewrite auto-promote staging→main for Gitea REST API Root cause: same as #65/PR-#66 — gh CLI calls Gitea GraphQL (/api/graphql) which returns HTTP 405. Additionally, gh workflow run calls /actions/workflows/{id}/dispatches which does not exist on Gitea 1.22.6 (verified via swagger.v1.json). Fix: - Replace gh run list with Gitea REST combined-status endpoint (GET /repos/{owner}/{repo}/commits/{ref}/status). Combined state encodes the AND across every check context — simpler than the per-workflow loop and immune to workflow-name collisions. - Replace gh pr create / merge --auto with direct curl calls to POST /pulls and POST /pulls/{N}/merge with merge_when_checks_succeed. - Remove the post-merge polling tail entirely. The GitHub-era GITHUB_TOKEN no-recursion rule does not apply on Gitea Actions (verified empirically: PR #66 merge fired downstream pushes naturally). Even if we wanted to dispatch, Gitea has no workflow_dispatch REST endpoint. Critical constraint: main has enable_push: false with no whitelist; direct push is impossible for any persona. PR-mediated merge is the only path. main has required_approvals: 1 — auto-merge waits for Hongming's approval before landing, preserving the feedback_prod_apply_needs_hongming_chat_go contract. Identity: AUTO_SYNC_TOKEN (devops-engineer persona). Not founder PAT. Per feedback_per_agent_gitea_identity_default. Same persona used by auto-sync (PR #66) — keeps identity model coherent. Header comment block fully rewritten with 4 failure-mode runbooks (A: gates not green, B: PR-create non-201, C: merge schedule fails, D: token rotated/scope wrong) per PR #66's pattern. Refs: #65, #73, #195, PR #66 (canonical reference) Closes #73 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 15:24:28 -07:00
claude-ceo-assistant	bfc393c065	ci: add AUTO_SYNC_TOKEN rotation drift canary (#72 ) Adds a 6h-cron synthetic check that fires the auth surface used by auto-sync-main-to-staging.yml (PR #66) and emits a red workflow status when AUTO_SYNC_TOKEN has drifted out of validity. Closes hostile-self-review weakest-spot #3 from PR #66 (token-rotation detection latency). Read-only verification — no writes, no synthetic merge commits, no canary branch noise. Three probes: 1. GET /api/v1/user → token authenticates as devops-engineer 2. GET /api/v1/repos/molecule-ai/molecule-core → read:repository scope 3. git ls-remote refs/heads/staging → exact HTTPS auth path used by actions/checkout in the real auto-sync workflow Hard-fail on missing AUTO_SYNC_TOKEN secret on both schedule and workflow_dispatch — per feedback_schedule_vs_dispatch_secrets_hardening, a silent soft-skip would make the canary itself drift-invisible (the sweep-cf-orphans #2088 lesson). Operator runbook in workflow header. Token reuse: same AUTO_SYNC_TOKEN as the workflow under monitor; no new credential introduced. Read-only paths only. Refs: #72, hostile-self-review #66	2026-05-07 15:23:03 -07:00
security-auditor	c0f4c16cc9	feat(canvas): ActivityTab subscribes to ACTIVITY_LOGGED — drop 5s polling Stage 3 of #61 (final stage). Replaces the 5s setInterval poll with: 1. Initial bootstrap on mount + on filter-change + on workspaceId- change (preserved from existing useEffect on loadActivities). 2. Manual Refresh button (preserved — still triggers loadActivities). 3. useSocketEvent subscription to ACTIVITY_LOGGED — every event for THIS workspace prepends to the list, gated on the user's autoRefresh toggle and current filter selection. No interval poll. Steady-state HTTP traffic from this tab drops from 12 req/min (5s × 1 active workspace) to 0 outside of bootstraps and manual refreshes. Live update latency drops from up to 5s to ~10ms. The autoRefresh ("Live" / "Paused") toggle now gates LIVE updates instead of polling cadence — semantically the same (paused = list stays frozen), implementationally simpler. The filter selection is honoured by the WS handler so a user filtering to "Tasks" doesn't see live a2a_send rows trickle in. Same shape the server-side `?type=<filter>` enforces on the bootstrap. Test changes: - 27 existing tests pass unchanged (filter / autoRefresh / Refresh / loading / error / empty / count / row-content all preserved) - 7 new WS-subscription tests: - WS push for matching workspace prepends with NO HTTP call - WS push for different workspace ignored - WS push respects active filter (non-matching ignored) - WS push respects active filter (matching renders) - WS push while autoRefresh paused ignored - WS push for already-in-list row deduped (no double-render) - NO 5s interval polling after mount Mutation-tested: - drop workspace_id filter → "different workspace" test fails - drop autoRefresh gate → "paused" test fails - drop filter gate → "non-matching activity_type" test fails - drop dedup-by-id → "already in list deduped" test fails Full canvas suite: 1396 passing, 0 failing. tsc clean. No API or schema change. /workspaces/:id/activity HTTP endpoint stays — used for bootstrap + manual refresh + filter-change reload. ACTIVITY_LOGGED event shape unchanged. Hostile self-review (three weakest spots): 1. Server-side activity_logs row UPDATES (status flips, etc.) are not reflected post-#61 — the dedup-by-id check skips a re-fired ACTIVITY_LOGGED for an existing row. Acceptable: activity_logs is append-only by design (audit trail); status updates surface as new task_update rows, not as in-place mutations. If a future server change adds in-place updates, fire ACTIVITY_UPDATED as a distinct event so this dedup logic stays simple. 2. WS handler is recreated on every render (filter / autoRefresh / workspaceId state changes). useSocketEvent's ref-based pattern keeps the bus subscription stable, but the handler closure re-captures each render. Side effect: fine — handler call cost is negligible. 3. The "error" filter matches activity_type === "error" (mirrors server semantics). It does NOT match status === "error" rows of other activity types — same as the polling version. Worth re-evaluating in a separate PR if users expect the broader semantic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 15:21:38 -07:00
security-auditor	7194b08987	feat(canvas): A2ATopologyOverlay subscribes to ACTIVITY_LOGGED — drop 60s polling Stage 2 of #61. Replaces the 60s setInterval poll that fanned out across every visible workspace fetching `?type=delegation&limit=500` with: 1. One bootstrap fan-out on mount (or on visible-ID-set change), same shape as before — preserves the 60-min look-back history. 2. useSocketEvent subscription to ACTIVITY_LOGGED — every event with activity_type=delegation + method=delegate from a visible workspace appends to a local rolling buffer, edges are re-derived via the existing buildA2AEdges helper. 3. showA2AEdges toggle off: clears edges + buffer. No interval poll. The visibleIdsKey selector gate that fixed the 2026-05-04 render-loop incident is preserved — peer-discovery / status-flip writes still don't trigger a wasteful re-bootstrap. Steady-state HTTP traffic from this overlay drops from N req/min (N visible workspaces × 1 cycle/min) to 0 outside of mount + visible- ID-set-change bootstraps. Live update latency drops from up to 60s to ~10ms. Bootstrap race-aware: any WS arrivals that landed in the buffer during the fetch await are preserved by id-dedup-with-fetched-first ordering. No row is double-counted; no row is lost during in-flight updates. Test changes: - 27 existing tests pass unchanged (buildA2AEdges purity preserved, component visibility/visibleIdsKey/error-swallow behaviour preserved). - 6 new WS-subscription tests: - NO 60s polling after bootstrap (clock advance fires nothing) - WS push for delegation updates edges with NO HTTP call - WS push for non-delegation activity_type ignored - WS push for delegate_result ignored (mirrors buildA2AEdges method filter) - WS push from hidden workspace ignored - WS push while showA2AEdges=false ignored Mutation-tested: - drop activity_type filter → "non-delegation" test fails - drop method===delegate filter → "delegate_result" test fails - drop visible-ws membership filter → "hidden workspace" test fails Full canvas suite: 1395 passing, 0 failing. tsc clean. No API or schema change. ACTIVITY_LOGGED event shape unchanged. The /workspaces/:id/activity HTTP endpoint stays — used for bootstrap. Hostile self-review (three weakest spots): 1. Bootstrap fetches up to 500 rows × N workspaces. Worst-case buffer ~3000 entries before window-prune. Acceptable: window- prune runs on every recomputeAndPush, buildA2AEdges aggregates to at most N² edges. Real-world usage stays well under both. 2. WS handler re-arms on every bootstrap dependency change (visibleIds change). useSocketEvent's ref-based pattern means the bus subscription stays stable across renders, but the handler closure re-captures bootstrap each time. Side effect: fine — handler invocation just calls recomputeAndPush which is idempotent. 3. delegate_result rows arriving over WS are silently dropped. Acceptable: the existing buildA2AEdges already filters them out at aggregation time (avoids double-counting); pre-filtering at the WS handler is the correct mirror — keeps the bus path and the bootstrap path consistent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 15:17:19 -07:00
claude-ceo-assistant	d9e380c5bc	feat(workspace-server): local-dev provisioner builds from Gitea source when MOLECULE_IMAGE_REGISTRY is unset (#63 , Task #194 ) OSS contributors who clone molecule-core and `go run ./workspace-server/cmd/server` now get a working end-to-end provision without authenticating to GHCR or AWS ECR. Pre-fix: with MOLECULE_IMAGE_REGISTRY unset, the provisioner attempted to pull ghcr.io/molecule-ai/workspace-template-<runtime>:latest, which has been returning 403 since the 2026-05-06 GitHub-org suspension. Post-fix: when MOLECULE_IMAGE_REGISTRY is unset, the provisioner switches to local-build mode — looks up the workspace-template-<runtime> repo's HEAD sha on Gitea via a single API call, shallow-clones into ~/.cache/molecule/, and runs `docker build --platform=linux/amd64`. SHA-pinned cache key skips the clone+build entirely on subsequent provisions. Production tenants are unaffected: every prod tenant sets the var to its private ECR mirror, so the SaaS pull path is byte-for-byte identical. SSOT for mode detection lives in Resolve() (registry_mode.go) returning a discriminated RegistrySource{Mode, Prefix} so call sites that branch on mode get a compile-time push instead of a string-equality footgun. Coverage: * registry_mode.go — new SSOT (Resolve, RegistryMode, IsKnownRuntime) * registry_mode_test.go — 8 tests pinning mode-decision contract * localbuild.go — clone+build pipeline (570 LOC, fully unit-tested) * localbuild_test.go — 22 tests covering happy/sad paths, fail-closed * provisioner.go — Start() inserts ensureLocalImageHook in local mode * docs/adr/ADR-002 — design rationale + alternatives + security review * docs/development/local-development.md — local-build flow + env overrides Security: * Allowlist-only runtime names (knownRuntimes) gate the clone path. * Repo prefix hardcoded to git.moleculesai.app/molecule-ai/molecule-ai-workspace-template-; forks via opt-in MOLECULE_LOCAL_TEMPLATE_REPO_PREFIX. * MOLECULE_GITEA_TOKEN masked in every log line via maskTokenInURL/maskTokenInString. * Fail-closed: Gitea unreachable / runtime not mirrored → clear error, never silently fall back to GHCR/ECR. * docker build invocation passes no --build-arg from external input. * HTTP body cap 64KB on Gitea API responses (defence vs malicious upstream). Closes #63 / Task #194. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 15:16:51 -07:00
devops-engineer	249dbc6ac9	chore: sync main → staging (auto, `f8a238df`)	2026-05-07 22:11:39 +00:00
devops-engineer	f8a238dfdd	chore: second auto-sync verification (post-#66/#67) (#68 )	2026-05-07 22:11:30 +00:00
security-auditor	830de70e84	feat(canvas): CommunicationOverlay subscribes to ACTIVITY_LOGGED — drop 30s polling Stage 1 of #61. Replaces the 30s setInterval poll with: 1. One bootstrap fan-out on mount (cap of 3 retained from the 2026-05-04 fix), gives the initial recent-comms window without waiting for live events. 2. useSocketEvent subscription to ACTIVITY_LOGGED — every event with a comm-overlay-relevant activity_type from a visible online workspace prepends to the rendered list. 3. Re-bootstrap on visibility-toggle re-open so the snapshot is fresh after a long collapsed period. No interval poll. Inherits the singleton ReconnectingSocket's reconnect / backoff / health-check guarantees via useSocketEvent. Steady-state HTTP traffic from this overlay drops from ~6 req/min (3 ws × 2 cycles/min) to 0 outside of mount/visibility-toggle bootstraps. Live updates arrive within ~10ms of the server insert instead of after up to 30s. Test changes: - Bootstrap fan-out cap of 3 — kept (was the cadence test's role pre-#61) - 30s cadence test — replaced with "no interval polling" test that pins the absence of any cadence-driven HTTP after bootstrap - Visibility gate test — extended to verify both: no fetches while closed, AND re-bootstrap on re-open - WS subscription tests (new): - WS push extends rendered list with NO HTTP call - WS push for offline workspace ignored - WS push for non-comm activity_type ignored - WS push while collapsed ignored - non-ACTIVITY_LOGGED events ignored Mutation-tested: - drop visibility gate → visibility test fails - drop activity_type filter → "non-comm activity_type" test fails - drop workspace online-set filter → "offline workspace" test fails Full canvas suite: 1393 passing, 0 failing. tsc clean. No API or schema change. ACTIVITY_LOGGED event shape pinned by existing socket-events tests. Hostile self-review (three weakest spots): 1. Sustained WS outage shows stale comms until visibility-toggle re-bootstrap. Acceptable: the singleton socket already auto- reconnects and the comm overlay isn't a critical-path surface. 2. Bootstrap on visibility-toggle costs another 3 HTTP calls each re-open. Acceptable: visibility-toggle is a deliberate user action, not a tight loop. 3. The WS handler reads the latest `nodes` via nodesRef rather than re-subscribing on node changes. By design — the bus listener stays bound for the component lifetime to avoid the "tear-down storm" pattern A2ATopologyOverlay's comment warns about (ref-based current-state lookup, stable subscription). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 15:11:02 -07:00
devops-engineer	3f68ac1fcb	chore: second consecutive trigger for auto-sync verification (post-#66/#67)	2026-05-07 15:10:40 -07:00
devops-engineer	da7baee2a3	chore: sync main → staging (auto, `5efa92fb`)	2026-05-07 22:10:12 +00:00
devops-engineer	5efa92fbc6	chore: verify auto-sync main→staging post-#66 (#67 )	2026-05-07 22:10:04 +00:00
devops-engineer	f0664264cb	chore: empty commit to verify auto-sync main→staging post-#66	2026-05-07 15:09:18 -07:00
devops-engineer	2679fdd01a	chore: sync main → staging (manual, resolve auto-sync workflow conflict, post-#66) # Conflicts: # .github/workflows/auto-sync-main-to-staging.yml	2026-05-07 15:08:20 -07:00
devops-engineer	7b194eb1aa	fix(ci): rewrite auto-sync main→staging for Gitea direct push (#66 , closes #65 )	2026-05-07 22:07:00 +00:00
devops-engineer	6235ef7461	fix(ci): rewrite auto-sync main→staging for Gitea direct push Root cause of `Auto-sync main → staging / sync-staging (push)` failing every push to main since the GitHub→Gitea migration: The workflow assumed a GitHub `merge_queue` ruleset on staging (blocking direct push) and used `gh pr create` + `gh pr merge --auto` to land sync via the queue. On Gitea this fails at the `gh pr create` step with `HTTP 405 Method Not Allowed (https://git.moleculesai.app/api/graphql)` — Gitea exposes no GraphQL endpoint, and the GitHub-CLI cannot ship PRs against Gitea. Verified failure mode in run 1117/job 0 (token logs at /tmp/log2.txt, run target /molecule-ai/molecule-core/actions/ runs/1117/jobs/0). The merge step succeeded and pushed auto-sync/main-1e1f4d63; the PR step failed with the 405. So every main push left an orphan auto-sync/* branch and a red CI status, with no PR to land it. Fix: the staging branch protection on Gitea (`enable_push: true`, `push_whitelist_usernames: [devops-engineer]`) already permits direct push from the devops-engineer persona. Drop the entire merge-queue PR architecture and replace with: 1. Checkout staging with secrets.AUTO_SYNC_TOKEN (devops-engineer persona token, NOT founder PAT — `feedback_per_agent_gitea_identity_default`). 2. `git fetch origin main` + ff-merge or no-ff merge. 3. `git push origin staging` directly. The AUTO_SYNC_TOKEN repo secret already exists (created 2026-05-07 14:00 alongside the staging push_whitelist update). Workflow name + job name unchanged → required-check name `Auto-sync main → staging / sync-staging (push)` keeps the same context, no branch-protection edits needed. Rejected alternatives (documented in workflow header): - Reuse PR architecture via Gitea REST: ~80 LOC of API plumbing for no benefit; direct push works. - GH_HOST=git.moleculesai.app: still calls /api/graphql, same 405; doesn't fix the root issue. - Custom JS action: external dep for a 5-line `git push`. Header comment in the workflow now documents: - What this workflow does (SSOT for staging advancing). - Why direct push (GitHub merge_queue → Gitea push_whitelist). - Identity and token (anti-bot-ring per saved memory). - Failure modes A–D with operator runbook for each. - Loop safety (push to staging doesn't fire push:main → no recursion). Verification plan: this fix-PR's merge to main is itself the trigger; watch the workflow run on the merge commit and on one follow-up trigger commit, expect both green. Refs: failing run https://git.moleculesai.app/molecule-ai/ molecule-core/actions/runs/1117/jobs/0 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 15:04:12 -07:00
security-auditor	5b7b669b4c	docs(ratelimit): tighten dev-mode comment after keyFor refactor The previous comment said "all share one IP bucket" — accurate before the keyFor refactor, slightly stale after it. The dev-mode rationale (bucket fills fast, blanks the page on a single-user dev box) is unchanged; only the bucket-key flavour text needed updating. Doc-only follow-up from #60's hostile self-review #3. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 14:57:21 -07:00
security-auditor	9dda84d671	fix(ratelimit): tenant-aware bucket keying — close canvas 429 storm Closes #59. Symptom: /workspaces/:id/activity returns 429 with rate-limit-exceeded on hongming.moleculesai.app whenever multiple workspaces are visible in the canvas. Single-tab, single-user, well within the documented 600 req/min budget — but every request collapsed into one bucket. Root cause: workspace-server's RateLimiter keyed buckets on c.ClientIP(). After issue #179 turned off proxy-header trust (SetTrustedProxies(nil), correctly closing the XFF spoofing hole), c.ClientIP() returns the TCP RemoteAddr — which in production is the upstream proxy (Caddy on per-tenant EC2; CP/Vercel on the SaaS plane). Every browser tab + every canvas consumer + every poll loop for every tenant collapsed into one bucket. Fix: bucket key derivation moves into a single keyFor helper that mirrors the SSOT pattern of: - molecule-controlplane/internal/middleware/ratelimit.go (org > user > IP) - this package's own MCPRateLimiter (token-hash via tokenKey) Priority: X-Molecule-Org-Id header → SHA-256(Authorization Bearer) → ClientIP. Token values are kept hashed in the bucket map so the in-memory state can't become a token dump. Tests: - TestKeyFor_OrgIdHeaderTrumpsBearerAndIP — priority order - TestKeyFor_BearerTokenWhenNoOrgId — middle tier + raw-token leak pin - TestKeyFor_IPFallbackWhenNoOrgIdNoBearer — anon probe path - TestRateLimit_TwoOrgsSameIP_IndependentBuckets — load-bearing regression (issue #59) — two tenants behind same upstream proxy must not share a bucket - TestRateLimit_TwoTokensSameIP_IndependentBuckets — same shape for the per-tenant Caddy box - TestRateLimit_SameOrgDifferentTokens_SharedBucket — counter-pin: rotating tokens within one org must NOT bypass the org's quota - TestRateLimit_Middleware_RoutesThroughKeyFor — AST gate, mirrors the SSOT gates established in #36/#10/#12 Mutation-tested: - strip org-id branch in keyFor → 3 tests fail - strip bearer-token branch → 2 tests fail - reintroduce direct c.ClientIP() in Middleware → 3 tests fail (including the AST gate) Existing tests pass unchanged: dev-mode fail-open, X-RateLimit-* headers (#105), Retry-After on 429 (#105), XFF anti-spoofing (#179). No schema/API change. 429 response body and X-RateLimit-* headers unchanged. RATE_LIMIT env var semantics unchanged. Hostile self-review (three weakest spots) is in the issue body: 1. one-shot Docker-inspect cost is now bucket-key derivation cost (string compare + SHA-256 of bearer); single-digit microseconds. 2. X-Molecule-Org-Id is unvalidated at the rate-limiter layer — spoofing is closed by tenant SG + CP front; documented in keyFor's docstring with the conditions under which to revisit. 3. cpProv-style SaaS surface is out of scope; CP's own limiter handles that hop. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 14:51:08 -07:00
Hongming Wang	7c6acc18ae	ci(branch-protection): check-name parity gate (#144 ) Audit finding: every workflow that emits a required-status-check name on molecule-core's branch protection (apply.sh's STAGING_CHECKS + MAIN_CHECKS) ALREADY uses the safe always-runs-with-conditional-steps shape — Platform/Canvas/Python/Shellcheck in ci.yml, Canvas tabs E2E in e2e-staging-canvas.yml, E2E API Smoke in e2e-api.yml, PR-built wheel in runtime-prbuild-compat.yml, the codeql Analyze matrix, and the always-on Secret scan + Detect changes. No production drift to fix today. Adds a regression-guard so the next path-filter / matrix refactor / workflow rename can't silently re-introduce the bug shape called out in saved memory feedback_branch_protection_check_name_parity: "Path filters … silently break branch protection because no job emits the protected sentinel status when path-filter returns false." New tools: - tools/branch-protection/check_name_parity.sh — extracts every required check name from apply.sh's heredocs, then for each name classifies the owning workflow as safe (no top-level paths:) / safe (per-step if-gates without top-level paths:) / unsafe (top-level paths: without per-step if-gates) / unsafe-mix (top-level paths: WITH per-step if-gates — the workflow may still skip entirely on path exclusion, leaving the gates dormant) / missing (no emitter at all). Special-cases codeql.yml's matrix- expanded `Analyze (${{ matrix.language }})`. - tools/branch-protection/test_check_name_parity.sh — 6 unit tests covering each classification: safe, unsafe-path-filter, missing, safe-with-per-step-gates, unsafe-mix, matrix-expansion. Each test builds a synthetic apply.sh + workflow file in a tmpdir, invokes the script, and asserts on exit code + stderr substring. Per feedback_assert_exact_not_substring the assertions pin specific classifications, not just non-zero exit. Wired into branch-protection-drift.yml so every PR touching .github/workflows/** runs the parity check; the existing daily schedule covers between-PR drift. The check is cheap (~1s) and runs without the admin token — only reads files in the checkout. Self- test step runs the unit tests on every invocation, so a regression in the script can't false-pass on production. Per BSD-vs-GNU portability hygiene: heredoc-marker extraction stays in plain awk + sed (no gawk-only `match()` array form), grep regex avoids `^` anchor for `if:` lines because real workflows use ` - if:` with the `-` step-marker between leading spaces and `if:` (the original anchor missed every workflow's per-step gates). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 14:42:50 -07:00
claude-ceo-assistant	1e1f4d635b	fix(ci): convert CodeQL workflow to no-op stub on Gitea (#156 ) (#51 ) Closes #156. Touches #142. Approved-by: security-auditor	2026-05-07 21:37:04 +00:00
claude-ceo-assistant	3a00dd236f	fix(ci): convert CodeQL workflow to no-op stub on Gitea (#156 ) Why --- PR #35 marked `continue-on-error: true` at the JOB level (correct YAML), but Gitea Actions 1.22.6 does NOT propagate job-level continue-on-error to the commit-status API — every matrix leg still posts `failure`. That keeps OVERALL=failure on every push to main + staging and blocks the auto-promote signal even when every other gate is green. Worse: the underlying CodeQL run never actually worked on Gitea. The github/codeql-action/init@v4 step calls api.github.com bundle endpoints (CLI download + query packs + telemetry) that Gitea does NOT proxy. Confirmed via live-tested run 1d/3101 on operator host: 2026-05-07T20:55:17 ::group::Run Initialize CodeQL with: languages: ${{ matrix.language }} queries: security-extended 2026-05-07T20:55:36 ::error::404 page not found 2026-05-07T20:55:50 Failure - Main Initialize CodeQL 2026-05-07T20:55:51 skipping Perform CodeQL Analysis (main skipped) 2026-05-07T20:55:51 :⚠️:No files were found at sarif-results/go/ The SARIF artifact upload was already a no-op (warning above) — the analyze step never wrote anything because init failed. So nothing of value is being lost by stubbing this out. What ---- - Convert the workflow to a single-step stub that emits success per matrix language (go, javascript-typescript, python). - Keep workflow `name: CodeQL` exactly (auto-promote-staging.yml line 67 keys on it as a workflow_run gate). - Keep job name template `Analyze (${{ matrix.language }})` and the 3-leg matrix exactly (commit-status context names + branch protection + #144 required-check-name parity). - Keep all four triggers (push / pull_request / merge_group / schedule) so merge_group required-checks parity holds. - Drop the codeql-action steps, the Autobuild step, the SARIF parse step, and the upload-artifact step — all four of those are now dead code (init can never succeed against Gitea's API surface). Policy ------ Per Hongming decision 2026-05-07 (#156): CodeQL is ADVISORY, not blocking, until a Gitea-compatible SAST pipeline lands. The header of the new workflow file documents this decision + lists the three re-enable options (self-hosted Semgrep, Sonatype, GitHub mirror) plus the compensating controls in place (secret-scan, block-internal- paths, lint-curl-status-capture, branch-protection-drift). Closes #156. Touches #142 (no capital-M Molecule-AI refs in this file — already lowercase per `e01077be`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 14:26:57 -07:00
devops-engineer	229b1a902a	fix(ci): pre-clone manifest deps in harness-replays workflow (#173 followup) harness-replays.yml builds tenant-alpha + tenant-beta via tests/harness/ compose.yml using workspace-server/Dockerfile.tenant. Post-#173, that Dockerfile expects .tenant-bundle-deps/{workspace-configs-templates, org-templates,plugins} pre-cloned at the build context root. Sister PR #38 added the pre-clone step to publish-workspace-server-image.yml but missed harness-replays.yml. Symptoms: - main run #892 (2026-05-07T20:28:53Z): COPY .tenant-bundle-deps/plugins -> failed to calculate checksum ... not found. - staging run #964 (2026-05-07T20:41:52Z): hits the OLD in-image clone path (staging hasn't picked up the Dockerfile.tenant refactor yet via auto-sync) and fails on 'fatal: could not read Username for https://git.moleculesai.app' when cloning the first private workspace-template-* repo. Fix: add the same Pre-clone step to harness-replays.yml, mirroring publish-workspace-server-image.yml. Uses AUTO_SYNC_TOKEN (devops-engineer persona PAT) per feedback_per_agent_gitea_identity_default. Once auto-sync main->staging unblocks (sister agent fixing the 7-file conflict in flight), staging will inherit both this workflow fix AND the Dockerfile.tenant refactor atomically. Refs: #168, #173	2026-05-07 14:26:52 -07:00
devops-engineer	e3904ebb42	chore: reconcile main → staging post-suspension divergence (Task #165 followup) (#48 )	2026-05-07 21:26:41 +00:00
claude-ceo-assistant (Claude Opus 4.7 on Hongming's MacBook)	25fb696965	chore: reconcile main → staging post-suspension divergence Refs Task #165 (Class D AUTO_SYNC_TOKEN plumbing). main and staging diverged after the 2026-05-06 GitHub-org suspension because Class D / Class G / feature work landed on staging while unrelated CI fixes (#34-47, ECR auth-inline, buildx→docker, pre-clone manifest deps) landed straight on main. Both branches edited the same workflow files, so every push to main triggered an Auto-sync run that aborted at `git merge --no-ff origin/main` with 7 content conflicts: - .github/workflows/canary-verify.yml (URL: github.com → Gitea) - .github/workflows/ci.yml (3 URL refs) - .github/workflows/publish-runtime.yml (cascade: HTTP repo-dispatch → Gitea push) - .github/workflows/publish-workspace-server-image.yml (drop AWS-action steps; ECR auth is inline) - .github/workflows/retarget-main-to-staging.yml (URL) - manifest.json (lowercase org slug + add mock-bigorg from main) - scripts/clone-manifest.sh (keep main's MOLECULE_GITEA_TOKEN auth path + drop awk-tolower since manifest is now lowercase) Resolution: union — staging's post-suspension Gitea/ECR migrations win on URL/policy edits; main's additive work (mock-bigorg manifest entry, inline ECR auth, MOLECULE_GITEA_TOKEN basic-auth) is preserved on top. After this lands, staging is a strict superset of main, so the next auto-sync run on a push to main will be a clean fast-forward / no-op. The auto-sync workflow on main also picks up staging's AUTO_SYNC_TOKEN swap (Class D #26) for free, fixing the latent layer-2 push-auth issue. Verified locally: - bash -n scripts/clone-manifest.sh - python -c 'yaml.safe_load(...)' on each touched workflow - python -c 'json.load(open(manifest.json))' (21 plugins, 9 templates, 7 org_templates) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 14:24:37 -07:00
claude-ceo-assistant	0276b295cc	Merge pull request 'chore(ci): retrigger publish-workspace-server-image after ECR repo create (#173 )' (#47 ) from chore/issue173-retrigger-after-ecr-repo-create into main	2026-05-07 20:54:53 +00:00
devops-engineer	194cdf012b	chore(ci): retrigger publish-workspace-server-image after ECR repo create (#173 ) Run #1010 (post-#46) succeeded all the way to push but failed with "repository molecule-ai/platform does not exist" — the platform image ECR repo had never been created (only platform-tenant existed). Created the repo via: aws ecr create-repository --region us-east-2 \ --repository-name molecule-ai/platform \ --image-scanning-configuration scanOnPush=true This is a one-line workflow comment to satisfy the path-filter and re-run the publish workflow against the now-existing repo. Closes #173 properly this time — pre-clone + inline ECR auth + ECR repo all in place. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 13:54:11 -07:00

1 2 3 4 5 ...

4721 Commits