Commit Graph

4690 Commits

Author SHA1 Message Date
29da0882a7 Merge branch 'main' into fix/175-env-matched-pair-guard 2026-05-08 02:46:40 +00:00
8798b316f6 chore: promote accumulated staging fixes to main (vitest, postgres, e2e-api, eic, plugins) (#99)
Brings staging tip b11044f8 to main. Includes #97 (vitest) + #98 (handlers-postgres) + #100 (e2e-api) + EIC race fix + #84 (SaaS plugin EIC) + accumulated staging commits. Triggers Vercel + Railway + ECR production deploy chain. Approved by security-auditor.
2026-05-08 02:46:39 +00:00
b11044f885 fix(plugins): SaaS (EC2-per-workspace) install/uninstall via EIC SSH (#84)
Closes docker-only row in backends.md. Approved by security-auditor.
2026-05-08 02:15:49 +00:00
d201b13b93 Merge branch 'staging' into fix/saas-plugin-install-eic 2026-05-08 02:03:53 +00:00
a4ab623bbf fix(ci): e2e-api — parallel-safe postgres/redis containers (#100)
Closes #94. Mirrors PR #98 pattern. Approved by security-auditor.
2026-05-08 02:02:57 +00:00
b9d2786f45 fix(ci): e2e-api — parallel-safe postgres/redis containers + provisioner setup
Class B Hongming-owned CICD red sweep, e2e-api leg. Same substrate
hazard as PR #98 (handlers-postgres-integration) — Gitea act_runner
configures `container.network: host` operator-wide, so:

  * Two concurrent e2e-api runs both attempted to bind `-p 15432:5432`
    and `-p 16379:6379` on the operator host. Verified in run a7/2727
    on 2026-05-07: `docker: Error response from daemon: Conflict. The
    container name "/molecule-ci-redis" is already in use by container
    af10f438...` — exit 125, job fails before any test runs.
  * Hardcoded container names `molecule-ci-postgres` / `-redis` plus
    the leading `docker rm -f` step meant a second job's startup also
    KILLED the first job's still-running services.

Fix shape (mirrors PR #98 bridge-net pattern, adapted because the
platform-server is a Go binary on the host, not a containerised step):

  1. Per-run unique container names: `pg-e2e-api-${RUN_ID}-${RUN_ATTEMPT}`,
     `redis-e2e-api-${RUN_ID}-${RUN_ATTEMPT}`. Unique even across reruns
     of the same run_id.
  2. Ephemeral host port per run via `-p 0:5432` / `-p 0:6379` and
     `docker port` lookup, exported as `DATABASE_URL` / `REDIS_URL` to
     `$GITHUB_ENV`. No fixed host-port → no collision.
  3. `127.0.0.1` (NOT `localhost`) in URLs — IPv6 first-resolve flake
     fixed in #92 stays fixed.
  4. `if: always()` cleanup so containers don't leak when test steps
     fail.

Issue #94 items #2 + #3 also addressed:

  * Pre-pull `alpine:latest` (provisioner uses it for ephemeral
    token-write containers in `internal/handlers/container_files.go`).
  * Idempotent `docker network create molecule-monorepo-net` (the
    provisioner attaches workspace containers via that bridge —
    `internal/provisioner/provisioner.go::DefaultNetwork`).

Issue #94 item #1 (timeouts) NOT bumped — recent log evidence shows
postgres ready in 3s, redis in 1s, platform in 1s when they DO come
up. Timeouts are not the bottleneck on the current substrate.

NOT addressed here (out of scope, separate change required):

  * `Run E2E API tests` step has been failing on `Status back online`
    because the platform's langgraph workspace template image
    (`ghcr.io/molecule-ai/workspace-template-langgraph:latest`)
    returns 403 Forbidden post-2026-05-06 GitHub org suspension. That
    is a template-registry resolution issue (ADR-002 / local-build
    mode) and belongs in a workspace-server change, not this workflow
    file. This PR fixes the parallel-collision class and the workflow
    setup hygiene; the langgraph-403 failure will still surface on
    runs after this lands until template resolution is fixed
    separately.

Verified manually on operator host 2026-05-08: docker now hands out
ephemeral ports on `-p 0:5432`, two parallel runs land on different
ports, both reach pg_isready GREEN.

Closes #94 (items #2 and #3; item #1 documented as not-bottleneck;
langgraph-template-403 referenced for follow-up).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 18:59:56 -07:00
576166c8c3 Merge branch 'staging' into fix/saas-plugin-install-eic 2026-05-08 01:29:11 +00:00
8a3141a763 fix(ci): handlers-postgres — sidestep port collision under host-network runner (#98)
Switches from services: block to --network molecule-monorepo-net with unique per-run container names. Avoids port-5432 collision when parallel Handlers-Postgres jobs run on host-network act_runner. Approved by security-auditor.
2026-05-08 01:29:06 +00:00
78f77532ea Merge branch 'main' into fix/175-env-matched-pair-guard 2026-05-08 01:27:44 +00:00
dccd1aa1ba fix(canvas-tests): bump vitest testTimeout to 30000ms on CI for cold-start overhead (#97)
Closes molecule-core#96. Unblocks Canvas (Next.js) on PRs #82/#81/#54/#53 after rebase. Approved by security-auditor.
2026-05-08 01:27:43 +00:00
a302d75129 chore(ci): retrigger Handlers Postgres Integration for second-green proof
Class B verification — second consecutive green run to demonstrate the
fix isn't flaky.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 18:23:05 -07:00
241859b552 fix(ci): handlers-postgres — sidestep port collision under host-network runner
Class B Hongming-owned CICD red sweep. The Handlers Postgres Integration
workflow has been silently failing on staging push and PRs ever since
#92 fixed the IPv6 flake — the IPv6 fix correctly pinned 127.0.0.1, but
unmasked a deeper issue: with our act_runner global container.network=host
config, multiple concurrent runs of this workflow each tried to bind
0.0.0.0:5432 on the operator host. The first wins; subsequent postgres
service containers exit with `FATAL: could not create any TCP/IP sockets`
+ `Address in use`. Docker auto-removes them (act_runner sets
AutoRemove:true), so by the time `Apply migrations` runs `psql`, the
container is gone — Connection refused, then `failed to remove container:
No such container` at cleanup time.

Per-job container.network override is silently ignored by act_runner
(`--network and --net in the options will be ignored.`), so we sidestep
`services:` entirely. The job container still uses host-net (required
for cache server discovery on the operator's bridge IP). We launch a
sibling postgres on the existing molecule-monorepo-net bridge with a
unique name per run (run_id+run_attempt) and connect via the bridge IP
read from `docker inspect`.

Verified manually on operator host 2026-05-08: 2× postgres on host-net
collides, but on the bridge with unique names + different IPs, both
succeed and each is reachable from a host-net job container.

Adds:
- always()-cleanup step so containers don't leak on test failure
- Diagnostic dump now includes the postgres container's docker logs
- Runbook at docs/runbooks/ documenting the substrate behavior + the
  pattern future workflows should adopt for any `services:`-shaped need.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 18:21:12 -07:00
da1a5af7a4 fix(canvas): bump vitest testTimeout to 30s on CI for v8-coverage cold start (#96)
Class A red sweep — 3 first-tests timing out at the 5000ms default on the
self-hosted Gitea Actions Docker runner across 4 unrelated PRs (#82, #81,
#54, #53). The PRs share zero canvas/ surface — same 3 tests, same
cold-start signature, same shape on every run.

Root cause: `npx vitest run --coverage` cold-start cost (v8 coverage
instrumentation init + JSDOM bootstrap + heavy @/components/* and @/lib/*
import + first React render) consumes 5-7 seconds for the first
synchronous test in a heavyweight test file. Empirically:

  ActivityTab "renders all 7 filter options"           5230ms (FAIL)
  CreateWorkspaceDialog "opens the dialog ..."         6453ms (FAIL)
  ConfigTab.provider "PUTs the new provider on Save"   5605ms (FAIL)

vs subsequent tests in the same files at 100-1500ms each. The component
code is correct (e.g. ActivityTab.FILTERS has 7 entries matching the
test). 1407 tests pass locally with --coverage in 9-15s; CI runs at 200s
under the same flag — the gap is import/transform/environment overhead,
not test logic.

Fix: CI-conditional `testTimeout: process.env.CI ? 30000 : 5000` in
canvas/vitest.config.ts. Local-dev sensitivity to genuine waitFor races
preserved; CI gets ~5x headroom over the worst observed first-test
(6453ms). Same shape Vitest documents at
<https://vitest.dev/config/testtimeout> and
<https://vitest.dev/guide/coverage#profiling-test-performance>.

Verification:
  - Local: 5x runs of the 3 failing test files, all 74 tests green
    (process.env.CI unset → 5000ms applies).
  - Local: 7s sleep probe FAILS at 5000ms default and PASSES under
    CI=true → ternary takes effect as written.
  - Local: full canvas suite under CI=true with --coverage:
    "Test Files 98 passed (98) | Tests 1407 passed (1407)".

Closes #96.
Refs: #82, #81, #54, #53.

Hostile self-review (3 weakest spots):
1. 30000ms is a guess, not a measurement. Mitigation: vitest still
   emits per-test duration; a real 25s+ test will surface as a
   duration regression and we dial down.
2. Doesn't fix the Docker-runner-overhead root-root-cause. True. That
   is a multi-week perf project. The right trade today is unblocking 4
   PRs from this single class.
3. Local-default of 5000ms means a real 8s race that flies on CI's
   30000ms could pass without local sensitivity. Mitigation: dev-time
   waitFor races are caught at the per-test level; suite-level cold-
   start is the only legitimate >5s case here.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 18:19:58 -07:00
09ec0b1b4a chore: sync main → staging (auto, 068c9682) 2026-05-08 01:17:12 +00:00
068c968206 docs(hermes): hermes-agent fork moved to Gitea (#90)
Doc update reflecting #160 hermes-agent migration. Approved by security-auditor.
2026-05-08 01:17:03 +00:00
devops-engineer
419c109f1d chore: sync main → staging (auto-resolved workflow conflicts via main-wins)
Conflicted files in .github/workflows/ taken from main:
  .github/workflows/ci.yml
  .github/workflows/e2e-staging-canvas.yml
  .github/workflows/retarget-main-to-staging.yml

Conflicts arose from main advancing through PR #66/#79/#89 (CI workflow rewrites)
while staging hadn't picked up the changes yet. Main is the source of truth for
CI workflows; staging is downstream.

Co-authored-by: Claude (orchestrator)
2026-05-08 01:00:48 +00:00
97c042f666 Merge branch 'main' into fix/hermes-agent-doc-gitea-migration 2026-05-08 00:54:30 +00:00
7492d9661c Merge branch 'main' into fix/175-env-matched-pair-guard 2026-05-08 00:54:14 +00:00
3d6303afcc fix(ci): rewrite retarget-main-to-staging for Gitea REST API (#79)
Closes #74. Approved by security-auditor.
2026-05-08 00:26:27 +00:00
3fcaa1fcc5 Merge branch 'main' into fix/hermes-agent-doc-gitea-migration 2026-05-08 00:21:17 +00:00
7f61206a18 Merge branch 'staging' into fix/saas-plugin-install-eic 2026-05-08 00:21:10 +00:00
6c823cf673 Merge branch 'main' into fix/196-retarget-main-to-staging-gitea-rest 2026-05-08 00:20:49 +00:00
36a509abfb Merge branch 'main' into fix/175-env-matched-pair-guard 2026-05-08 00:20:43 +00:00
12ff797d12 fix(ci): close 3 chronic Gitea-Actions workflow flakes (#92)
Closes #88. Bundles localhost→127.0.0.1 + 2 other Gitea-act_runner flakes per feedback_gitea_actions_migration_audit_pattern. Approved by security-auditor.
2026-05-08 00:20:42 +00:00
4193d54852 fix(ci): pin actions/upload-artifact + download-artifact to @v3 (#89)
Closes #210. Unblocks 5 stuck PRs (#53/#54/#69/#71/#76/#81). Approved by security-auditor.
2026-05-08 00:20:00 +00:00
7eb348536b fix(harness): bake cf-proxy nginx.conf at build time, not via configs:
The previous configs:-based fix (87b971a2) didn't actually fix the DinD
issue — Compose v2 falls back to bind mounts for `configs:` when swarm
mode is not active, so the resulting runc invocation still tries to
mount /workspace/.../cf-proxy/nginx.conf from the OUTER host filesystem
that the act_runner-vs-host-docker socket-mount can't see. Same
"not a directory" error returned.

Switch to a thin Dockerfile (cf-proxy/Dockerfile) that COPYs nginx.conf
into nginx:1.27-alpine. The build context is uploaded to the daemon as
a tarball, not bind-mounted from the host filesystem, so the path
translation gap doesn't apply. Verified locally: `docker build` +
`docker run cf-proxy nginx -T` reproduces the baked config end-to-end.

Trade-off: ~2-3s build cost on every harness up. Acceptable for the
Gitea CI gate; local-dev re-builds the image only when nginx.conf
changes (Docker layer cache).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 17:09:08 -07:00
87b971a292 fix(ci): close 3 chronic Gitea-Actions workflow flakes (closes #88)
Three workflows have been failing on every push to this Gitea repo for
GitHub-shaped reasons that don't translate to act_runner. Surfaced
while landing #84; bundled per `feedback_gitea_actions_migration_audit_pattern`
("bundle per-repo, not per-finding") instead of three separate PRs.

1) handlers-postgres-integration: localhost → 127.0.0.1
   - lib/pq tries to dial localhost → ::1 first; the postgres service
     container only listens on IPv4 → ECONNREFUSED → all
     TestIntegration_* fail. Pin IPv4 to make the job deterministic.

2) pr-guards / disable-auto-merge-on-push: Gitea no-op
   - The previous reusable-workflow caller invoked `gh pr merge
     --disable-auto`, which calls GitHub's GraphQL API. Gitea returns
     HTTP 405 on /api/graphql → step always fails. Inline the step so
     it can detect Gitea (GITEA_ACTIONS=true OR repo url under
     moleculesai.app) and no-op with a notice. Auto-merge gating is
     moot on Gitea anyway: there's no `--auto` primitive being
     touched. Job stays ALWAYS-RUN so branch protection's required
     check still lands SUCCESS (avoids the SKIPPED-in-set trap from
     `feedback_branch_protection_check_name_parity`).

3) Harness Replays: cf-proxy nginx.conf via docker `configs:` (not bind)
   - act_runner runs the workflow inside a runner container; runc in
     the docker daemon below resolves bind-mount source paths on the
     OUTER host, not inside the runner. The path
     `/workspace/.../cf-proxy/nginx.conf` is invisible there → "not a
     directory" runc error. Switching to compose `configs:` packages
     the file as content rather than a host bind, sidestepping the
     DinD path-translation gap.

Local validation:
  - YAML parsed clean for all 3 files.
  - cf-proxy nginx.conf: standalone `docker compose run cf-proxy
    nginx -T` reproduced the configs: mount end-to-end and dumped the
    config correctly. The full harness compose still renders via
    `docker compose config`.

Real-CI verification will land on this branch's first push.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 17:06:09 -07:00
devops-engineer
0bcf195fbc docs(hermes): hermes-agent fork moved to Gitea (post-suspension)
The `HongmingWang-Rabbit/hermes-agent` fork is no longer reachable on
github.com (account suspended 2026-05-06). The patched fork now lives
at https://git.moleculesai.app/molecule-ai/hermes-agent. Same SHAs,
same branches — pure URL flip.

See molecule-ai/internal#72 for the github.com fork shell decision.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 16:57:57 -07:00
8885f7cd12 fix(ci): pin actions/upload-artifact + download-artifact to @v3 for Gitea compatibility
actions/upload-artifact@v4+ and download-artifact@v4+ use the GHES 3.10+
artifact protocol that Gitea Actions (act_runner v0.6 / Gitea 1.22.x)
does NOT implement. Failure cite from PR #54 run 1325 jobs/2:

  ::error::@actions/artifact v2.0.0+, upload-artifact@v4+ and
  download-artifact@v4+ are not currently supported on GHES.

Pinned all 3 references to v3.2.2 (latest v3) at SHA-pinned form for
supply-chain hygiene, matching the existing `uses:` style in this repo.
Affected workflows:
  - ci.yml (Canvas Next.js coverage upload, blocks `CI / Canvas (Next.js)`
    required check on every PR — was the merge-queue blocker for #53,
    #54, #69, #71, #76, #81)
  - e2e-staging-canvas.yml (Playwright report + screenshots on failure)

No download-artifact callers in the repo, so v3-pin doesn't compose-break
anywhere. Drop these pins post-Gitea-1.23+ when the v4 artifact protocol
ships, or migrate to a Gitea-native action.

Closes #210.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 16:54:44 -07:00
0c7f3c8909 chore: sync main → staging (auto, cdbf28fd) 2026-05-07 23:45:36 +00:00
cdbf28fd76 ci(canary): synthetic-check cron for AUTO_SYNC_TOKEN rotation drift (#77)
6h cron probes auth + scope + git-push --dry-run. Closes #72. Approved by security-auditor.
2026-05-07 23:45:25 +00:00
3f9ba90672 chore: sync main → staging (auto, 07bd91e4) 2026-05-07 23:44:31 +00:00
4b82db72a7 Merge branch 'main' into fix/issue-72-auto-sync-token-canary-v2 2026-05-07 23:44:22 +00:00
07bd91e436 fix(ci): replace gh run list with Gitea commit-status query (#83)
Class F of #75 sweep. /commits/{sha}/statuses replaces unavailable workflow-runs API. 4 mapping buckets verified against synthetic+real Gitea data. Approved by security-auditor.
2026-05-07 23:44:21 +00:00
ed0874504e Merge branch 'main' into fix/issue75-class-F-gh-run-list-to-statuses 2026-05-07 23:44:00 +00:00
6656862870 chore: sync main → staging (auto, e39fc920) 2026-05-07 23:39:46 +00:00
e39fc92074 fix(ci): replace gh pr CLI with Gitea v1 REST in workflows + scripts (#80)
Class A of #75 sweep. 23 bash + 9 python tests pass. Live-integration verified against prod Gitea. Approved by security-auditor.
2026-05-07 23:39:22 +00:00
6d7554d282 chore: sync main → staging (auto, d84d88ad) 2026-05-07 23:38:08 +00:00
1819ac21f4 Merge branch 'main' into fix/issue75-class-A-gh-pr-to-gitea-rest 2026-05-07 23:37:57 +00:00
d84d88ad70 feat(workspace-server): local-dev provisioner builds from Gitea source (#70)
Hongming-locked Option C: MOLECULE_IMAGE_REGISTRY presence as mode marker. ADR-002 captures rationale. 30 new tests + 64 existing preserved. Hostile-review weakest 3 filed as #204/#205/#206 follow-ups. Closes #63 (Task #194). Approved by security-auditor.
2026-05-07 23:37:56 +00:00
ae49b184f6 chore: sync main → staging (auto, 1f1ead18) 2026-05-07 23:33:25 +00:00
6bb272360d Merge branch 'main' into feat/issue-63-local-build-from-gitea-v2 2026-05-07 23:33:03 +00:00
1f1ead1833 fix(ci): rewrite auto-promote-staging for Gitea (#78)
Removes ~60 lines polling+dispatch (Gitea fires on:push naturally on token-merge). Uses Gitea merge_when_checks_succeed; preserves required_approvals=1 on main. Closes #73. Approved by security-auditor.
2026-05-07 23:32:58 +00:00
c5f40de585 Merge branch 'main' into fix/195-auto-promote-staging-gitea-rest 2026-05-07 23:30:09 +00:00
a234ed5c51 chore: sync main → staging (auto, 330a5842) 2026-05-07 23:29:14 +00:00
330a5842ab Merge pull request 'feat(canvas): ActivityTab → ACTIVITY_LOGGED subscriber (#61 stage 3, final)' (#76) from feat/canvas-activity-tab-ws-subscribe into main 2026-05-07 23:27:32 +00:00
cd55ce10d2 chore: sync main → staging (auto, 502aa082) 2026-05-07 23:25:49 +00:00
2505b36a2c Merge branch 'main' into fix/195-auto-promote-staging-gitea-rest 2026-05-07 23:22:24 +00:00
security-auditor
e0feae18f4 Merge remote-tracking branch 'origin/main' into feat/canvas-activity-tab-ws-subscribe 2026-05-07 16:18:34 -07:00
502aa082bc Merge pull request 'feat(canvas): A2ATopologyOverlay → ACTIVITY_LOGGED subscriber (#61 stage 2)' (#71) from feat/canvas-topology-overlay-ws-subscribe into main 2026-05-07 23:18:24 +00:00