molecule-core

Author	SHA1	Message	Date
devops-engineer	aa6458b42d	fix(ci): replace gh api REST passthroughs with Gitea-compatible shapes (#75 class D) Part of the post-#66 sweep to remove `gh` CLI dependencies that fail silently against Gitea (which exposes /api/v1 only — no GraphQL → 405, no /api/v3 → 404). Class D covers `gh api` REST passthroughs that either have a Gitea v1 equivalent at a different path/shape or no equivalent at all. Three files in this class, each with a different fix shape because each underlying Gitea capability is different: `auto-promote-on-e2e.yml` (compute SHA ancestry): Old: `gh api repos/.../compare/A...B` returning `.status` (ahead\|behind\|identical\|diverged). Gitea: `/api/v1/repos/.../compare/A...B` accepts only branch / tag refs — full commit SHAs return `BaseNotExist`. So even a "translate the URL" rewrite would fail. Verified empirically 2026-05-07: branches/tags work, SHAs don't. Fix: Add `actions/checkout@v6 fetch-depth=200` + use `git merge- base --is-ancestor` locally. Exact same four-bucket semantics (ahead \| behind \| diverged \| error), zero cross-host API dependency. Same pattern PR #66 used for auto-sync. The 200- commit depth comfortably covers any realistic divergence between :latest and a candidate retag (promotes are minutes apart, not hundreds of commits). `ci.yml` (canvas-deploy-reminder commit comment): Old: `gh api -X POST repos/.../commits/{sha}/comments` posting a deploy-reminder body for the operator. Gitea: NO commit-comments endpoint exists — `/repos/.../commits/ {sha}/comments` returns 404 (verified 2026-05-07). Gitea only exposes `/commits/{sha}/statuses` for commit-level surface, which is the wrong shape for a free-form reminder. Fix: Drop the API call. Write the reminder body to `$GITHUB_STEP_SUMMARY` instead. The reminder is entirely operator-facing and is just as discoverable on the run summary page (which an operator naturally lands on when they need to action a deploy). Commit comments were a stale UI artefact of the GitHub era, not a load-bearing automation surface. Permission: drop `contents: write` (no longer needed) → `read`, smallest scope per least-privilege. `check-merge-group-trigger.yml` (merge_group: trigger linter): Old: `gh api .../branches/staging/protection/required_status_checks` reading the contexts list, then walking workflow files. Gitea: branch-protection API is at /api/v1/repos/.../branch_ protections/{name} (different path) with `status_check_ contexts` (different field name) — but the entire workflow only existed to lint that workflows producing a required check declare a `merge_group:` trigger, which is needed because GitHub's merge queue dead-locks at AWAITING_CHECKS when the trigger is missing. Gitea has NO merge queue, NO gh-readonly-queue/... ref shape, NO merge_group event semantics. The dead-lock pattern this linter catches cannot occur on Gitea by construction. Fix: Convert to no-op stub (same pattern as the CodeQL stub landed in PR #51). Workflow name + trigger surface preserved so any external referrer (none confirmed via the 2026-05-07 branch-protection audit) keeps resolving. Re-enable path documented in the file header for if/when Gitea grows a merge queue. curl invocation pattern: `curl --fail-with-body -sS` (NOT `-fsS` — the two short-fail flags are mutually exclusive in modern curl). Token model: workflows continue to use act_runner's GITHUB_TOKEN where they still need API access (`auto-promote-on-e2e.yml`'s checkout uses the runner's default token; `ci.yml` no longer needs any API auth for the deploy-reminder step; `check-merge-group- trigger.yml` no longer makes any API calls). Verification: - YAML syntax validates for all three files. - Live curl against Gitea confirms `/compare/A...B` accepts branch refs (200, total_commits=N) and refuses full SHAs (404, BaseNotExist) — justifying the local-git approach. - `/repos/.../commits/{sha}/comments` confirmed 404 on Gitea. - `git merge-base --is-ancestor` exit-code semantics match the GitHub compare API status semantics exactly: ahead = current is ancestor of target; behind = target is ancestor of current; diverged = neither. Closes part of #75. Class A landed in #80; class F (gh run list → no Gitea workflow-runs API at all) lands in a separate PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 15:35:59 -07:00
devops-engineer	f8a238dfdd	chore: second auto-sync verification (post-#66/#67) (#68 )	2026-05-07 22:11:30 +00:00
devops-engineer	3f68ac1fcb	chore: second consecutive trigger for auto-sync verification (post-#66/#67)	2026-05-07 15:10:40 -07:00
devops-engineer	5efa92fbc6	chore: verify auto-sync main→staging post-#66 (#67 )	2026-05-07 22:10:04 +00:00
devops-engineer	f0664264cb	chore: empty commit to verify auto-sync main→staging post-#66	2026-05-07 15:09:18 -07:00
devops-engineer	7b194eb1aa	fix(ci): rewrite auto-sync main→staging for Gitea direct push (#66 , closes #65 )	2026-05-07 22:07:00 +00:00
devops-engineer	6235ef7461	fix(ci): rewrite auto-sync main→staging for Gitea direct push Root cause of `Auto-sync main → staging / sync-staging (push)` failing every push to main since the GitHub→Gitea migration: The workflow assumed a GitHub `merge_queue` ruleset on staging (blocking direct push) and used `gh pr create` + `gh pr merge --auto` to land sync via the queue. On Gitea this fails at the `gh pr create` step with `HTTP 405 Method Not Allowed (https://git.moleculesai.app/api/graphql)` — Gitea exposes no GraphQL endpoint, and the GitHub-CLI cannot ship PRs against Gitea. Verified failure mode in run 1117/job 0 (token logs at /tmp/log2.txt, run target /molecule-ai/molecule-core/actions/ runs/1117/jobs/0). The merge step succeeded and pushed auto-sync/main-1e1f4d63; the PR step failed with the 405. So every main push left an orphan auto-sync/* branch and a red CI status, with no PR to land it. Fix: the staging branch protection on Gitea (`enable_push: true`, `push_whitelist_usernames: [devops-engineer]`) already permits direct push from the devops-engineer persona. Drop the entire merge-queue PR architecture and replace with: 1. Checkout staging with secrets.AUTO_SYNC_TOKEN (devops-engineer persona token, NOT founder PAT — `feedback_per_agent_gitea_identity_default`). 2. `git fetch origin main` + ff-merge or no-ff merge. 3. `git push origin staging` directly. The AUTO_SYNC_TOKEN repo secret already exists (created 2026-05-07 14:00 alongside the staging push_whitelist update). Workflow name + job name unchanged → required-check name `Auto-sync main → staging / sync-staging (push)` keeps the same context, no branch-protection edits needed. Rejected alternatives (documented in workflow header): - Reuse PR architecture via Gitea REST: ~80 LOC of API plumbing for no benefit; direct push works. - GH_HOST=git.moleculesai.app: still calls /api/graphql, same 405; doesn't fix the root issue. - Custom JS action: external dep for a 5-line `git push`. Header comment in the workflow now documents: - What this workflow does (SSOT for staging advancing). - Why direct push (GitHub merge_queue → Gitea push_whitelist). - Identity and token (anti-bot-ring per saved memory). - Failure modes A–D with operator runbook for each. - Loop safety (push to staging doesn't fire push:main → no recursion). Verification plan: this fix-PR's merge to main is itself the trigger; watch the workflow run on the merge commit and on one follow-up trigger commit, expect both green. Refs: failing run https://git.moleculesai.app/molecule-ai/ molecule-core/actions/runs/1117/jobs/0 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 15:04:12 -07:00
claude-ceo-assistant	1e1f4d635b	fix(ci): convert CodeQL workflow to no-op stub on Gitea (#156 ) (#51 ) Closes #156. Touches #142. Approved-by: security-auditor	2026-05-07 21:37:04 +00:00
claude-ceo-assistant	3a00dd236f	fix(ci): convert CodeQL workflow to no-op stub on Gitea (#156 ) Why --- PR #35 marked `continue-on-error: true` at the JOB level (correct YAML), but Gitea Actions 1.22.6 does NOT propagate job-level continue-on-error to the commit-status API — every matrix leg still posts `failure`. That keeps OVERALL=failure on every push to main + staging and blocks the auto-promote signal even when every other gate is green. Worse: the underlying CodeQL run never actually worked on Gitea. The github/codeql-action/init@v4 step calls api.github.com bundle endpoints (CLI download + query packs + telemetry) that Gitea does NOT proxy. Confirmed via live-tested run 1d/3101 on operator host: 2026-05-07T20:55:17 ::group::Run Initialize CodeQL with: languages: ${{ matrix.language }} queries: security-extended 2026-05-07T20:55:36 ::error::404 page not found 2026-05-07T20:55:50 Failure - Main Initialize CodeQL 2026-05-07T20:55:51 skipping Perform CodeQL Analysis (main skipped) 2026-05-07T20:55:51 :⚠️:No files were found at sarif-results/go/ The SARIF artifact upload was already a no-op (warning above) — the analyze step never wrote anything because init failed. So nothing of value is being lost by stubbing this out. What ---- - Convert the workflow to a single-step stub that emits success per matrix language (go, javascript-typescript, python). - Keep workflow `name: CodeQL` exactly (auto-promote-staging.yml line 67 keys on it as a workflow_run gate). - Keep job name template `Analyze (${{ matrix.language }})` and the 3-leg matrix exactly (commit-status context names + branch protection + #144 required-check-name parity). - Keep all four triggers (push / pull_request / merge_group / schedule) so merge_group required-checks parity holds. - Drop the codeql-action steps, the Autobuild step, the SARIF parse step, and the upload-artifact step — all four of those are now dead code (init can never succeed against Gitea's API surface). Policy ------ Per Hongming decision 2026-05-07 (#156): CodeQL is ADVISORY, not blocking, until a Gitea-compatible SAST pipeline lands. The header of the new workflow file documents this decision + lists the three re-enable options (self-hosted Semgrep, Sonatype, GitHub mirror) plus the compensating controls in place (secret-scan, block-internal- paths, lint-curl-status-capture, branch-protection-drift). Closes #156. Touches #142 (no capital-M Molecule-AI refs in this file — already lowercase per `e01077be`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 14:26:57 -07:00
claude-ceo-assistant	0276b295cc	Merge pull request 'chore(ci): retrigger publish-workspace-server-image after ECR repo create (#173 )' (#47 ) from chore/issue173-retrigger-after-ecr-repo-create into main	2026-05-07 20:54:53 +00:00
devops-engineer	194cdf012b	chore(ci): retrigger publish-workspace-server-image after ECR repo create (#173 ) Run #1010 (post-#46) succeeded all the way to push but failed with "repository molecule-ai/platform does not exist" — the platform image ECR repo had never been created (only platform-tenant existed). Created the repo via: aws ecr create-repository --region us-east-2 \ --repository-name molecule-ai/platform \ --image-scanning-configuration scanOnPush=true This is a one-line workflow comment to satisfy the path-filter and re-run the publish workflow against the now-existing repo. Closes #173 properly this time — pre-clone + inline ECR auth + ECR repo all in place. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 13:54:11 -07:00
claude-ceo-assistant	6b30ab6391	fix(ci): inline aws ecr get-login-password + docker login (#46 ) Closes #173 — final piece.	2026-05-07 20:49:55 +00:00
devops-engineer	f0e8d9bb23	fix(ci): inline aws ecr get-login-password + docker login (followup #173 ) CI run #987 (post-#45) showed `docker push` from shell still hits "no basic auth credentials" — `aws-actions/amazon-ecr-login@v2` writes auth to a step-scoped DOCKER_CONFIG that doesn't carry across to the next shell step on Gitea Actions. Fix: drop both `aws-actions/configure-aws-credentials@v4` and `aws-actions/amazon-ecr-login@v2`. Run `aws ecr get-login-password \| docker login` inline in the same shell step as `docker build` + `docker push`. AWS creds come from secrets via env vars, ECR token is fresh per-step (12h validity is plenty), config.json lives in the same shell process — auth state is guaranteed. This is the operator-host manual approach mapped 1:1 into CI. runner-base image already has aws-cli + docker (verified locally). Closes #173 (fifth piece — and final, this matches the manual flow exactly). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 13:49:12 -07:00
claude-ceo-assistant	ee56443146	fix(ci): replace buildx with plain docker build+push (#45 ) Closes #173 — fourth and hopefully final piece.	2026-05-07 20:44:42 +00:00
devops-engineer	43e2d24c5b	fix(ci): replace buildx with plain docker build+push (followup #173 ) CI run #946 (post-#43) confirmed `driver: docker` doesn't fix the ECR push 401 either: buildx CLI inside the runner container talks to the operator-host docker daemon (mounted socket), but the daemon doesn't see the runner's ECR auth state, and the runner's buildx CLI doesn't attach the auth header in a way the daemon accepts. Drop buildx + build-push-action entirely. Plain `docker build` + `docker push` from the runner container works because both use the SAME docker socket + the SAME runner-container config.json (populated by `aws ecr get-login-password \| docker login` from amazon-ecr-login). Trade-off: lose multi-arch support. We only ship linux/amd64 tenant images today, so this is fine. If multi-arch becomes a requirement later, we can revisit (likely with `docker buildx create --driver=remote` pointing at an external buildkit, but that's substantial infra work; not worth it for a single-arch shop). Closes #173 (fourth piece — and hopefully last; this matches the operator-host manual approach exactly). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 13:43:50 -07:00
claude-ceo-assistant	0b840df563	fix(ci): use docker driver for buildx + drop type=gha cache (#43 ) Closes #173 — third and final piece. Pairs with #38 and #41.	2026-05-07 20:36:01 +00:00
devops-engineer	bee4f9ea79	fix(ci): use docker driver for buildx + drop type=gha cache (followup #173 ) PR #38 + #41 fixed the Dockerfile-side clone issue. CI run #893 then revealed two Gitea-Actions-specific issues with the unchanged buildx config: 1. `failed to push: 401 Unauthorized` to ECR. Root cause: default buildx driver `docker-container` spawns a buildkit container that doesn't share the host's `~/.docker/config.json`, so the ECR auth set up by amazon-ecr-login doesn't reach the push. Fix: pin `driver: docker` so buildx delegates to the host daemon, which already has the ECR creds. 2. `dial tcp ...:41939: i/o timeout` on `_apis/artifactcache/cache`. Root cause: `cache-from/cache-to: type=gha` is GitHub-specific; Gitea Actions has no compatible artifact-cache backend, so every cache lookup fails after a 30s timeout. Fix: remove the cache-* options. Cold-build cost is <10min for 37-repo clone + Go/Node compile, acceptable. Could revisit with type=registry inline cache later if rebuilds get painful. With this + #38/#41, the workflow should run end-to-end on Gitea Actions: pre-clone -> docker build (host daemon) -> ECR push. Closes #173 (third and final piece). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 13:35:07 -07:00
claude-ceo-assistant	c1e32ff4a7	Merge pull request 'fix(test): drain coalesceRestart goroutines before t.Cleanup (Class H, #170 )' (#39 ) from fix/170-goroutine-bleed-test-isolation into main	2026-05-07 20:27:08 +00:00
claude-ceo-assistant	bac04dc278	fix(ci): apply pre-clone fix to platform Dockerfile too (#41 ) Closes #173 — followup to #38.	2026-05-07 20:23:33 +00:00
devops-engineer	e16d7eaa08	fix(ci): apply pre-clone fix to platform Dockerfile too (followup #173 ) The first PR (#38) only patched Dockerfile.tenant — but the workflow also builds the platform image from workspace-server/Dockerfile, which had the SAME in-image `git clone` stage. Build run #794 caught this: "process clone-manifest.sh ... exit code 128" on the platform image. Apply the same pre-clone shape to the platform Dockerfile: drop the `templates` stage, COPY from .tenant-bundle-deps/ instead. The workflow's existing "Pre-clone manifest deps" step (added in #38) already populates .tenant-bundle-deps/ before either build runs, so no workflow change needed. Self-review note: the missed-platform-Dockerfile is a Phase 1 quality miss — I read both files but only registered the tenant one as in-scope. Saved memory `feedback_orchestrator_must_verify_before_declaring_fixed` applies: should have grepped the whole workspace-server/ for "templates" stages before claiming Task #173 done. CI run #794 caught it within ~6 minutes; net cost: one followup commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 13:13:13 -07:00
Hongming Wang	17f1f30b3f	fix(test): drain coalesceRestart goroutines before t.Cleanup (Class H, #170 ) TestPooledWithEICTunnel_PreservesFnErr (and any sqlmock-using neighbour test) was at risk of inheriting stale INSERT calls from a previous test's coalesceRestart goroutine that survived its t.Cleanup boundary. The production callsite shape is `go h.RestartByID(...)` from a2a_proxy.go, a2a_proxy_helpers.go and main.go. When that goroutine's runRestartCycle panics, coalesceRestart's deferred recover swallows it to keep the platform process alive — but in tests, nothing waits for the goroutine to fully exit. If it's still draining LogActivity-shaped work after the test returns, those INSERTs land in the next test's sqlmock connection as kind=DELEGATION_FAILED / kind=WORKSPACE_PROVISION_FAILED, surfacing as "INSERT-not-expected". Fix: introduce drainCoalesceGoroutine(t, wsID, cycle) test helper that spawns coalesceRestart on a goroutine (matching production) and registers a t.Cleanup with sync.WaitGroup.Wait so the test can't declare itself done while a goroutine is still alive. Convert TestCoalesceRestart_PanicInCycleClearsState to use the helper (previously it called coalesceRestart synchronously, which never exercised the production goroutine-survival contract). Add TestCoalesceRestart_DrainHelperWaitsForGoroutineExit as the regression guard: cycle blocks 150ms then panics; the test asserts t.Run elapsed >= 150ms (proving the Wait barrier engaged) AND the deferred close ran (proving the panic-recovery defer chain executed) AND state.running was cleared. Verified the assertion is real by mutation-testing: removing t.Cleanup(wg.Wait) makes this test FAIL deterministically with elapsed <300µs. Per saved memory feedback_assert_exact_not_substring: the regression test asserts an exact-shape contract (elapsed >= blockFor) rather than a substring-in-output, so it discriminates between "drain works" and "drain skipped". Per Phase 3: 10/10 race-detector runs pass for all TestCoalesceRestart_* tests. Full ./internal/handlers/... suite green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 13:13:13 -07:00
Hongming Wang	694c05552b	fix(test): drain coalesceRestart goroutines before t.Cleanup (Class H, #170 ) TestPooledWithEICTunnel_PreservesFnErr (and any sqlmock-using neighbour test) was at risk of inheriting stale INSERT calls from a previous test's coalesceRestart goroutine that survived its t.Cleanup boundary. The production callsite shape is `go h.RestartByID(...)` from a2a_proxy.go, a2a_proxy_helpers.go and main.go. When that goroutine's runRestartCycle panics, coalesceRestart's deferred recover swallows it to keep the platform process alive — but in tests, nothing waits for the goroutine to fully exit. If it's still draining LogActivity-shaped work after the test returns, those INSERTs land in the next test's sqlmock connection as kind=DELEGATION_FAILED / kind=WORKSPACE_PROVISION_FAILED, surfacing as "INSERT-not-expected". Fix: introduce drainCoalesceGoroutine(t, wsID, cycle) test helper that spawns coalesceRestart on a goroutine (matching production) and registers a t.Cleanup with sync.WaitGroup.Wait so the test can't declare itself done while a goroutine is still alive. Convert TestCoalesceRestart_PanicInCycleClearsState to use the helper (previously it called coalesceRestart synchronously, which never exercised the production goroutine-survival contract). Add TestCoalesceRestart_DrainHelperWaitsForGoroutineExit as the regression guard: cycle blocks 150ms then panics; the test asserts t.Run elapsed >= 150ms (proving the Wait barrier engaged) AND the deferred close ran (proving the panic-recovery defer chain executed) AND state.running was cleared. Verified the assertion is real by mutation-testing: removing t.Cleanup(wg.Wait) makes this test FAIL deterministically with elapsed <300µs. Per saved memory feedback_assert_exact_not_substring: the regression test asserts an exact-shape contract (elapsed >= blockFor) rather than a substring-in-output, so it discriminates between "drain works" and "drain skipped". Per Phase 3: 10/10 race-detector runs pass for all TestCoalesceRestart_* tests. Full ./internal/handlers/... suite green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 13:04:57 -07:00
claude-ceo-assistant	948b5a0d89	fix(ci): pre-clone manifest deps in workflow, drop in-image clone (#38 ) Closes #173. Verified locally with persona PAT (37/37 repos cloned).	2026-05-07 20:01:06 +00:00
devops-engineer	a6d67b4c68	fix(ci): pre-clone manifest deps in workflow, drop in-image clone (closes #173 ) publish-workspace-server-image.yml could not run on Gitea Actions because Dockerfile.tenant's stage 3 ran `git clone` against private Gitea repos from inside the Docker build context, where no auth path exists. Every workspace-server rebuild required a manual operator-host push. Move cloning to the trusted CI context (where AUTO_SYNC_TOKEN — the devops-engineer persona PAT — is naturally available). Dockerfile.tenant now COPYs from .tenant-bundle-deps/, populated by the workflow's new "Pre-clone manifest deps" step. The Gitea token never enters the image. - scripts/clone-manifest.sh: optional MOLECULE_GITEA_TOKEN env embeds basic-auth in the clone URL; redacted in log output. Anonymous fallback preserved for future public-repo path. - .github/workflows/publish-workspace-server-image.yml: new pre-clone step before docker build; injects AUTO_SYNC_TOKEN. Fail-fast if the secret is empty. - workspace-server/Dockerfile.tenant: drop stage 3 (templates), COPY from .tenant-bundle-deps/ instead. Header documents the prereq. - .gitignore: ignore /.tenant-bundle-deps/ so a local build can't accidentally commit cloned repos. Verified locally: clone-manifest.sh with the devops-engineer persona token cloned all 37 repos (9 ws + 7 org + 21 plugins, 4.9MB after .git strip). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 12:59:46 -07:00
claude-ceo-assistant	d2da0c8d34	Merge pull request 'fix(workspace-server): a2a-proxy preflight container check (closes #36 )' (#37 ) from fix/issue36-a2a-proxy-preflight into main	2026-05-07 18:25:07 +00:00
claude-ceo-assistant (Claude Opus 4.7 on Hongming's MacBook)	be5fbb5ad3	fix(workspace-server): a2a-proxy preflight container check (closes #36 ) Same SSOT-divergence shape as #10 / fixed in #12, but on the a2a-proxy code path. The plugin handler was routed through `provisioner.RunningContainerName`; a2a-proxy was forwarding optimistically and only catching missing containers REACTIVELY via `maybeMarkContainerDead` after the network call timed out. Result on tenants whose agent containers had been recycled (e.g. post-EC2 replace from molecule-controlplane#20): canvas waits 2-30s for the network forward to fail before getting a 503, and the workspace-server logs only "ProxyA2A forward error" without the "container is dead" signal. This PR adds a proactive `Provisioner.IsRunning` check in `proxyA2ARequest` between `resolveAgentURL` and `dispatchA2A`, gated on the conditions where we know we're talking to a sibling Docker container we own (`h.provisioner != nil` AND `platformInDocker` AND the URL was rewritten to Docker-DNS form). Three outcomes via the SSOT helper: (true, nil) → forward as today (false, nil) → fast-503 with `error="workspace container not running — restart triggered"`, `restarting=true`, `preflight=true`, plus the same offline-flip + WORKSPACE_OFFLINE broadcast + async restart that `maybeMarkContainerDead` produces (true, err) → fall through to optimistic forward (matches IsRunning's "fail-soft as alive" contract — flaky daemon must not trigger a restart cascade) The `preflight=true` flag in the response distinguishes the proactive short-circuit from the reactive `maybeMarkContainerDead` path so canvas or downstream callers can render distinct messages later. * `internal/handlers/a2a_proxy.go` — preflight call site between resolveAgentURL and dispatchA2A; gated on `h.provisioner != nil && platformInDocker && url == http://<ContainerName(id)>:port`. * `internal/handlers/a2a_proxy_helpers.go` — `preflightContainerHealth` helper. Routes through `h.provisioner.IsRunning` (which itself wraps `RunningContainerName`). Identical offline-flip side-effects as `maybeMarkContainerDead` for the dead-container case. * `internal/handlers/a2a_proxy_preflight_test.go` — 4 tests: running → nil; not-running → structured 503 + sqlmock expectations on the offline-flip + structure_events insert; transient error → nil (fail-soft); AST gate pinning the SSOT routing (mirror of #12's gate). Mutation-tested: removing the `if running { return nil }` guard makes the production code fail to compile (unused var). A subtler mutation (replacing the !running branch with `return nil`) would make TestPreflight_ContainerNotRunning_StructuredFastFail fail at runtime with sqlmock's "expected DB call did not occur." Refs: molecule-core#36. Companion to #12 (issue #10). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 11:15:08 -07:00
claude-ceo-assistant	b9ca4ad84a	Merge pull request 'fix(ci): mark CodeQL continue-on-error (advisory only) — closes #156 ' (#35 ) from fix/codeql-continue-on-error-156 into main	2026-05-07 17:26:59 +00:00
claude-ceo-assistant	b73d3bfff2	fix(ci): mark CodeQL continue-on-error (advisory only) — closes #156	2026-05-07 17:26:52 +00:00
hongming	51ea86e3ec	feat: mock runtime + mock-bigorg 200-workspace org (#34 ) Demo Mock #3 — see PR for details. Admin-merged, CI skipped per Hongming directive.	2026-05-07 15:41:06 +00:00
Hongming Wang	d64641904f	feat(workspace-server): mock runtime + mock-bigorg org template Adds a 'mock' runtime: virtual workspaces with no container, no EC2, no LLM. Every A2A reply is synthesised from a small canned-variant pool ('On it!', 'Got it, on it now.', etc.) deterministically seeded by (workspace_id, request_id). Built for funding-demo "200-workspace mock org" — renders an enterprise-scale org chart on the canvas (CEO/VPs/Managers/ICs) without burning real LLM credits or provisioning 200 EC2 instances. Surfaces: - workspace-server/internal/handlers/mock_runtime.go: A2A proxy short-circuit, canned-reply pool, deterministic variant pick. - workspace-server/internal/handlers/a2a_proxy.go: gate the short-circuit before resolveAgentURL (mock has no URL). - workspace-server/internal/handlers/org_import.go: skip Docker provisioning for mock workspaces, set status='online' directly, drop the per-sibling 2s pacing for mock children (collapses a 200-workspace import from ~7min → ~1s). - workspace-server/internal/handlers/runtime_registry.go: register 'mock' in the runtime allowlist (manifest + fallback set). - workspace-server/internal/registry/healthsweep.go + orphan_sweeper.go: skip mock workspaces in container-health and stale-token sweeps (no container by design). - workspace-server/internal/handlers/workspace_restart.go: mirror the 'external' Restart no-op for mock. - manifest.json: register the new Molecule-AI/molecule-ai-org-template-mock-bigorg repo. Tests: 5 new in mock_runtime_test.go covering happy-path, non-mock regression guard, determinism, IsMockRuntime trim/case, JSON-RPC id echo. All existing handler + registry tests still pass. Local-verified: imported the 200-workspace template against a fresh postgres+redis, confirmed all 200 land in 'online' and stay there through the 30s health-sweep window, exercised A2A on CEO + VPs + Managers + ICs and saw the variant pool rotate. Org template lives at Molecule-AI/molecule-ai-org-template-mock-bigorg (created today) and is imported via the existing /org/import flow on the canvas Template Palette. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 08:40:37 -07:00
claude-ceo-assistant	70104d1cef	Merge pull request #33 from molecule-ai/feat/demo-mock-1-purchase-success-modal feat(canvas): demo Mock #1 — purchase-success modal Per Hongming directive: skip CI for 2h, admin-merge for funding demo.	2026-05-07 15:32:55 +00:00
Hongming Wang	a37a4a6e40	feat(canvas): demo Mock #1 — purchase-success modal on URL flag Funding-demo Mock #1: when the canvas loads with `?purchase_success=1`, show a centred success modal in the warm-paper theme. Auto-dismisses after 5s; Close button + Esc + backdrop click also dismiss; URL params are stripped on first paint so a refresh after dismiss does not re-trigger. Mounted in `app/layout.tsx` (not `app/page.tsx`) so the modal persists across the canvas page-state transitions (loading → hydrated → error) without unmounting and losing its open-state. No real billing logic — the marketplace "Purchase" button on the landing page redirects here with the flag; this modal is the only thing the user sees of the "transaction". Local-verified end-to-end via playwright (5/5 tests pass): redirect URL shape, modal visibility, URL cleanup, close button, refresh-after- dismiss behaviour, 5s auto-dismiss. Pairs with the Purchase button added to landingpage Marketplace section.	2026-05-07 08:32:35 -07:00
claude-ceo-assistant	85b09659e6	Merge pull request 'fix(ci): add scripts/** to publish-workspace-server-image path filter' (#32 ) from fix/publish-path-filter-add-scripts into main	2026-05-07 15:19:12 +00:00
devops-engineer	6de3c1ccd2	fix(ci): add scripts/** to publish-workspace-server-image path filter scripts/clone-manifest.sh runs inside the platform Dockerfile build, so a change to that script needs to retrigger publish. Without it, the prior fix (clone via Gitea + lowercase org) didn't trigger this workflow because scripts/ wasn't in the path filter. Also serves as the file change to satisfy the path filter for THIS push, retriggering publish-workspace-server-image now.	2026-05-07 08:18:53 -07:00
claude-ceo-assistant	d4256b9d83	Merge pull request 'fix(scripts): clone-manifest.sh — use Gitea + lowercase org slug (Class G)' (#31 ) from fix/clone-manifest-gitea into main	2026-05-07 15:18:09 +00:00
devops-engineer	8313b2a7a7	fix(scripts): clone-manifest.sh — use Gitea + lowercase org slug Post-2026-05-06 GitHub-org suspension: scripts/clone-manifest.sh was still pointing at https://github.com/${repo}.git, so the Docker build for workspace-server'\''s platform image fails at: fatal: could not read Username for 'https://github.com': No such device or address with no credentials available in the build container. Fix: clone from https://git.moleculesai.app/${repo}.git instead. manifest.json'\''s repo paths still read 'Molecule-AI/...' (the historic GitHub slug, mixed-case); Gitea lowercases the org component to 'molecule-ai/...'. Lowercase the org segment on the fly with awk so we don'\''t need to rewrite every manifest entry. Local verify: bash -n passes, lowercase transform produces correct Gitea paths, anonymous git clone of one of the manifest plugins over HTTPS to git.moleculesai.app succeeds. Class G in the prod-ship CI sweep — same shape as the github.com ref Harness Replays hits, this is the second instance found.	2026-05-07 08:17:58 -07:00
claude-ceo-assistant	566c095571	Merge pull request 'chore(ci): trigger publish-workspace-server-image (path-filter satisfaction)' (#30 ) from chore/touch-publish-workflow-to-trigger into main	2026-05-07 15:12:22 +00:00
devops-engineer	694a036a7f	chore(ci): trailing newline to retrigger publish-workspace-server-image (path-filter requires workflow file change)	2026-05-07 08:12:10 -07:00
claude-ceo-assistant	8c1dbc6ba5	Merge pull request 'chore(ci): retrigger publish-workspace-server-image post AWS secrets registration' (#29 ) from chore/retrigger-publish-post-aws-secrets into main	2026-05-07 15:08:03 +00:00
devops-engineer	72d0d4b44e	chore(ci): retrigger publish-workspace-server-image post AWS secrets registration	2026-05-07 08:07:46 -07:00
claude-ceo-assistant	52e61d4704	fix(ci): cherry-pick PR#23 — drop github-app-auth plugin checkout (#28 )	2026-05-07 14:52:47 +00:00
devops-engineer	10e510f50c	chore: drop github-app-auth + swap GHCR→ECR (closes #157 , #161 ) Two coupled cleanups for the post-2026-05-06 stack: ============================================ The plugin injected GITHUB_TOKEN/GH_TOKEN via the App's installation-access flow (~hourly rotation). Per-agent Gitea identities replaced this approach after the 2026-05-06 suspension — workspaces now provision with a per-persona Gitea PAT from .env instead of an App-rotated token. The plugin code itself lived on github.com/Molecule-AI/molecule-ai-plugin-github-app-auth which is also unreachable post-suspension; checking it out at CI build time was already failing. Removed: - workspace-server/cmd/server/main.go: githubappauth import + the `if os.Getenv("GITHUB_APP_ID") != ""` block that called BuildRegistry. gh-identity remains as the active mutator. - workspace-server/Dockerfile + Dockerfile.tenant: COPY of the sibling repo + the `replace github.com/Molecule-AI/molecule-ai- plugin-github-app-auth => /plugin` directive injection. - workspace-server/go.mod + go.sum: github-app-auth dep entry (cleaned up by `go mod tidy`). - 3 workflows: actions/checkout steps for the sibling plugin repo: - .github/workflows/codeql.yml (Go matrix path) - .github/workflows/harness-replays.yml - .github/workflows/publish-workspace-server-image.yml Verified `go build ./cmd/server` + `go vet ./...` pass post-removal. ======================================================= Same workflow used to push to ghcr.io/molecule-ai/platform + platform-tenant. ghcr.io/molecule-ai is gone post-suspension. The operator's ECR org (153263036946.dkr.ecr.us-east-2.amazonaws.com/ molecule-ai/) already hosts platform-tenant + workspace-template-* + runner-base images and is the post-suspension SSOT for container images. This PR aligns publish-workspace-server-image with that stack. - env.IMAGE_NAME + env.TENANT_IMAGE_NAME repointed to ECR URL. - docker/login-action swapped for aws-actions/configure-aws- credentials@v4 + aws-actions/amazon-ecr-login@v2 chain (the standard ECR auth pattern; uses AWS_ACCESS_KEY_ID/SECRET secrets bound to the molecule-cp IAM user). The :staging-<sha> + :staging-latest tag policy is unchanged — staging-CP's TENANT_IMAGE pin still points at :staging-latest, just with the new registry prefix. Refs molecule-core#157, #161; parallel to org-wide CI-green sweep.	2026-05-07 07:48:51 -07:00
claude-ceo-assistant	6fac24e3de	Merge pull request 'fix(workspace-server): SSOT-route container check + 422 on external runtimes (closes #10 )' (#12 ) from fix/issue10-runtime-aware-plugin-install into main	2026-05-07 11:27:52 +00:00
claude-ceo-assistant	f51722411b	Merge branch 'main' into fix/issue10-runtime-aware-plugin-install	2026-05-07 11:26:14 +00:00
claude-ceo-assistant	f0015bff81	Merge pull request 'fix(workspace-server): default-bind to 127.0.0.1 in dev-mode fail-open (closes #7 )' (#8 ) from fix/s8-bind-loopback-dev into main	2026-05-07 11:25:48 +00:00
claude-ceo-assistant	b72d1d3f26	Merge branch 'main' into fix/issue10-runtime-aware-plugin-install	2026-05-07 11:25:24 +00:00
claude-ceo-assistant	a674a6547e	Merge branch 'main' into fix/s8-bind-loopback-dev	2026-05-07 11:25:20 +00:00
claude-ceo-assistant	f2f5338183	Merge pull request 'fix(ci): lowercase 'molecule-ai/' in cross-repo workflow refs' (#17 ) from fix/lowercase-org-slug into main	2026-05-07 10:38:12 +00:00
security-auditor	e01077be38	fix(ci): lowercase 'molecule-ai/' in cross-repo workflow refs Gitea is case-sensitive on owner slugs; canonical is lowercase `molecule-ai/...`. Mixed-case `Molecule-AI/...` refs fail-at-0s when the runner tries to resolve the cross-repo workflow / checkout. Same fix as molecule-controlplane#12. Mechanical case-correction; no behavior change beyond making CI resolve again. Refs: internal#46 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 01:00:10 -07:00
security-auditor	c1de2287fd	fix(workspace-server): SSOT-route container check + 422 on external runtimes Two coupled fixes for molecule-core#10 (plugin install 503 vs status=online split-state): 1. SSOT for "is this workspace's container running" — `findRunningContainer` in plugins.go used to carry its own copy of `cli.ContainerInspect`, which collapsed transient daemon errors into the same `""` return as a genuinely-stopped container. Healthsweep's `Provisioner.IsRunning` handled the same input correctly (defensive). Promote the inspect logic to `provisioner.RunningContainerName`, route both consumers through it. Transient errors get a distinct log line on the plugins side so triage doesn't confuse a flaky daemon with a stopped container. 2. Runtime-aware Install/Uninstall — `runtime='external'` workspaces have no local container; push-install via docker exec is meaningless. They pull plugins via the download endpoint instead (Phase 30.3). Without a guard they fell through to `findRunningContainer` and 503'd with a misleading "container not running." Add an early 422 with a hint pointing at the download endpoint. The two fixes are independent: (1) preserves correctness when the SSOT helper is later modified; (2) eliminates the persistent split-state on the 5 external persona-agent workspaces in this DB (and on tenant deployments hitting the same shape). * `internal/provisioner/provisioner.go` — new `RunningContainerName(ctx, cli, id) (string, error)` with three documented outcomes (running / stopped / transient). `Provisioner.IsRunning` now wraps it; behavior preserved. * `internal/handlers/plugins.go` — `findRunningContainer` shimmed onto `RunningContainerName`; new `isExternalRuntime(id)` predicate. * `internal/handlers/plugins_install.go` — Install + Uninstall reject external runtimes with 422 + hint, before the source-fetch step. * `internal/handlers/plugins_install_external_test.go` — 5 cases: external→422, uninstall-external→422, container-backed-falls-through, no-runtime-lookup-fails-open, lookup-error-fails-open. * `internal/handlers/plugins_findrunning_ssot_test.go` — two AST gates pin the SSOT routing so future PRs can't silently re-introduce the parallel impl. Mutation-tested: reverting either consumer to a direct `ContainerInspect` makes the gate fail. Refs: molecule-core#10 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 22:58:20 -07:00

1 2 3 4 5 ...

4547 Commits