Part of the post-#66 sweep to remove `gh` CLI dependencies that fail
silently against Gitea (which exposes /api/v1 only — no GraphQL → 405,
no /api/v3 → 404). Class D covers `gh api` REST passthroughs that
either have a Gitea v1 equivalent at a different path/shape or no
equivalent at all.
Three files in this class, each with a different fix shape because
each underlying Gitea capability is different:
`auto-promote-on-e2e.yml` (compute SHA ancestry):
Old: `gh api repos/.../compare/A...B` returning `.status`
(ahead|behind|identical|diverged).
Gitea: `/api/v1/repos/.../compare/A...B` accepts only branch / tag
refs — full commit SHAs return `BaseNotExist`. So even a
"translate the URL" rewrite would fail. Verified empirically
2026-05-07: branches/tags work, SHAs don't.
Fix: Add `actions/checkout@v6 fetch-depth=200` + use `git merge-
base --is-ancestor` locally. Exact same four-bucket semantics
(ahead | behind | diverged | error), zero cross-host API
dependency. Same pattern PR #66 used for auto-sync. The 200-
commit depth comfortably covers any realistic divergence
between :latest and a candidate retag (promotes are minutes
apart, not hundreds of commits).
`ci.yml` (canvas-deploy-reminder commit comment):
Old: `gh api -X POST repos/.../commits/{sha}/comments` posting a
deploy-reminder body for the operator.
Gitea: NO commit-comments endpoint exists — `/repos/.../commits/
{sha}/comments` returns 404 (verified 2026-05-07). Gitea only
exposes `/commits/{sha}/statuses` for commit-level surface,
which is the wrong shape for a free-form reminder.
Fix: Drop the API call. Write the reminder body to
`$GITHUB_STEP_SUMMARY` instead. The reminder is entirely
operator-facing and is just as discoverable on the run summary
page (which an operator naturally lands on when they need to
action a deploy). Commit comments were a stale UI artefact of
the GitHub era, not a load-bearing automation surface.
Permission: drop `contents: write` (no longer needed) → `read`,
smallest scope per least-privilege.
`check-merge-group-trigger.yml` (merge_group: trigger linter):
Old: `gh api .../branches/staging/protection/required_status_checks`
reading the contexts list, then walking workflow files.
Gitea: branch-protection API is at /api/v1/repos/.../branch_
protections/{name} (different path) with `status_check_
contexts` (different field name) — but the entire workflow
only existed to lint that workflows producing a required
check declare a `merge_group:` trigger, which is needed
because GitHub's merge queue dead-locks at AWAITING_CHECKS
when the trigger is missing. Gitea has NO merge queue, NO
gh-readonly-queue/... ref shape, NO merge_group event
semantics. The dead-lock pattern this linter catches cannot
occur on Gitea by construction.
Fix: Convert to no-op stub (same pattern as the CodeQL stub
landed in PR #51). Workflow name + trigger surface preserved
so any external referrer (none confirmed via the 2026-05-07
branch-protection audit) keeps resolving. Re-enable path
documented in the file header for if/when Gitea grows a
merge queue.
curl invocation pattern: `curl --fail-with-body -sS` (NOT `-fsS` —
the two short-fail flags are mutually exclusive in modern curl).
Token model: workflows continue to use act_runner's GITHUB_TOKEN
where they still need API access (`auto-promote-on-e2e.yml`'s
checkout uses the runner's default token; `ci.yml` no longer needs
any API auth for the deploy-reminder step; `check-merge-group-
trigger.yml` no longer makes any API calls).
Verification:
- YAML syntax validates for all three files.
- Live curl against Gitea confirms `/compare/A...B` accepts branch
refs (200, total_commits=N) and refuses full SHAs (404,
BaseNotExist) — justifying the local-git approach.
- `/repos/.../commits/{sha}/comments` confirmed 404 on Gitea.
- `git merge-base --is-ancestor` exit-code semantics match the
GitHub compare API status semantics exactly: ahead = current is
ancestor of target; behind = target is ancestor of current;
diverged = neither.
Closes part of #75. Class A landed in #80; class F (gh run list →
no Gitea workflow-runs API at all) lands in a separate PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause of `Auto-sync main → staging / sync-staging (push)`
failing every push to main since the GitHub→Gitea migration:
The workflow assumed a GitHub `merge_queue` ruleset on staging
(blocking direct push) and used `gh pr create` + `gh pr merge
--auto` to land sync via the queue. On Gitea this fails at the
`gh pr create` step with `HTTP 405 Method Not Allowed
(https://git.moleculesai.app/api/graphql)` — Gitea exposes no
GraphQL endpoint, and the GitHub-CLI cannot ship PRs against
Gitea.
Verified failure mode in run 1117/job 0 (token logs at
/tmp/log2.txt, run target /molecule-ai/molecule-core/actions/
runs/1117/jobs/0). The merge step succeeded and pushed
auto-sync/main-1e1f4d63; the PR step failed with the 405. So
every main push left an orphan auto-sync/* branch and a red CI
status, with no PR to land it.
Fix: the staging branch protection on Gitea
(`enable_push: true`, `push_whitelist_usernames:
[devops-engineer]`) already permits direct push from the
devops-engineer persona. Drop the entire merge-queue PR
architecture and replace with:
1. Checkout staging with secrets.AUTO_SYNC_TOKEN
(devops-engineer persona token, NOT founder PAT —
`feedback_per_agent_gitea_identity_default`).
2. `git fetch origin main` + ff-merge or no-ff merge.
3. `git push origin staging` directly.
The AUTO_SYNC_TOKEN repo secret already exists (created
2026-05-07 14:00 alongside the staging push_whitelist update).
Workflow name + job name unchanged → required-check name
`Auto-sync main → staging / sync-staging (push)` keeps the
same context, no branch-protection edits needed.
Rejected alternatives (documented in workflow header):
- Reuse PR architecture via Gitea REST: ~80 LOC of API
plumbing for no benefit; direct push works.
- GH_HOST=git.moleculesai.app: still calls /api/graphql,
same 405; doesn't fix the root issue.
- Custom JS action: external dep for a 5-line `git push`.
Header comment in the workflow now documents:
- What this workflow does (SSOT for staging advancing).
- Why direct push (GitHub merge_queue → Gitea push_whitelist).
- Identity and token (anti-bot-ring per saved memory).
- Failure modes A–D with operator runbook for each.
- Loop safety (push to staging doesn't fire push:main → no
recursion).
Verification plan: this fix-PR's merge to main is itself the
trigger; watch the workflow run on the merge commit and on
one follow-up trigger commit, expect both green.
Refs: failing run https://git.moleculesai.app/molecule-ai/
molecule-core/actions/runs/1117/jobs/0
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why
---
PR #35 marked `continue-on-error: true` at the JOB level (correct YAML),
but Gitea Actions 1.22.6 does NOT propagate job-level continue-on-error
to the commit-status API — every matrix leg still posts `failure`. That
keeps OVERALL=failure on every push to main + staging and blocks the
auto-promote signal even when every other gate is green.
Worse: the underlying CodeQL run never actually worked on Gitea. The
github/codeql-action/init@v4 step calls api.github.com bundle endpoints
(CLI download + query packs + telemetry) that Gitea does NOT proxy.
Confirmed via live-tested run 1d/3101 on operator host:
2026-05-07T20:55:17 ::group::Run Initialize CodeQL
with: languages: ${{ matrix.language }}
queries: security-extended
2026-05-07T20:55:36 ::error::404 page not found
2026-05-07T20:55:50 Failure - Main Initialize CodeQL
2026-05-07T20:55:51 skipping Perform CodeQL Analysis (main skipped)
2026-05-07T20:55:51 :⚠️:No files were found at sarif-results/go/
The SARIF artifact upload was already a no-op (warning above) — the
analyze step never wrote anything because init failed. So nothing of
value is being lost by stubbing this out.
What
----
- Convert the workflow to a single-step stub that emits success per
matrix language (go, javascript-typescript, python).
- Keep workflow `name: CodeQL` exactly (auto-promote-staging.yml
line 67 keys on it as a workflow_run gate).
- Keep job name template `Analyze (${{ matrix.language }})` and the
3-leg matrix exactly (commit-status context names + branch
protection + #144 required-check-name parity).
- Keep all four triggers (push / pull_request / merge_group /
schedule) so merge_group required-checks parity holds.
- Drop the codeql-action steps, the Autobuild step, the SARIF parse
step, and the upload-artifact step — all four of those are now
dead code (init can never succeed against Gitea's API surface).
Policy
------
Per Hongming decision 2026-05-07 (#156): CodeQL is ADVISORY, not
blocking, until a Gitea-compatible SAST pipeline lands. The header
of the new workflow file documents this decision + lists the three
re-enable options (self-hosted Semgrep, Sonatype, GitHub mirror)
plus the compensating controls in place (secret-scan, block-internal-
paths, lint-curl-status-capture, branch-protection-drift).
Closes#156. Touches #142 (no capital-M Molecule-AI refs in this
file — already lowercase per e01077be).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Run #1010 (post-#46) succeeded all the way to push but failed with
"repository molecule-ai/platform does not exist" — the platform image
ECR repo had never been created (only platform-tenant existed).
Created the repo via:
aws ecr create-repository --region us-east-2 \
--repository-name molecule-ai/platform \
--image-scanning-configuration scanOnPush=true
This is a one-line workflow comment to satisfy the path-filter and
re-run the publish workflow against the now-existing repo. Closes#173
properly this time — pre-clone + inline ECR auth + ECR repo all in
place.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI run #987 (post-#45) showed `docker push` from shell still hits
"no basic auth credentials" — `aws-actions/amazon-ecr-login@v2`
writes auth to a step-scoped DOCKER_CONFIG that doesn't carry across
to the next shell step on Gitea Actions.
Fix: drop both `aws-actions/configure-aws-credentials@v4` and
`aws-actions/amazon-ecr-login@v2`. Run `aws ecr get-login-password |
docker login` inline in the same shell step as `docker build` +
`docker push`. AWS creds come from secrets via env vars, ECR token
is fresh per-step (12h validity is plenty), config.json lives in the
same shell process — auth state is guaranteed.
This is the operator-host manual approach mapped 1:1 into CI.
runner-base image already has aws-cli + docker (verified locally).
Closes#173 (fifth piece — and final, this matches the manual flow
exactly).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI run #946 (post-#43) confirmed `driver: docker` doesn't fix the ECR
push 401 either: buildx CLI inside the runner container talks to the
operator-host docker daemon (mounted socket), but the daemon doesn't
see the runner's ECR auth state, and the runner's buildx CLI doesn't
attach the auth header in a way the daemon accepts.
Drop buildx + build-push-action entirely. Plain `docker build` +
`docker push` from the runner container works because both use the
SAME docker socket + the SAME runner-container config.json (populated
by `aws ecr get-login-password | docker login` from amazon-ecr-login).
Trade-off: lose multi-arch support. We only ship linux/amd64 tenant
images today, so this is fine. If multi-arch becomes a requirement
later, we can revisit (likely with `docker buildx create
--driver=remote` pointing at an external buildkit, but that's
substantial infra work; not worth it for a single-arch shop).
Closes#173 (fourth piece — and hopefully last; this matches the
operator-host manual approach exactly).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #38 + #41 fixed the Dockerfile-side clone issue. CI run #893 then
revealed two Gitea-Actions-specific issues with the unchanged buildx
config:
1. `failed to push: 401 Unauthorized` to ECR. Root cause: default
buildx driver `docker-container` spawns a buildkit container that
doesn't share the host's `~/.docker/config.json`, so the ECR auth
set up by amazon-ecr-login doesn't reach the push. Fix: pin
`driver: docker` so buildx delegates to the host daemon, which
already has the ECR creds.
2. `dial tcp ...:41939: i/o timeout` on `_apis/artifactcache/cache`.
Root cause: `cache-from/cache-to: type=gha` is GitHub-specific;
Gitea Actions has no compatible artifact-cache backend, so every
cache lookup fails after a 30s timeout. Fix: remove the cache-*
options. Cold-build cost is <10min for 37-repo clone + Go/Node
compile, acceptable. Could revisit with type=registry inline cache
later if rebuilds get painful.
With this + #38/#41, the workflow should run end-to-end on Gitea
Actions: pre-clone -> docker build (host daemon) -> ECR push.
Closes#173 (third and final piece).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The first PR (#38) only patched Dockerfile.tenant — but the workflow
also builds the platform image from workspace-server/Dockerfile, which
had the SAME in-image `git clone` stage. Build run #794 caught this:
"process clone-manifest.sh ... exit code 128" on the platform image.
Apply the same pre-clone shape to the platform Dockerfile: drop the
`templates` stage, COPY from .tenant-bundle-deps/ instead. The
workflow's existing "Pre-clone manifest deps" step (added in #38)
already populates .tenant-bundle-deps/ before either build runs, so no
workflow change needed.
Self-review note: the missed-platform-Dockerfile is a Phase 1 quality
miss — I read both files but only registered the tenant one as
in-scope. Saved memory `feedback_orchestrator_must_verify_before_declaring_fixed`
applies: should have grepped the whole workspace-server/ for "templates"
stages before claiming Task #173 done. CI run #794 caught it within
~6 minutes; net cost: one followup commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TestPooledWithEICTunnel_PreservesFnErr (and any sqlmock-using neighbour
test) was at risk of inheriting stale INSERT calls from a previous
test's coalesceRestart goroutine that survived its t.Cleanup boundary.
The production callsite shape is `go h.RestartByID(...)` from
a2a_proxy.go, a2a_proxy_helpers.go and main.go. When that goroutine's
runRestartCycle panics, coalesceRestart's deferred recover swallows it
to keep the platform process alive — but in tests, nothing waits for
the goroutine to fully exit. If it's still draining LogActivity-shaped
work after the test returns, those INSERTs land in the next test's
sqlmock connection as kind=DELEGATION_FAILED /
kind=WORKSPACE_PROVISION_FAILED, surfacing as "INSERT-not-expected".
Fix: introduce drainCoalesceGoroutine(t, wsID, cycle) test helper that
spawns coalesceRestart on a goroutine (matching production) and
registers a t.Cleanup with sync.WaitGroup.Wait so the test can't
declare itself done while a goroutine is still alive.
Convert TestCoalesceRestart_PanicInCycleClearsState to use the helper
(previously it called coalesceRestart synchronously, which never
exercised the production goroutine-survival contract).
Add TestCoalesceRestart_DrainHelperWaitsForGoroutineExit as the
regression guard: cycle blocks 150ms then panics; the test asserts
t.Run elapsed >= 150ms (proving the Wait barrier engaged) AND the
deferred close ran (proving the panic-recovery defer chain executed)
AND state.running was cleared. Verified the assertion is real by
mutation-testing: removing t.Cleanup(wg.Wait) makes this test FAIL
deterministically with elapsed <300µs.
Per saved memory feedback_assert_exact_not_substring: the regression
test asserts an exact-shape contract (elapsed >= blockFor) rather than
a substring-in-output, so it discriminates between "drain works" and
"drain skipped".
Per Phase 3: 10/10 race-detector runs pass for all TestCoalesceRestart_*
tests. Full ./internal/handlers/... suite green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TestPooledWithEICTunnel_PreservesFnErr (and any sqlmock-using neighbour
test) was at risk of inheriting stale INSERT calls from a previous
test's coalesceRestart goroutine that survived its t.Cleanup boundary.
The production callsite shape is `go h.RestartByID(...)` from
a2a_proxy.go, a2a_proxy_helpers.go and main.go. When that goroutine's
runRestartCycle panics, coalesceRestart's deferred recover swallows it
to keep the platform process alive — but in tests, nothing waits for
the goroutine to fully exit. If it's still draining LogActivity-shaped
work after the test returns, those INSERTs land in the next test's
sqlmock connection as kind=DELEGATION_FAILED /
kind=WORKSPACE_PROVISION_FAILED, surfacing as "INSERT-not-expected".
Fix: introduce drainCoalesceGoroutine(t, wsID, cycle) test helper that
spawns coalesceRestart on a goroutine (matching production) and
registers a t.Cleanup with sync.WaitGroup.Wait so the test can't
declare itself done while a goroutine is still alive.
Convert TestCoalesceRestart_PanicInCycleClearsState to use the helper
(previously it called coalesceRestart synchronously, which never
exercised the production goroutine-survival contract).
Add TestCoalesceRestart_DrainHelperWaitsForGoroutineExit as the
regression guard: cycle blocks 150ms then panics; the test asserts
t.Run elapsed >= 150ms (proving the Wait barrier engaged) AND the
deferred close ran (proving the panic-recovery defer chain executed)
AND state.running was cleared. Verified the assertion is real by
mutation-testing: removing t.Cleanup(wg.Wait) makes this test FAIL
deterministically with elapsed <300µs.
Per saved memory feedback_assert_exact_not_substring: the regression
test asserts an exact-shape contract (elapsed >= blockFor) rather than
a substring-in-output, so it discriminates between "drain works" and
"drain skipped".
Per Phase 3: 10/10 race-detector runs pass for all TestCoalesceRestart_*
tests. Full ./internal/handlers/... suite green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
publish-workspace-server-image.yml could not run on Gitea Actions because
Dockerfile.tenant's stage 3 ran `git clone` against private Gitea repos
from inside the Docker build context, where no auth path exists. Every
workspace-server rebuild required a manual operator-host push.
Move cloning to the trusted CI context (where AUTO_SYNC_TOKEN — the
devops-engineer persona PAT — is naturally available). Dockerfile.tenant
now COPYs from .tenant-bundle-deps/, populated by the workflow's new
"Pre-clone manifest deps" step. The Gitea token never enters the image.
- scripts/clone-manifest.sh: optional MOLECULE_GITEA_TOKEN env embeds
basic-auth in the clone URL; redacted in log output. Anonymous fallback
preserved for future public-repo path.
- .github/workflows/publish-workspace-server-image.yml: new pre-clone
step before docker build; injects AUTO_SYNC_TOKEN. Fail-fast if the
secret is empty.
- workspace-server/Dockerfile.tenant: drop stage 3 (templates), COPY
from .tenant-bundle-deps/ instead. Header documents the prereq.
- .gitignore: ignore /.tenant-bundle-deps/ so a local build can't
accidentally commit cloned repos.
Verified locally: clone-manifest.sh with the devops-engineer persona
token cloned all 37 repos (9 ws + 7 org + 21 plugins, 4.9MB after
.git strip).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same SSOT-divergence shape as #10 / fixed in #12, but on the a2a-proxy
code path. The plugin handler was routed through `provisioner.RunningContainerName`;
a2a-proxy was forwarding optimistically and only catching missing containers
REACTIVELY via `maybeMarkContainerDead` after the network call timed out.
Result on tenants whose agent containers had been recycled (e.g. post-EC2
replace from molecule-controlplane#20): canvas waits 2-30s for the network
forward to fail before getting a 503, and the workspace-server logs only
"ProxyA2A forward error" without the "container is dead" signal.
This PR adds a proactive `Provisioner.IsRunning` check in `proxyA2ARequest`
between `resolveAgentURL` and `dispatchA2A`, gated on the conditions where
we know we're talking to a sibling Docker container we own (`h.provisioner
!= nil` AND `platformInDocker` AND the URL was rewritten to Docker-DNS form).
Three outcomes via the SSOT helper:
(true, nil) → forward as today
(false, nil) → fast-503 with `error="workspace container not running —
restart triggered"`, `restarting=true`, `preflight=true`,
plus the same offline-flip + WORKSPACE_OFFLINE broadcast +
async restart that `maybeMarkContainerDead` produces
(true, err) → fall through to optimistic forward (matches IsRunning's
"fail-soft as alive" contract — flaky daemon must not
trigger a restart cascade)
The `preflight=true` flag in the response distinguishes the proactive
short-circuit from the reactive `maybeMarkContainerDead` path so canvas
or downstream callers can render distinct messages later.
* `internal/handlers/a2a_proxy.go` — preflight call site between
resolveAgentURL and dispatchA2A; gated on `h.provisioner != nil &&
platformInDocker && url == http://<ContainerName(id)>:port`.
* `internal/handlers/a2a_proxy_helpers.go` — `preflightContainerHealth`
helper. Routes through `h.provisioner.IsRunning` (which itself wraps
`RunningContainerName`). Identical offline-flip side-effects as
`maybeMarkContainerDead` for the dead-container case.
* `internal/handlers/a2a_proxy_preflight_test.go` — 4 tests: running →
nil; not-running → structured 503 + sqlmock expectations on the
offline-flip + structure_events insert; transient error → nil
(fail-soft); AST gate pinning the SSOT routing (mirror of #12's gate).
Mutation-tested: removing the `if running { return nil }` guard makes
the production code fail to compile (unused var). A subtler mutation
(replacing the !running branch with `return nil`) would make
TestPreflight_ContainerNotRunning_StructuredFastFail fail at runtime
with sqlmock's "expected DB call did not occur."
Refs: molecule-core#36. Companion to #12 (issue #10).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a 'mock' runtime: virtual workspaces with no container, no EC2,
no LLM. Every A2A reply is synthesised from a small canned-variant
pool ('On it!', 'Got it, on it now.', etc.) deterministically seeded
by (workspace_id, request_id).
Built for funding-demo "200-workspace mock org" — renders an
enterprise-scale org chart on the canvas (CEO/VPs/Managers/ICs)
without burning real LLM credits or provisioning 200 EC2 instances.
Surfaces:
- workspace-server/internal/handlers/mock_runtime.go: A2A proxy
short-circuit, canned-reply pool, deterministic variant pick.
- workspace-server/internal/handlers/a2a_proxy.go: gate the
short-circuit before resolveAgentURL (mock has no URL).
- workspace-server/internal/handlers/org_import.go: skip Docker
provisioning for mock workspaces, set status='online' directly,
drop the per-sibling 2s pacing for mock children (collapses
a 200-workspace import from ~7min → ~1s).
- workspace-server/internal/handlers/runtime_registry.go: register
'mock' in the runtime allowlist (manifest + fallback set).
- workspace-server/internal/registry/healthsweep.go +
orphan_sweeper.go: skip mock workspaces in container-health and
stale-token sweeps (no container by design).
- workspace-server/internal/handlers/workspace_restart.go: mirror
the 'external' Restart no-op for mock.
- manifest.json: register the new
Molecule-AI/molecule-ai-org-template-mock-bigorg repo.
Tests: 5 new in mock_runtime_test.go covering happy-path, non-mock
regression guard, determinism, IsMockRuntime trim/case, JSON-RPC
id echo. All existing handler + registry tests still pass.
Local-verified: imported the 200-workspace template against a fresh
postgres+redis, confirmed all 200 land in 'online' and stay there
through the 30s health-sweep window, exercised A2A on CEO + VPs +
Managers + ICs and saw the variant pool rotate.
Org template lives at
Molecule-AI/molecule-ai-org-template-mock-bigorg (created today)
and is imported via the existing /org/import flow on the canvas
Template Palette.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Funding-demo Mock #1: when the canvas loads with `?purchase_success=1`,
show a centred success modal in the warm-paper theme. Auto-dismisses
after 5s; Close button + Esc + backdrop click also dismiss; URL params
are stripped on first paint so a refresh after dismiss does not
re-trigger.
Mounted in `app/layout.tsx` (not `app/page.tsx`) so the modal persists
across the canvas page-state transitions (loading → hydrated → error)
without unmounting and losing its open-state.
No real billing logic — the marketplace "Purchase" button on the
landing page redirects here with the flag; this modal is the only
thing the user sees of the "transaction".
Local-verified end-to-end via playwright (5/5 tests pass): redirect
URL shape, modal visibility, URL cleanup, close button, refresh-after-
dismiss behaviour, 5s auto-dismiss.
Pairs with the Purchase button added to landingpage Marketplace
section.
scripts/clone-manifest.sh runs inside the platform Dockerfile build,
so a change to that script needs to retrigger publish. Without it,
the prior fix (clone via Gitea + lowercase org) didn't trigger this
workflow because scripts/ wasn't in the path filter.
Also serves as the file change to satisfy the path filter for THIS
push, retriggering publish-workspace-server-image now.
Post-2026-05-06 GitHub-org suspension: scripts/clone-manifest.sh
was still pointing at https://github.com/${repo}.git, so the
Docker build for workspace-server'\''s platform image fails at:
fatal: could not read Username for 'https://github.com':
No such device or address
with no credentials available in the build container.
Fix: clone from https://git.moleculesai.app/${repo}.git instead.
manifest.json'\''s repo paths still read 'Molecule-AI/...' (the
historic GitHub slug, mixed-case); Gitea lowercases the org
component to 'molecule-ai/...'. Lowercase the org segment on
the fly with awk so we don'\''t need to rewrite every manifest
entry.
Local verify: bash -n passes, lowercase transform produces correct
Gitea paths, anonymous git clone of one of the manifest plugins
over HTTPS to git.moleculesai.app succeeds.
Class G in the prod-ship CI sweep — same shape as the github.com
ref Harness Replays hits, this is the second instance found.
Two coupled cleanups for the post-2026-05-06 stack:
============================================
The plugin injected GITHUB_TOKEN/GH_TOKEN via the App's
installation-access flow (~hourly rotation). Per-agent Gitea
identities replaced this approach after the 2026-05-06 suspension —
workspaces now provision with a per-persona Gitea PAT from .env
instead of an App-rotated token. The plugin code itself lived on
github.com/Molecule-AI/molecule-ai-plugin-github-app-auth which is
also unreachable post-suspension; checking it out at CI build time
was already failing.
Removed:
- workspace-server/cmd/server/main.go: githubappauth import + the
`if os.Getenv("GITHUB_APP_ID") != ""` block that called
BuildRegistry. gh-identity remains as the active mutator.
- workspace-server/Dockerfile + Dockerfile.tenant: COPY of the
sibling repo + the `replace github.com/Molecule-AI/molecule-ai-
plugin-github-app-auth => /plugin` directive injection.
- workspace-server/go.mod + go.sum: github-app-auth dep entry
(cleaned up by `go mod tidy`).
- 3 workflows: actions/checkout steps for the sibling plugin repo:
- .github/workflows/codeql.yml (Go matrix path)
- .github/workflows/harness-replays.yml
- .github/workflows/publish-workspace-server-image.yml
Verified `go build ./cmd/server` + `go vet ./...` pass post-removal.
=======================================================
Same workflow used to push to ghcr.io/molecule-ai/platform +
platform-tenant. ghcr.io/molecule-ai is gone post-suspension. The
operator's ECR org (153263036946.dkr.ecr.us-east-2.amazonaws.com/
molecule-ai/) already hosts platform-tenant + workspace-template-*
+ runner-base images and is the post-suspension SSOT for container
images. This PR aligns publish-workspace-server-image with that
stack.
- env.IMAGE_NAME + env.TENANT_IMAGE_NAME repointed to ECR URL.
- docker/login-action swapped for aws-actions/configure-aws-
credentials@v4 + aws-actions/amazon-ecr-login@v2 chain (the
standard ECR auth pattern; uses AWS_ACCESS_KEY_ID/SECRET secrets
bound to the molecule-cp IAM user).
The :staging-<sha> + :staging-latest tag policy is unchanged —
staging-CP's TENANT_IMAGE pin still points at :staging-latest, just
with the new registry prefix.
Refs molecule-core#157, #161; parallel to org-wide CI-green sweep.
Gitea is case-sensitive on owner slugs; canonical is lowercase
`molecule-ai/...`. Mixed-case `Molecule-AI/...` refs fail-at-0s
when the runner tries to resolve the cross-repo workflow / checkout.
Same fix as molecule-controlplane#12. Mechanical case-correction;
no behavior change beyond making CI resolve again.
Refs: internal#46
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two coupled fixes for molecule-core#10 (plugin install 503 vs
status=online split-state):
1. SSOT for "is this workspace's container running" — `findRunningContainer`
in plugins.go used to carry its own copy of `cli.ContainerInspect`, which
collapsed transient daemon errors into the same `""` return as a
genuinely-stopped container. Healthsweep's `Provisioner.IsRunning`
handled the same input correctly (defensive). Promote the inspect logic
to `provisioner.RunningContainerName`, route both consumers through it.
Transient errors get a distinct log line on the plugins side so triage
doesn't confuse a flaky daemon with a stopped container.
2. Runtime-aware Install/Uninstall — `runtime='external'` workspaces have
no local container; push-install via docker exec is meaningless. They
pull plugins via the download endpoint instead (Phase 30.3). Without a
guard they fell through to `findRunningContainer` and 503'd with a
misleading "container not running." Add an early 422 with a hint
pointing at the download endpoint.
The two fixes are independent: (1) preserves correctness when the SSOT
helper is later modified; (2) eliminates the persistent split-state on
the 5 external persona-agent workspaces in this DB (and on tenant
deployments hitting the same shape).
* `internal/provisioner/provisioner.go` — new `RunningContainerName(ctx,
cli, id) (string, error)` with three documented outcomes (running /
stopped / transient). `Provisioner.IsRunning` now wraps it; behavior
preserved.
* `internal/handlers/plugins.go` — `findRunningContainer` shimmed onto
`RunningContainerName`; new `isExternalRuntime(id)` predicate.
* `internal/handlers/plugins_install.go` — Install + Uninstall reject
external runtimes with 422 + hint, before the source-fetch step.
* `internal/handlers/plugins_install_external_test.go` — 5 cases:
external→422, uninstall-external→422, container-backed-falls-through,
no-runtime-lookup-fails-open, lookup-error-fails-open.
* `internal/handlers/plugins_findrunning_ssot_test.go` — two AST gates
pin the SSOT routing so future PRs can't silently re-introduce the
parallel impl. Mutation-tested: reverting either consumer to a direct
`ContainerInspect` makes the gate fail.
Refs: molecule-core#10
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>