The GitHub org Molecule-AI was suspended on 2026-05-06; canonical SCM
is now Gitea at https://git.moleculesai.app/molecule-ai/. Stale
github.com/Molecule-AI/... URLs return 404 and break tooling that
clones / pip-installs / curls them.
This bundles all non-Go-module URL fixes for this repo into a single PR.
Go module path references (in *.go, go.mod, go.sum) are out of scope
here -- tracked separately under Task #140.
Token-auth clone URLs also flip ${GITHUB_TOKEN} -> ${GITEA_TOKEN} since
the GitHub token does not auth against Gitea.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 3-line wrapper at .github/workflows/secret-scan.yml referenced
`uses: molecule-ai/molecule-core/.github/workflows/secret-scan.yml@staging`.
molecule-core is private; act_runner clones cross-repo reusable
workflows anonymously, so the resolve fails at 0s with no logs.
Same root cause + same fix that molecule-controlplane already shipped
(see its secret-scan.yml comment block lines 10-22). Inlining keeps
the gate functional until Gitea is upgraded or the canonical scanner
moves to a public repo. When either lands, this file reverts to the
3-line wrapper.
Refs: internal#46 Phase 3 Class 2.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Gitea is case-sensitive on owner slugs; canonical is lowercase
`molecule-ai/...`. Mixed-case `Molecule-AI/...` refs fail-at-0s
when the runner tries to resolve the cross-repo workflow / checkout.
Same fix as molecule-controlplane#12. Mechanical case-correction;
no behavior change beyond making CI resolve again.
Refs: internal#46
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The bundled pre-commit hook is the runtime-side mirror of
molecule-core's canonical .github/workflows/secret-scan.yml SECRET_PATTERNS
array. They drifted: canonical added the MiniMax sk-cp- pattern
(F1088 vector — caught only after the fact) but this side wasn't
updated. Result: a workspace developer's local pre-commit would let
through a sk-cp- token that the org-wide CI scan would then refuse —
useless friction.
This brings the two sides back into byte-aligned-on-the-pattern-list
state. The drift is exactly the maintenance gap that task #139's
upcoming molecule-core CI lint is designed to surface automatically;
this PR clears the gap so the lint passes from day 1.
Refs: task #139.
Calls the canonical workflow shipped in
Molecule-AI/molecule-monorepo#2109. Defense against the #2090-class
leak: a hosted-agent commit slipping a credential-shaped string into
a PR — caught at the PR layer, before merge.
Higher stakes here than most repos: this package publishes to PyPI,
so a leaked credential on a release tag would propagate to every
downstream tenant on next pip install.
Pattern set lives in molecule-monorepo so we don't maintain a
parallel copy here. Pairs with the runtime-side pre-commit hook
(scripts/pre-commit-checks.sh) which catches local commits before
they reach a PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This repo is now a publish artifact of Molecule-AI/molecule-core/workspace/.
Runtime code edits go to the monorepo; the publish-runtime workflow
regenerates this mirror + uploads to PyPI on every runtime-v* tag.
Changes:
- Delete .github/workflows/publish.yml. PyPI publishing now happens only
from the monorepo's publish-runtime workflow. Without removing this,
two different code shapes could reach PyPI depending on which workflow
fired (the drift this lockdown is preventing).
- Delete .github/workflows/auto-promote-staging.yml. The staging→main
fast-forward dance has no purpose on a mirror repo — the mirror is
rebuilt wholesale on each release.
- Replace .github/workflows/ci.yml with a 'mirror-guard' job that fails
on any pull_request event with a clear redirect message. Push events
are still allowed (so existing in-flight branches don't all turn red
while the migration finishes); that allowance becomes a follow-up
removal once the auto-sync from monorepo is wired up.
- Rewrite README.md with a prominent ⚠ banner pointing at the monorepo.
- Add CONTRIBUTING.md with the explicit redirect table.
What this does NOT do:
- Wire up the auto-sync from monorepo → this repo. The
publish-runtime workflow currently uploads to PyPI but doesn't push
the rewritten tree back here. As a follow-up, extend that workflow
with a step that commits the build dir to this repo's main. Until
then this repo's contents will go stale relative to PyPI — but
that's fine because no one should be reading code from here anyway.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Adds a secret-scan gate alongside the existing internal-paths block in
the runtime's bundled pre-commit hook. Runs on every commit in every
repo (not scoped to Molecule-AI public repos like the internal-paths
block) — refuses any staged addition matching a high-value credential
shape and prints a recovery message that does NOT echo the secret value.
Pattern set covers GitHub family (ghp_, ghs_, gho_, ghu_, ghr_,
github_pat_), Anthropic / OpenAI / Slack / AWS — same shape as the
tenant-proxy CI scanner; keep aligned when either side adds a pattern.
Single hook file dispatches both checks (renamed
pre-commit-block-internal-paths.sh → pre-commit-checks.sh) so each
agent commit pays one git-config + one hook-install surface, not two.
Both checks share the existing fast-paths (skip if GIT_AUTHOR_NAME
unset; skip during rebase / cherry-pick / merge / revert).
End-to-end test exercises a real bash subprocess against a real temp
git repo with real staged content. Three cases:
- ghs_-prefixed token in package.json (the actual #2090 vector) → refuse
- clean README → pass through
- sk-ant- key in a non-Molecule-AI repo → refuse (secret scan is universal,
internal-paths block is not)
Skipped when bash is not on PATH so Windows test environments without
WSL stay green.
Bumps version 0.1.15 → 0.1.16.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #52 fixed the empty '[A2A_ERROR] ' suffix but didn't bump the
version — the fix landed on main without a corresponding PyPI
release, so workspace-template rebuilds keep pulling 0.1.14 and the
fix never reaches running agents.
Bump to 0.1.15 to trigger the publish-on-tag workflow (maintainer
pushes v0.1.15 tag after staging→main promotion).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI does not install pytest-asyncio — follow test_shared_runtime.py's
_run(coro) helper pattern. Tests still cover the same two paths (bare
exception class-name fallback + message passthrough) but no longer
require the async pytest plugin.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When an exception's str() is empty (bare TimeoutError(), BrokenPipeError(),
some httpx transport errors) `f"{_A2A_ERROR_PREFIX}{e}"` produced
`"[A2A_ERROR] "` with a trailing space and zero diagnostic context,
masking the real cause of peer-delegation failures in activity_logs.
Observed on main monorepo: 22+ occurrences in 75 min across 7 leads
during the MiniMax M2.7 trial rate-limit episode — zero breadcrumbs
to route the debug from.
Fix:
- Exception branch: fall back to `type(e).__name__` when str(e) is empty
- Error branch: include JSON-RPC `error.code` alongside message when present
Tests: test_a2a_error_observability.py covers both the bare-exception
path (must surface class name) and the message-passthrough path (must
preserve existing useful messages).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Trace from molecule-core cycle 107 (2026-04-24): 15 staging PRs stuck
DIRTY (real merge conflicts) with 0 merges in 1+ hours. Authors couldn't
rebase to fix the conflicts because the pre-commit hook (shipped in
0.1.11) refuses ANY commit that includes forbidden paths in the diff —
including rebase replays of historical commits that pre-date the gate.
Specifically, agents trying to `git rebase staging` on a PR like
"docs(marketing): Phase 30 social copy" fail at the first commit replay
because that commit added marketing/* files. The fix would require
interactive rebase + manual file deletion + commit amend — agents don't
do that, so the PR stays DIRTY indefinitely.
Detection: check .git for rebase-merge/, rebase-apply/, CHERRY_PICK_HEAD,
MERGE_HEAD, or REVERT_HEAD. These state markers exist only during the
corresponding git operation. Skip the hook silently when present.
The hook still blocks fresh `git commit` (the failure mode it was
designed for). It just doesn't try to police what was already in git
history.
Bumped to 0.1.14.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sister fix to 0.1.12 (root mounting). After fixing the route mount,
every inbound A2A still returned `-32601 Method not found` because the
1.x dispatcher's method table doesn't recognize v0.3-shaped names
(`message/send`, `tasks/get`) that the platform's ProxyA2A still sends.
Reproduces in the SDK on a minimal handler:
create_jsonrpc_routes(h, "/") → "Method not found"
create_jsonrpc_routes(h, "/", enable_v0_3_compat=True) → dispatches OK
Bumped to 0.1.13. Both 0.1.12 and 0.1.13 are needed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Baseline restart 2026-04-24: every workspace came up healthy (uvicorn
listening, agent-card serving) but produced zero delegations for two
maintenance cycles. Tracing revealed platform's ProxyA2A POSTs to
`http://ws-<id>:8000/` (no path suffix, see
workspace-server/internal/provisioner.InternalURL) while the runtime's
JSON-RPC routes were mounted at `/api/v1/jsonrpc/` under the a2a-sdk
1.x API migration.
Result was silent — every inbound A2A returned 404 Not Found, the
platform logged "Not Found" at INFO level, but no error bubbled up
because the SDK's jsonrpc route factory doesn't respond to root when
mounted at a subpath. Agents stayed warm, crons fired, but no work
flowed.
Fix: `create_jsonrpc_routes(handler, "/")` — matches platform
expectation and the agent-card self-advertisement (which also shows
root as the JSON-RPC URL). Agent-card route keeps its hard-coded
`/.well-known/agent-card.json` path so there's no collision.
Bumped to 0.1.12.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Anti-leak proposal item A. Companion to D (decision tree in role
prompts, separate PR on org-templates).
Why a local pre-commit hook
===========================
Agents try to `git add /research/foo.md` despite SHARED_RULES, the
.gitignore patterns, and the CI gate. Each leak attempt costs ~5 cycles
(PR opens, CI fails, agent retries with workaround) and pollutes git
history with reverts.
A pre-commit hook converts the failure from "PR opens then fails" →
"commit refused immediately, with the recovery command printed in the
same error message the agent reads." Agents act on what's in the
current response context — putting the redirect command literally in
the failure output is the highest-density feedback we can provide.
What changes
============
- molecule_runtime/scripts/pre-commit-block-internal-paths.sh —
bash hook. Checks `git remote get-url origin`, only enforces in
Molecule-AI/molecule-monorepo + molecule-core. In every other repo
(internal, plugins, templates, third-party) it's a no-op.
When forbidden paths are staged, refuses the commit with the redirect
recipe + the alternative public-facing paths + the workflow-edit path
for legitimate exceptions.
- molecule_runtime/precommit_hook.py — install_pre_commit_hook():
1. Extracts bundled hook to ~/.molecule-runtime/git-hooks/pre-commit
2. chmod +x
3. Sets core.hooksPath globally — UNLESS already set by an operator
(then logs a warning + skips, doesn't clobber)
- molecule_runtime/main.py — calls install_pre_commit_hook() at
step 0.2, right after install_credential_helper()
- pyproject.toml bumped to 0.1.11
Both A and D together close the loop: D ensures the agent knows the
right path before writing; A enforces it at the local git boundary if
the agent forgets. CI gate remains the third backstop for anything
that gets pushed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lifts the per-template wiring (Dockerfile COPY + entrypoint.sh git config
+ nohup daemon launch) into the Python runtime. Templates that depend
on molecule-ai-workspace-runtime get the behavior automatically — they
no longer need to maintain their own copy of the helper scripts or
remember to write the right git config in their entrypoint.
Background:
- GitHub App installation tokens (ghs_…) expire ~60min after issue
- claude-code-default template shipped without wiring → 39 workspaces
lost their tokens, three PMs' A2A queues filled with retry-status
messages, manual fleet restart required (cycle 62-66 incident)
This commit:
- Adds molecule_runtime/scripts/{molecule-git-token-helper.sh,
molecule-gh-token-refresh.sh} as package data (copies from canonical
workspace/scripts/ in molecule-monorepo)
- Adds molecule_runtime/credential_helper.py with
install_credential_helper() that:
1. Extracts bundled scripts to ~/.molecule-runtime/scripts/
2. Configures git credential.helper for github.com
3. Creates ~/.molecule-token-cache/ mode 0700
4. Spawns refresh daemon under respawn loop (PID file dedup)
5. Runs initial gh auth login --with-token
- Hooks call site early in main.py (step 0.1, before config load)
- Fails-soft: each step independently fault-tolerant; missing git/gh
binary doesn't block runtime startup
Bumped to 0.1.10. Templates can drop their entrypoint.sh credential
helper setup once they update the runtime pin (separate PRs per template).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Lower _PROCESS_ERROR_STDERR_MAX_CHARS to 1024 (was 4096) so A2A
responses stay bounded — the full context is already in workspace logs
via logger.error/exception.
- Add stderr= kwarg to sanitize_agent_error() so callers can surface
subprocess stderr verbatim in A2A responses.
- In _execute_locked() non-retryable error path, extract the first 1 KB
of exc.stderr and pass it to sanitize_agent_error() so the A2A
response carries actionable context (rate limit message, auth error,
etc.) instead of just a class name.
- Add test_executor_helpers.py unit tests for the new stderr= kwarg.
CI doesn't have pytest-asyncio installed, and the async wrapping was
incidental — the production retry pattern (refresh-on-401) is identical
in sync and async forms. Switching to httpx.Client + MockTransport keeps
the same coverage without the async dep.
6/6 still pass locally.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Problem
Auto-restart rotates the workspace's auth token in two non-atomic steps:
1. Platform issues new token via wsauth.IssueToken
2. Provisioner writes the new token to /configs/.auth_token AFTER
ContainerStart returns
Between steps 1 and 2, the new container has booted and the runtime has
already loaded the OLD cached value of .auth_token (or no value if the
file was empty during boot). The runtime's first /registry/heartbeat
call sends the stale token, gets 401, but the loop never re-reads the
on-disk token — so subsequent heartbeats also send the stale value.
Each 401 means the platform never sees the workspace as alive →
status stays 'provisioning' → scheduler won't dispatch → workspace
looks dead from every angle even though the container is actually
running.
The existing code comment in workspace_provision.go acknowledges this:
"the workspace will get 401 on its first heartbeat and can recover on
the next restart." That recovery only worked because workspaces used
to crash for unrelated reasons and get restarted. After PR #1861
(provisioner empty-volume auto-recover) removed those crashes,
workspaces get stuck in the 401 loop with no exit.
## Fix
Two-part runtime-side fix in molecule-ai-workspace-runtime:
1. **platform_auth.refresh_from_disk()** — new helper that clears the
in-memory cache and re-reads /configs/.auth_token. Returns the
fresh value (or None if missing). Updates the cache as a side effect.
2. **HeartbeatLoop._loop()** — on 401 from /registry/heartbeat, calls
refresh_from_disk() and retries the request ONCE with the new token.
Same pattern in _check_delegations(). Bounded retry budget — if the
on-disk token is also stale (bug elsewhere), no infinite loop.
## Tests
6/6 new tests in tests/test_token_refresh_1877.py:
- refresh_picks_up_rotated_token — happy path
- refresh_returns_none_when_file_missing — defensive
- refresh_clears_stale_cache_when_file_disappears
- refresh_is_idempotent
- 401_retry_pattern_uses_refreshed_token — the production fix path
- 401_retry_no_loop_when_disk_token_also_stale — bounded retry budget
All pass locally on Python 3.13 + pytest 9.
## Why this fix and not the alternatives
- **Alternative B (platform writes token before ContainerStart):**
Right architecturally but invasive — needs provisioner refactor to
prep volumes before docker run.
- **Alternative C (skip rotation on auto-restart):** Breaks the
multi-instance-safety invariant the existing code calls out
(revoke prevents stale tokens from sister deployments).
- **This fix (A):** 3-line core change + helper. Self-healing for any
timing edge case, not just the post-restart one. Costs nothing in
the happy path (only triggers on 401).
## Version
Bumped to 0.1.9. Once published to PyPI + workspace template image
rebuilt, deployed workspaces auto-recover from token-rotation races
without operator intervention.
Closes#1877.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Self-review findings on #38:
1. **Token substring leak**: the "unknown prefix" warning included the
first 12 chars of the token in the log message. Logs get shipped to
Langfuse / CloudWatch / slack-firehose — 12 bytes of a secret in a
log is still 12 bytes too many. Warning no longer references the
token value at all.
2. **Base-URL substring match was too loose**: `"anthropic.com" not in
base` would accept `https://proxy.anthropic.com.evil.example/` as
"looks like Anthropic, keep the URL." Replaced with an allowlist of
exact hostnames parsed via urllib.parse.urlparse.
3. **Whitespace in pasted tokens**: operators frequently paste tokens
from terminals with a trailing newline. The token would flow through
startswith() detection but then fail downstream auth with a
confusing "malformed token" error. Strip and persist the cleaned
value.
4. **Malformed base URL crash guard**: if someone sets ANTHROPIC_BASE_URL
to something urlparse can't handle, don't crash — fall through to
clearing it, which is the safe choice in OAuth mode.
Added 5 new tests covering each of the above. 16/16 tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Platform stores per-workspace LLM credentials under a single key
(ANTHROPIC_AUTH_TOKEN in workspace_secrets). But downstream tools
expect different env var names depending on the token type:
sk-ant-oat01-* → CLAUDE_CODE_OAUTH_TOKEN (Claude Code OAuth session)
sk-ant-api03-* → ANTHROPIC_API_KEY (direct Anthropic API)
sk-cp-* → ANTHROPIC_AUTH_TOKEN (proxy: MiniMax, gateways)
Without normalisation, an OAuth token under ANTHROPIC_AUTH_TOKEN gets
sent as a bearer to api.anthropic.com, which responds:
401 authentication_error: OAuth authentication is currently not
supported.
This was a platform-wide footgun: anyone rotating LLM keys had to
know the exact env var for each token type, AND make sure stale
overrides were cleared, AND set ANTHROPIC_BASE_URL correctly for
proxies (or NOT set for native Claude). Nothing downstream could
help — the SDK just saw the wrong var.
Fix:
- New molecule_runtime/llm_auth.py — normalise_llm_env() mutates
os.environ (or any dict) to the correct shape based on token
prefix. Returns a NormalisationResult for logging.
- main.py calls it as step 0, before any adapter/executor import.
Every adapter (claude-code, langgraph, crewai, autogen, hermes,
…) benefits automatically — no per-adapter branching needed.
- 11 unit tests covering all prefix paths, edge cases, and the
"operator deliberately set CLAUDE_CODE_OAUTH_TOKEN" precedence
rule.
Operationally: this means operators can keep using one
ANTHROPIC_AUTH_TOKEN slot in platform settings and just paste
whatever token the agent needs. No env-var-name awareness required.
Tested locally: 11/11 new tests pass. 83 other tests unchanged
(pre-existing failures on staging are all unrelated:
test_workspace_id_validation, test_a2a_mcp_server RBAC, the
test_imports.main module-walker — same signature as on staging
HEAD before this PR).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fixes#1372 — phantom busy: canvas showed workspace as active for up
to 30s after task completion because set_current_task("") returned
early without posting the updated heartbeat.
Before: clearing only updated the heartbeat object; the next 30s
scheduled heartbeat cycle propagated the clear. Quick tasks would leave
a phantom-busy indicator.
After: both SET and CLEAR push immediately to /registry/heartbeat.
active_tasks=0 on clear, active_tasks=1 on set. Heartbeat object
update and HTTP post are now unconditional.
Tests: 5 new cases covering SET/CLEAR HTTP body, error resilience,
None heartbeat, and missing env vars.
Co-authored-by: Molecule AI Infra-Runtime-BE <infra-runtime-be@agents.moleculesai.app>
Both set_current_task() implementations (shared_runtime.py + executor_helpers.py):
- Increment active_tasks on task start, decrement on completion (was binary 0/1)
- Push heartbeat immediately on BOTH increment AND decrement
- Only clear current_task when active_tasks reaches 0 (preserves description
for still-running tasks)
Fixes phantom-busy: the old code returned early on clear, leaving
active_tasks=1 in the platform DB until the next 30s heartbeat cycle.
If a new cron fired before the heartbeat, the workspace appeared
permanently busy — required manual DB reset every 30 min.
Bump: 0.1.2 → 0.1.3
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>