Commit Graph

94 Commits

Author SHA1 Message Date
3a9c76eeef Merge pull request 'fix(post-suspension): migrate github.com/Molecule-AI refs to git.moleculesai.app (Class G #168)' (#2) from fix/post-suspension-github-urls into main
All checks were successful
ci / mirror-guard (push) Successful in 10s
Secret scan / Scan diff for credential-shaped strings (push) Successful in 13s
2026-05-07 20:02:39 +00:00
91eac4b611 fix(post-suspension): migrate github.com/Molecule-AI refs to git.moleculesai.app (Class G #168)
Some checks failed
ci / mirror-guard (pull_request) Failing after 7s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 11s
The GitHub org Molecule-AI was suspended on 2026-05-06; canonical SCM
is now Gitea at https://git.moleculesai.app/molecule-ai/. Stale
github.com/Molecule-AI/... URLs return 404 and break tooling that
clones / pip-installs / curls them.

This bundles all non-Go-module URL fixes for this repo into a single PR.
Go module path references (in *.go, go.mod, go.sum) are out of scope
here -- tracked separately under Task #140.

Token-auth clone URLs also flip ${GITHUB_TOKEN} -> ${GITEA_TOKEN} since
the GitHub token does not auth against Gitea.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 13:02:35 -07:00
security-auditor
a96f696ffb fix(ci): inline secret-scan body, drop cross-repo uses: of private molecule-core
All checks were successful
ci / mirror-guard (push) Successful in 4s
Secret scan / Scan diff for credential-shaped strings (push) Successful in 6s
The 3-line wrapper at .github/workflows/secret-scan.yml referenced
`uses: molecule-ai/molecule-core/.github/workflows/secret-scan.yml@staging`.
molecule-core is private; act_runner clones cross-repo reusable
workflows anonymously, so the resolve fails at 0s with no logs.

Same root cause + same fix that molecule-controlplane already shipped
(see its secret-scan.yml comment block lines 10-22). Inlining keeps
the gate functional until Gitea is upgraded or the canonical scanner
moves to a public repo. When either lands, this file reverts to the
3-line wrapper.

Refs: internal#46 Phase 3 Class 2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 02:29:03 -07:00
c266664f12 Merge pull request 'fix(ci): lowercase 'molecule-ai/' in cross-repo workflow refs' (#1) from fix/lowercase-org-slug into main
Some checks failed
Secret scan / secret-scan (push) Failing after 0s
ci / mirror-guard (push) Successful in 5s
2026-05-07 08:59:04 +00:00
security-auditor
d7ea277ce4 fix(ci): lowercase 'molecule-ai/' in cross-repo workflow refs
Some checks failed
Secret scan / secret-scan (pull_request) Failing after 0s
ci / mirror-guard (pull_request) Failing after 3s
Gitea is case-sensitive on owner slugs; canonical is lowercase
`molecule-ai/...`. Mixed-case `Molecule-AI/...` refs fail-at-0s
when the runner tries to resolve the cross-repo workflow / checkout.

Same fix as molecule-controlplane#12. Mechanical case-correction;
no behavior change beyond making CI resolve again.

Refs: internal#46

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 01:00:01 -07:00
Hongming Wang
05486e193c
Merge pull request #56 from Molecule-AI/chore/lockdown-as-mirror
Some checks failed
Secret scan / secret-scan (push) Failing after 0s
ci / mirror-guard (push) Successful in 9s
chore: lock down as publish artifact; source-of-truth is monorepo
2026-04-29 01:58:40 -07:00
Hongming Wang
0fb1038724
Merge pull request #58 from Molecule-AI/chore/precommit-add-minimax-pattern
chore(precommit): add sk-cp- MiniMax pattern (F1088 retroactive fix); bump 0.1.16 → 0.1.17
2026-04-29 00:54:13 -07:00
Hongming Wang
d517deea72
Merge pull request #57 from Molecule-AI/chore/enroll-secret-scan
chore(ci): enroll in org-wide secret-scan reusable workflow (#2109 rollout)
2026-04-29 00:54:09 -07:00
Hongming Wang
b8903eac09
Merge pull request #59 from Molecule-AI/fix/secret-scan-add-sk-cp-pattern
fix(pre-commit): align SECRET_PATTERNS with molecule-core canonical (add sk-cp-)
2026-04-28 15:25:44 -07:00
Hongming Wang
2d514612a2 fix(pre-commit): align SECRET_PATTERNS with molecule-core canonical (add sk-cp-)
The bundled pre-commit hook is the runtime-side mirror of
molecule-core's canonical .github/workflows/secret-scan.yml SECRET_PATTERNS
array. They drifted: canonical added the MiniMax sk-cp- pattern
(F1088 vector — caught only after the fact) but this side wasn't
updated. Result: a workspace developer's local pre-commit would let
through a sk-cp- token that the org-wide CI scan would then refuse —
useless friction.

This brings the two sides back into byte-aligned-on-the-pattern-list
state. The drift is exactly the maintenance gap that task #139's
upcoming molecule-core CI lint is designed to surface automatically;
this PR clears the gap so the lint passes from day 1.

Refs: task #139.
2026-04-28 15:21:13 -07:00
rabbitblood
e927d3b281 chore(precommit): add sk-cp- MiniMax pattern (F1088 retroactive fix); bump 0.1.16 → 0.1.17 2026-04-26 21:43:24 -07:00
rabbitblood
d381f20779 fix(ci): use molecule-core@staging — repo was renamed from molecule-monorepo, workflow lives on staging 2026-04-26 15:44:29 -07:00
rabbitblood
0b11d669b5 chore(ci): enroll in org-wide secret-scan reusable workflow
Calls the canonical workflow shipped in
Molecule-AI/molecule-monorepo#2109. Defense against the #2090-class
leak: a hosted-agent commit slipping a credential-shaped string into
a PR — caught at the PR layer, before merge.

Higher stakes here than most repos: this package publishes to PyPI,
so a leaked credential on a release tag would propagate to every
downstream tenant on next pip install.

Pattern set lives in molecule-monorepo so we don't maintain a
parallel copy here. Pairs with the runtime-side pre-commit hook
(scripts/pre-commit-checks.sh) which catches local commits before
they reach a PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 15:14:17 -07:00
Hongming Wang
01b818d1c8
Merge pull request #55 from Molecule-AI/feat/precommit-secret-scan
feat(precommit): add secret scan to bundled pre-commit hook (defense-in-depth for #2090-style leaks)
2026-04-26 12:40:50 -07:00
Hongming Wang
96864263bb chore: lock down as publish artifact; source-of-truth is monorepo
This repo is now a publish artifact of Molecule-AI/molecule-core/workspace/.
Runtime code edits go to the monorepo; the publish-runtime workflow
regenerates this mirror + uploads to PyPI on every runtime-v* tag.

Changes:

- Delete .github/workflows/publish.yml. PyPI publishing now happens only
  from the monorepo's publish-runtime workflow. Without removing this,
  two different code shapes could reach PyPI depending on which workflow
  fired (the drift this lockdown is preventing).

- Delete .github/workflows/auto-promote-staging.yml. The staging→main
  fast-forward dance has no purpose on a mirror repo — the mirror is
  rebuilt wholesale on each release.

- Replace .github/workflows/ci.yml with a 'mirror-guard' job that fails
  on any pull_request event with a clear redirect message. Push events
  are still allowed (so existing in-flight branches don't all turn red
  while the migration finishes); that allowance becomes a follow-up
  removal once the auto-sync from monorepo is wired up.

- Rewrite README.md with a prominent ⚠ banner pointing at the monorepo.

- Add CONTRIBUTING.md with the explicit redirect table.

What this does NOT do:

- Wire up the auto-sync from monorepo → this repo. The
  publish-runtime workflow currently uploads to PyPI but doesn't push
  the rewritten tree back here. As a follow-up, extend that workflow
  with a step that commits the build dir to this repo's main. Until
  then this repo's contents will go stale relative to PyPI — but
  that's fine because no one should be reading code from here anyway.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
2026-04-26 12:03:12 -07:00
rabbitblood
f1bede31a8 feat(precommit): add secret scan to bundled pre-commit hook (defense-in-depth for #2090-style leaks)
Adds a secret-scan gate alongside the existing internal-paths block in
the runtime's bundled pre-commit hook. Runs on every commit in every
repo (not scoped to Molecule-AI public repos like the internal-paths
block) — refuses any staged addition matching a high-value credential
shape and prints a recovery message that does NOT echo the secret value.

Pattern set covers GitHub family (ghp_, ghs_, gho_, ghu_, ghr_,
github_pat_), Anthropic / OpenAI / Slack / AWS — same shape as the
tenant-proxy CI scanner; keep aligned when either side adds a pattern.

Single hook file dispatches both checks (renamed
pre-commit-block-internal-paths.sh → pre-commit-checks.sh) so each
agent commit pays one git-config + one hook-install surface, not two.
Both checks share the existing fast-paths (skip if GIT_AUTHOR_NAME
unset; skip during rebase / cherry-pick / merge / revert).

End-to-end test exercises a real bash subprocess against a real temp
git repo with real staged content. Three cases:
 - ghs_-prefixed token in package.json (the actual #2090 vector) → refuse
 - clean README → pass through
 - sk-ant- key in a non-Molecule-AI repo → refuse (secret scan is universal,
   internal-paths block is not)

Skipped when bash is not on PATH so Windows test environments without
WSL stay green.

Bumps version 0.1.15 → 0.1.16.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 11:57:39 -07:00
Hongming Wang
7fc3537b2f
Merge pull request #42 from Molecule-AI/fix/stderr-capture-a2a-response
fix(runtime): capture stderr in A2A error response (closes #66)
2026-04-24 13:25:15 -07:00
Hongming Wang
1759e221e9
Merge pull request #53 from Molecule-AI/chore/bump-0.1.15
Some checks failed
Publish to PyPI / build-and-publish (push) Failing after 41s
chore: bump to 0.1.15 — ship A2A_ERROR observability fix (#51)
2026-04-24 12:03:36 -07:00
rabbitblood
84f3faea8a chore: bump to 0.1.15 — ship A2A_ERROR observability fix (#51)
PR #52 fixed the empty '[A2A_ERROR] ' suffix but didn't bump the
version — the fix landed on main without a corresponding PyPI
release, so workspace-template rebuilds keep pulling 0.1.14 and the
fix never reaches running agents.

Bump to 0.1.15 to trigger the publish-on-tag workflow (maintainer
pushes v0.1.15 tag after staging→main promotion).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 11:54:13 -07:00
Hongming Wang
0d71ee8345
Merge pull request #52 from Molecule-AI/fix/a2a-error-observability-51
fix(a2a): include exception class + error code in [A2A_ERROR] (#51)
2026-04-24 11:35:42 -07:00
rabbitblood
4940abdc68 fix(tests): remove pytest-asyncio dependency from #51 regression tests
CI does not install pytest-asyncio — follow test_shared_runtime.py's
_run(coro) helper pattern. Tests still cover the same two paths (bare
exception class-name fallback + message passthrough) but no longer
require the async pytest plugin.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 11:34:30 -07:00
rabbitblood
6ead3b433e fix(a2a): include exception class + error code in [A2A_ERROR] (#51)
When an exception's str() is empty (bare TimeoutError(), BrokenPipeError(),
some httpx transport errors) `f"{_A2A_ERROR_PREFIX}{e}"` produced
`"[A2A_ERROR] "` with a trailing space and zero diagnostic context,
masking the real cause of peer-delegation failures in activity_logs.

Observed on main monorepo: 22+ occurrences in 75 min across 7 leads
during the MiniMax M2.7 trial rate-limit episode — zero breadcrumbs
to route the debug from.

Fix:
- Exception branch: fall back to `type(e).__name__` when str(e) is empty
- Error branch: include JSON-RPC `error.code` alongside message when present

Tests: test_a2a_error_observability.py covers both the bare-exception
path (must surface class name) and the message-passthrough path (must
preserve existing useful messages).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 11:22:57 -07:00
Hongming Wang
d75a161ee8 fix(ci): sync auto-promote workflow (ff-only, no-gates mode) 2026-04-24 08:35:15 -07:00
Hongming Wang
ae624a1f6a
Merge pull request #50 from Molecule-AI/chore/add-auto-promote-staging
chore(ci): add auto-promote-staging workflow
2026-04-24 08:18:43 -07:00
Hongming Wang
f58d12bee2 chore(ci): add auto-promote-staging workflow 2026-04-24 07:43:56 -07:00
Hongming Wang
a80294766c
Merge pull request #49 from Molecule-AI/fix/precommit-skip-rebase
Some checks failed
Publish to PyPI / build-and-publish (push) Failing after 40s
fix(precommit): skip during rebase/cherry-pick/merge/revert — unblocks DIRTY PR rebase
2026-04-24 04:35:19 -07:00
rabbitblood
c43df7f947 fix(precommit): skip during rebase/cherry-pick/merge/revert — unblocks DIRTY PR rebase
Trace from molecule-core cycle 107 (2026-04-24): 15 staging PRs stuck
DIRTY (real merge conflicts) with 0 merges in 1+ hours. Authors couldn't
rebase to fix the conflicts because the pre-commit hook (shipped in
0.1.11) refuses ANY commit that includes forbidden paths in the diff —
including rebase replays of historical commits that pre-date the gate.

Specifically, agents trying to `git rebase staging` on a PR like
"docs(marketing): Phase 30 social copy" fail at the first commit replay
because that commit added marketing/* files. The fix would require
interactive rebase + manual file deletion + commit amend — agents don't
do that, so the PR stays DIRTY indefinitely.

Detection: check .git for rebase-merge/, rebase-apply/, CHERRY_PICK_HEAD,
MERGE_HEAD, or REVERT_HEAD. These state markers exist only during the
corresponding git operation. Skip the hook silently when present.

The hook still blocks fresh `git commit` (the failure mode it was
designed for). It just doesn't try to police what was already in git
history.

Bumped to 0.1.14.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 04:34:55 -07:00
Hongming Wang
faa5b42aa4
Merge pull request #47 from Molecule-AI/fix/enable-v0-3-compat
Some checks failed
Publish to PyPI / build-and-publish (push) Failing after 42s
fix: enable v0_3 compat in JSON-RPC dispatcher
2026-04-24 02:37:20 -07:00
rabbitblood
19f0033222 fix: enable v0_3 compat in JSON-RPC dispatcher — platform sends old method names
Sister fix to 0.1.12 (root mounting). After fixing the route mount,
every inbound A2A still returned `-32601 Method not found` because the
1.x dispatcher's method table doesn't recognize v0.3-shaped names
(`message/send`, `tasks/get`) that the platform's ProxyA2A still sends.

Reproduces in the SDK on a minimal handler:
    create_jsonrpc_routes(h, "/")                          → "Method not found"
    create_jsonrpc_routes(h, "/", enable_v0_3_compat=True) → dispatches OK

Bumped to 0.1.13. Both 0.1.12 and 0.1.13 are needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 02:37:07 -07:00
Hongming Wang
d22c19ad31
Merge pull request #46 from Molecule-AI/fix/jsonrpc-mount-at-root
Some checks failed
Publish to PyPI / build-and-publish (push) Failing after 40s
fix: mount JSON-RPC at root — fixes silent fleet productivity loss
2026-04-24 02:06:58 -07:00
rabbitblood
30ebe9baf3 fix: mount JSON-RPC at root — platform POSTs to /, not /api/v1/jsonrpc/
Baseline restart 2026-04-24: every workspace came up healthy (uvicorn
listening, agent-card serving) but produced zero delegations for two
maintenance cycles. Tracing revealed platform's ProxyA2A POSTs to
`http://ws-<id>:8000/` (no path suffix, see
workspace-server/internal/provisioner.InternalURL) while the runtime's
JSON-RPC routes were mounted at `/api/v1/jsonrpc/` under the a2a-sdk
1.x API migration.

Result was silent — every inbound A2A returned 404 Not Found, the
platform logged "Not Found" at INFO level, but no error bubbled up
because the SDK's jsonrpc route factory doesn't respond to root when
mounted at a subpath. Agents stayed warm, crons fired, but no work
flowed.

Fix: `create_jsonrpc_routes(handler, "/")` — matches platform
expectation and the agent-card self-advertisement (which also shows
root as the JSON-RPC URL). Agent-card route keeps its hard-coded
`/.well-known/agent-card.json` path so there's no collision.

Bumped to 0.1.12.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 02:06:04 -07:00
Hongming Wang
64720c0fc6
Merge pull request #45 from Molecule-AI/feat/precommit-hook-block-internal-paths
Some checks failed
Publish to PyPI / build-and-publish (push) Failing after 41s
feat: pre-commit hook to block internal paths in public monorepo (A)
2026-04-24 00:49:42 -07:00
rabbitblood
89739bf848 feat: pre-commit hook to block internal paths in public monorepo (A)
Anti-leak proposal item A. Companion to D (decision tree in role
prompts, separate PR on org-templates).

Why a local pre-commit hook
===========================

Agents try to `git add /research/foo.md` despite SHARED_RULES, the
.gitignore patterns, and the CI gate. Each leak attempt costs ~5 cycles
(PR opens, CI fails, agent retries with workaround) and pollutes git
history with reverts.

A pre-commit hook converts the failure from "PR opens then fails" →
"commit refused immediately, with the recovery command printed in the
same error message the agent reads." Agents act on what's in the
current response context — putting the redirect command literally in
the failure output is the highest-density feedback we can provide.

What changes
============

- molecule_runtime/scripts/pre-commit-block-internal-paths.sh —
  bash hook. Checks `git remote get-url origin`, only enforces in
  Molecule-AI/molecule-monorepo + molecule-core. In every other repo
  (internal, plugins, templates, third-party) it's a no-op.

  When forbidden paths are staged, refuses the commit with the redirect
  recipe + the alternative public-facing paths + the workflow-edit path
  for legitimate exceptions.

- molecule_runtime/precommit_hook.py — install_pre_commit_hook():
  1. Extracts bundled hook to ~/.molecule-runtime/git-hooks/pre-commit
  2. chmod +x
  3. Sets core.hooksPath globally — UNLESS already set by an operator
     (then logs a warning + skips, doesn't clobber)

- molecule_runtime/main.py — calls install_pre_commit_hook() at
  step 0.2, right after install_credential_helper()

- pyproject.toml bumped to 0.1.11

Both A and D together close the loop: D ensures the agent knows the
right path before writing; A enforces it at the local git boundary if
the agent forgets. CI gate remains the third backstop for anything
that gets pushed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 00:48:47 -07:00
Hongming Wang
f334872d56
Merge pull request #44 from Molecule-AI/feat/inline-credential-helper
feat: ship GitHub credential-helper inline in runtime (fixes #1933 class)
2026-04-24 00:42:32 -07:00
rabbitblood
f1329fe230 feat: ship GitHub credential-helper inline in runtime (fixes #1933 class)
Lifts the per-template wiring (Dockerfile COPY + entrypoint.sh git config
+ nohup daemon launch) into the Python runtime. Templates that depend
on molecule-ai-workspace-runtime get the behavior automatically — they
no longer need to maintain their own copy of the helper scripts or
remember to write the right git config in their entrypoint.

Background:
- GitHub App installation tokens (ghs_…) expire ~60min after issue
- claude-code-default template shipped without wiring → 39 workspaces
  lost their tokens, three PMs' A2A queues filled with retry-status
  messages, manual fleet restart required (cycle 62-66 incident)

This commit:
- Adds molecule_runtime/scripts/{molecule-git-token-helper.sh,
  molecule-gh-token-refresh.sh} as package data (copies from canonical
  workspace/scripts/ in molecule-monorepo)
- Adds molecule_runtime/credential_helper.py with
  install_credential_helper() that:
    1. Extracts bundled scripts to ~/.molecule-runtime/scripts/
    2. Configures git credential.helper for github.com
    3. Creates ~/.molecule-token-cache/ mode 0700
    4. Spawns refresh daemon under respawn loop (PID file dedup)
    5. Runs initial gh auth login --with-token
- Hooks call site early in main.py (step 0.1, before config load)
- Fails-soft: each step independently fault-tolerant; missing git/gh
  binary doesn't block runtime startup

Bumped to 0.1.10. Templates can drop their entrypoint.sh credential
helper setup once they update the runtime pin (separate PRs per template).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 00:41:32 -07:00
19fde6f466 fix(runtime): capture stderr in A2A error response (closes #66)
- Lower _PROCESS_ERROR_STDERR_MAX_CHARS to 1024 (was 4096) so A2A
  responses stay bounded — the full context is already in workspace logs
  via logger.error/exception.
- Add stderr= kwarg to sanitize_agent_error() so callers can surface
  subprocess stderr verbatim in A2A responses.
- In _execute_locked() non-retryable error path, extract the first 1 KB
  of exc.stderr and pass it to sanitize_agent_error() so the A2A
  response carries actionable context (rate limit message, auth error,
  etc.) instead of just a class name.
- Add test_executor_helpers.py unit tests for the new stderr= kwarg.
2026-04-24 05:00:51 +00:00
molecule-ai[bot]
d5cf872311
feat: migrate a2a-sdk 1.x (KI-009) (#39)
- Replace a2a.utils.new_agent_text_message → a2a.helpers.new_text_message
- Replace Part(root=TextPart(...)) → Part(text=...) (flat Part API)
- Replace A2AStarletteApplication → Starlette route factories
  (create_agent_card_routes, create_jsonrpc_routes)
- Update conftest stubs: remove a2a.server.apps/a2a.utils,
  add a2a.server.routes/a2a.helpers/AgentInterface
- Add AgentInterface to AgentCard supported_interfaces
- Rename snake_case AgentCard fields per 1.x schema

Co-authored-by: Molecule AI Infra-Runtime-BE <infra-runtime-be@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 01:54:33 +00:00
Hongming Wang
1b04da2061
Merge pull request #38 from Molecule-AI/fix/auto-detect-llm-token-type
feat(runtime): auto-detect LLM token type, normalise env on boot
2026-04-23 13:53:06 -07:00
Hongming Wang
e562b7a03e
Merge branch 'staging' into fix/auto-detect-llm-token-type 2026-04-23 13:52:25 -07:00
Hongming Wang
3556244725
Merge pull request #40 from Molecule-AI/fix/heartbeat-401-token-refresh-1877
fix(heartbeat): refresh on-disk auth token on 401 + retry once (#1877)
2026-04-23 13:51:42 -07:00
rabbitblood
a78b9f229e test(1877): convert async tests to sync httpx.Client to unblock CI
CI doesn't have pytest-asyncio installed, and the async wrapping was
incidental — the production retry pattern (refresh-on-401) is identical
in sync and async forms. Switching to httpx.Client + MockTransport keeps
the same coverage without the async dep.

6/6 still pass locally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 13:35:45 -07:00
rabbitblood
050c2412b3 fix(heartbeat): refresh on-disk auth token on 401 + retry once (#1877)
## Problem

Auto-restart rotates the workspace's auth token in two non-atomic steps:
  1. Platform issues new token via wsauth.IssueToken
  2. Provisioner writes the new token to /configs/.auth_token AFTER
     ContainerStart returns

Between steps 1 and 2, the new container has booted and the runtime has
already loaded the OLD cached value of .auth_token (or no value if the
file was empty during boot). The runtime's first /registry/heartbeat
call sends the stale token, gets 401, but the loop never re-reads the
on-disk token — so subsequent heartbeats also send the stale value.

Each 401 means the platform never sees the workspace as alive →
status stays 'provisioning' → scheduler won't dispatch → workspace
looks dead from every angle even though the container is actually
running.

The existing code comment in workspace_provision.go acknowledges this:
"the workspace will get 401 on its first heartbeat and can recover on
the next restart." That recovery only worked because workspaces used
to crash for unrelated reasons and get restarted. After PR #1861
(provisioner empty-volume auto-recover) removed those crashes,
workspaces get stuck in the 401 loop with no exit.

## Fix

Two-part runtime-side fix in molecule-ai-workspace-runtime:

1. **platform_auth.refresh_from_disk()** — new helper that clears the
   in-memory cache and re-reads /configs/.auth_token. Returns the
   fresh value (or None if missing). Updates the cache as a side effect.

2. **HeartbeatLoop._loop()** — on 401 from /registry/heartbeat, calls
   refresh_from_disk() and retries the request ONCE with the new token.
   Same pattern in _check_delegations(). Bounded retry budget — if the
   on-disk token is also stale (bug elsewhere), no infinite loop.

## Tests

6/6 new tests in tests/test_token_refresh_1877.py:

  - refresh_picks_up_rotated_token              — happy path
  - refresh_returns_none_when_file_missing      — defensive
  - refresh_clears_stale_cache_when_file_disappears
  - refresh_is_idempotent
  - 401_retry_pattern_uses_refreshed_token      — the production fix path
  - 401_retry_no_loop_when_disk_token_also_stale — bounded retry budget

All pass locally on Python 3.13 + pytest 9.

## Why this fix and not the alternatives

- **Alternative B (platform writes token before ContainerStart):**
  Right architecturally but invasive — needs provisioner refactor to
  prep volumes before docker run.
- **Alternative C (skip rotation on auto-restart):** Breaks the
  multi-instance-safety invariant the existing code calls out
  (revoke prevents stale tokens from sister deployments).
- **This fix (A):** 3-line core change + helper. Self-healing for any
  timing edge case, not just the post-restart one. Costs nothing in
  the happy path (only triggers on 401).

## Version

Bumped to 0.1.9. Once published to PyPI + workspace template image
rebuilt, deployed workspaces auto-recover from token-rotation races
without operator intervention.

Closes #1877.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 13:26:36 -07:00
rabbitblood
4bafea58ae fix(llm_auth): tighten base-URL hostname match + strip whitespace + no token in logs
Self-review findings on #38:

1. **Token substring leak**: the "unknown prefix" warning included the
   first 12 chars of the token in the log message. Logs get shipped to
   Langfuse / CloudWatch / slack-firehose — 12 bytes of a secret in a
   log is still 12 bytes too many. Warning no longer references the
   token value at all.

2. **Base-URL substring match was too loose**: `"anthropic.com" not in
   base` would accept `https://proxy.anthropic.com.evil.example/` as
   "looks like Anthropic, keep the URL." Replaced with an allowlist of
   exact hostnames parsed via urllib.parse.urlparse.

3. **Whitespace in pasted tokens**: operators frequently paste tokens
   from terminals with a trailing newline. The token would flow through
   startswith() detection but then fail downstream auth with a
   confusing "malformed token" error. Strip and persist the cleaned
   value.

4. **Malformed base URL crash guard**: if someone sets ANTHROPIC_BASE_URL
   to something urlparse can't handle, don't crash — fall through to
   clearing it, which is the safe choice in OAuth mode.

Added 5 new tests covering each of the above. 16/16 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 10:46:07 -07:00
rabbitblood
0a0f11b41f feat(runtime): auto-detect LLM token type, normalise env on boot
Platform stores per-workspace LLM credentials under a single key
(ANTHROPIC_AUTH_TOKEN in workspace_secrets). But downstream tools
expect different env var names depending on the token type:

  sk-ant-oat01-*  → CLAUDE_CODE_OAUTH_TOKEN  (Claude Code OAuth session)
  sk-ant-api03-*  → ANTHROPIC_API_KEY        (direct Anthropic API)
  sk-cp-*         → ANTHROPIC_AUTH_TOKEN     (proxy: MiniMax, gateways)

Without normalisation, an OAuth token under ANTHROPIC_AUTH_TOKEN gets
sent as a bearer to api.anthropic.com, which responds:

    401 authentication_error: OAuth authentication is currently not
    supported.

This was a platform-wide footgun: anyone rotating LLM keys had to
know the exact env var for each token type, AND make sure stale
overrides were cleared, AND set ANTHROPIC_BASE_URL correctly for
proxies (or NOT set for native Claude). Nothing downstream could
help — the SDK just saw the wrong var.

Fix:

- New molecule_runtime/llm_auth.py — normalise_llm_env() mutates
  os.environ (or any dict) to the correct shape based on token
  prefix. Returns a NormalisationResult for logging.
- main.py calls it as step 0, before any adapter/executor import.
  Every adapter (claude-code, langgraph, crewai, autogen, hermes,
  …) benefits automatically — no per-adapter branching needed.
- 11 unit tests covering all prefix paths, edge cases, and the
  "operator deliberately set CLAUDE_CODE_OAUTH_TOKEN" precedence
  rule.

Operationally: this means operators can keep using one
ANTHROPIC_AUTH_TOKEN slot in platform settings and just paste
whatever token the agent needs. No env-var-name awareness required.

Tested locally: 11/11 new tests pass. 83 other tests unchanged
(pre-existing failures on staging are all unrelated:
test_workspace_id_validation, test_a2a_mcp_server RBAC, the
test_imports.main module-walker — same signature as on staging
HEAD before this PR).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 10:41:47 -07:00
molecule-ai[bot]
dcb6edd1a1
fix(shared_runtime): push heartbeat on CLEAR in set_current_task() (#37)
Fixes #1372 — phantom busy: canvas showed workspace as active for up
to 30s after task completion because set_current_task("") returned
early without posting the updated heartbeat.

Before: clearing only updated the heartbeat object; the next 30s
scheduled heartbeat cycle propagated the clear. Quick tasks would leave
a phantom-busy indicator.

After: both SET and CLEAR push immediately to /registry/heartbeat.
active_tasks=0 on clear, active_tasks=1 on set. Heartbeat object
update and HTTP post are now unconditional.

Tests: 5 new cases covering SET/CLEAR HTTP body, error resilience,
None heartbeat, and missing env vars.

Co-authored-by: Molecule AI Infra-Runtime-BE <infra-runtime-be@agents.moleculesai.app>
2026-04-22 17:33:42 +00:00
rabbitblood
1e545ed6ba chore: bump 0.1.8 — executor_helpers phantom-busy fix confirmed in tree
Some checks failed
Publish to PyPI / build-and-publish (push) Failing after 8s
2026-04-21 07:16:47 -07:00
rabbitblood
5a1990552d chore: bump 0.1.7 — ensure executor_helpers phantom-busy fix in PyPI build
Some checks failed
Publish to PyPI / build-and-publish (push) Failing after 7s
2026-04-21 07:07:17 -07:00
rabbitblood
59f54560a0 Merge branch 'main' of https://github.com/Molecule-AI/molecule-ai-workspace-runtime into fix/507-mcp-server-path-absolute-imports
Some checks failed
Publish to PyPI / build-and-publish (push) Failing after 6s
# Conflicts:
#	pyproject.toml
2026-04-21 06:37:38 -07:00
rabbitblood
d3235cc564 fix(heartbeat): increment/decrement active_tasks + push on clear (#1372, #1408)
Both set_current_task() implementations (shared_runtime.py + executor_helpers.py):
- Increment active_tasks on task start, decrement on completion (was binary 0/1)
- Push heartbeat immediately on BOTH increment AND decrement
- Only clear current_task when active_tasks reaches 0 (preserves description
  for still-running tasks)

Fixes phantom-busy: the old code returned early on clear, leaving
active_tasks=1 in the platform DB until the next 30s heartbeat cycle.
If a new cron fired before the heartbeat, the workspace appeared
permanently busy — required manual DB reset every 30 min.

Bump: 0.1.2 → 0.1.3

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-21 06:37:12 -07:00
Hongming Wang
7febb51382
Merge pull request #36 from Molecule-AI/chore/bump-0.1.5
Some checks failed
Publish to PyPI / build-and-publish (push) Failing after 6s
chore: bump to 0.1.5 for X-Molecule-Org-Id header fix
2026-04-20 20:30:54 -07:00