molecule-ai-workspace-runtime/pyproject.toml
rabbitblood 050c2412b3 fix(heartbeat): refresh on-disk auth token on 401 + retry once (#1877)
## Problem

Auto-restart rotates the workspace's auth token in two non-atomic steps:
  1. Platform issues new token via wsauth.IssueToken
  2. Provisioner writes the new token to /configs/.auth_token AFTER
     ContainerStart returns

Between steps 1 and 2, the new container has booted and the runtime has
already loaded the OLD cached value of .auth_token (or no value if the
file was empty during boot). The runtime's first /registry/heartbeat
call sends the stale token, gets 401, but the loop never re-reads the
on-disk token — so subsequent heartbeats also send the stale value.

Each 401 means the platform never sees the workspace as alive →
status stays 'provisioning' → scheduler won't dispatch → workspace
looks dead from every angle even though the container is actually
running.

The existing code comment in workspace_provision.go acknowledges this:
"the workspace will get 401 on its first heartbeat and can recover on
the next restart." That recovery only worked because workspaces used
to crash for unrelated reasons and get restarted. After PR #1861
(provisioner empty-volume auto-recover) removed those crashes,
workspaces get stuck in the 401 loop with no exit.

## Fix

Two-part runtime-side fix in molecule-ai-workspace-runtime:

1. **platform_auth.refresh_from_disk()** — new helper that clears the
   in-memory cache and re-reads /configs/.auth_token. Returns the
   fresh value (or None if missing). Updates the cache as a side effect.

2. **HeartbeatLoop._loop()** — on 401 from /registry/heartbeat, calls
   refresh_from_disk() and retries the request ONCE with the new token.
   Same pattern in _check_delegations(). Bounded retry budget — if the
   on-disk token is also stale (bug elsewhere), no infinite loop.

## Tests

6/6 new tests in tests/test_token_refresh_1877.py:

  - refresh_picks_up_rotated_token              — happy path
  - refresh_returns_none_when_file_missing      — defensive
  - refresh_clears_stale_cache_when_file_disappears
  - refresh_is_idempotent
  - 401_retry_pattern_uses_refreshed_token      — the production fix path
  - 401_retry_no_loop_when_disk_token_also_stale — bounded retry budget

All pass locally on Python 3.13 + pytest 9.

## Why this fix and not the alternatives

- **Alternative B (platform writes token before ContainerStart):**
  Right architecturally but invasive — needs provisioner refactor to
  prep volumes before docker run.
- **Alternative C (skip rotation on auto-restart):** Breaks the
  multi-instance-safety invariant the existing code calls out
  (revoke prevents stale tokens from sister deployments).
- **This fix (A):** 3-line core change + helper. Self-healing for any
  timing edge case, not just the post-restart one. Costs nothing in
  the happy path (only triggers on 401).

## Version

Bumped to 0.1.9. Once published to PyPI + workspace template image
rebuilt, deployed workspaces auto-recover from token-rotation races
without operator intervention.

Closes #1877.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 13:26:36 -07:00

40 lines
1.1 KiB
TOML

[build-system]
requires = ["setuptools>=68.0", "wheel"]
build-backend = "setuptools.build_meta"
[project]
name = "molecule-ai-workspace-runtime"
version = "0.1.9"
description = "Molecule AI workspace runtime — shared infrastructure for all agent adapters"
requires-python = ">=3.11"
license = {text = "BSL-1.1"}
readme = "README.md"
# Don't pin heavy deps — each adapter adds its own
dependencies = [
# Upper bound: a2a-sdk 1.0.0 dropped the a2a.server.apps module we import
# in main.py. Keep on the 0.3.x line until we migrate to the 1.x API.
"a2a-sdk[http-server]>=0.3.25,<1.0",
"httpx>=0.27.0",
"uvicorn>=0.30.0",
"starlette>=0.38.0",
"websockets>=12.0",
"pyyaml>=6.0",
"langchain-core>=0.3.0",
"opentelemetry-api>=1.24.0",
"opentelemetry-sdk>=1.24.0",
"opentelemetry-exporter-otlp-proto-http>=1.24.0",
"temporalio>=1.7.0",
]
[project.scripts]
molecule-runtime = "molecule_runtime.main:main_sync"
[tool.setuptools.packages.find]
where = ["."]
include = ["molecule_runtime*"]
[tool.setuptools.package-data]
"molecule_runtime" = ["py.typed"]