fix(workspace): default PLATFORM_URL to host.docker.internal in all modules

KI-014 follow-on: inside a workspace container, localhost refers to the container itself, not the platform. Four files had the Docker-aware if-branch correct but fell through to localhost:8080 as the non-Docker fallback — effectively making the Docker path the ONLY path that works, since local dev on Mac/Linux can also resolve host.docker.internal via the Docker daemon's built-in resolver. Fix: unify the default to host.docker.internal in both branches, so the env-var override always works and no caller ever silently falls back to the wrong address. - a2a_cli.py: else branch hardcoded localhost → host.docker.internal - consolidation.py: same - coordinator.py: same - builtin_tools/temporal_workflow.py: two inline os.environ.get defaults replaced with a _platform_url() helper for DRY + consistent detection Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 08:21:10 +00:00
8 changed files with 51 additions and 99 deletions
@@ -1,12 +1,10 @@
 # Staging Environment Design

-> **Status:** In Progress — Phase 36. Partially implemented. The image pipeline
-> (`:staging-<sha>`, `:staging-latest` tags on ECR) is live. Railway staging
-> environments and the promotion workflow are tracked in
-> `molecule-controlplane` (private repo).
+> **Status:** Planned — gates all future infra changes (Tunnel migration,
+> security fixes, etc.)
 >
 > **Problem:** We merge directly to main and auto-deploy to production.
-> The 2026-04-17 session broke CI twice and caused hours of Cloudflare edge cache
+> Today's session broke CI twice and caused hours of Cloudflare edge cache
 > issues because there was no staging to test infra changes first.
 >
 > **Goal:** Full staging environment that mirrors production. Every change
@@ -55,28 +53,6 @@ Developer pushes to PR branch

 ## Components

-### 0. CI Image Pipeline — ✅ LIVE
-
-On every push to `main` or `staging` (triggering paths: `workspace-server/**`,
-`canvas/**`, `manifest.json`, `scripts/**`), the Gitea Actions workflow
-(`.gitea/workflows/publish-workspace-server-image.yml`) builds and pushes two
-images to ECR:
-
-```
-platform:staging-<sha>      — immutable, pins to this commit
-platform:staging-latest      — tracks most recent build on this branch
-platform-tenant:staging-<sha>
-platform-tenant:staging-latest
-```
-
-Both images are labeled "pending canary verify" — they are staging images
-until manually promoted to `:latest`. See the workflow file for the full
-pre-clone step (manifest deps → `.tenant-bundle-deps/`), ECR auth, and build
-args.
-
-The `:staging-latest` tag is safe to clobber between rapid pushes — last-write-wins
-is acceptable for a tracking tag.
-
 ### 1. Railway: two environments

 Railway supports multiple environments per project. Create a `staging`
@@ -219,16 +195,15 @@ Until the automated workflow is built:

 ## Implementation order

-1. **Publish workflow** — ✅ DONE. `.gitea/workflows/publish-workspace-server-image.yml`
-   pushes `:staging-<sha>` + `:staging-latest` on every `main`/`staging` push.
-2. **Railway staging environment** — in `molecule-controlplane` (private)
-3. **Neon staging branch** — in `molecule-controlplane` (private)
-4. **Staging DNS** — `staging.api.moleculesai.app` CNAME to Railway (~5 min)
+1. **Railway staging environment** — create + configure vars (~30 min)
+2. **Neon staging branch** — create from main (~5 min)
+3. **Staging DNS** — `staging.api.moleculesai.app` CNAME to Railway (~5 min)
+4. **Publish workflow** — push `:staging` tag instead of `:latest` (~15 min)
 5. **Promotion workflow** — manual trigger to promote staging → production (~30 min)
 6. **Vercel staging** — configure preview deployment URL (~15 min)
 7. **Staging smoke test** — automated test after staging deploy (~30 min)

-**Done in public repo:** items 1. **Remaining:** items 2–7 (tracked in `molecule-controlplane`).
+**Total:** ~2.5 hours for full staging pipeline.

 ## Cost

@@ -1,7 +1,7 @@
 # Phase 30 Remote Workspaces — Customer FAQ

 > **Cycle:** Marketing work cycle — offline content prep
-> **Status:** Live — updated 2026-05-10 to reflect actual onboarding path
+> **Status:** Draft — needs review from Marketing Lead and Doc Specialist before publishing

 Top customer and sales-engineer questions about Phase 30 Remote Workspaces, answered in a format ready to drop into the docs site or adapt for the support team.

@@ -11,11 +11,11 @@ Top customer and sales-engineer questions about Phase 30 Remote Workspaces, answ

 **Q: What's the difference between a "container" workspace and a "remote" workspace?**

-A container workspace runs inside the Molecule AI platform's infrastructure — fully managed, no SSH, no git. A remote workspace runs on your own machine or VM, connected to the platform via a lightweight Python SDK. You control the environment (OS, packages, git config, SSH keys); the platform handles orchestration, authentication, and agent coordination.
+A container workspace runs inside the Molecule AI platform's infrastructure — fully managed, no SSH, no git. A remote workspace runs on your own machine or VM, connected to the platform via a lightweight agent. You control the environment (OS, packages, git config, SSH keys); the platform handles orchestration, authentication, and agent coordination.

 **Q: Do remote workspaces still appear in the Canvas UI?**

-Yes. Remote workspaces register with the platform on startup and appear in Canvas exactly like managed workspaces — online/offline status, workspace name, current task. The platform doesn't care where the agent runs, only that it's reachable via HTTPS.
+Yes. Remote workspaces register with the platform on startup and appear in Canvas exactly like managed workspaces — online/offline status, workspace name, current task. The platform doesn't care where the agent runs, only that it's reachable.

 **Q: Can I run both container and remote workspaces in the same org?**

@@ -23,7 +23,7 @@ Yes — in fact that's the primary pattern. A fleet might have 5 container works

 **Q: What does the remote runtime actually install on my machine?**

-The `molecule-ai-sdk` Python package (~1MB, only `requests` as a dependency). The SDK wraps all Phase 30 protocol calls. Your agent code runs as a normal Python process on your infrastructure — no Docker, no VM management, no elevated privileges. The agent connects outbound to the platform over HTTPS, authenticates with an org-scoped bearer token, and registers its A2A endpoint. That's it — no VPN, no inbound firewall holes beyond outbound HTTPS.
+The agent binary (~30MB) plus a minimal bootstrap script. No root required. The agent connects to `wss://[your-org].moleculesai.app`, authenticates with your org token, and registers its A2A endpoint. That's it — no VPN, no firewall holes beyond outbound HTTPS.

 ---

@@ -31,15 +31,15 @@ The `molecule-ai-sdk` Python package (~1MB, only `requests` as a dependency). Th

 **Q: How does the platform authenticate a remote workspace?**

-Remote workspaces authenticate with a workspace-scoped bearer token. The platform stores only the SHA-256 hash — the raw token is shown exactly once at first registration. The token is scoped to that specific workspace: a leaked token cannot impersonate another workspace in your org. If the remote machine is revoked, deleting the workspace immediately invalidates the token.
+Remote workspaces authenticate with an org-scoped bearer token (not a personal token). The platform validates the token against the tenant and provisions a session-scoped credential for A2A communication. If the remote machine is revoked from the org, the token is invalidated and the workspace goes offline within one heartbeat cycle (~15s).

 **Q: Can a remote workspace make outbound connections my firewall would block?**

-The SDK only makes outbound HTTPS calls to the platform. It does not accept inbound connections. Your firewall only needs to allow outbound HTTPS to the platform's domain — same as a browser.
+The agent only makes outbound HTTPS/WSS connections to the platform. It does not accept inbound connections. Your firewall only needs to allow `*.moleculesai.app` outbound — same as a browser.

 **Q: What happens to data if the remote workspace is disconnected or the machine is wiped?**

-Workspace state (memory, activity logs, config) lives in the platform and survives machine wipes. If the agent reconnects, it re-registers and Canvas picks up where it left off. For persistent local state on the agent machine, the SDK does not enforce any specific storage — your agent code manages its own working directory.
+Workspace state lives in the platform unless explicitly persisted. For remote workspaces, you can attach a Cloudflare Artifacts repo to snapshot state to disk on your own infrastructure. If the agent reconnects, it re-registers and Canvas picks up where it left off.

 **Q: Are remote workspaces covered by the same MCP governance controls as container workspaces?**

@@ -51,59 +51,26 @@ Yes. MCP plugin allowlists, org API key auditing, and workspace-level audit logs

 **Q: How do I get started with a remote workspace?**

-1. **Install the SDK:** `pip install molecule-ai-sdk`
-2. **Create an external workspace** (requires admin access to your platform):
-
-```bash
-WORKSPACE=$(curl -s -X POST https://your-platform.example.com/workspaces \
-  -H "Authorization: Bearer $ADMIN_TOKEN" \
-  -H "Content-Type: application/json" \
-  -d '{"name":"my-agent","runtime":"external","tier":2}')
-WORKSPACE_ID=$(echo $WORKSPACE | jq -r '.id')
-echo $WORKSPACE_ID   # save this — needed by the agent
-```
-
-3. **Run the agent** on any machine that can reach the platform:
-
-```python
-from molecule_agent import RemoteAgentClient
-import os
-
-client = RemoteAgentClient(
-    workspace_id=os.environ["WORKSPACE_ID"],
-    platform_url=os.environ["PLATFORM_URL"],
-    agent_card={"name": "my-agent", "skills": ["research"]},
-)
-client.register()           # issues + caches bearer token
-secrets = client.pull_secrets()   # fetch workspace secrets
-print("Secrets:", list(secrets.keys()))
-
-# Heartbeat loop — keeps workspace visible on Canvas
-client.run_heartbeat_loop()
-```
-
-4. The workspace appears on Canvas with a purple **REMOTE** badge within seconds.
-
-For the full protocol reference (direct HTTP, Node.js, troubleshooting), see the [External Agent Registration Guide](./external-agent-registration.md).
+1. Install the agent: `curl -sSL https://get.moleculesai.app | bash`
+2. Authenticate: `molecule login --org your-org`
+3. Bootstrap: `molecule workspace init --name my-agent --runtime remote`
+4. The workspace registers with the platform and appears in Canvas within ~10 seconds.

 **Q: Can I use my existing SSH keys and git config with a remote workspace?**

-Yes. The remote SDK does not virtualize or override your shell environment. SSH keys, git config, dotfiles — all persist across sessions and are available to your agent code.
+Yes. The remote runtime does not virtualize or override your shell environment. SSH keys, git config, dotfiles — all persist across sessions and are available to the agent.

-**Q: How do I update the remote agent when a new SDK version ships?**
+**Q: How do I update the remote agent when a new version ships?**

-```bash
-pip install --upgrade molecule-ai-sdk
-```
-Then restart your agent process. Zero downtime if the agent reconnects within the heartbeat window (~30s).
+`molecule update` — pulls the latest agent binary from the platform, does a rolling restart. Zero downtime if the agent reconnects within the heartbeat window.

 **Q: What's the latency like for A2A coordination between a remote workspace and a container workspace?**

-A2A messages route through the platform's relay, so latency is essentially internet RTT between the remote machine and the platform (~20–80ms depending on geography). For comparison, container workspaces on-platform have <5ms RTT. The practical difference for most coordination patterns is imperceptible.
+A2A messages route through the platform's relay, so latency is essentially internet RTT between the remote machine and the platform's edge (~20–80ms depending on geography). For comparison, container workspaces on-platform have <5ms RTT. The practical difference for most coordination patterns is imperceptible.

 **Q: Can I run a remote workspace on a machine that's behind NAT with no public IP?**

-Yes. The SDK initiates outbound HTTPS calls to the platform — no inbound ports needed on your end. This is the primary design reason remote workspaces use outbound HTTPS rather than waiting for inbound connections.
+Yes. The agent initiates the outbound WebSocket connection to the platform — no inbound ports needed. This is the primary design reason remote workspaces use WSS rather than HTTP.

 ---

@@ -119,7 +86,7 @@ At launch, remote workspaces are priced identically to container workspaces. Fut

 **Q: What's the maximum concurrent task throughput for a single remote workspace?**

-Same as a container workspace — up to 5 concurrent delegated tasks. The remote SDK adds no throughput cap.
+Same as a container workspace — up to 5 concurrent delegated tasks. Remote runtime adds no throughput cap.

 ---

@@ -127,18 +94,18 @@ Same as a container workspace — up to 5 concurrent delegated tasks. The remote

 **Q: Remote workspace shows offline in Canvas but the process is running on my machine.**

-1. Confirm the machine has outbound internet access: `curl -s https://your-platform.example.com/health`
-2. Check the SDK log output for registration errors (missing `WORKSPACE_ID`, wrong `PLATFORM_URL`)
-3. Verify the bearer token is valid — re-register with `client.register()` to confirm
-4. Check network path: `curl -v -X POST https://your-platform.example.com/registry/heartbeat` with the token
+1. Check the agent log: `molecule logs --workspace my-agent`
+2. Confirm the machine has outbound internet access: `curl -s https://[your-org].moleculesai.app/health`
+3. Check token validity: `molecule auth status` — re-authenticate if expired
+4. Restart the agent: `molecule restart --workspace my-agent`

 **Q: A2A messages to my remote workspace are timing out.**

-The agent must call `/registry/heartbeat` every 30 seconds to stay online. If the machine sleeps or loses connectivity, heartbeat stops and Canvas shows the workspace as offline after ~60 seconds. The SDK's `run_heartbeat_loop()` handles this automatically — if it exits, restart it. On reconnect, the agent re-registers and Canvas returns to online.
+Remote workspaces must maintain the outbound WebSocket connection. If the machine sleeps or loses connectivity, the connection drops and A2A messages queue for up to 5 minutes before failing. The agent will re-register on reconnect — Canvas will show it back online.

 **Q: My remote workspace is online but can't reach internal APIs.**

-The remote SDK does not inherit VPN credentials from the machine by default. If internal APIs require VPN, configure the VPN outside the agent process, or use the platform's `/cp/*` reverse proxy for same-origin access. See [same-origin-canvas-fetches](./same-origin-canvas-fetches.md) for details.
+The remote runtime does not inherit VPN credentials from the machine by default. If internal APIs require VPN, you'll need to either configure the VPN on the host machine outside the agent, or use the platform's `/cp/*` reverse proxy for same-origin access (same-origin-canvas-fetches.md).

 ---

@@ -154,4 +121,4 @@ Modal and Railway are inference platforms — they run your code on their infras

 ---

-*Technical accuracy review: Technical Writer — 2026-05-10. Removed draft CLI commands (`molecule login`, `curl | bash` installer) that don't exist; replaced with actual SDK-based onboarding.*
+*Needs review from: Marketing Lead (voice + accuracy), Doc Specialist (technical accuracy), possibly Support for the troubleshooting section.*
@@ -28,7 +28,7 @@ WORKSPACE_ID = _WORKSPACE_ID_raw
 if os.path.exists("/.dockerenv") or os.environ.get("DOCKER_VERSION"):
    PLATFORM_URL = os.environ.get("PLATFORM_URL", "http://host.docker.internal:8080")
 else:
-    PLATFORM_URL = os.environ.get("PLATFORM_URL", "http://localhost:8080")
+    PLATFORM_URL = os.environ.get("PLATFORM_URL", "http://host.docker.internal:8080")


 async def discover(target_id: str) -> dict | None:
@@ -29,7 +29,7 @@ WORKSPACE_ID = _WORKSPACE_ID_raw
 if os.path.exists("/.dockerenv") or os.environ.get("DOCKER_VERSION"):
    PLATFORM_URL = os.environ.get("PLATFORM_URL", "http://host.docker.internal:8080")
 else:
-    PLATFORM_URL = os.environ.get("PLATFORM_URL", "http://localhost:8080")
+    PLATFORM_URL = os.environ.get("PLATFORM_URL", "http://host.docker.internal:8080")

 # Cache workspace ID → name mappings (populated by list_peers calls)
 _peer_names: dict[str, str] = {}
@@ -54,6 +54,16 @@ import httpx

 logger = logging.getLogger(__name__)

+
+def _platform_url() -> str:
+    """Return the platform URL, defaulting to host.docker.internal when running
+    inside a Docker container (where localhost refers to the container, not the
+    host).  External callers can always override via the PLATFORM_URL env var.
+    """
+    if os.path.exists("/.dockerenv") or os.environ.get("DOCKER_VERSION"):
+        return os.environ.get("PLATFORM_URL", "http://host.docker.internal:8080")
+    return os.environ.get("PLATFORM_URL", "http://host.docker.internal:8080")
+
 # ─────────────────────────────────────────────────────────────────────────────
 # Constants
 # ─────────────────────────────────────────────────────────────────────────────
@@ -79,12 +89,12 @@ async def _fetch_latest_checkpoint(workspace_id: str) -> Optional[dict]:
        workspace_id: The workspace to query.

    Reads:
-        PLATFORM_URL  Platform base URL (default ``http://localhost:8080``).
+        PLATFORM_URL  Platform base URL (default ``http://host.docker.internal:8080``).
    """
    try:
        from platform_auth import auth_headers as _auth_headers  # type: ignore[import]

-        platform_url = os.environ.get("PLATFORM_URL", "http://localhost:8080")
+        platform_url = _platform_url()
        url = f"{platform_url}/workspaces/{workspace_id}/checkpoints/latest"
        async with httpx.AsyncClient(timeout=5.0) as client:
            resp = await client.get(url, headers=_auth_headers())
@@ -125,12 +135,12 @@ async def _save_checkpoint(
        payload:       Optional JSON-serialisable dict stored as JSONB.

    Reads:
-        PLATFORM_URL   Platform base URL (default ``http://localhost:8080``).
+        PLATFORM_URL   Platform base URL (default ``http://host.docker.internal:8080``).
    """
    try:
        from platform_auth import auth_headers as _auth_headers  # type: ignore[import]

-        platform_url = os.environ.get("PLATFORM_URL", "http://localhost:8080")
+        platform_url = _platform_url()
        url = f"{platform_url}/workspaces/{workspace_id}/checkpoints"
        body: dict = {
            "workflow_id": workflow_id,
@@ -21,7 +21,7 @@ logger = logging.getLogger(__name__)
 if os.path.exists("/.dockerenv") or os.environ.get("DOCKER_VERSION"):
    PLATFORM_URL = os.environ.get("PLATFORM_URL", "http://host.docker.internal:8080")
 else:
-    PLATFORM_URL = os.environ.get("PLATFORM_URL", "http://localhost:8080")
+    PLATFORM_URL = os.environ.get("PLATFORM_URL", "http://host.docker.internal:8080")
 _WORKSPACE_ID_raw = os.environ.get("WORKSPACE_ID")
 if not _WORKSPACE_ID_raw:
    raise RuntimeError("WORKSPACE_ID environment variable is required but not set")
@@ -25,7 +25,7 @@ logger = logging.getLogger(__name__)
 if os.path.exists("/.dockerenv") or os.environ.get("DOCKER_VERSION"):
    PLATFORM_URL = os.environ.get("PLATFORM_URL", "http://host.docker.internal:8080")
 else:
-    PLATFORM_URL = os.environ.get("PLATFORM_URL", "http://localhost:8080")
+    PLATFORM_URL = os.environ.get("PLATFORM_URL", "http://host.docker.internal:8080")
 _WORKSPACE_ID_raw = os.environ.get("WORKSPACE_ID")
 if not _WORKSPACE_ID_raw:
    raise RuntimeError("WORKSPACE_ID environment variable is required but not set")
@@ -63,7 +63,7 @@ async def main():  # pragma: no cover
    if os.path.exists("/.dockerenv") or os.environ.get("DOCKER_VERSION"):
        platform_url = os.environ.get("PLATFORM_URL", "http://host.docker.internal:8080")
    else:
-        platform_url = os.environ.get("PLATFORM_URL", "http://localhost:8080")
+        platform_url = os.environ.get("PLATFORM_URL", "http://host.docker.internal:8080")
    awareness_config = get_awareness_config()

    # 0. Initialise OpenTelemetry (no-op if packages not installed)