fix(registry): allow pending-DNS platform tunnel URL at register (#36 register half) #2425

Merged
devops-engineer merged 1 commits from fix/validate-agent-url-pending-tunnel into main 2026-06-08 04:44:04 +00:00
Member

The REGISTER half of #36 (provision half = Hetzner location-failover, cp#619). Cross-cloud workspaces register advertising their per-workspace Cloudflare tunnel hostname (ws-.); the DNS record is eventually-consistent and a FAST-booting box (Hetzner ~1s) registers before it propagates -> validateAgentURL net.LookupIP fails -> 400 -> the runtime does NOT retry a 4xx -> agent_card never lands. AWS/GCP boot slow enough to miss the race (only the fast cloud broke). Diagnosed live: faithful Hetzner repro boxes 400 against a WARM tenant with 'hostname ... cannot be resolved (DNS error)'. Fix: on DNS failure, allow the hostname in SaaS mode IFF it is a platform-tunnel hostname (ws- prefix under the platform domain, MOLECULE_APP_DOMAIN default moleculesai.app) -- not an SSRF vector (only the platform controls that domain; metadata/loopback blocks still apply once it resolves). Self-hosted keeps the strict block. SECURITY-sensitive (SSRF validator) -- see the scoped rationale + tests. Generated with Claude Code

The REGISTER half of #36 (provision half = Hetzner location-failover, cp#619). Cross-cloud workspaces register advertising their per-workspace Cloudflare tunnel hostname (ws-<id>.<appDomain>); the DNS record is eventually-consistent and a FAST-booting box (Hetzner ~1s) registers before it propagates -> validateAgentURL net.LookupIP fails -> 400 -> the runtime does NOT retry a 4xx -> agent_card never lands. AWS/GCP boot slow enough to miss the race (only the fast cloud broke). Diagnosed live: faithful Hetzner repro boxes 400 against a WARM tenant with 'hostname ... cannot be resolved (DNS error)'. Fix: on DNS failure, allow the hostname in SaaS mode IFF it is a platform-tunnel hostname (ws- prefix under the platform domain, MOLECULE_APP_DOMAIN default moleculesai.app) -- not an SSRF vector (only the platform controls that domain; metadata/loopback blocks still apply once it resolves). Self-hosted keeps the strict block. SECURITY-sensitive (SSRF validator) -- see the scoped rationale + tests. Generated with Claude Code
devops-engineer added 1 commit 2026-06-08 04:31:36 +00:00
fix(registry): allow pending-DNS platform tunnel URL at register (#36/#2421)
ci-arm64-advisory / fast-checks (pull_request) Waiting to run
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 5s
CI / Python Lint & Test (pull_request) Successful in 4s
E2E API Smoke Test / detect-changes (pull_request) Successful in 9s
CI / Detect changes (pull_request) Successful in 13s
E2E Chat / detect-changes (pull_request) Successful in 12s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 8s
Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 4s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 4s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 9s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 2s
CI / Canvas (Next.js) (pull_request) Successful in 3s
Harness Replays / detect-changes (pull_request) Successful in 13s
Lint shellcheck (arm64 pilot) / shellcheck-arm64 (pilot) (pull_request) Successful in 12s
E2E Chat / E2E Chat (pull_request) Successful in 2s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 3s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 9s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (local) (pull_request) Successful in 35s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (pull_request) Successful in 10s
E2E Staging SaaS (full lifecycle) / pr-validate (pull_request) Successful in 30s
gate-check-v3 / gate-check (pull_request_target) Successful in 6s
sop-checklist / all-items-acked (pull_request) acked: 0/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +4 — body-unfilled: comprehensive-testing, local-postgres-e2
sop-checklist / na-declarations (pull_request) N/A: (none)
CI / Canvas Deploy Status (pull_request) Successful in 1s
Harness Replays / Harness Replays (pull_request) Successful in 1s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m2s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 1m5s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 1m59s
sop-checklist / review-refire (pull_request_target) Has been skipped
sop-checklist / all-items-acked (pull_request_target) Has been cancelled
qa-review / approved (pull_request_target) Refired via /qa-recheck; qa-review failed
E2E Staging SaaS (full lifecycle) / E2E Staging Platform Boot (pull_request) Failing after 5m30s
E2E Staging External Runtime / E2E Staging External Runtime (pull_request) Successful in 5m35s
security-review / approved (pull_request_target) Refired via /security-recheck; security-review failed
CI / Platform (Go) (pull_request) Successful in 6m58s
CI / all-required (pull_request) Successful in 1s
E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (pull_request) Failing after 9m51s
audit-force-merge / audit (pull_request_target) Successful in 6s
644734bb7c
Cross-cloud workspaces (e.g. Hetzner under a GCP tenant) register
advertising their per-workspace Cloudflare tunnel hostname
ws-<id>.<appDomain>. That DNS record is eventually-consistent, and a
FAST-booting box (a Hetzner cpx reports 'workspace ready after ~1s')
registers BEFORE it propagates → validateAgentURL's net.LookupIP fails →
the handler returns 400 → and the runtime does NOT retry a 4xx → so
agent_card never lands and the agent never comes online. AWS/GCP boot
slowly enough to miss the race, which is why ONLY the fast cloud broke.

Diagnosed live: faithful Hetzner repro boxes register against a warm
tenant and still 400 with
  {"error":"hostname \"ws-...\" cannot be resolved (DNS error)..."}

Fix: when DNS resolution fails, allow the hostname through in SaaS mode iff
it is a platform-tunnel hostname (ws-<id> under the platform's own domain,
MOLECULE_APP_DOMAIN default moleculesai.app). Such a hostname is NOT an
SSRF vector — only the platform controls DNS there, so an attacker cannot
point it at 169.254/127/private space, and the unconditional metadata/
loopback blocks still apply once it resolves. Restores the pre-#1130
'let an unresolvable platform URL through' behaviour, scoped to the
trusted tunnel domain. Self-hosted keeps the strict block.

This is the register half of #36; the provision half (Hetzner location
capacity failover) shipped in cp#619.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
devops-engineer added the tier:low label 2026-06-08 04:35:50 +00:00
Author
Member

/qa-recheck

/qa-recheck
Author
Member

/security-recheck

/security-recheck
devops-engineer merged commit dbdced6aa9 into main 2026-06-08 04:44:04 +00:00
Sign in to join this conversation.
No Reviewers
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2425