Hard gate Tier 2 item 2 of 4. Cron-driven full-lifecycle E2E that
catches regressions visible only at runtime — schema drift,
deployment-pipeline gaps, vendor outages, env-var rotations,
DNS / CF / Railway side-effects.
Empirical motivation from today:
- #2345 (A2A v0.2 silent drop) — passed unit tests, broke at JSON-RPC
parse layer between sender + receiver. Visible only when a sender
exercises the full path. Now-fixed by PR #2349, but a continuous
E2E would have surfaced it within 20 min of the regression.
- RFC #2312 chat upload — landed staging-branch but never reached
staging tenants because publish-workspace-server-image was main-
only. Caught by manual dogfooding hours after deploy. Same pattern.
Both classes are invisible to PR-time CI. The continuous gate fires
every 20 min against a real staging tenant and surfaces regressions
within minutes.
Cadence: cron `0,20,40 * * * *` (3x/hour). Offsets the existing
sweep-cf-orphans (:15) and sweep-cf-tunnels (:45) so the three ops
don't burst CF/AWS APIs at the same minute. Concurrency group
prevents overlapping runs if one hangs.
Cost: ~$0.50-1/day GHA + pennies of staging tenant lifecycle.
Reuses existing tests/e2e/test_staging_full_saas.sh — no new harness
to maintain. Bounded at 10 min wall-clock (vs 15 min default) so
stuck runs fail fast rather than holding up the next firing.
Defaults to E2E_RUNTIME=langgraph (fastest cold start; the regression
classes this gate catches don't need hermes-specific paths). Operators
can dispatch with runtime=hermes when they want SDK-native coverage.
Schedule-vs-dispatch hardening: hard-fail on missing
CP_STAGING_ADMIN_API_TOKEN for cron firing (silent-skip would mask
real outages); soft-skip for operator dispatch.
Refs:
- #2342 hard-gates Tier 2 item 2
- #2345 (A2A v0.2 fix that this gate would have caught earlier)
- #2335 / #2337 (deployment-pipeline gaps that this gate also catches)
Closes#2332 item 1 (workspace awareness — agents don't surface
platform-native tools up front).
The dogfooding session surfaced that agents weren't using A2A
delegation, persistent memory, or send_message_to_user. The tools
were registered AND documented in the system prompt — but only in
sections #8 (Inter-Agent Communication) and #9 (Hierarchical Memory),
which agents read AFTER they've already started reasoning about a
plan from earlier sections.
This adds a tight inventory at section #1.5 (immediately after
Platform Instructions, before role-specific prompt files) — every
tool name + its short description in a bulleted block. Detailed
when_to_use docs in sections #8/#9 stay; this preamble is the
elevator pitch ("you have these"), the later sections are the
manual ("here's when and how").
Generated from `platform_tools.registry` ToolSpecs — every tool's
`name` + `short` flow through automatically, no manual sync. A new
`get_capabilities_preamble(mcp: bool)` helper in executor_helpers
mirrors the existing get_a2a_instructions / get_hma_instructions
pattern.
CLI-runtime agents (mcp=False) get an empty preamble — they see
_A2A_INSTRUCTIONS_CLI's hand-written subcommand vocabulary further
down, and the registry's MCP tool names would conflict.
Tests:
- test_capabilities_preamble_appears_in_mcp_prompt: header present
- test_capabilities_preamble_lists_every_registry_tool: every
a2a + memory tool from registry shows up (drift catches at test
time — adding a new tool to registry surfaces here automatically)
- test_capabilities_preamble_precedes_prompt_files: ordering
invariant (toolkit before role docs)
- test_capabilities_preamble_skipped_for_cli_runtime: empty when
mcp=False
All 40 prompt + platform_tools tests pass.
Parity with #2337's redeploy-tenants-on-staging.yml. Both prod and
staging redeploys now have explicit serialization:
group: redeploy-tenants-on-main (per-workflow, global)
group: redeploy-tenants-on-staging (per-workflow, global)
cancel-in-progress: false on both — aborting a half-rolled-out fleet
would leave tenants stuck on whatever image they happened to be on
when cancelled. Better to finish the in-flight rollout before starting
the next one.
Pre-fix this workflow relied on GitHub's implicit workflow_run queueing,
which is "probably fine" but not defensible — explicit > implicit for
load-bearing pipeline behavior. Picked up as a #2337 review nit
(architecture finding 1: concurrency asymmetry between the two
redeploy workflows).
No behavior change in the common case. The change matters only when
two main pushes land within seconds AND the first redeploy is still
mid-rollout — currently rare; will become more common once #2335
(staging-trigger publish) feeds main more frequently via auto-promote.
Two follow-ups from #2335 review (tracked in #2336):
1. Add `concurrency:` block to publish-workspace-server-image.yml so
two rapid staging pushes don't race the same :staging-latest retag.
Group is per-branch (`${{ github.ref }}`) so staging and main can
build in parallel — they produce different :staging-<sha> tags and
last-write-wins on :staging-latest is acceptable across branches.
`cancel-in-progress: false` keeps in-flight builds — partially-pushed
images would break canary-fleet pin consistency.
2. Add redeploy-tenants-on-staging.yml. After #2335, every staging push
produces a fresh :staging-latest, but existing tenants only pick it
up on next reprovision. This workflow mirrors redeploy-tenants-on-
main but for staging:
- workflow_run-gated to branches: [staging]
- target_tag default 'staging-latest' (vs 'latest' for prod)
- CP_URL default https://staging-api.moleculesai.app
- CP_STAGING_ADMIN_API_TOKEN repo secret (operator must set)
- canary_slug empty by default — staging is itself the canary; no
sub-canary needed inside it. Soak still applies if operator
specifies a tenant for blast-radius control.
Schedule-vs-dispatch hardening matches sweep-cf-orphans/sweep-cf-
tunnels: hard-fail on auto-trigger when secret missing so misconfig
doesn't silently leave staging tenants on stale code; soft-skip on
operator dispatch.
Operator action required after merge:
Add CP_STAGING_ADMIN_API_TOKEN repo secret. Pull value from staging-
CP's CP_ADMIN_API_TOKEN env in Railway controlplane / staging
environment. Until set, the auto-trigger will fail the workflow run
(visible as red CI), surfacing the misconfiguration. Workflow runs
only on staging publish-workspace-server-image success, so no extra
load while it sits unconfigured.
Verification:
- YAML lint clean on both workflows.
- Reviewed redeploy-tenants-on-main as template; differences are scoped
to staging-specific values (URL, tag, secret name) + harden-on-missing-
secret pattern.
Refs #2335, #2336.
Root cause: this workflow only triggered on `branches: [main]`, but
staging-CP pins TENANT_IMAGE=:staging-latest (verified via Railway).
:staging-latest was only retagged on main push, so:
staging-branch code → never built → never reaches staging tenants
staging-CP serves → "yesterday's main" indefinitely
When staging→main was wedged (path-filter parity bug, canvas teardown
race — both fixed earlier today), :staging-latest stopped updating
entirely. RFC #2312 (chat upload HTTP-forward) landed on staging but
freshly-provisioned staging tenants kept failing chat upload because
they pulled pre-RFC-#2312 image. Verified by tearing down a fresh
tenant and observing the legacy "workspace container not running"
error from the docker-exec code path that RFC #2312 deleted.
Pre-2026-04-24 there was a related-but-different incident: TENANT_IMAGE
was a static :staging-<sha> pin that drifted 10 days behind. This new
incident is "the dynamic pin still drifts when its update workflow
doesn't fire."
Fix: add `staging` to the branches trigger. Tag policy is unchanged
(:staging-<sha> + :staging-latest on every push). canary-verify.yml
still runs on main push (workflow_run-gated to `branches: [main]`),
preserving the canary-verified :latest promotion for prod tenants.
Steady state after this:
- staging push → :staging-latest = staging-branch code → staging-CP
- main push → :staging-<sha> for canary, :staging-latest retag
(post-promote main code), and after canary green
→ :latest for prod tenants
What this does NOT change:
- canary-verify.yml flow (still main-only)
- redeploy-tenants-on-main.yml (still rolls prod fleet on main push)
- publish-canvas-image.yml (self-hosted standalone canvas; orthogonal)
- The :latest tag (canary-verified main, unchanged)
What this does fix:
- RFC #2312-class fixes that land on staging now actually reach
staging tenants without waiting for staging→main promote.
- The dogfooding observation "staging tenants seem to be running
yesterday's code" disappears as a class.
Drive-by: also fixed the typo in the path-filter list (was
`publish-platform-image.yml`, the actual file is
`publish-workspace-server-image.yml`).
The header comment claimed:
"file upload (HTTP-forward) + download (Docker-exec)"
and:
"Download still uses the v1 docker-cp path; migrating it lives in
the next PR in this stack"
Both wrong now. RFC #2312 PR-D landed the Download HTTP-forward path:
chat_files.go:336 builds an http.NewRequestWithContext to
${wsURL}/internal/file/read?path=<abs>, with the response streamed
back to the caller. The workspace-side Starlette handler is at
workspace/internal_file_read.py, mounted at workspace/main.py:440.
Update the header to reflect actual code: both upload + download are
HTTP-forward, share the same per-workspace platform_inbound_secret
auth, and work uniformly on local Docker and SaaS EC2.
Pure docs change — no behavior, no build/test impact.
Closes the observability gap surfaced in #2329 item 5: callers received
queue_id in the 202 enqueue response but had no public lookup. The only
existing observability path was check_task_status (delegation-flavored
A2A only — joins via request_body->>'delegation_id'). Cross-workspace
peer-direct A2A had no observability after enqueue.
This PR ships RFC #2331's Tier 1: minimum viable observability + caller-
specified TTL. No schema migration — expires_at column already exists
(migration 042); only DequeueNext was honoring it, with no caller path
to populate it.
Two changes:
1. extractExpiresInSeconds(body) — new helper mirroring
extractIdempotencyKey/extractDelegationIDFromBody. Pulls
params.expires_in_seconds from the JSON-RPC body. Zero (the unset
default) preserves today's infinite-TTL semantics. EnqueueA2A grew
an expiresAt *time.Time parameter; the proxy callsite computes
*time.Time from the extracted seconds and threads it through to
the INSERT.
2. GET /workspaces/:id/a2a/queue/:queue_id — new public handler.
Auth: caller's workspace token must match queue.caller_id OR
queue.workspace_id, OR be an org-level token. 404 (not 403) on
auth failure to avoid leaking queue_id existence. Response
includes status/attempts/last_error/timestamps/expires_at; embeds
response_body via LEFT JOIN against activity_logs when status=
completed for delegation-flavored items.
What this does NOT change:
- Drain semantics (heartbeat-driven dispatch).
- Native-session bypass (claude-agent-sdk, hermes still skip queue).
- Schema (column already exists).
- MCP tools (delegate_task_async / check_task_status keep their
contract; this is a parallel queue-id surface).
Tests:
- 7 cases on extractExpiresInSeconds covering absent/positive/
zero/negative/invalid-JSON/wrong-type/empty-params.
- go vet + go build clean.
- Full handlers test suite passes (no regressions from the
EnqueueA2A signature change — only one production caller).
Tier 2 (cross-workspace stitch + webhook callback) and Tier 3
(controllerized lifecycle) deferred per RFC #2331.
Issue: scripts/dev-start.sh assumed `go` was on PATH; on a fresh dev
box without Go installed, line 111 (`go run ./cmd/server`) failed
with `go: not found` and the script bailed before printing the
readiness banner. The script's own prerequisite list (line 13-21)
said "Go 1.25+" but there was no signpost between "open the doc" and
"command not found."
Fix: detect `go` via `command -v`. If present, keep the existing
`go run` path (fast iteration, attaches to local log). If not,
fall back to `docker compose up -d --build platform` which uses the
published platform container — slower first run but the script
still works without forcing the dev to install Go just to read logs.
Either path leaves /health on :8080 so the rest of the script's
wait loop is unchanged.
If both paths fail, the error message names the install URL
(https://go.dev/dl/) and the fallback diagnostic (`/tmp/molecule-platform.log`)
so the dev has a single, actionable next step.
Verified: `sh -n` syntax check passes.
Closes#2329 item 2.
CP's tenant-delete cascade removes the DNS record (with sweep-cf-orphans
as a backstop) but does NOT delete the underlying Cloudflare Tunnel.
Each E2E provision creates one Tunnel named `tenant-<slug>`; without
cleanup these accumulate indefinitely on the account, consuming the
tunnel quota and cluttering the dashboard.
Observed 2026-04-30: dozens of `tenant-e2e-canvas-*` tunnels in Down
state with zero replicas, weeks past their tenant's deletion. Same
class of bug as the DNS-records leak that drove sweep-cf-orphans
(controlplane#239).
Parallel-shape to sweep-cf-orphans:
- Same dry-run-by-default + --execute pattern
- Same MAX_DELETE_PCT safety gate (default 90% — higher than DNS
sweep's 50% because tenant-shaped tunnels are orphans by design)
- Same schedule/dispatch hardening (hard-fail on missing secrets
when scheduled, soft-skip when dispatched)
- Cron offset to :45 to avoid CF API bursts colliding with the DNS
sweep at :15
Decision rules (in order):
1. Name doesn't match `tenant-<slug>` → keep (unknown — never sweep
tunnels that might belong to platform infra).
2. Tunnel has active connections (status=healthy or non-empty
connections array) → keep (defense-in-depth: don't kill a live
tunnel even if CP forgot the org).
3. Slug ∈ {prod_slugs ∪ staging_slugs} → keep.
4. Otherwise → delete (orphan).
Verified by:
- shell syntax check (bash -n)
- YAML lint
- Decide-logic offline smoke (7 cases, all pass)
- End-to-end dry-run smoke with stubbed CP + CF APIs
Required secrets (added to existing org-secrets):
CF_API_TOKEN must include account:cloudflare_tunnel:edit
scope (separate from zone:dns:edit used by
sweep-cf-orphans — same token if scope is
broad, or a new token if narrowly scoped).
CF_ACCOUNT_ID account that owns the tunnels (visible in
dash.cloudflare.com URL path).
CP_PROD_ADMIN_TOKEN reused from sweep-cf-orphans.
CP_STAGING_ADMIN_TOKEN reused from sweep-cf-orphans.
Note: CP-side root cause (tenant-delete should cascade to tunnel
delete) is in molecule-controlplane and worth fixing separately. This
janitor is the operational backstop in the meantime — same pattern
applied to DNS records when the same root cause was unaddressed.
Setup wrote .playwright-staging-state.json at the END (step 7), only
after org create + provision-wait + TLS + workspace create + workspace-
online all succeeded. If setup crashed at steps 1-6, the org existed in
CP but the state file did not, so Playwright's globalTeardown bailed
out ("nothing to tear down") and the workflow safety-net pattern-swept
every e2e-canvas-<today>-* org to compensate. That sweep deleted
concurrent runs' live tenants — including their CF DNS records —
causing victims' next fetch to die with `getaddrinfo ENOTFOUND`.
Race observed 2026-04-30 on PR #2264 staging→main: three real-test
runs killed each other mid-test, blocking 68 commits of staging→main
promotion.
Fix: write the state file as setup's first action, right after slug
generation, before any CP call. Now:
- Crash before slug gen → no state file, no orphan to clean
- Crash during steps 1-6 → state file has slug; teardown deletes
it (DELETE 404s if org never created)
- Setup completes → state file has full state; teardown
deletes the slug
The workflow safety-net no longer pattern-sweeps; it reads the state
file and deletes only the recorded slug. Concurrent canvas-E2E runs no
longer poison each other.
Verified by:
- tsc --noEmit on staging-setup.ts + staging-teardown.ts
- YAML lint on e2e-staging-canvas.yml
- Code review: state file write moved to line 113 (post-makeSlug,
pre-CP) with the original line-249 write retained as a "promote
to full state" overwrite at the end
Acceptance criterion 3 of #2001 ("CI check that fails if TENANT_IMAGE
contains a SHA-shaped suffix") was deferred from PR #2168 because
querying Railway from a GitHub Actions runner needs RAILWAY_TOKEN
plumbed as a repo secret. The detection script + regression test in
#2168 cover detection; this is the automation-cadence layer.
Daily 13:00 UTC schedule (06:00 PT) + workflow_dispatch. Daily is the
right cadence for variables-tier config — Railway env var changes are
deliberate operator actions, low-frequency. Hourly would risk Railway
API rate-limit surprises.
Issue-on-failure pattern mirrors e2e-staging-sanity.yml — drift opens
a `railway-drift` priority-high issue (or comments on the open one),
and a subsequent clean run auto-closes it with a "drift resolved"
comment. No human-in-the-loop needed for the close.
Schedule-vs-dispatch secret hardening per
feedback_schedule_vs_dispatch_secrets_hardening:
- Schedule trigger HARD-FAILS on missing RAILWAY_AUDIT_TOKEN
(silent-success was the failure mode that bit us before)
- workflow_dispatch SOFT-SKIPS so an operator can dry-run the
workflow shape during initial token provisioning
Operator action required before this gate is live:
- Provision a Railway API token, read-only `variables` scope on the
molecule-platform project (id 7ccc8c68-61f4-42ab-9be5-586eeee11768)
- Store as repo secret RAILWAY_AUDIT_TOKEN
- Rotate per the standard 90-day schedule
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Branch protection treats matching-name check runs as a SET — any SKIPPED
member fails the required-check eval, even with SUCCESS siblings. The
two-jobs-sharing-name pattern (no-op + real-job) emits one SKIPPED + one
SUCCESS check run per workflow run; with multiple runs at the same SHA
(detect-changes triggers + auto-promote re-runs) the SET fills with
SKIPPED entries that block branch protection.
Verified live on PR #2264 (staging→main auto-promote): mergeStateStatus
stayed BLOCKED for 18+ hours despite APPROVED + MERGEABLE + all gates
green at the workflow level. `gh pr merge` returned "base branch policy
prohibits the merge"; `enqueuePullRequest` returned "No merge queue
found for branch 'main'". The check-runs API showed `E2E API Smoke
Test` and `Canvas tabs E2E` each had 2 SKIPPED + 2 SUCCESS at head SHA
66142c1e.
Fix: collapse no-op + real-job into ONE job with no job-level `if:`,
gating real work via per-step `if: needs.detect-changes.outputs.X ==
'true'`. The job always runs and emits exactly one SUCCESS check run
under the required-check name regardless of paths-filter outcome —
branch-protection-clean.
Same pattern as ci.yml's earlier conversion of Canvas/Platform/Python/
Shellcheck (PR #2322). Closes the parity-fix that should have been
applied to all four path-filtered required checks at once.
Two rapid main pushes whose E2Es complete out-of-order can promote
:latest backwards: SHA-A merges, SHA-B merges, SHA-B's E2E completes
first → :latest = staging-B → SHA-A's E2E completes → :latest = staging-A.
Now :latest is older than main's tip and stays wrong until the next
main push lands. The orphan-reconciler "next run corrects it" pattern
doesn't apply because there's no auto-corrective re-promote.
Detection: read the current :latest's `org.opencontainers.image.revision`
label (set by publish-workspace-server-image.yml at build time) and ask
the GitHub compare API how the candidate SHA relates to current. Branch
on `.status`:
ahead → retag (target newer)
identical → retag is a no-op
behind → HARD FAIL (this is the race we're catching)
diverged → HARD FAIL (force-push or unusual history)
error → fail; manual dispatch can override
Hard-fail rather than soft-skip per the approved design — silent-bypass
is the class we're moving away from per
feedback_schedule_vs_dispatch_secrets_hardening. Workflow goes red,
oncall sees it, operator decides whether to retry, force-promote, or
investigate. Manual dispatch skips the check (operator override),
matching the gate-step's existing semantics.
Backward-compat: when current :latest carries no revision label
(legacy image), skip-with-warning. All :latest images on main are
post-label as of 2026-04-29, so this branch becomes dead within 90 days
— TODO note in the step explains the cleanup.
No tests — the race is hypothetical at our scale (<1 occurrence/year
expected for a fleet of ≤20 paying tenants), and the only way to
exercise the new branches is to construct production-shape image
state. The dry-fall path lands behind the existing E2E gate-check, so
a regression in this step would surface as a failed promote (visible),
not a silent advance (invisible).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Supersedes #2321 + #2322. Applies the same shape uniformly across every
required check that uses a path filter: Canvas (Next.js), Platform (Go),
Python Lint & Test, Shellcheck (E2E scripts).
The bug + fix in one paragraph:
GitHub registers a check run for every job whose `name:` matches the
required-check context, regardless of whether the job actually executed.
A job-level `if:` that evaluates false produces a SKIPPED check run.
Branch protection's "required check" rule looks at the SET of check
runs with the matching context name on the latest commit and treats
any conclusion other than SUCCESS as not-passed — including SKIPPED.
Adding a sibling no-op job under the same `name:` (PR #2321 / #2322
attempt) doesn't help: branch protection still sees the SKIPPED
sibling and stays BLOCKED.
The shape that works: ONE job per required check name, no job-level
`if:`, all real work gated per-step. The job always runs and reports
SUCCESS regardless of which paths changed.
This patch:
* Canvas (Next.js): drops the `canvas-build-noop` shadow added in
#2321 (which didn't actually clear merge state — verified live on
PR #2314). Refactors `canvas-build` to always run; gates checkout/
setup-node/install/build/test on `if: needs.changes.outputs.canvas
== 'true'`. Coverage upload step also gated.
* Platform (Go): drops job-level `if:`. Gates checkout/setup-go/
download/build/vet/lint/test/coverage-report/threshold-check on
per-step `if:`.
* Python Lint & Test: drops job-level `if:`. Gates checkout/setup-
python/install/pytest on per-step `if:`.
* Shellcheck (E2E scripts): drops job-level `if:`. Gates checkout/
shellcheck-run on per-step `if:`.
Each refactored job adds a leading no-op echo step with `working-directory: .`
override so the always-running spin-up doesn't fail when the path-
filter-true working-directory (workspace, workspace-server, canvas)
doesn't exist after no-op checkout.
Why all four in one PR: the bug shape is identical across all four,
and a future PR that only touches workspace-server (passing platform
filter, missing canvas/python/scripts) would hit the same BLOCKED state
on whichever filter it missed. PR-A and PR-2321 merged because their
diffs happened to trigger every filter; PR-B (#2314) only missed
canvas. Fixing one at a time means re-living this debugging cycle three
more times.
Cost: ~10s of always-on CI runtime per PR per job (the ubuntu-latest
spin-up + the no-op echo). 40s aggregate, negligible vs. the manual-
merge cost when BLOCKED catches us.
Memory `feedback_branch_protection_check_name_parity` already updated
(2026-04-29) to mark the original two-jobs-sharing-name pattern as
DO NOT FOLLOW and document the working shape this PR uses.
Refs PR #2321 (the misguided fix-attempt that this supersedes).
External callers (third-party SDKs, the channel plugin) authenticate
purely via bearer and frequently don't set the X-Workspace-ID header.
Without this, activity_logs.source_id ends up NULL — breaking the
peer_id signal on notifications, the "Agent Comms by peer" canvas tab,
and any analytics that breaks down inbound A2A by sender.
The bearer is the authoritative caller identity per the wsauth contract
(it's what proves who you are); the header is a display/routing hint
that must agree with it. So we derive callerID from the bearer's owning
workspace whenever the header is absent. The existing validateCallerToken
guard fires after this and enforces token-to-callerID binding the same
way it always has.
Org-token requests are skipped — those grant org-wide access and don't
bind to a single workspace, so the canvas-class semantics (callerID="")
are preserved. Bearer-resolution failures (revoked, removed workspace)
fall through to canvas-class as well, never 401.
New wsauth.WorkspaceFromToken exposes the bearer→workspace lookup as a
modular interface; mirrors ValidateAnyToken's defense-in-depth JOIN on
workspaces.status != 'removed'.
Tests: 4 unit tests on WorkspaceFromToken + 3 integration tests on
ProxyA2A covering the three observable paths (bearer-derived,
org-token skipped, derive-failure fallthrough).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Supersedes PR #2321's two-jobs-sharing-a-name approach, which didn't
actually clear branch-protection's required-check evaluation. Live
test on PR #2314: GraphQL `isRequired` confirmed BOTH check runs
under "Canvas (Next.js)" name (one SUCCESS via no-op, one SKIPPED via
real job) registered, and the SKIPPED one kept mergeStateStatus =
BLOCKED despite the SUCCESS sibling. Branch protection's "set of
matching contexts" semantic is stricter than the durable feedback
memory documented — at least one passing isn't enough; SKIPPED
counts as not-passed regardless.
Real fix: ONE job that always runs (no job-level `if:`), with all
real work gated on the path filter via per-step `if:`. Produces
exactly one "Canvas (Next.js)" check run per commit, always SUCCEEDS,
regardless of which paths changed. Costs ~10s of always-on CI runtime
per PR — negligible vs. the manual-merge cost when the BLOCKED state
catches us.
This same anti-pattern probably affects Platform (Go) (`platform`
filter), Python Lint & Test (`python` filter), and Shellcheck (E2E
scripts) (`scripts` filter) — all required, all path-gated. PR-A and
PR-2321 merged because they happened to trigger every filter; PR-B
only missed canvas. File a follow-up issue to apply the same
single-job-conditional-steps pattern across those required jobs to
remove the latent merge-blocker.
Updates feedback memory: branch_protection_check_name_parity is wrong
about "two jobs sharing name + at-least-one-success works." Need to
correct the note.
PRs that don't touch canvas/** paths skip the Canvas (Next.js) job via
its `if: needs.changes.outputs.canvas == 'true'` guard. GitHub reports
SKIPPED for that conclusion. Branch protection on staging requires
Canvas (Next.js) — and treats SKIPPED as not-passed, blocking merge
on every workspace-server-only or migration-only PR.
This is the design pattern documented in feedback memory
"branch_protection_check_name_parity": split into a real job + a
no-op shadow that share the same `name:`. Exactly one runs per PR;
both report the same check context, and at least one always reports
SUCCESS, satisfying the required check.
The no-op job runs in a few seconds (single `echo` step) and produces
the right check context for any PR that has changes outside canvas/**.
Concrete blocker that prompted this: PR #2314 (RFC #2312 PR-B) sat
APPROVED + CI-green + UP-TO-DATE for half an hour with mergeStateStatus
BLOCKED, traced via the GraphQL `isRequired` field to a single
SKIPPED Canvas (Next.js) check. PRs #2319 (PR-F) and the rest of the
RFC #2312 stack would have hit the same wall.
Conflict between PR #2311's revert of PR #2309's external-runtime gate
(which kept the original docker-exec Upload) and PR-B's branch (which
contains PR-C's HTTP-forward rewrite via stack consolidation).
PR-C supersedes both — the docker-exec path is gone entirely. Taking
HEAD (PR-B+C combined) is the correct resolution.
Build + chat_files tests both green after resolution.
Mirrors PR-C's Upload migration: replaces the docker-cp tar-stream
extraction with a streaming HTTP GET to the workspace's own
/internal/file/read endpoint. Closes the SaaS gap for downloads —
without this PR, GET /workspaces/:id/chat/download still returns 503
on Railway-hosted SaaS even after A+B+C+F land.
Stacks: PR-A #2313 → PR-B #2314 → PR-C #2315 → PR-F #2319 → this PR.
Why a single broad /internal/file/read instead of /internal/chat/download:
Today's chat_files.go::Download already accepts paths under any of the
four allowed roots {/configs, /workspace, /home, /plugins} — it's not
strictly chat. Future PRs (template export, etc.) will reuse this
endpoint via the same forward pattern; reusing avoids three near-
identical handlers (one per domain) with duplicated path-safety logic.
Path safety is duplicated on platform + workspace sides — defence in
depth via two parallel checks, not "trust the workspace."
Changes:
* workspace/internal_file_read.py — Starlette handler. Validates path
(must be absolute, under allowed roots, no traversal, canonicalises
cleanly). lstat (not stat) so a symlink at the path doesn't redirect
the read. Streams via FileResponse (no buffering). Mirrors Go's
contentDispositionAttachment for Content-Disposition header.
* workspace/main.py — registers GET /internal/file/read alongside the
POST /internal/chat/uploads/ingest from PR-B.
* scripts/build_runtime_package.py — adds internal_file_read to
TOP_LEVEL_MODULES so the publish-runtime cascade rewrites its
imports correctly. Also includes the PR-B additions
(internal_chat_uploads, platform_inbound_auth) since this branch
was rooted before PR-B's drift-gate fix; merge-clean alphabetic
additions.
* workspace-server/internal/handlers/chat_files.go — Download
rewritten as streaming HTTP GET forward. Resolves workspace URL +
platform_inbound_secret (same shape as Upload), builds GET request
with path query param, propagates response headers (Content-Type /
Content-Length / Content-Disposition) + body. Drops archive/tar
+ mime imports (no longer needed). Drops Docker-exec branch entirely
— Download is now uniform across self-hosted Docker and SaaS EC2.
* workspace-server/internal/handlers/chat_files_test.go — replaces
TestChatDownload_DockerUnavailable (stale post-rewrite) with 4
new tests:
- TestChatDownload_WorkspaceNotInDB → 404 on missing row
- TestChatDownload_NoInboundSecret → 503 on NULL column
(with RFC #2312 detail in body)
- TestChatDownload_ForwardsToWorkspace_HappyPath → forward shape
(auth header, GET method, /internal/file/read path) + headers
propagated + body byte-for-byte
- TestChatDownload_404FromWorkspacePropagated → 404 from
workspace propagates (NOT remapped to 500)
Existing TestChatDownload_InvalidPath path-safety tests preserved.
* workspace/tests/test_internal_file_read.py — 21 tests covering
_validate_path matrix (absolute, allowed roots, traversal, double-
slash, exact-match-on-root), 401 on missing/wrong/no-secret-file
bearer, 400 on missing path/outside-root/traversal, 404 on missing
file, happy-path streaming with correct Content-Type +
Content-Disposition, special-char escaping in Content-Disposition,
symlink-redirect-rejection (lstat-not-stat protection).
Test results:
* go test ./internal/handlers/ ./internal/wsauth/ — green
* pytest workspace/tests/ — 1292 passed (was 1272 before PR-D)
Refs #2312 (parent RFC), #2308 (chat upload+download 503 incident).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the SaaS-side gap that PR-A acknowledged but didn't fix: SaaS
workspaces have no persistent /configs volume, so the platform_inbound_secret
that PR-A's provisioner wrote at workspace creation never reaches the
runtime. Without this, even after the entire RFC #2312 stack lands,
SaaS chat upload would 401 (workspace fails-closed when /configs/.platform_inbound_secret
is missing).
Solution: return the secret in the /registry/register response body
on every register call. The runtime extracts it and persists to
/configs/.platform_inbound_secret at mode 0600. Idempotent — Docker-
mode workspaces also receive it and overwrite the value the provisioner
already wrote (same value until rotation).
Why on every register, not just first-register:
* SaaS containers can be restarted (deploys, drains, EBS detach/
re-attach) — /configs is rebuilt empty on each fresh start.
* The auth_token is "issue once" because re-issuing rotates and
invalidates the previous one. The inbound secret has no rotation
flow yet (#2318) so re-sending the same value is harmless.
* Eliminates the bootstrap window where a restarted SaaS workspace
has no inbound secret on disk and would 401 every platform call.
Changes:
* workspace-server/internal/handlers/registry.go — Register handler
reads workspaces.platform_inbound_secret via wsauth.ReadPlatformInboundSecret
and includes it in the response body. Legacy workspaces (NULL
column) get a successful registration with the field omitted.
* workspace-server/internal/handlers/registry_test.go — two new tests:
- TestRegister_ReturnsPlatformInboundSecret_RFC2312_PRF: secret
present in DB → secret in response, alongside auth_token.
- TestRegister_NoInboundSecret_OmitsField: NULL column → field
omitted, registration still 200.
* workspace/platform_inbound_auth.py — adds save_inbound_secret(secret).
Atomic write via tmp + os.replace, mode 0600 from os.open(O_CREAT,
0o600) so a concurrent reader never sees 0644-default. Resets the
in-process cache after write so the next get_inbound_secret() returns
the freshly-written value (rotation-safe when it lands).
* workspace/main.py — register-response handler extracts
platform_inbound_secret alongside auth_token and persists via
save_inbound_secret. Mirrors the existing save_token pattern.
* workspace/tests/test_platform_inbound_auth.py — 6 new tests for
save_inbound_secret: writes file, mode 0600, overwrite-existing,
cache invalidation after save, empty-input no-op, parent-dir creation
for fresh installs.
Test results:
* go test ./internal/handlers/ ./internal/wsauth/ — all green
* pytest workspace/tests/ — 1272 passed (was 1266 before this PR)
Refs #2312 (parent RFC), #2308 (chat upload 503 incident).
Stacks: PR-A #2313 → PR-B #2314 → PR-C #2315 → this PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drift gate at scripts/build_runtime_package.py asserts every workspace/*.py
appears in the TOP_LEVEL_MODULES allowlist before publishing. Without
this commit, the publish-runtime cascade would have failed on PR-B's
merge with:
in workspace/ but NOT in TOP_LEVEL_MODULES (will ship un-rewritten):
['internal_chat_uploads', 'platform_inbound_auth']
This is the same incident class as the 0.1.16 transcript_auth outage
(per memory: feedback_runtime_publish_pipeline_gates.md): a new module
shipped with un-rewritten flat imports → ModuleNotFoundError on every
workspace boot.
Verified locally:
$ python3 scripts/build_runtime_package.py --version 0.0.0-test --out /tmp/runtime-build-test
[build] copied 66 .py files
[build] rewrote imports in 40 files
[build] done.
$ grep "from molecule_runtime\." /tmp/runtime-build-test/molecule_runtime/internal_chat_uploads.py
from molecule_runtime.platform_inbound_auth import get_inbound_secret, inbound_authorized
Refs #2312.
Self-review found the original draft of this PR added forward-time
validateAgentURL() as defense-in-depth — paranoia layer on top of the
existing register-time gate. The validator unconditionally blocks
loopback (127.0.0.1/8), which makes httptest-based proxy tests
impossible without an env-var hatch I'd rather not add to a security-
critical path on first pass.
Trust note kept inline pointing at the upstream gate + tracking issue
so the gap is explicit, not invisible.
Refs #2312.
Closes the SaaS upload gap (#2308) with the unified architecture from
RFC #2312: same code path on local Docker and SaaS, no Docker socket
dependency, no `dockerCli == nil` cliff. Stacked on PR-A (#2313) +
PR-B (#2314).
Before:
Upload → findContainer (nil in SaaS) → 503
After:
Upload → resolve workspaces.url + platform_inbound_secret
→ stream multipart to <url>/internal/chat/uploads/ingest
→ forward response back unchanged
Same call site whether the workspace runs on local docker-compose
("http://ws-<id>:8000") or SaaS EC2 ("https://<id>.<tenant>...").
The bug behind #2308 cannot exist by construction.
Why streaming, not parse-then-re-encode:
* No 50 MB intermediate buffer on the platform
* Per-file size + path-safety enforcement is the workspace's job
(see workspace/internal_chat_uploads.py, PR-B)
* Workspace's error responses (413 with offending filename, 400 on
missing files field, etc.) propagate through unchanged
Changes:
* workspace-server/internal/handlers/chat_files.go — Upload rewritten
as a streaming HTTP proxy. Drops sanitizeFilename, copyFlatToContainer,
and the entire docker-exec path. ChatFilesHandler gains an httpClient
(broken out for test injection). Download stays docker-exec for now;
follow-up PR will migrate it to the same shape.
* workspace-server/internal/handlers/chat_files_external_test.go —
deleted. Pinned the wrong-headed runtime=external 422 gate from
#2309 (already reverted in #2311). Superseded by the proxy tests.
* workspace-server/internal/handlers/chat_files_test.go — replaced
sanitize-filename tests (now in workspace/tests/test_internal_chat_uploads.py)
with sqlmock + httptest proxy tests:
- 400 invalid workspace id
- 404 workspace row missing
- 503 platform_inbound_secret NULL (with RFC #2312 detail)
- 503 workspaces.url empty
- happy-path forward (asserts auth header, content-type forwarded,
body streamed, response propagated back)
- 413 from workspace propagated unchanged (NOT remapped to 500)
- 502 on workspace unreachable (connect refused)
Existing Download + ContentDisposition tests preserved.
* tests/e2e/test_chat_upload_e2e.sh — single-script-everywhere E2E.
Takes BASE as env (default http://localhost:8080). Creates a
workspace, waits for online, mints a test token, uploads a fixture,
reads it back via /chat/download, asserts content matches +
bearer-required. Same script runs against staging tenants (set
BASE=https://<id>.<tenant>.staging.moleculesai.app).
Test plan:
* go build ./... — green
* go test ./internal/handlers/ ./internal/wsauth/ — green (full suite)
* tests/e2e/test_chat_upload_e2e.sh against local docker-compose
after PR-A + PR-B + this PR all merge — TODO before merge
Refs #2312 (parent RFC), #2308 (chat upload 503 incident).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stacked on PR-A (#2313). The platform-side rewrite that actually calls
this endpoint lands in PR-C; this PR adds the workspace-side consumer
+ hardening so PR-C is a small Go-only diff.
What this adds:
* platform_inbound_auth.py — auth gate mirroring transcript_auth.py.
Reads /configs/.platform_inbound_secret (delivered by the PR-A
provisioner). Fail-closed when the file is missing/empty/unreadable.
Constant-time compare via hmac.compare_digest.
* internal_chat_uploads.py — POST /internal/chat/uploads/ingest.
Multipart parse → sanitize each filename → write to
/workspace/.molecule/chat-uploads/<random>-<name> with
O_CREAT|O_EXCL|O_NOFOLLOW. Same response shape (uri/name/mimeType/
size + workspace: URI scheme) as the legacy Go handler — canvas /
agent code that resolves "workspace:..." paths keeps working.
* Wired into workspace/main.py via starlette_app.add_route alongside
the existing /transcript route.
* python-multipart>=0.0.18 added to requirements.txt (Starlette's
Request.form() needs it; ≥ 0.0.18 closes CVE-2024-53981).
Test coverage (36 tests, all green; full workspace suite 1266 passed):
* test_platform_inbound_auth.py — 14 tests:
happy path, fail-closed on missing file, empty file, whitespace-
only file, missing/case-wrong/empty Bearer prefix, in-process
cache, default CONFIGS_DIR fallback, end-to-end file → authorized.
* test_internal_chat_uploads.py — 22 tests:
sanitize_filename matrix (incl. ../traversal, CJK chars, length
truncation), 401 on missing/wrong/no-secret-file bearer, single +
batch upload happy paths, unique random prefix on duplicate names,
mimetype guess fallback, 400 on missing files field, 413 on per-
file + total-body oversize, symlink-at-target refusal (with
sentinel-content unchanged assertion).
Why this is safe to ship before PR-C:
* No platform-side caller yet → no behavior change visible to users.
* Auth fails closed; nothing on the network can hit a write path
until the platform forwards with the matching bearer.
* Workspace's existing routes (/health, /transcript, /handle/*) are
unchanged.
Refs #2312 (parent RFC), #2308 (chat upload 503 incident).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Foundation for the HTTP-forward architecture that replaces Docker-exec
in chat upload + 5 follow-on handlers. This PR is intentionally scoped
to schema + token mint + provisioner wiring; no caller reads the secret
yet so behavior is unchanged.
Why a second per-workspace bearer (not reuse the existing
workspace_auth_tokens row):
workspace_auth_tokens workspaces.platform_inbound_secret
───────────────────── ─────────────────────────────────
workspace → platform platform → workspace
hash stored, plaintext gone plaintext stored (platform reads back)
workspace presents bearer platform presents bearer
platform validates by hash workspace validates by file compare
Distinct roles, distinct rotation lifecycle, distinct audit signal —
splitting later would require a fleet-wide rolling rotation, so paying
the schema cost up front.
Changes:
* migration 044: ADD COLUMN workspaces.platform_inbound_secret TEXT
* wsauth.IssuePlatformInboundSecret + ReadPlatformInboundSecret
* issueAndInjectInboundSecret hook in workspace_provision: mints
on every workspace create / re-provision; Docker mode writes
plaintext to /configs/.platform_inbound_secret alongside .auth_token,
SaaS mode persists to DB only (workspace will receive via
/registry/register response in a follow-up PR)
* 8 unit tests against sqlmock — covers happy path, rotation, NULL
column, empty string, missing workspace row, empty workspaceID
PR-B (next) wires up workspace-side `/internal/chat/uploads/ingest`
that validates the bearer against /configs/.platform_inbound_secret.
Refs #2312 (parent RFC), #2308 (chat upload 503 incident).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>