The previous soft-skip-on-dispatch path used `exit 0`, which only
ends the STEP — the rest of the workflow continued with empty
secrets. Caught 2026-05-04 by dispatched run 25296530706:
- E2E_MINIMAX_API_KEY: empty
- verify-secrets printed warning + exit 0
- Install required tools: ran
- Run synthetic E2E: ran with empty MiniMax key
- SECRETS_JSON branched to OpenAI shape (MINIMAX empty → fall through)
- But model slug stayed MiniMax-M2.7-highspeed (workflow env)
- Workspace booted with OpenAI keys + MiniMax model
- 5 min later: "Agent error (Exception)" — claude SDK 401'd
against api.minimax.io with the OpenAI key
The confusing failure mode silently masked the real problem (missing
secret) under a runtime-error label. Fix: drop both soft-skip paths
and exit 1 always. Operators who want to verify a YAML change without
setting up secrets can read the verify-secrets step's stderr — the
failure IS the verification signal.
Pure visibility fix; preserves the cron hard-fail path (now also the
dispatch hard-fail path). No mechanism change beyond the exit code.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GitHub Actions scheduler de-prioritises :00 cron firings under load.
Empirical 2026-05-03: the canary's cron was '0,20,40 * * * *' but
actual firings landed at :08, :03, :01, :03 — :20 and :40 silently
dropped. Detection latency degraded from claimed 20 min to actual
~60 min worst case.
Move to '10,30,50 * * * *':
- :10/:30/:50 sit 10 min off the top-of-hour load peak
- Still 5 min from :15 sweep-cf-orphans and :45 sweep-cf-tunnels
(the original constraint that kept us off :15/:45)
- Same 20-min cadence; only the phase changes
No code change beyond the cron expression + comment refresh.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cuts the per-run LLM cost ~10x (MiniMax M2.7 vs gpt-4.1-mini) and
removes the recurring OpenAI-quota-exhaustion failure mode that took
the canary down on 2026-05-03 (#265 — staging quota burnt for ~16h).
Path:
E2E_RUNTIME=claude-code (default)
→ workspace-configs-templates/claude-code-default/config.yaml's
`minimax` provider (lines 64-69)
→ ANTHROPIC_BASE_URL auto-set to api.minimax.io/anthropic
→ reads MINIMAX_API_KEY (per-vendor env, no collision with
GLM/Z.ai etc.)
Workflow changes (continuous-synth-e2e.yml):
- Default runtime: langgraph → claude-code
- New env: E2E_MODEL_SLUG (defaults to MiniMax-M2.7-highspeed,
overridable via workflow_dispatch)
- New secret wire: E2E_MINIMAX_API_KEY ←
secrets.MOLECULE_STAGING_MINIMAX_API_KEY
- Per-runtime missing-secret guard: claude-code requires MINIMAX,
langgraph/hermes require OPENAI. Cron firing hard-fails on missing
key for the active runtime; dispatch soft-skips so operators can
ad-hoc test without setting up the secret first
- Operators can still pick langgraph/hermes via workflow_dispatch;
the OpenAI fallback path stays wired
Script changes (tests/e2e/test_staging_full_saas.sh):
- SECRETS_JSON branches on which key is set:
E2E_MINIMAX_API_KEY → {MINIMAX_API_KEY: <key>} (claude-code path)
E2E_OPENAI_API_KEY → {OPENAI_API_KEY, HERMES_*, MODEL_PROVIDER} (legacy)
MiniMax wins when both are present — claude-code default canary
must not accidentally consume the OpenAI key
Tests (new tests/e2e/test_secrets_dispatch.sh):
- 10 cases pinning the precedence + payload shape per branch
- Discipline check verified: 5 of 10 FAIL on a swapped if/elif
(precedence inversion), all 10 PASS on the fix
- Anchors on the section-comment header so a structural refactor
fails loudly rather than silently sourcing nothing
The model_slug dispatcher (lib/model_slug.sh) needs no change:
E2E_MODEL_SLUG override path is already wired (line 41), and
claude-code template's `minimax-` prefix matcher catches
"MiniMax-M2.7-highspeed" via lowercase-on-lookup.
Operator action required to land green:
- Set MOLECULE_STAGING_MINIMAX_API_KEY in repo secrets
(Settings → Secrets and Variables → Actions). Use
`gh secret set MOLECULE_STAGING_MINIMAX_API_KEY -R Molecule-AI/molecule-core`
to avoid leaking the value into shell history.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The synth-E2E (#2342) provisions a langgraph tenant whose default
model `openai:gpt-4.1-mini` requires OPENAI_API_KEY for the first LLM
call. Sibling workflows already wire this:
- e2e-staging-saas.yml:89
- canary-staging.yml:63
continuous-synth-e2e.yml just forgot. Result: tenant boots, accepts
a2a messages, then returns:
Agent error: "Could not resolve authentication method. Expected
either api_key or auth_token to be set."
This was masked since 2026-04-29 (workflow creation) by a2a-sdk v0→v1
contract violations — PR #2558 (Task-enqueue) and #2563
(TaskUpdater.complete/failed terminal events) cleared those, exposing
the underlying auth gap on the synth-E2E firing at 11:39 UTC today.
The script tests/e2e/test_staging_full_saas.sh:325 already reads
E2E_OPENAI_API_KEY and persists it as a workspace_secret on tenant
create — only the workflow wiring was missing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hard gate Tier 2 item 2 of 4. Cron-driven full-lifecycle E2E that
catches regressions visible only at runtime — schema drift,
deployment-pipeline gaps, vendor outages, env-var rotations,
DNS / CF / Railway side-effects.
Empirical motivation from today:
- #2345 (A2A v0.2 silent drop) — passed unit tests, broke at JSON-RPC
parse layer between sender + receiver. Visible only when a sender
exercises the full path. Now-fixed by PR #2349, but a continuous
E2E would have surfaced it within 20 min of the regression.
- RFC #2312 chat upload — landed staging-branch but never reached
staging tenants because publish-workspace-server-image was main-
only. Caught by manual dogfooding hours after deploy. Same pattern.
Both classes are invisible to PR-time CI. The continuous gate fires
every 20 min against a real staging tenant and surfaces regressions
within minutes.
Cadence: cron `0,20,40 * * * *` (3x/hour). Offsets the existing
sweep-cf-orphans (:15) and sweep-cf-tunnels (:45) so the three ops
don't burst CF/AWS APIs at the same minute. Concurrency group
prevents overlapping runs if one hangs.
Cost: ~$0.50-1/day GHA + pennies of staging tenant lifecycle.
Reuses existing tests/e2e/test_staging_full_saas.sh — no new harness
to maintain. Bounded at 10 min wall-clock (vs 15 min default) so
stuck runs fail fast rather than holding up the next firing.
Defaults to E2E_RUNTIME=langgraph (fastest cold start; the regression
classes this gate catches don't need hermes-specific paths). Operators
can dispatch with runtime=hermes when they want SDK-native coverage.
Schedule-vs-dispatch hardening: hard-fail on missing
CP_STAGING_ADMIN_API_TOKEN for cron firing (silent-skip would mask
real outages); soft-skip for operator dispatch.
Refs:
- #2342 hard-gates Tier 2 item 2
- #2345 (A2A v0.2 fix that this gate would have caught earlier)
- #2335 / #2337 (deployment-pipeline gaps that this gate also catches)