obs/infra: control-plane + some tenant boxes not observable — wire Railway CP→Loki log drain + grant fleet ssm:SendCommand (prod) #3214

Open
opened 2026-06-24 07:53:39 +00:00 by hongming-ceo-delegated · 0 comments
Member

Observability gap (root enabler of the ADK debug stall)

Our Loki (obs.moleculesai.app, ds P8E80F9AEF21F6940) ingests the operator host + AWS tenant boxes that ship + CI — but NOT (a) the Railway-hosted control-plane, and (b) some org boxes (molecule-adk-demo's box is absent from the tenant label set). This is why the ADK provision failure was un-diagnosable via obs and got mis-escalated as 'need RAILWAY_TOKEN_PRODUCTION' (a phantom — the CP DB + org API gave the answer).

Asks (owner/infra)

  1. Railway CP → Loki log drain (HTTP/syslog drain → Alloy/Loki) so the control-plane is observable without per-incident token pulls.
  2. Ensure every tenant org box ships to Loki (vector/alloy wired at provision; molecule-adk-demo's box isn't).
  3. Grant the fleet/operator identity ssm:SendCommand (+ the AWS-RunShellCommand doc) on prod tenant instances so box-level docker logs are readable for incident response (currently InvalidDocument).
## Observability gap (root enabler of the ADK debug stall) Our Loki (obs.moleculesai.app, ds P8E80F9AEF21F6940) ingests the operator host + AWS tenant boxes that ship + CI — but NOT (a) the Railway-hosted control-plane, and (b) some org boxes (molecule-adk-demo's box is absent from the `tenant` label set). This is why the ADK provision failure was un-diagnosable via obs and got mis-escalated as 'need RAILWAY_TOKEN_PRODUCTION' (a phantom — the CP DB + org API gave the answer). ## Asks (owner/infra) 1. Railway CP → Loki log drain (HTTP/syslog drain → Alloy/Loki) so the control-plane is observable without per-incident token pulls. 2. Ensure every tenant org box ships to Loki (vector/alloy wired at provision; molecule-adk-demo's box isn't). 3. Grant the fleet/operator identity `ssm:SendCommand` (+ the AWS-RunShellCommand doc) on prod tenant instances so box-level docker logs are readable for incident response (currently InvalidDocument).
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#3214