Commit Graph

3996 Commits

Author SHA1 Message Date
rabbitblood
e4535560cf fix(platform): panic-recovering supervisor for every background goroutine (#92)
Yesterday's scheduler-died incident (#85) was one instance of a systemic
bug: every long-running goroutine in the platform lacks panic recovery
and exposes no liveness signal. In a multi-tenant SaaS deployment, a
single tenant's bad data panicking any subsystem takes down the
subsystem for every tenant, silently, with all standard health probes
still green. That is a scale-of-one sev-1.

This PR:

1. Introduces `platform/internal/supervised/` with two primitives:

   a. RunWithRecover(ctx, name, fn) — runs fn in a recover wrapper.
      On panic logs the stack + exponential-backoff restart (1s → 2s →
      4s → … → 30s cap). On clean return (fn decided to stop) returns.
      On ctx.Done cancels cleanly.

   b. Heartbeat(name) + LastTick(name) + Snapshot() + IsHealthy(names,
      staleThreshold) — shared in-memory liveness registry. Every
      subsystem calls Heartbeat(name) at the end of each tick so
      operators can distinguish "goroutine alive and healthy" from
      "alive but stuck inside a single tick".

2. Wraps every `go X.Start(ctx)` in main.go:
   - broadcaster.Subscribe   (Redis pub/sub relay → WebSocket)
   - registry.StartLivenessMonitor
   - registry.StartHealthSweep
   - scheduler.Start         (the one that died yesterday)
   - channelMgr.Start        (Telegram / Slack)

3. Adds `supervised.Heartbeat("scheduler")` inside the scheduler tick
   loop as the first end-to-end demonstration. Follow-up PRs will add
   heartbeats to the other four subsystems.

4. Adds `GET /admin/liveness` endpoint returning per-subsystem
   last_tick_at + seconds_ago. Operators can poll this and alert on
   any subsystem whose seconds_ago exceeds 2x its cron/tick interval.

5. Unit tests for RunWithRecover (clean return no restart; panic
   restarts with backoff; ctx cancel stops restart loop) and for the
   liveness registry.

Net new code: ~160 lines + ~100 lines of tests. Refactor of main.go:
~10 line changes. No behavior change on happy path; only lifts what
happens on a panic.

Closes #92. Supersedes the local recover added to scheduler.go in
#90 (kept conceptually, but now via the shared helper).
2026-04-14 20:34:18 -07:00
Backend Engineer
602dcb6283 fix(security): C6 — block loopback IP literals in /registry/register
A workspace that self-registers with a 127.0.0.x URL on first INSERT
could redirect A2A proxy traffic back to the platform itself (SSRF).
The previous fix only blocked 169.254.0.0/16 (cloud metadata).

Add 127.0.0.0/8 to validateAgentURL's blocklist. RFC-1918 private
ranges (10.x, 172.16.x, 192.168.x) remain allowed — Docker container
networking depends on them.

Safe because the provisioner writes 127.0.0.1 URLs via direct SQL
UPDATE, not through /registry/register, so the UPSERT CASE that
preserves provisioner URLs is unaffected. Local-dev agents can still
register using "localhost" by name (hostname, not IP literal).

Tests: removed "valid localhost http" case (now correctly rejected),
added "valid localhost name" + three loopback-block assertions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 03:34:14 +00:00
Backend Engineer
19bdd81ba4 fix(security): C6 — block loopback IP literals in /registry/register
A workspace that self-registers with a 127.0.0.x URL on first INSERT
could redirect A2A proxy traffic back to the platform itself (SSRF).
The previous fix only blocked 169.254.0.0/16 (cloud metadata).

Add 127.0.0.0/8 to validateAgentURL's blocklist. RFC-1918 private
ranges (10.x, 172.16.x, 192.168.x) remain allowed — Docker container
networking depends on them.

Safe because the provisioner writes 127.0.0.1 URLs via direct SQL
UPDATE, not through /registry/register, so the UPSERT CASE that
preserves provisioner URLs is unaffected. Local-dev agents can still
register using "localhost" by name (hostname, not IP literal).

Tests: removed "valid localhost http" case (now correctly rejected),
added "valid localhost name" + three loopback-block assertions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 03:34:14 +00:00
Hongming Wang
8019adb811 Merge pull request #90 from Molecule-AI/fix/scheduler-watchdog-recover
fix(scheduler): recover from panics + add liveness watchdog (#85)
2026-04-14 20:30:31 -07:00
Hongming Wang
c02bfb4257
Merge pull request #90 from Molecule-AI/fix/scheduler-watchdog-recover
fix(scheduler): recover from panics + add liveness watchdog (#85)
2026-04-14 20:30:31 -07:00
Hongming Wang
e33ff21cad Merge pull request #87 from Molecule-AI/chore/template-evolution-crons
chore(template): add 4 evolution crons — ecosystem / plugins / template / channels
2026-04-14 20:30:26 -07:00
Hongming Wang
12ef17f8e0
Merge pull request #87 from Molecule-AI/chore/template-evolution-crons
chore(template): add 4 evolution crons — ecosystem / plugins / template / channels
2026-04-14 20:30:26 -07:00
Hongming Wang
857218ad35 Merge pull request #81 from Molecule-AI/docs/sync-2026-04-15-tick-9
QA verified: docs-only change (PLAN.md + edit-history). CI green (all 6 checks pass). No code changes. Safe to merge.
2026-04-14 20:30:18 -07:00
Hongming Wang
092652770c
Merge pull request #81 from Molecule-AI/docs/sync-2026-04-15-tick-9
QA verified: docs-only change (PLAN.md + edit-history). CI green (all 6 checks pass). No code changes. Safe to merge.
2026-04-14 20:30:18 -07:00
Hongming Wang
e2ba3e3256 Merge pull request #91 from Molecule-AI/feat/canvas-saas-cross-origin
feat(canvas): SaaS cross-origin — slug header + cookie credentials (Phase F)
2026-04-14 20:10:46 -07:00
Hongming Wang
e7275531d8
Merge pull request #91 from Molecule-AI/feat/canvas-saas-cross-origin
feat(canvas): SaaS cross-origin — slug header + cookie credentials (Phase F)
2026-04-14 20:10:46 -07:00
Hongming Wang
3abcee11b3 feat(canvas): SaaS cross-origin — slug header + cookie credentials (Phase F)
Canvas will be served at <slug>.moleculesai.app (Vercel). API calls go
cross-origin to https://app.moleculesai.app. This commit wires the
client side:

- canvas/src/lib/tenant.ts: getTenantSlug() derives the slug from
  window.location.hostname, case-insensitive, matching the control
  plane's reservedSubdomains list (app/www/api/admin/…). Server-side
  + localhost + vercel preview URLs + apex all return "" so local dev
  keeps working.

- canvas/src/lib/api.ts: adds X-Molecule-Org-Slug header + sets
  credentials:"include" on every fetch. The control plane's CORS
  middleware allows the origin + credentials; the session cookie has
  Domain=.moleculesai.app so the browser ships it.

- canvas/src/lib/api/secrets.ts: same treatment (secrets API uses its
  own fetch helper — shared slug+credentials logic applied).

Tests: +6 (tenant.test.ts covers slug / reserved / case / non-SaaS /
preview URL / apex). Full canvas suite 447/447 green.

Not in this PR:
- WS URL derivation for terminal/socket.ts (separate follow-up; WS
  needs its own slug-aware URL and the canvas terminal isn't used in
  SaaS launch day-one).
- Next.js rewrites (decided against; cross-origin with credentials
  is cleaner than path-level rewrites for session cookies).

Deploys to Vercel once merged — no manual config needed (env already set).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 20:08:39 -07:00
Hongming Wang
c7537436ff feat(canvas): SaaS cross-origin — slug header + cookie credentials (Phase F)
Canvas will be served at <slug>.moleculesai.app (Vercel). API calls go
cross-origin to https://app.moleculesai.app. This commit wires the
client side:

- canvas/src/lib/tenant.ts: getTenantSlug() derives the slug from
  window.location.hostname, case-insensitive, matching the control
  plane's reservedSubdomains list (app/www/api/admin/…). Server-side
  + localhost + vercel preview URLs + apex all return "" so local dev
  keeps working.

- canvas/src/lib/api.ts: adds X-Molecule-Org-Slug header + sets
  credentials:"include" on every fetch. The control plane's CORS
  middleware allows the origin + credentials; the session cookie has
  Domain=.moleculesai.app so the browser ships it.

- canvas/src/lib/api/secrets.ts: same treatment (secrets API uses its
  own fetch helper — shared slug+credentials logic applied).

Tests: +6 (tenant.test.ts covers slug / reserved / case / non-SaaS /
preview URL / apex). Full canvas suite 447/447 green.

Not in this PR:
- WS URL derivation for terminal/socket.ts (separate follow-up; WS
  needs its own slug-aware URL and the canvas terminal isn't used in
  SaaS launch day-one).
- Next.js rewrites (decided against; cross-origin with credentials
  is cleaner than path-level rewrites for session cookies).

Deploys to Vercel once merged — no manual config needed (env already set).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 20:08:39 -07:00
rabbitblood
ef7f482593 fix(scheduler): recover from panics + add liveness watchdog (#85)
The scheduler died silently on 2026-04-14 14:21 UTC and stayed dead for
12+ hours. Platform restart didn't recover it. Root cause: tick() and
fireSchedule() goroutines have no panic recovery. A single bad row, bad
cron expression, DB blip, or transient panic anywhere in the chain
permanently kills the scheduler goroutine — and the only signal to an
operator is "no crons firing", which is invisible if you're not watching.

Specifically:

  func (s *Scheduler) Start(ctx context.Context) {
      for {
          select {
          case <-ticker.C:
              s.tick(ctx)   // <- if this panics, the for-loop exits forever
          }
      }
  }

And inside tick:

  go func(s2 scheduleRow) {
      defer wg.Done()
      defer func() { <-sem }()
      s.fireSchedule(ctx, s2)   // <- panic here propagates up wg.Wait()
  }(sched)

Two `defer recover()` additions:

1. In Start's tick wrapper — a panic in tick() (DB scan, cron parse,
   row processing) is logged and the next tick fires normally.
2. In each fireSchedule goroutine — a single bad workspace can't take
   the rest of the batch down.

Plus a liveness watchdog:

- Scheduler now records `lastTickAt` after each successful tick.
- New methods `LastTickAt()` and `Healthy()` (true if last tick within
  2× pollInterval = 60s).
- Initialised at Start so Healthy() returns true on a fresh process.

Endpoint plumbing for /admin/scheduler/health is a follow-up — needs
threading the scheduler instance through router.Setup(). Documented
on #85.

Closes the silent-outage failure mode of #85. The other proposed
fixes (force-kill on /restart hang, active_tasks watchdog) are
separate concerns tracked in #85's comments.
2026-04-14 19:32:01 -07:00
rabbitblood
7dc9d83792 fix(scheduler): recover from panics + add liveness watchdog (#85)
The scheduler died silently on 2026-04-14 14:21 UTC and stayed dead for
12+ hours. Platform restart didn't recover it. Root cause: tick() and
fireSchedule() goroutines have no panic recovery. A single bad row, bad
cron expression, DB blip, or transient panic anywhere in the chain
permanently kills the scheduler goroutine — and the only signal to an
operator is "no crons firing", which is invisible if you're not watching.

Specifically:

  func (s *Scheduler) Start(ctx context.Context) {
      for {
          select {
          case <-ticker.C:
              s.tick(ctx)   // <- if this panics, the for-loop exits forever
          }
      }
  }

And inside tick:

  go func(s2 scheduleRow) {
      defer wg.Done()
      defer func() { <-sem }()
      s.fireSchedule(ctx, s2)   // <- panic here propagates up wg.Wait()
  }(sched)

Two `defer recover()` additions:

1. In Start's tick wrapper — a panic in tick() (DB scan, cron parse,
   row processing) is logged and the next tick fires normally.
2. In each fireSchedule goroutine — a single bad workspace can't take
   the rest of the batch down.

Plus a liveness watchdog:

- Scheduler now records `lastTickAt` after each successful tick.
- New methods `LastTickAt()` and `Healthy()` (true if last tick within
  2× pollInterval = 60s).
- Initialised at Start so Healthy() returns true on a fresh process.

Endpoint plumbing for /admin/scheduler/health is a follow-up — needs
threading the scheduler instance through router.Setup(). Documented
on #85.

Closes the silent-outage failure mode of #85. The other proposed
fixes (force-kill on /restart hang, active_tasks watchdog) are
separate concerns tracked in #85's comments.
2026-04-14 19:32:01 -07:00
Hongming Wang
d81d15537e Merge pull request #89 from Molecule-AI/docs/sync-saas-progress
docs(plan): add Phase 32 current-state snapshot
2026-04-14 18:17:36 -07:00
Hongming Wang
15ad2a8dbe
Merge pull request #89 from Molecule-AI/docs/sync-saas-progress
docs(plan): add Phase 32 current-state snapshot
2026-04-14 18:17:36 -07:00
Hongming Wang
73887948b2 Merge pull request #88 from Molecule-AI/fix/tenant-guard-state-no-prefix
fix(middleware): tenant guard reads bare UUID from state= (pair with cp #8)
2026-04-14 18:14:14 -07:00
Hongming Wang
ff6499f634
Merge pull request #88 from Molecule-AI/fix/tenant-guard-state-no-prefix
fix(middleware): tenant guard reads bare UUID from state= (pair with cp #8)
2026-04-14 18:14:14 -07:00
Hongming Wang
c24c7bdb97 docs(plan): add Phase 32 current-state block
Point-in-time snapshot of the live SaaS infrastructure + which phases
are done vs in-flight vs not started. Links to molecule-controlplane's
own PLAN for deeper operator detail.
2026-04-14 18:13:47 -07:00
Hongming Wang
821ed3a532 docs(plan): add Phase 32 current-state block
Point-in-time snapshot of the live SaaS infrastructure + which phases
are done vs in-flight vs not started. Links to molecule-controlplane's
own PLAN for deeper operator detail.
2026-04-14 18:13:47 -07:00
Hongming Wang
7af4f10226 fix(middleware): tenant guard reads bare UUID from state= (no prefix)
Pair to molecule-controlplane PR #8. Fly's proxy returns 502 if the
fly-replay state value contains '=', so the control plane now puts the
bare UUID in state= (no 'org-id=' prefix). TenantGuard now treats the
whole 'state=...' value as the org id.
2026-04-14 18:09:44 -07:00
Hongming Wang
e38257ac88 fix(middleware): tenant guard reads bare UUID from state= (no prefix)
Pair to molecule-controlplane PR #8. Fly's proxy returns 502 if the
fly-replay state value contains '=', so the control plane now puts the
bare UUID in state= (no 'org-id=' prefix). TenantGuard now treats the
whole 'state=...' value as the org id.
2026-04-14 18:09:44 -07:00
rabbitblood
4f2b28c060 chore(template): add 4 evolution crons — ecosystem / plugins / template / channels
Today's crons are all REVIEW (Security audit, UIUX audit, QA tests). Nothing
actively pushes the team to EVOLVE the four levers CEO named: templates,
plugins, channels, watchlist. The team-runs-24/7 goal needs both — defensive
reviews AND offensive evolution.

Adds 4 new schedules:

1. Research Lead — Daily ecosystem watch (0 8 * * *)
   Survey github.com/trending + HN + AI-blogs for new agent-infra projects
   from the last 24h. Add 1-3 entries to docs/ecosystem-watch.md per day,
   commit to chore/eco-watch-YYYY-MM-DD branch + push + PR. Re-enables
   the watchlist pipeline that was paused earlier today.

2. Technical Researcher — Weekly plugin curation (0 9 * * 1, Mondays)
   Inventory plugins/ + builtin_tools/ + recent landings. Identify gaps
   (builtin not exposed as plugin; role missing extras; rarely-used plugin
   in defaults). Survey upstream (claude.ai cookbook, MCP servers,
   anthropic/openai/langchain blogs). File 1-3 plugin proposals per week
   as GH issues with concrete integration sketches.

3. Dev Lead — Daily template fitness audit (30 8 * * *)
   Health-check the template itself: stale system prompts, schedules not
   firing (catches the #85 scheduler-died failure mode), roles missing
   plugins they should have, missing crons, channel gaps. File issues for
   any drift. Designed to catch the silent-stall pattern from today's
   incident.

4. DevOps Engineer — Weekly channel expansion survey (0 10 * * 1, Mondays)
   PM is the only role with a channel today (Telegram). Survey what
   channel infra the platform supports + what role-channel pairings would
   actually help (Security→email-on-critical, DevOps→Slack-on-CI-break,
   etc). File channel-proposal issues.

All four crons end with the structured audit_summary routing per #51/#75
(category, severity, issues, top_recommendation) so they integrate with
the platform-level category_routing PM uses to fan out work. The template's
existing category_routing block already maps research / plugins / template /
channels — these new crons consume exactly those slots.

Also drops three stale "# UNION with defaults (#71)" comments left from
the cleanup PR — those plugins lists are now self-documenting after #71.

Aligns with north-star goal: team should run 24/7 AND keep getting better
across templates / plugins / channels / watchlist. This PR closes the gap
where the "review" half of the loop was running but the "evolve" half had
no active driver.
2026-04-14 18:04:00 -07:00
rabbitblood
18ded13ab3 chore(template): add 4 evolution crons — ecosystem / plugins / template / channels
Today's crons are all REVIEW (Security audit, UIUX audit, QA tests). Nothing
actively pushes the team to EVOLVE the four levers CEO named: templates,
plugins, channels, watchlist. The team-runs-24/7 goal needs both — defensive
reviews AND offensive evolution.

Adds 4 new schedules:

1. Research Lead — Daily ecosystem watch (0 8 * * *)
   Survey github.com/trending + HN + AI-blogs for new agent-infra projects
   from the last 24h. Add 1-3 entries to docs/ecosystem-watch.md per day,
   commit to chore/eco-watch-YYYY-MM-DD branch + push + PR. Re-enables
   the watchlist pipeline that was paused earlier today.

2. Technical Researcher — Weekly plugin curation (0 9 * * 1, Mondays)
   Inventory plugins/ + builtin_tools/ + recent landings. Identify gaps
   (builtin not exposed as plugin; role missing extras; rarely-used plugin
   in defaults). Survey upstream (claude.ai cookbook, MCP servers,
   anthropic/openai/langchain blogs). File 1-3 plugin proposals per week
   as GH issues with concrete integration sketches.

3. Dev Lead — Daily template fitness audit (30 8 * * *)
   Health-check the template itself: stale system prompts, schedules not
   firing (catches the #85 scheduler-died failure mode), roles missing
   plugins they should have, missing crons, channel gaps. File issues for
   any drift. Designed to catch the silent-stall pattern from today's
   incident.

4. DevOps Engineer — Weekly channel expansion survey (0 10 * * 1, Mondays)
   PM is the only role with a channel today (Telegram). Survey what
   channel infra the platform supports + what role-channel pairings would
   actually help (Security→email-on-critical, DevOps→Slack-on-CI-break,
   etc). File channel-proposal issues.

All four crons end with the structured audit_summary routing per #51/#75
(category, severity, issues, top_recommendation) so they integrate with
the platform-level category_routing PM uses to fan out work. The template's
existing category_routing block already maps research / plugins / template /
channels — these new crons consume exactly those slots.

Also drops three stale "# UNION with defaults (#71)" comments left from
the cleanup PR — those plugins lists are now self-documenting after #71.

Aligns with north-star goal: team should run 24/7 AND keep getting better
across templates / plugins / channels / watchlist. This PR closes the gap
where the "review" half of the loop was running but the "evolve" half had
no active driver.
2026-04-14 18:04:00 -07:00
Hongming Wang
e523ca9b20 Merge pull request #86 from Molecule-AI/docs/plugin-adaptor-header-fix
docs(plan): plugin adaptor system is shipped, not future work
2026-04-14 18:03:28 -07:00
Hongming Wang
5b814ca1a7
Merge pull request #86 from Molecule-AI/docs/plugin-adaptor-header-fix
docs(plan): plugin adaptor system is shipped, not future work
2026-04-14 18:03:28 -07:00
Hongming Wang
2db410cccb Merge pull request #84 from Molecule-AI/fix/tenant-guard-fly-replay-src
fix(middleware): TenantGuard accepts org id via Fly-Replay-Src state
2026-04-14 18:03:19 -07:00
Hongming Wang
a7619d4f9a
Merge pull request #84 from Molecule-AI/fix/tenant-guard-fly-replay-src
fix(middleware): TenantGuard accepts org id via Fly-Replay-Src state
2026-04-14 18:03:19 -07:00
Hongming Wang
c442d79aac docs(plan): rename 'Future Work — Plugin Adaptor System' to reflect shipped state
Header implied the whole system was future work, but the section body
says the core (per-runtime adapters, hybrid resolver, AgentskillsAdaptor,
/plugins filter, SDK, agentskills.io spec compliance) all landed. Only
the bullets under 'Deferred, not blocking' are actually open.

Rename + lead with 'The system is done.' so a skim reader doesn't
misfile the whole topic as unshipped.
2026-04-14 18:02:28 -07:00
Hongming Wang
a99517f4ec docs(plan): rename 'Future Work — Plugin Adaptor System' to reflect shipped state
Header implied the whole system was future work, but the section body
says the core (per-runtime adapters, hybrid resolver, AgentskillsAdaptor,
/plugins filter, SDK, agentskills.io spec compliance) all landed. Only
the bullets under 'Deferred, not blocking' are actually open.

Rename + lead with 'The system is done.' so a skim reader doesn't
misfile the whole topic as unshipped.
2026-04-14 18:02:28 -07:00
Hongming Wang
f1dd7cc367 fix(middleware): TenantGuard accepts org id via Fly-Replay-Src state
Phase B.3 pair-fix to the control plane's fly-replay state change.

Background: the private molecule-controlplane's router emits
`fly-replay: app=X;instance=Y;state=org-id=<uuid>`. Fly's edge replays
the request to the tenant and injects `Fly-Replay-Src: instance=Z;...;
state=org-id=<uuid>` on the replayed request. But response headers from
the cp (like X-Molecule-Org-Id) never travel to the replayed tenant —
only the state= param does.

TenantGuard now checks both paths in order:
  1. Primary: X-Molecule-Org-Id header (direct-access path, e.g. molecli)
  2. Secondary: Fly-Replay-Src's `state=org-id=<uuid>` segment
     (production fly-replay path)

Either matching configured MOLECULE_ORG_ID → allow. Neither matches →
404 (still don't leak tenant existence).

New helper orgIDFromReplaySrc parses the semicolon-separated Fly-Replay-
Src header per Fly's format. Covered by a table-driven test with 7 cases
including malformed + empty-header + wrong-state-key.

Tests: +3 new TestTenantGuard_* (FlyReplaySrc match, mismatch, table).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 17:54:13 -07:00
Hongming Wang
522d055758 fix(middleware): TenantGuard accepts org id via Fly-Replay-Src state
Phase B.3 pair-fix to the control plane's fly-replay state change.

Background: the private molecule-controlplane's router emits
`fly-replay: app=X;instance=Y;state=org-id=<uuid>`. Fly's edge replays
the request to the tenant and injects `Fly-Replay-Src: instance=Z;...;
state=org-id=<uuid>` on the replayed request. But response headers from
the cp (like X-Molecule-Org-Id) never travel to the replayed tenant —
only the state= param does.

TenantGuard now checks both paths in order:
  1. Primary: X-Molecule-Org-Id header (direct-access path, e.g. molecli)
  2. Secondary: Fly-Replay-Src's `state=org-id=<uuid>` segment
     (production fly-replay path)

Either matching configured MOLECULE_ORG_ID → allow. Neither matches →
404 (still don't leak tenant existence).

New helper orgIDFromReplaySrc parses the semicolon-separated Fly-Replay-
Src header per Fly's format. Covered by a table-driven test with 7 cases
including malformed + empty-header + wrong-state-key.

Tests: +3 new TestTenantGuard_* (FlyReplaySrc match, mismatch, table).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 17:54:13 -07:00
Hongming Wang
8a0ebace8e Merge pull request #83 from Molecule-AI/fix/fly-registry-username
fix(ci): revert Fly registry username to 'x' — 401 on any other value
2026-04-14 17:26:12 -07:00
Hongming Wang
63cf7e5693
Merge pull request #83 from Molecule-AI/fix/fly-registry-username
fix(ci): revert Fly registry username to 'x' — 401 on any other value
2026-04-14 17:26:12 -07:00
Hongming Wang
6f785f0b5a fix(ci): revert Fly registry username to 'x' — 'molecule-ai' gets 401
Post-mortem on the failed publish-platform-image run on main (PR #82):

Fly's Docker registry requires username EXACTLY equal to "x". My
code-review "readability fix" changing it to "molecule-ai" caused
every push to return 401 Unauthorized. Verified locally:

  echo $FLY_API_TOKEN | docker login registry.fly.io -u x --password-stdin
  → Login Succeeded

  echo $FLY_API_TOKEN | docker login registry.fly.io -u molecule-ai --password-stdin
  → 401 Unauthorized

Lesson: don't second-guess docs that specify a literal value. Comment
now says "MUST be literal 'x'" with a 2026-04-15 verification note to
prevent future regressions.

Code-review process improvement: when reviewing a change against a
vendor API, prefer "preserve exact doc-specified values" over readability
suggestions. Logged as a cron-learning.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 17:21:53 -07:00
Hongming Wang
8decdd491e fix(ci): revert Fly registry username to 'x' — 'molecule-ai' gets 401
Post-mortem on the failed publish-platform-image run on main (PR #82):

Fly's Docker registry requires username EXACTLY equal to "x". My
code-review "readability fix" changing it to "molecule-ai" caused
every push to return 401 Unauthorized. Verified locally:

  echo $FLY_API_TOKEN | docker login registry.fly.io -u x --password-stdin
  → Login Succeeded

  echo $FLY_API_TOKEN | docker login registry.fly.io -u molecule-ai --password-stdin
  → 401 Unauthorized

Lesson: don't second-guess docs that specify a literal value. Comment
now says "MUST be literal 'x'" with a 2026-04-15 verification note to
prevent future regressions.

Code-review process improvement: when reviewing a change against a
vendor API, prefer "preserve exact doc-specified values" over readability
suggestions. Logged as a cron-learning.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 17:21:53 -07:00
Hongming Wang
cd73f8fc72 Merge pull request #82 from Molecule-AI/feat/mirror-to-fly-registry
feat(ci): mirror platform image to registry.fly.io/molecule-tenant
2026-04-14 17:16:04 -07:00
Hongming Wang
31fca5ea6e
Merge pull request #82 from Molecule-AI/feat/mirror-to-fly-registry
feat(ci): mirror platform image to registry.fly.io/molecule-tenant
2026-04-14 17:16:04 -07:00
Hongming Wang
855d423f6c review: split push steps, runbook for secret rotation, username clarity
Addresses PR #82 code review: 🟡×3 + 🔵×5.

- Fly registry login username: 'x' → 'molecule-ai' + explanatory comment.
- Build & push split into two steps (GHCR / Fly registry) so a single-
  registry outage can't fail the other. Second step uses 'if: always()'
  to ensure Fly mirror runs even if GHCR push flakes.
- docs/runbooks/saas-secrets.md: full secret map + rotation procedures
  for every SaaS credential, with danger-case callouts. Documents the
  coupled FLY_API_TOKEN (lives in GHA secret AND fly secrets — must be
  rotated in both).
- CLAUDE.md: new 'SaaS ops' section linking to the runbook.
2026-04-14 17:09:11 -07:00
Hongming Wang
73dbca4e38 review: split push steps, runbook for secret rotation, username clarity
Addresses PR #82 code review: 🟡×3 + 🔵×5.

- Fly registry login username: 'x' → 'molecule-ai' + explanatory comment.
- Build & push split into two steps (GHCR / Fly registry) so a single-
  registry outage can't fail the other. Second step uses 'if: always()'
  to ensure Fly mirror runs even if GHCR push flakes.
- docs/runbooks/saas-secrets.md: full secret map + rotation procedures
  for every SaaS credential, with danger-case callouts. Documents the
  coupled FLY_API_TOKEN (lives in GHA secret AND fly secrets — must be
  rotated in both).
- CLAUDE.md: new 'SaaS ops' section linking to the runbook.
2026-04-14 17:09:11 -07:00
Hongming Wang
b811b47334 feat(ci): mirror platform image to registry.fly.io/molecule-tenant
Keeps ghcr.io/molecule-ai/platform private (per CEO direction — open-
source when full SaaS ships) while still letting the private control
plane's Fly provisioner boot tenant machines: Fly auto-authenticates
same-org machines against registry.fly.io, no per-tenant pull
credentials to wire.

Workflow now logs into both GHCR (using built-in GITHUB_TOKEN) and
Fly registry (using FLY_API_TOKEN secret) and pushes the same image to
four tags total:
- ghcr.io/molecule-ai/platform:latest
- ghcr.io/molecule-ai/platform:sha-<short>
- registry.fly.io/molecule-tenant:latest
- registry.fly.io/molecule-tenant:sha-<short>

Secret added via `gh secret set FLY_API_TOKEN` on the public repo.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 17:05:36 -07:00
Hongming Wang
6bcafd643e feat(ci): mirror platform image to registry.fly.io/molecule-tenant
Keeps ghcr.io/molecule-ai/platform private (per CEO direction — open-
source when full SaaS ships) while still letting the private control
plane's Fly provisioner boot tenant machines: Fly auto-authenticates
same-org machines against registry.fly.io, no per-tenant pull
credentials to wire.

Workflow now logs into both GHCR (using built-in GITHUB_TOKEN) and
Fly registry (using FLY_API_TOKEN secret) and pushes the same image to
four tags total:
- ghcr.io/molecule-ai/platform:latest
- ghcr.io/molecule-ai/platform:sha-<short>
- registry.fly.io/molecule-tenant:latest
- registry.fly.io/molecule-tenant:sha-<short>

Secret added via `gh secret set FLY_API_TOKEN` on the public repo.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 17:05:36 -07:00
Hongming Wang
ba184dea5f docs: sync documentation with 2026-04-15 tick-9 merges (#79, #80)
- PLAN.md: new "Recently launched (2026-04-15 tick-9)" block covering
  Phase 32 Phase B.2 image pipeline (PR #80) + tick-8 docs (PR #79).
- docs/edit-history/2026-04-15.md: new file for today's merges.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 16:43:00 -07:00
Hongming Wang
55eaa8d395 docs: sync documentation with 2026-04-15 tick-9 merges (#79, #80)
- PLAN.md: new "Recently launched (2026-04-15 tick-9)" block covering
  Phase 32 Phase B.2 image pipeline (PR #80) + tick-8 docs (PR #79).
- docs/edit-history/2026-04-15.md: new file for today's merges.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 16:43:00 -07:00
Hongming Wang
292eb71c52 Merge pull request #80 from Molecule-AI/feat/ghcr-platform-image
feat(ci): publish-platform-image → ghcr.io/molecule-ai/platform (Phase B.2)
2026-04-14 16:41:59 -07:00
Hongming Wang
c3cc8e8725
Merge pull request #80 from Molecule-AI/feat/ghcr-platform-image
feat(ci): publish-platform-image → ghcr.io/molecule-ai/platform (Phase B.2)
2026-04-14 16:41:59 -07:00
Hongming Wang
cf5abf8f63 Merge pull request #79 from Molecule-AI/docs/sync-2026-04-14-tick-8
docs: sync documentation with 2026-04-14 tick-8 merge (#78)
2026-04-14 16:40:27 -07:00
Hongming Wang
d53a128774
Merge pull request #79 from Molecule-AI/docs/sync-2026-04-14-tick-8
docs: sync documentation with 2026-04-14 tick-8 merge (#78)
2026-04-14 16:40:27 -07:00
Hongming Wang
035287df38 feat(ci): publish-platform-image workflow → ghcr.io/molecule-ai/platform
Phase B.2 companion to the private molecule-controlplane provisioner PR.
On every push to main that touches platform/**, builds platform/Dockerfile
and pushes to GHCR with two tags:

- :latest              (floating, always main's tip)
- :sha-<short-commit>  (immutable, pin-friendly)

Cache via GitHub Actions cache (cache-from: type=gha). Workflow_dispatch
trigger so we can re-publish after a docs-only merge if needed.

The private molecule-controlplane sets TENANT_IMAGE=ghcr.io/molecule-ai/platform:<tag>
and the provisioner creates each tenant Fly Machine from this image. Staying
on the same base image across tenants keeps upgrades atomic.

CLAUDE.md updated to document the new workflow in the CI pipeline section.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 16:37:49 -07:00