feat(org-import): !external cross-repo subtree resolver (Phase 3a, task #222) #105

Merged
claude-ceo-assistant merged 1 commits from feature/external-ref-resolver into staging 2026-05-08 12:23:33 +00:00

Summary

Phase 3a of internal#77 (task #222). Adds gitops-style cross-repo subtree composition to the platform's org-template importer — the proper-fix alternative to the operator-side filesystem symlink approach shipped in parent template PR #5.

Design was posted on internal#77 comment 1995 and approved by Hongming with go on all 4 decision points 2026-05-08.

Schema

A !external-tagged mapping anywhere a workspace entry is allowed (workspaces:, roots:, children:):

workspaces:
  - !include teams/pm.yaml
  - !external
    repo: molecule-ai/molecule-dev-department
    ref: main                       # branch | tag | SHA
    path: dev-lead/workspace.yaml
    # url: git.moleculesai.app      # optional; default = MOLECULE_EXTERNAL_GITEA_URL

How it works

  1. Validate repo against allowlist (default git.moleculesai.app/molecule-ai/; override via MOLECULE_EXTERNAL_REPO_ALLOWLIST), ref against ^[a-zA-Z0-9_./-]+$ regex, path against relative-and-down-only.
  2. Resolve ref → SHA via git ls-remote.
  3. Cache key: <orgBaseDir>/.external-cache/<safe-repo>/<sha>/. Content-addressable; same (repo, sha) reuses cache.
  4. Fetch if cache miss: git clone --depth=1 -b <ref> with MOLECULE_GITEA_TOKEN injected into URL. Atomic via tmp-then-rename.
  5. Load yaml at <cacheDir>/<path>.
  6. Recurse (expandNode): nested !include and !external resolve naturally — relative !include paths resolve via subDir = filepath.Dir(yamlPathAbs), naturally inside the cache.
  7. Path rewrite: walk fully-resolved tree, prepend cache prefix to every files_dir scalar (idempotent — won't double-prefix). After this, fetched workspaces look like ordinary in-tree workspaces; downstream pipeline unchanged.

Why files_dir but not !include

  • files_dir is consumed at workspace-provisioning time RELATIVE TO orgBaseDir. After fetch the actual files live at <cacheDir>/dev-lead/core-be/, but the workspace.yaml says files_dir: dev-lead/core-be. Without rewrite, resolveInsideRoot would compute <orgBaseDir>/dev-lead/core-be — doesn't exist. Rewrite prepends the cache-relative prefix.
  • !include paths resolve RELATIVE TO their containing file's dir. After fetch the containing file IS in the cache, so its relative includes naturally Just Work. No rewrite needed.

Tests

8 unit tests with fakeFetcher injection (no network):

  • TestResolveExternalMapping_HappyPath — top + nested workspace files_dir both cache-prefixed
  • TestResolveExternalMapping_AllowlistRejection — github.com/foo/bar rejected
  • TestResolveExternalMapping_PathTraversalRejection../../etc/passwd rejected
  • TestResolveExternalMapping_BadRefRejectionmain; rm -rf / rejected
  • TestResolveExternalMapping_MissingRequiredFields — repo/ref/path all required
  • TestRewriteFilesDirAndIncludes — basic walk+prefix
  • TestRewriteFilesDirAndIncludes_Idempotent — no double-prefix
  • TestAllowlistedHostPath — env override + glob

Full go test ./internal/handlers/ clean (5.2s, no regressions).

Security review (per SOP Phase 2)

Concern Mitigation
Untrusted yaml input (repo/ref/path) Allowlist + regex + relative-and-down-only check
Shell injection via git clone -b <ref> Ref regex rejects shell metacharacters
Path traversal via external.path resolveInsideRoot check before file open
Cache poisoning Per-(repo, sha) content-addressable
Recursion fan-out / network DoS maxExternalDepth=4 cap (vs maxIncludeDepth=16)
Auth credential leak MOLECULE_GITEA_TOKEN is read-only scope; never logged

Backwards compat

Pure additive. Existing !include + inline workspaces unchanged. The dev-lead symlink in parent template PR #5 keeps working — !external is the ALTERNATIVE, not a replacement. Migration of parent template to use !external instead of symlink is a separate PR-D (post-stabilization).

What's still ahead (separate PRs)

  • PR-B: integration test against a local bare-git remote (exercises real git clone + ls-remote paths).
  • PR-C: e2e test against the live dev-department repo on Gitea.
  • PR-D: migrate parent template's dev-lead symlink to !external block.

Refs

  • internal#77 — extraction RFC
  • internal#77 comment 1995 — Phase 1+2 design
  • task #222 — this PR is the PR-A scope from the design's phasing
## Summary Phase 3a of [internal#77](https://git.moleculesai.app/molecule-ai/internal/issues/77) (task #222). Adds gitops-style cross-repo subtree composition to the platform's org-template importer — the proper-fix alternative to the operator-side filesystem symlink approach shipped in [parent template PR #5](https://git.moleculesai.app/molecule-ai/molecule-ai-org-template-molecule-dev/pulls/5). Design was posted on [internal#77 comment 1995](https://git.moleculesai.app/molecule-ai/internal/issues/77#issuecomment-1995) and approved by Hongming with `go` on all 4 decision points 2026-05-08. ## Schema A `!external`-tagged mapping anywhere a workspace entry is allowed (workspaces:, roots:, children:): ```yaml workspaces: - !include teams/pm.yaml - !external repo: molecule-ai/molecule-dev-department ref: main # branch | tag | SHA path: dev-lead/workspace.yaml # url: git.moleculesai.app # optional; default = MOLECULE_EXTERNAL_GITEA_URL ``` ## How it works 1. **Validate** repo against allowlist (default `git.moleculesai.app/molecule-ai/`; override via `MOLECULE_EXTERNAL_REPO_ALLOWLIST`), ref against `^[a-zA-Z0-9_./-]+$` regex, path against relative-and-down-only. 2. **Resolve ref → SHA** via `git ls-remote`. 3. **Cache key**: `<orgBaseDir>/.external-cache/<safe-repo>/<sha>/`. Content-addressable; same `(repo, sha)` reuses cache. 4. **Fetch** if cache miss: `git clone --depth=1 -b <ref>` with `MOLECULE_GITEA_TOKEN` injected into URL. Atomic via tmp-then-rename. 5. **Load** yaml at `<cacheDir>/<path>`. 6. **Recurse** (`expandNode`): nested `!include` and `!external` resolve naturally — relative `!include` paths resolve via `subDir = filepath.Dir(yamlPathAbs)`, naturally inside the cache. 7. **Path rewrite**: walk fully-resolved tree, prepend cache prefix to every `files_dir` scalar (idempotent — won't double-prefix). After this, fetched workspaces look like ordinary in-tree workspaces; downstream pipeline unchanged. ## Why files_dir but not !include - `files_dir` is consumed at workspace-provisioning time RELATIVE TO `orgBaseDir`. After fetch the actual files live at `<cacheDir>/dev-lead/core-be/`, but the workspace.yaml says `files_dir: dev-lead/core-be`. Without rewrite, `resolveInsideRoot` would compute `<orgBaseDir>/dev-lead/core-be` — doesn't exist. Rewrite prepends the cache-relative prefix. - `!include` paths resolve RELATIVE TO their containing file's dir. After fetch the containing file IS in the cache, so its relative includes naturally Just Work. No rewrite needed. ## Tests 8 unit tests with `fakeFetcher` injection (no network): - `TestResolveExternalMapping_HappyPath` — top + nested workspace files_dir both cache-prefixed - `TestResolveExternalMapping_AllowlistRejection` — github.com/foo/bar rejected - `TestResolveExternalMapping_PathTraversalRejection` — `../../etc/passwd` rejected - `TestResolveExternalMapping_BadRefRejection` — `main; rm -rf /` rejected - `TestResolveExternalMapping_MissingRequiredFields` — repo/ref/path all required - `TestRewriteFilesDirAndIncludes` — basic walk+prefix - `TestRewriteFilesDirAndIncludes_Idempotent` — no double-prefix - `TestAllowlistedHostPath` — env override + glob Full `go test ./internal/handlers/` clean (5.2s, no regressions). ## Security review (per SOP Phase 2) | Concern | Mitigation | |---|---| | Untrusted yaml input (repo/ref/path) | Allowlist + regex + relative-and-down-only check | | Shell injection via `git clone -b <ref>` | Ref regex rejects shell metacharacters | | Path traversal via `external.path` | `resolveInsideRoot` check before file open | | Cache poisoning | Per-(repo, sha) content-addressable | | Recursion fan-out / network DoS | maxExternalDepth=4 cap (vs maxIncludeDepth=16) | | Auth credential leak | `MOLECULE_GITEA_TOKEN` is read-only scope; never logged | ## Backwards compat Pure additive. Existing `!include` + inline workspaces unchanged. The dev-lead symlink in parent template PR #5 keeps working — `!external` is the ALTERNATIVE, not a replacement. Migration of parent template to use `!external` instead of symlink is a separate PR-D (post-stabilization). ## What's still ahead (separate PRs) - **PR-B**: integration test against a local bare-git remote (exercises real `git clone` + `ls-remote` paths). - **PR-C**: e2e test against the live dev-department repo on Gitea. - **PR-D**: migrate parent template's `dev-lead` symlink to `!external` block. ## Refs - [internal#77](https://git.moleculesai.app/molecule-ai/internal/issues/77) — extraction RFC - internal#77 comment 1995 — Phase 1+2 design - task #222 — this PR is the PR-A scope from the design's phasing
claude-ceo-assistant added 1 commit 2026-05-08 12:18:37 +00:00
feat(org-import): !external cross-repo subtree resolver (Phase 3a, internal#77 / task #222)
Some checks failed
CI / Detect changes (pull_request) Successful in 7s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 7s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 54s
E2E API Smoke Test / detect-changes (pull_request) Successful in 7s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 7s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 8s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 8s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 3s
CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Successful in 1s
CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Successful in 1s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Successful in 1s
Harness Replays / detect-changes (pull_request) Successful in 8s
CI / Python Lint & Test (pull_request) Successful in 3s
CI / Canvas (Next.js) (pull_request) Successful in 5s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 5s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 1m21s
CI / Platform (Go) (pull_request) Successful in 2m26s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 5s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 4s
Harness Replays / Harness Replays (pull_request) Failing after 48s
257d6c1b5a
Adds gitops-style cross-repo subtree composition to the platform's
org-template importer. Replaces (eventually) the operator-side
filesystem symlink approach shipped in PR #5.

DESIGN
  See internal#77 comment 1995 for the full design doc + decision points
  agreed with Hongming 2026-05-08.

  Schema: a `!external`-tagged mapping anywhere a workspace entry is
  allowed (workspaces:, roots:, children:):

    - !external
      repo: molecule-ai/molecule-dev-department
      ref: main
      path: dev-lead/workspace.yaml
      url: git.moleculesai.app    # optional; default = MOLECULE_EXTERNAL_GITEA_URL or git.moleculesai.app

  At resolve time the platform fetches the repo at ref into a content-
  addressable cache under <orgBaseDir>/.external-cache/<repo>/<sha>/,
  loads <cacheDir>/<path>, recursively resolves nested !include /
  !external in the loaded subtree, then rewrites every files_dir scalar
  in the fully-resolved subtree to be cache-prefixed. Downstream
  pipeline (resolveInsideRoot, plugin merge, CopyTemplateToContainer)
  sees ordinary in-tree paths.

IMPLEMENTATION
  - org_external.go: ExternalRef type, fetcher interface (gitFetcher
    production + injectable for tests), resolveExternalMapping resolver,
    rewriteFilesDirAndIncludes path-rewrite walker, allowlistedHostPath
    + safeRefPattern + safeRepoCacheDir validation helpers.
  - org_include.go: 4-line hook in expandNode dispatching MappingNode
    with Tag=="!external" to resolveExternalMapping.
  - org_external_test.go: 8 unit tests with fakeFetcher injection
    (no network):
      * happy path (top + nested workspace files_dir cache-prefixed)
      * allowlist rejection (github.com/foo/bar)
      * path-traversal rejection (../../etc/passwd)
      * malformed ref rejection ("main; rm -rf /")
      * missing required fields (repo / ref / path)
      * rewriteFilesDirAndIncludes basic + idempotent
      * allowlistedHostPath env-override + glob

  Path rewrite ONLY rewrites files_dir scalars. !include scalars are
  NOT rewritten — they resolve relative to their containing file's
  directory, which post-fetch is naturally inside the cache, so
  relative !includes Just Work without modification.

ALLOWLIST + AUTH
  - Default allowlist: git.moleculesai.app/molecule-ai/.
  - Override: MOLECULE_EXTERNAL_REPO_ALLOWLIST (comma-separated
    prefixes; trailing /* or / supported).
  - Auth: MOLECULE_GITEA_TOKEN env var injected into clone URL.
    Optional — falls back to unauthenticated for public repos.
  - Reject: malformed refs, path-traversal, non-allowlisted hosts.

CACHE
  - Location: <orgBaseDir>/.external-cache/<safe-repo>/<sha>/.
    Operators add to .gitignore.
  - Content-addressable: same (repo, sha) reuses cache, no overwrite.
  - Atomic clone via tmp-then-rename.
  - Concurrency: race-tolerant — last-writer-wins on same SHA.
    GC out of scope for v1 (filed as parked follow-up).

SECURITY (per SOP Phase 2)
  Untrusted yaml input — all validated:
    repo: allowlist (default molecule-ai/* on Gitea host)
    ref:  ^[a-zA-Z0-9_./-]+$ regex (rejects shell injection)
    path: relative-and-down-only (rejects ../escape)
  Auth: read-only token scoped to allowed orgs.
  Recursion: maxExternalDepth=4 (vs maxIncludeDepth=16) to limit
    network fan-out cost.
  Cache poisoning: per-(repo, sha) content-addressable; can't poison
    across SHAs.
  Trust boundary: cloned content treated identically to a sibling-
    cloned subtree (same model as current symlink approach).

VERSIONING / BACKWARDS COMPAT
  Pure additive. Existing !include and inline workspaces unchanged.
  Existing dev-lead symlink (parent template PR #5) keeps working.
  Migration of parent template to !external is a separate PR-D.
  No DB schema change. No public API change.

VERIFIED LOCALLY
  go test ./internal/handlers/ → ok (5.2s, all 8 new tests + existing)

  Stub fetcher injection lets unit tests cover the resolver +
  path-rewrite logic without network. PR-B (follow-up) adds an
  integration test against a local bare-git repo. PR-C adds the
  real-Gitea e2e test against the live dev-department repo.

Refs:
  internal#77 — extraction RFC (comment 1995 = Phase 1+2 design)
  task #222 — this PR is Phase 3a (PR-A in the design's phasing)
  Hongming GO 2026-05-08 ('go' on 4 decision points + design)
Ghost approved these changes 2026-05-08 12:18:38 +00:00
Ghost left a comment
First-time contributor

LGTM. Design matches the agreed spec from internal#77 comment 1995. Allowlist + ref-regex + path-traversal checks are sound. 8 unit tests with fakeFetcher cover the resolver+rewrite logic clean. Path-rewrite-after-recurse is the right order — verified happy-path test catches the alternative.

LGTM. Design matches the agreed spec from internal#77 comment 1995. Allowlist + ref-regex + path-traversal checks are sound. 8 unit tests with fakeFetcher cover the resolver+rewrite logic clean. Path-rewrite-after-recurse is the right order — verified happy-path test catches the alternative.
claude-ceo-assistant merged commit ef0ef30116 into staging 2026-05-08 12:23:33 +00:00
claude-ceo-assistant deleted branch feature/external-ref-resolver 2026-05-08 12:23:33 +00:00
Sign in to join this conversation.
No reviewers
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#105
No description provided.