molecule-core/docs/engineering/testing-strategy.md
Hongming Wang a56b765b2d
docs: testing strategy + PR hygiene + backend parity matrix + boot-event postmortem (#1824)
Bundles the documentation and lightweight tooling landed during the
2026-04-23 ops/triage session. Pure additions — no behavior changes.

## Added

### docs/architecture/backends.md
Parity matrix for Docker vs EC2 (SaaS) workspace backends. 18 features
tabulated with current status; 6 ranked drift risks; enforcement
hooks (parity-lint + contract tests). Living document — owners are
workspace-server + controlplane teams.

### docs/engineering/testing-strategy.md
Tiered test-coverage floors instead of a blanket 100% target. Seven
tiers by code class (auth/crypto → generated DTOs). Per-package
current-state snapshot + targets. Tracks the 3 biggest coverage gaps
(tokens.go 0%, workspace_provision.go 0%, wsauth ~48%) against their
tier-1/2 floors.

### docs/engineering/pr-hygiene.md
Captures the patterns that keep diffs reviewable. Motivated by the
2026-04-23 backlog audit where 8 of 23 open PRs had 70-380-file bloat
from stale branch drift. Covers: small-PR sizing, rebase-not-merge,
cherry-pick-onto-fresh-base for recovery, targeting staging first,
describing why-not-what.

### docs/engineering/postmortem-2026-04-23-boot-event-401.md
Postmortem for the /cp/tenants/boot-event 401 race. Root cause (DB
INSERT ordered AFTER readiness check), detection path (E2E + manual
log inspection), lessons (write-before-read pattern, integration
tests needed, E2E alerting gap, invariants-as-comments).

### tools/check-template-parity.sh
CI lint for template repos — diffs the `${VAR:+VAR=${VAR}}` provider-
key forwarders between install.sh (bare-host / EC2 path) and start.sh
(Docker path). Catches the #5 drift risk from backends.md before it
ships.

### workspace-server/internal/provisioner/backend_contract_test.go
Shared behavioral contract scaffold for Provisioner + CPProvisioner.
Compile-time assertions catch method-signature drift today; scenario-
level runs are t.Skip'd pending backend nil-hardening (drift risk #6,
see backends.md).

## Updated

### README.md
Links the new engineering docs + backends parity matrix into the
Documentation Map so agents and humans can actually find them.

## Related issues

- #1814 — unblock workspace_provision_test.go (broadcaster interface)
- #1813 — nil-client panic hardening (drift risk #6)
- #1815 — Canvas vitest coverage instrumentation
- #1816 — tokens.go 0% → 85%
- #1817 — 5 sqlmock column-drift failures
- #1818 — Python pytest-cov setup
- #1819 — wsauth middleware coverage gap
- #1821 — tiered coverage policy (meta)
- #1822 — backend parity drift tracker

Co-authored-by: Hongming Wang <hongmingwang.rabbit@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: molecule-ai[bot] <276602405+molecule-ai[bot]@users.noreply.github.com>
2026-04-23 19:59:38 +00:00

6.4 KiB

Testing Strategy

Status: Policy. Update when tier definitions or thresholds change. Audience: Everyone writing or reviewing code in this repo. Cross-refs: backends.md, pr-hygiene.md, postmortem-2026-04-23-boot-event-401.md

The short version

  • Don't chase 100% coverage. The last 15-20% costs as much as the first 80% and mostly adds brittle tests of trivial getters, error branches that can't fire, and stdlib wrappers.
  • Different code classes have different floors. Auth at 80% is scarier than a DTO at 50%. Match the test investment to the risk.
  • Tests should pay rent. A test that runs lines but asserts nothing meaningful isn't catching bugs — it's just dragging refactors down.

Tiered coverage floors

Every Go package, every TypeScript module, every Python module fits one of these tiers. The tier determines the minimum acceptable coverage — and the review standard.

Tier Examples Line floor Branch floor Review standard
1. Auth / secrets / crypto tokens, session_auth, wsauth_middleware, crypto/envelope, cp_tenant_auth 90% 85% Every branch tested. Adversarial scenarios (cross-tenant, expired token, null origin, malformed header). Timing considered.
2. Handlers with side effects workspace_provision, workspace_crud, container_files, terminal, registry 75% 70% Happy + main error paths. DB mocks. Ownership / tenant-isolation checks.
3. State machines + workers scheduler, provisioner, healthsweep, orphan-sweeper, boot_ready 75% 70% Every state transition tested, plus the transitions that shouldn't fire.
4. Config / business logic budget, orgtoken (validation), templates, derive-provider, redaction 70% 65% Standard unit-test territory. Table-driven preferred.
5. Plain DTOs / generated models/*, proto-generated Go, TypeScript interfaces none none Writing tests here is theatre. Don't.
6. CLI glue / cmd/* cmd/server, cmd/molecli smoke only Integration tests / E2E cover these. One startup-smoke test per binary.
7. Third-party wrappers awsapi, cloudflareapi, stripeapi, neonapi integration Unit tests mock vendor shape, not behavior. Real behavior covered by staging integration.

Why a blanket percentage is wrong

  • A models/ package at 90% means you wrote tests for func (w Workspace) ID() string { return w.id }. No bugs caught, but coverage number is green.
  • A tokens package at 75% means some rejection branch isn't covered. Maybe the exact branch that lets a revoked token still authenticate.
  • Blanket targets make the first case look equivalent to the second. They aren't.

Current state (as of 2026-04-23)

Run go test ./... -cover in each repo for up-to-date numbers. Snapshot:

workspace-server (Go)

Package Actual Tier Target Gap
internal/handlers/tokens.go 0% 1 90% 90
internal/handlers/workspace_provision.go 0% 2 75% 75
internal/middleware/wsauth_middleware.go ~48% 1 90% 42
internal/provisioner 45% 3 75% 30
internal/scheduler 49% 3 75% 26
internal/channels 40% 4 70% 30
internal/orgtoken 88% 4 70%
internal/crypto 91% 1 90%
internal/supervised 93% 3 75%
internal/plugins 94% 4 70%
internal/envx 100% 5 none

molecule-controlplane (Go)

Package Actual Tier Target Gap
internal/awsapi 18% 7 integration
internal/provisioner 48% 3 75% 27
internal/handlers 60% 2 75% 15
internal/billing 60% 4 70% 10
internal/crypto 68-80% 1 90% 10-22
internal/auth 96% 1 90%
internal/middleware 97% 1 90%
internal/reserved 100% 5 none
internal/httpx 100% 4 70%

canvas (TypeScript)

No coverage instrumentation today. 900 tests / 58 files pass, but coverage isn't measured. See issue #1815 for the fix: set a 70% line floor in vitest.config.ts and gate CI on it.

workspace (Python)

No pytest/coverage config. See issue #1818: set up pytest-cov with --cov-fail-under=75 (ratchet from current baseline over 2-3 weeks).

Writing a good test

A good test:

  • Asserts a specific outcome, not that a function runs without error.
  • Covers the exact branch that bugs would live in — cross-tenant access, revoked-but-cached token, race on state transition.
  • Uses table-driven patterns when the code is a dispatch with N cases. One test row per case.
  • Mocks at system boundaries (DB, HTTP, time), not at internal package boundaries.
  • Survives refactors — tests behavior, not internal state.

A bad test:

  • Tests a getter that just returns a field.
  • Mocks the function under test itself.
  • Relies on time.Sleep or clock timing to assert order.
  • Asserts nil == nil to boost coverage.

Enforcement

CI gates

  • Go: go test ./... -cover + a pre-commit script that compares coverage to .coverage-baseline and fails on drops > 2 points in a tier-1 package.
  • TypeScript: vitest --coverage with thresholds in vitest.config.ts. Fails CI if below.
  • Python: pytest --cov-fail-under=75 in the Python CI job.

Review expectations

  • Any PR touching a tier-1 package that lowers its coverage needs an explicit reviewer sign-off and justification.
  • New code should arrive at or above its tier's floor.
  • Untested files in tier-1 or tier-2 should be flagged in review, not waved through.