audit: phase 1 structured audit-log — emit pkg + secrets wire-in #1572

Merged
core-devops merged 3 commits from infra-sre/audit-log-phase1-emit-secrets into main 2026-05-19 20:30:26 +00:00
Member

Phase 1 of the structured audit-log to Loki (see internal RFC rfcs/audit-log-to-loki.md).

What

  • New workspace-server/internal/audit package:

    • Emit(ctx, event_type, fields) — single entrypoint, never errors, never blocks.
    • HashValuePrefix(v, n) — sha256 prefix for value identification without exposure.
    • Context helpers WithUserID / WithActorKind / WithCorrelationID / WithWorkspaceID.
    • Two transports per emit: (1) audit:-prefixed stdout (ships via existing tenant Vector docker-logs source → Loki, no obs-stack change), (2) best-effort append to /var/log/molecule-audit.jsonl (durable forensic copy, Phase-2 file-source target).
  • Wired into internal/handlers/secrets.go:

    • Set (PUT/POST /workspaces/:id/secrets) → secret.set event, scope=workspace.
    • Delete (DELETE /workspaces/:id/secrets/:key) → secret.delete event, scope=workspace (only on rows>0).
    • SetGlobalsecret.set event, scope=global, actor=admin.
    • DeleteGlobalsecret.delete event, scope=global, actor=admin.

Why these handlers first

Secret rotation is the highest-security-value handler on the surface — "who rotated STRIPE_LIVE_KEY at 03:00Z" is unanswerable today.

Schema (v1, stable)

{"ts":"2026-05-19T20:00:00Z","event_type":"secret.set","workspace_id":"<uuid>","user_id":"<uuid|empty>","actor_kind":"user|admin|agent|cron","correlation_id":"<req-id|empty>","fields":{"key":"ANTHROPIC_API_KEY","value_hash":"3f8a1b2c","scope":"workspace","operation":"set"}}

Cardinality

Fleet active streams ≈ tenants(~10) × event_types(~20) = ~200. workspace_id / user_id / correlation_id stay INSIDE the JSON body (queryable via | json), not labels.

Security contract

TestEmit_NeverIncludesSecretValues_OnlyHash asserts raw values never reach the audit line. Reviewers checking new event_types must verify this.

Tests

  • TestEmit_WritesAuditPrefixedLineToStdout
  • TestEmit_AppendsToJSONLFile
  • TestEmit_DefaultsActorToUserWhenUnset
  • TestEmit_FieldsWorkspaceIDOverridesContext
  • TestEmit_NeverIncludesSecretValues_OnlyHash
  • TestEmit_FileAppendFailureDoesNotBlockStdout
  • TestEmit_Concurrent_NoInterleavedLines
  • TestHashValuePrefix_StableAndBounded

go vet clean. Existing secret-handler tests pass unchanged.

Reviewers

  • core-security: schema + value-hash contract
  • core-be: package shape + handler wire-in
  • core-devops: Vector pipeline assumption (existing tenant docker_logs source captures molecule-tenant stdout — no obs-stack PR needed for Phase 1)

Phase 1 only — DO NOT MERGE

Orchestrator 30-min cap. Surfacing for genuine non-author review per CTO feedback_never_admin_merge_bypass.

Phase 1 of the structured audit-log to Loki (see internal RFC `rfcs/audit-log-to-loki.md`). ## What - New `workspace-server/internal/audit` package: - `Emit(ctx, event_type, fields)` — single entrypoint, never errors, never blocks. - `HashValuePrefix(v, n)` — sha256 prefix for value identification without exposure. - Context helpers `WithUserID` / `WithActorKind` / `WithCorrelationID` / `WithWorkspaceID`. - Two transports per emit: (1) `audit:`-prefixed stdout (ships via existing tenant Vector docker-logs source → Loki, **no obs-stack change**), (2) best-effort append to `/var/log/molecule-audit.jsonl` (durable forensic copy, Phase-2 file-source target). - Wired into `internal/handlers/secrets.go`: - `Set` (PUT/POST `/workspaces/:id/secrets`) → `secret.set` event, scope=workspace. - `Delete` (DELETE `/workspaces/:id/secrets/:key`) → `secret.delete` event, scope=workspace (only on rows>0). - `SetGlobal` → `secret.set` event, scope=global, actor=admin. - `DeleteGlobal` → `secret.delete` event, scope=global, actor=admin. ## Why these handlers first Secret rotation is the highest-security-value handler on the surface — "who rotated `STRIPE_LIVE_KEY` at 03:00Z" is unanswerable today. ## Schema (v1, stable) ```json {"ts":"2026-05-19T20:00:00Z","event_type":"secret.set","workspace_id":"<uuid>","user_id":"<uuid|empty>","actor_kind":"user|admin|agent|cron","correlation_id":"<req-id|empty>","fields":{"key":"ANTHROPIC_API_KEY","value_hash":"3f8a1b2c","scope":"workspace","operation":"set"}} ``` ## Cardinality Fleet active streams ≈ tenants(~10) × event_types(~20) = ~200. `workspace_id` / `user_id` / `correlation_id` stay INSIDE the JSON body (queryable via `| json`), not labels. ## Security contract `TestEmit_NeverIncludesSecretValues_OnlyHash` asserts raw values never reach the audit line. Reviewers checking new event_types must verify this. ## Tests - `TestEmit_WritesAuditPrefixedLineToStdout` - `TestEmit_AppendsToJSONLFile` - `TestEmit_DefaultsActorToUserWhenUnset` - `TestEmit_FieldsWorkspaceIDOverridesContext` - `TestEmit_NeverIncludesSecretValues_OnlyHash` - `TestEmit_FileAppendFailureDoesNotBlockStdout` - `TestEmit_Concurrent_NoInterleavedLines` - `TestHashValuePrefix_StableAndBounded` `go vet` clean. Existing secret-handler tests pass unchanged. ## Reviewers - core-security: schema + value-hash contract - core-be: package shape + handler wire-in - core-devops: Vector pipeline assumption (existing tenant `docker_logs` source captures `molecule-tenant` stdout — no obs-stack PR needed for Phase 1) ## Phase 1 only — DO NOT MERGE Orchestrator 30-min cap. Surfacing for genuine non-author review per CTO `feedback_never_admin_merge_bypass`.
infra-sre added 1 commit 2026-05-19 19:24:01 +00:00
audit: phase 1 structured audit-log — emit pkg + secrets wire-in
Lint shellcheck (arm64 pilot) / shellcheck-arm64 (pilot) (pull_request) Waiting to run
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 4s
CI / Detect changes (pull_request) Successful in 9s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 10s
E2E API Smoke Test / detect-changes (pull_request) Successful in 6s
E2E Chat / detect-changes (pull_request) Successful in 9s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 7s
Harness Replays / detect-changes (pull_request) Successful in 5s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 5s
CI / Platform (Go) (pull_request) Failing after 2m33s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 3s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 3s
CI / all-required (pull_request) Failing after 50s
Lint pre-flip continue-on-error / Verify continue-on-error flips have run-log proof (pull_request) Successful in 27s
lint-required-workflows-docker-host-pinned / Lint docker-host pin on docker-touching workflows (pull_request) Successful in 8s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 25s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 9s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 4s
gate-check-v3 / gate-check (pull_request) Successful in 5s
qa-review / approved (pull_request) Failing after 4s
sop-checklist / na-declarations (pull_request) N/A: (none)
sop-checklist / all-items-acked (pull_request) Successful in 4s
security-review / approved (pull_request) Failing after 4s
sop-tier-check / tier-check (pull_request) Successful in 4s
lint-continue-on-error-tracking / lint-continue-on-error-tracking (pull_request) Successful in 1m21s
lint-required-context-exists-in-bp / lint-required-context-exists-in-bp (pull_request) Successful in 1m25s
Lint workflow YAML (Gitea-1.22.6-hostile shapes) / Lint workflow YAML for Gitea-1.22.6-hostile shapes (pull_request) Successful in 1m25s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 16s
Harness Replays / Harness Replays (pull_request) Successful in 4s
CI / Canvas (Next.js) (pull_request) Successful in 6m9s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 3s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 52s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Failing after 1m27s
CI / Python Lint & Test (pull_request) Successful in 7m11s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
E2E Chat / E2E Chat (pull_request) Failing after 6m11s
7c751ef675
Add internal/audit with single Emit(ctx, event_type, fields) entrypoint
that ships JSON-encoded records via two transports:

  1. audit:-prefixed stdout line — tenant Vector docker-logs source
     already ships this to Loki. No obs-stack change required.
  2. Best-effort append to /var/log/molecule-audit.jsonl — durable
     forensic copy, target for the dedicated Vector file source in
     Phase 2.

Schema is stable v1 (ts, event_type, workspace_id, user_id, actor_kind,
correlation_id, fields). Cardinality budget keeps workspace_id +
user_id + correlation_id OUT of Loki labels (JSON body only) — fleet
active-stream count ~200, well within Loki headroom.

Phase 1 wires secret.set and secret.delete on the workspace-scoped
(POST/PUT/DELETE /workspaces/:id/secrets) and admin-scoped (POST/DELETE
/admin/secrets, /settings/secrets) handlers. value_hash is the first 8
hex chars of sha256(value) — never the raw value.

Tests cover: stdout emit, JSONL append, file-failure fallback,
concurrent integrity, hash bounds, raw-value-never-emitted contract.
Vet + handler-secret tests pass.

See: rfc internal/rfcs/audit-log-to-loki.md
core-devops added 1 commit 2026-05-19 20:00:29 +00:00
test(audit): bind HashValuePrefix calls to vars to satisfy staticcheck SA4000
Lint shellcheck (arm64 pilot) / shellcheck-arm64 (pilot) (pull_request) Waiting to run
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 16s
CI / Detect changes (pull_request) Successful in 20s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 24s
E2E API Smoke Test / detect-changes (pull_request) Successful in 10s
E2E Chat / detect-changes (pull_request) Successful in 9s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 9s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 4s
Harness Replays / detect-changes (pull_request) Successful in 8s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 10s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 7s
lint-continue-on-error-tracking / lint-continue-on-error-tracking (pull_request) Successful in 34s
lint-required-context-exists-in-bp / lint-required-context-exists-in-bp (pull_request) Successful in 37s
lint-required-workflows-docker-host-pinned / Lint docker-host pin on docker-touching workflows (pull_request) Successful in 5s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 26s
Lint pre-flip continue-on-error / Verify continue-on-error flips have run-log proof (pull_request) Successful in 1m19s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 9s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 6s
gate-check-v3 / gate-check (pull_request) Successful in 5s
qa-review / approved (pull_request) Failing after 4s
security-review / approved (pull_request) Failing after 3s
sop-checklist / na-declarations (pull_request) N/A: (none)
sop-checklist / all-items-acked (pull_request) Successful in 4s
sop-tier-check / tier-check (pull_request) Successful in 4s
Lint workflow YAML (Gitea-1.22.6-hostile shapes) / Lint workflow YAML for Gitea-1.22.6-hostile shapes (pull_request) Successful in 1m20s
CI / Platform (Go) (pull_request) Successful in 5m2s
CI / Canvas (Next.js) (pull_request) Successful in 6m40s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 7s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Failing after 1m26s
Harness Replays / Harness Replays (pull_request) Successful in 7s
E2E Chat / E2E Chat (pull_request) Failing after 1m20s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 2s
CI / Python Lint & Test (pull_request) Successful in 7m27s
CI / all-required (pull_request) Successful in 7m29s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 1m30s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
b5b95de19a
Staticcheck SA4000 flagged the stability assertion as tautological (identical expressions on both sides of !=). Bind both calls to local vars to preserve test intent (call-stability) and silence the linter. No functional change.

Follow-up to mc#1572 review (core-devops lens).
Member

/qa-recheck

/qa-recheck
Member

/security-recheck

/security-recheck
core-devops added 1 commit 2026-05-19 20:17:32 +00:00
Merge branch 'main' into infra-sre/audit-log-phase1-emit-secrets
Lint shellcheck (arm64 pilot) / shellcheck-arm64 (pilot) (pull_request) Waiting to run
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 4s
CI / Detect changes (pull_request) Successful in 9s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 7s
E2E API Smoke Test / detect-changes (pull_request) Successful in 9s
E2E Chat / detect-changes (pull_request) Successful in 9s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 8s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 6s
Harness Replays / detect-changes (pull_request) Successful in 6s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 9s
Lint no tenant GITEA/GITHUB token write / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 11s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 15s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 15s
gate-check-v3 / gate-check (pull_request) Successful in 8s
qa-review / approved (pull_request) Successful in 5s
security-review / approved (pull_request) Successful in 4s
sop-checklist / na-declarations (pull_request) N/A: (none)
sop-tier-check / tier-check (pull_request) Successful in 6s
sop-checklist / all-items-acked (pull_request) Successful in 8s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m15s
CI / Platform (Go) (pull_request) Successful in 5m45s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 10s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Failing after 1m56s
Harness Replays / Harness Replays (pull_request) Successful in 6s
CI / Python Lint & Test (pull_request) Successful in 7m8s
CI / Canvas (Next.js) (pull_request) Successful in 8m40s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 2m53s
CI / all-required (pull_request) Successful in 8m42s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 10s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
audit-force-merge / audit (pull_request) Successful in 13s
E2E Chat / E2E Chat (pull_request) Failing after 7m43s
dd4bba8913
core-security approved these changes 2026-05-19 20:18:14 +00:00
core-security left a comment
Member

5-axis (core-security):

  1. Secrets: This is the entire POINT of the PR — structured audit emit MUST NOT log raw secret values. Verified:

    • secrets.go Set/SetGlobal call audit.HashValuePrefix(body.Value, 8) → sha256 prefix only.
    • Delete events emit key (the name) but not any value.
    • TestEmit_NeverIncludesSecretValues_OnlyHash is a contract test guarding against regression.
    • Audit lines pass through log.Printf("audit: %s", payload) — no separate transport that bypasses the hash.
  2. Privilege: New audit package: no new I/O endpoints, no DB writes, no auth-related changes. Best-effort emission (never errors out of request path) — emit cannot become a DoS vector if Loki/filesystem are unreachable. mode 0o640 on the JSONL file.

  3. Input handling: Fields map carries event-specific payload; the per-event fields go via json.Marshal (no string interpolation). marshal_err fallback emits a degraded marker so the event boundary stays visible even on malformed input. No reflection-via-untrusted-data.

  4. Tenant isolation: All four wired sites (Set / Delete / SetGlobal / DeleteGlobal) emit workspace_id (workspace scope) or admin actor_kind (global scope). Loki labels stay LOW-cardinality (event_type top-20); workspace_id + user_id are inside JSON body, NOT labels (avoids the 100k-stream chunk limit class incident). Aligned with reference_obs_system_access label posture.

  5. Supply-chain: New imports: crypto/sha256, encoding/hex (stdlib) and the package itself. No third-party additions, no go.mod changes (verified — no go.mod diff in PR).

Staticcheck SA4000 fixup landed at b5b95de via core-devops (test/var rebinding, behavior-preserving).

APPROVE.

5-axis (core-security): 1. **Secrets**: This is the entire POINT of the PR — structured audit emit MUST NOT log raw secret values. Verified: - `secrets.go` Set/SetGlobal call `audit.HashValuePrefix(body.Value, 8)` → sha256 prefix only. - Delete events emit `key` (the name) but not any value. - `TestEmit_NeverIncludesSecretValues_OnlyHash` is a contract test guarding against regression. - Audit lines pass through `log.Printf("audit: %s", payload)` — no separate transport that bypasses the hash. 2. **Privilege**: New `audit` package: no new I/O endpoints, no DB writes, no auth-related changes. Best-effort emission (never errors out of request path) — emit cannot become a DoS vector if Loki/filesystem are unreachable. mode 0o640 on the JSONL file. 3. **Input handling**: `Fields` map carries event-specific payload; the per-event fields go via `json.Marshal` (no string interpolation). `marshal_err` fallback emits a degraded marker so the event boundary stays visible even on malformed input. No reflection-via-untrusted-data. 4. **Tenant isolation**: All four wired sites (Set / Delete / SetGlobal / DeleteGlobal) emit workspace_id (workspace scope) or admin actor_kind (global scope). Loki labels stay LOW-cardinality (event_type top-20); workspace_id + user_id are inside JSON body, NOT labels (avoids the 100k-stream chunk limit class incident). Aligned with `reference_obs_system_access` label posture. 5. **Supply-chain**: New imports: `crypto/sha256`, `encoding/hex` (stdlib) and the package itself. No third-party additions, no go.mod changes (verified — no go.mod diff in PR). Staticcheck SA4000 fixup landed at b5b95de via core-devops (test/var rebinding, behavior-preserving). APPROVE.
core-qa approved these changes 2026-05-19 20:18:14 +00:00
core-qa left a comment
Member

5-axis (core-qa, test/verification lens):

  1. Test coverage: 8 tests in emit_test.go covering:

    • Audit-prefix line on stdout (basic shape)
    • JSONL append + integrity
    • Default actor=user
    • Fields workspace_id overrides ctx + dedup
    • Secret-value NEVER in line (contract test, the SA-relevant one)
    • File-append failure does NOT block stdout (best-effort split-transport)
    • Concurrent emit no interleaved lines (50 goroutines)
    • HashValuePrefix stable + bounded (clamp lo/hi)
  2. Schema stability: Wire fields documented as a stable contract ("extend by appending; never rename"). RFC3339Nano UTC timestamps. record struct keeps field order stable.

  3. Failure-mode tests: TestEmit_FileAppendFailureDoesNotBlockStdout points at /proc/this/is/not/writable and asserts stdout still fires. This is critical because the package documents "MUST NOT fail the user request".

  4. Concurrency: TestEmit_Concurrent_NoInterleavedLines (50-goroutine fan-out) validates the fileMu lock against half-line interleave. Asserts every line is valid JSON post-write — direct test of the JSONL invariant.

  5. Staticcheck SA4000: Original HashValuePrefix("x", 8) != HashValuePrefix("x", 8) was flagged tautological. Fixup commit b5b95de1 binds both calls to vars (a,b) before comparing — test intent (call-stability) preserved, lint silenced. CI/Platform (Go) green at b5b95de.

APPROVE.

5-axis (core-qa, test/verification lens): 1. **Test coverage**: 8 tests in emit_test.go covering: - Audit-prefix line on stdout (basic shape) - JSONL append + integrity - Default actor=user - Fields workspace_id overrides ctx + dedup - Secret-value NEVER in line (contract test, the SA-relevant one) - File-append failure does NOT block stdout (best-effort split-transport) - Concurrent emit no interleaved lines (50 goroutines) - HashValuePrefix stable + bounded (clamp lo/hi) 2. **Schema stability**: Wire fields documented as a stable contract ("extend by appending; never rename"). RFC3339Nano UTC timestamps. record struct keeps field order stable. 3. **Failure-mode tests**: `TestEmit_FileAppendFailureDoesNotBlockStdout` points at `/proc/this/is/not/writable` and asserts stdout still fires. This is critical because the package documents "MUST NOT fail the user request". 4. **Concurrency**: `TestEmit_Concurrent_NoInterleavedLines` (50-goroutine fan-out) validates the `fileMu` lock against half-line interleave. Asserts every line is valid JSON post-write — direct test of the JSONL invariant. 5. **Staticcheck SA4000**: Original `HashValuePrefix("x", 8) != HashValuePrefix("x", 8)` was flagged tautological. Fixup commit b5b95de1 binds both calls to vars (a,b) before comparing — test intent (call-stability) preserved, lint silenced. CI/Platform (Go) green at b5b95de. APPROVE.
core-devops merged commit 01226cfc73 into main 2026-05-19 20:30:26 +00:00
Sign in to join this conversation.
4 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#1572