Slack adapter: adds chat.postMessage mode alongside legacy webhooks.
When bot_token is configured, uses chat:write.customize for per-agent
display name + emoji on every message. Each of the 15 active agents
posts with a distinct identity (PM 💼, Backend ⚙️, etc.).
5 channels configured:
#mol-engineering — PM, Dev Lead, Frontend, Backend, QA, Security, UIUX, Docs
#mol-research — Research Lead, Market Analyst, Tech Researcher, Competitive Intel
#mol-ops — DevOps, Triage, Offensive Security
#mol-ceo-feed — PM synthesized rollup (CEO-facing)
#mol-firehose — all agents (raw feed)
Tested live: 5 test messages across 4 channels, all ok=true.
pgvector migration: moved ALTER TABLE + CREATE INDEX inside the DO
block so the entire migration is skipped when pgvector extension is
unavailable (was crashing platform on restart — the guard caught
CREATE EXTENSION but execution continued to ALTER TABLE which used
the non-existent vector type).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The migration SQL is read as raw SQL (not through Go fmt.Sprintf),
so %% is two parameters, not an escaped percent. Postgres RAISE
uses single % for parameter substitution.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The ALTER TABLE and CREATE INDEX referenced vector(1536) outside the
exception-handling DO block, so when pgvector wasn't installed they
crashed the migration runner — blocking ALL E2E runs on main.
Fix: move all DDL inside the single DO block so the EXCEPTION handler
catches any pgvector-related failure and skips the entire migration.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds step-level checkpoint storage so workflows can resume from the
last completed step after a crash or restart without replaying prior work.
- Migration: `workflow_checkpoints` table — workspace_id (FK + CASCADE),
workflow_id, step_name, step_index, completed_at, payload JSONB.
UNIQUE(workspace_id, workflow_id, step_name) + covering index on
(workspace_id, workflow_id, completed_at DESC).
- Handlers (platform/internal/handlers/checkpoints.go):
POST /workspaces/:id/checkpoints — upsert via ON CONFLICT DO UPDATE
GET /workspaces/:id/checkpoints/:wfid — list steps ordered step_index DESC
DELETE /workspaces/:id/checkpoints/:wfid — clear on clean shutdown (404 if none)
- Router: all three routes on the wsAuth group (WorkspaceAuth middleware);
workspace A's token cannot reach workspace B's checkpoints.
- Tests (11 cases, sqlmock + race-safe): upsert-insert, upsert-update,
payload forwarding, list-ordered, list-not-found, rows.Err() → 500,
delete-success, delete-not-found, callerMismatch 403 on all 3 endpoints.
Closes#788. Parent: #583-1.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Post-mortem fix: UIUX Designer ran 22 cron fires over 23 hours with
every single response being empty or '(no response generated)'. The
scheduler reported status=ok because the HTTP call succeeded — nobody
caught it until the CEO asked.
Changes:
- Migration 032: adds consecutive_empty_runs INT to workspace_schedules
- scheduler.go: captures response body from ProxyA2ARequest (was _),
checks for empty/sentinel markers via isEmptyResponse(), increments
consecutive_empty_runs on empty ok responses, resets on non-empty.
When consecutive_empty_runs >= 3, sets last_status='stale' with a
descriptive error message.
The 'stale' status is surfaced via:
- GET /admin/schedules/health (merged in #671)
- PM's silence detector (companion fix in org-template PR)
- Maintenance loop response-body sampling (operator-side fix)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rebase of feat/issue-576-pgvector-semantic-memory onto current main,
preserving the #767 security layer (globalMemoryDelimiter + GLOBAL audit
log) that predates this branch.
Changes layered on top of main:
- Migration 031: embedding vector(1536) column + ivfflat cosine-ops index
(renumbered from 029 — 029/030 were taken by workspace-hibernation and
audit-events)
- Commit: embed-on-write after INSERT, non-fatal on embedding failure
- Search: semantic cosine-distance path when EmbeddingFunc is wired up;
falls back to FTS/ILIKE; GLOBAL delimiter wrapping applies on both paths
- EmbeddingFunc injection pattern; WithEmbedding chainable builder
All security invariants preserved:
- globalMemoryDelimiter wrapping on GLOBAL scope in both semantic + FTS
- GLOBAL write audit log (SHA-256 forensic trail) in Commit
- TestRecallMemory_GlobalScope_HasDelimiter passes
- TestMemoriesCommit_Global_AsRoot passes
- 3 new pgvector tests pass
Co-authored-by: molecule-ai[bot] <276602405+molecule-ai[bot]@users.noreply.github.com>
PR #724 (workspace hibernation) claimed migration number 029.
Renaming to 030 to resolve the sequence collision before merging #651.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements automatic workspace hibernation for workspaces that have been idle
longer than their configured hibernation_idle_minutes threshold.
Changes:
- migrations/029: Add hibernation_idle_minutes INT DEFAULT NULL column +
partial index on workspaces table
- registry/hibernation.go: New StartHibernationMonitor goroutine that ticks
every 2 min and calls hibernateIdleWorkspaces via the HibernateHandler
callback (same import-cycle-prevention pattern as OfflineHandler)
- registry/hibernation_test.go: 5 unit tests covering handler calls, no-rows,
DB error, tick behaviour, and context-cancel shutdown
- handlers/workspace_restart.go: New Hibernate() HTTP handler (POST
/workspaces/:id/hibernate) + HibernateWorkspace(ctx, id) method — stops
container, sets status='hibernated', clears Redis keys, broadcasts event
- handlers/a2a_proxy.go: Auto-wake in resolveAgentURL — when status='hibernated'
and URL is empty, triggers async RestartByID and returns 503 + Retry-After: 15
so callers can retry transparently
- registry/liveness.go: Exclude 'hibernated' workspaces from offline detection
- router.go: Register POST /workspaces/:id/hibernate under wsAuth group
- cmd/server/main.go: Wire hibernation monitor via supervised.RunWithRecover
Closes#711
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Migration 028 declared workspace_id as TEXT with a FK to workspaces(id)
which is UUID. Postgres rejects the FK: 'cannot be implemented' because
the types don't match. Same class of bug as #646 (which fixed 025).
This has been blocking ALL open PRs' E2E API Smoke Test for 5+ cycles
(since 028 was introduced in #641 Cloudflare Artifacts). Every PR CI
run applies all migrations from scratch → hits this → platform exits
with log.Fatalf → /health never responds → 30s timeout → FAIL.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PR #641 (workspace_artifacts) already claimed 028 on main.
Rename both .up.sql and .down.sql to 029_audit_events.* to avoid
the collision when this branch merges.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Replace == HMAC comparisons with hmac.compare_digest (Python) and
hmac.Equal (Go) in ledger.py, verify.py, and audit.go to prevent
timing oracle attacks (Fixes 1-6)
- Increase PBKDF2 iterations from 100K to 210K in both ledger.py and
audit.go — must match for cross-language verification (Fix 7)
- Return chain_valid: null when offset > 0 (paginated views cannot
verify a truncated chain; null means "not computed") (Fix 8)
- Remove module-level AUDIT_LEDGER_SALT attribute from ledger.py; read
the secret exclusively from os.environ inside _get_hmac_key() so the
salt is not exposed in the module namespace (Fix 9)
- Update tests: use monkeypatch.setenv/delenv instead of setattr on the
removed AUDIT_LEDGER_SALT attribute; update testAuditKey helper to
use 210K iterations; add TestAuditQuery_PaginatedOffsetReturnsNullChainValid
- Fix migration 028: workspace_id column type TEXT → UUID to match
workspaces.id UUID primary key
All tests pass: 1043 pytest + 0 Go test failures.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Migrations 025 + 026 declared workspace_id/org_id as TEXT but
workspaces.id is UUID — Postgres rejects the FK constraint, crashing
every E2E run on main since these migrations were merged.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add a minimal but complete integration with the Cloudflare Artifacts API
(private beta Apr 2026, public beta May 2026) — "Git for agents" versioned
workspace-snapshot storage.
## What's included
**`platform/internal/artifacts/client.go`** — typed Go HTTP client for the
CF Artifacts REST API:
- CreateRepo, GetRepo, ForkRepo, ImportRepo, DeleteRepo
- CreateToken, RevokeToken
- CF v4 response-envelope decoding; *APIError with StatusCode + Message
**`platform/internal/handlers/artifacts.go`** — four workspace-scoped
Gin handlers (all behind WorkspaceAuth middleware):
- POST /workspaces/:id/artifacts — attach or import a CF Artifacts repo
- GET /workspaces/:id/artifacts — get linked repo info (DB + live CF)
- POST /workspaces/:id/artifacts/fork — fork the workspace's repo
- POST /workspaces/:id/artifacts/token — mint a short-lived git credential
**`platform/migrations/028_workspace_artifacts.up.sql`** — `workspace_artifacts`
table: one-to-one link between a workspace and its CF Artifacts repo.
Credentials are never stored; only the credential-stripped remote URL.
**`platform/internal/router/router.go`** — wire the four routes into the
existing wsAuth group.
## Configuration
Two env vars gate the feature (returns 503 when either is absent):
- CF_ARTIFACTS_API_TOKEN — Cloudflare API token with Artifacts write perms
- CF_ARTIFACTS_NAMESPACE — Cloudflare Artifacts namespace name
## Tests
- 10 client-level tests (httptest.Server + CF v4 envelope mocks)
- 14 handler-level tests (sqlmock DB + mock CF server)
- Helper unit tests for stripCredentials, cfErrToHTTP
All 21 packages pass (go test ./...).
Closes#595
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Rebase on origin/fix/issue-631-migration-gap which inserts token_usage
(025) and org_plugin_allowlist (026); bump workspace_budget from 025 to
027 so the sequential runner applies all three in the correct order.
Update workspace_budget_test.go and workspace_test.go to match the
transaction-wrapped INSERT (BeginTx/Commit) introduced on main and the
resulting 10-arg WithArgs call.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Budget limit exceeded on A2A proxy now returns HTTP 402 PaymentRequired
instead of 429 TooManyRequests, matching the issue spec and the FE amber
banner check. Updates a2a_proxy.go, workspace_budget_test.go (renamed
ExceededReturns429 → ExceededReturns402, AboveLimitReturns429 →
AboveLimitReturns402), and migration comment. All go test ./... pass.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Migration 025: ADD COLUMN budget_limit BIGINT DEFAULT NULL and
monthly_spend BIGINT NOT NULL DEFAULT 0 to workspaces table
- Models: BudgetLimit *int64 in CreateWorkspacePayload;
MonthlySpend int64 in HeartbeatPayload
- workspace.go: scanWorkspaceRow, workspaceListQuery, Get, Create, and
Update all handle budget_limit/monthly_spend; budget_limit is gated
as a sensitiveUpdateField
- registry.go: heartbeat conditionally writes monthly_spend only when
payload.MonthlySpend > 0 (avoids overwriting with zero)
- a2a_proxy.go: checkWorkspaceBudget() returns 429 when
monthly_spend >= budget_limit (NULL = no limit; fail-open on DB error)
- Tests: 8 new workspace_budget_test.go tests + patched existing tests
for the 20-column scanWorkspaceRow and 10-param CREATE INSERT
Field type: BIGINT (int64), units: USD cents (budget_limit=500 = $5.00/month)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add an org-scoped allowlist table so org admins can restrict which plugins
workspace agents are allowed to install. An empty allowlist means
allow-all (backward-compatible with existing deployments).
• migrations/027_org_plugin_allowlist.{up,down}.sql — new table + unique
index on (org_id, plugin_name)
• handlers/org_plugin_allowlist.go — resolveOrgID, checkOrgPluginAllowlist
(fail-open on DB errors), GetAllowlist, PutAllowlist (atomic tx replace)
• handlers/org_plugin_allowlist_test.go — 23 unit tests covering all
handler paths, resolveOrgID, and all checkOrgPluginAllowlist branches
• handlers/plugins_install.go — allowlist gate between resolveAndStage and
deliverToContainer; returns 403 if plugin is blocked
• router/router.go — GET/PUT /orgs/:id/plugins/allowlist under AdminAuth
All tests pass; go build ./... clean; gosec Issues: 0
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Migration 026 adds workspace_token_usage table (uuid pk, workspace_id FK with
CASCADE, period_start TIMESTAMPTZ, input_tokens, output_tokens, call_count,
estimated_cost_usd NUMERIC(12,6), updated_at) with a UNIQUE index on
(workspace_id, period_start) for day-granularity upserts.
A2A proxy (proxyA2ARequest) now spawns a detached goroutine after each
successful call to extractAndUpsertTokenUsage, which:
1. Parses usage.input_tokens / usage.output_tokens from result.usage
(JSON-RPC wrapper) with fallback to top-level usage (direct Anthropic).
2. Calls upsertTokenUsage — INSERT ... ON CONFLICT DO UPDATE so multi-
call days accumulate correctly. Estimated cost = input×$0.000003 +
output×$0.000015 (Claude Sonnet default; adjustable in a later phase).
Token tracking never blocks the critical A2A path.
New endpoint: GET /workspaces/:id/metrics (wsAuth — WorkspaceAuth bearer
bound to :id). Returns:
{"input_tokens":N,"output_tokens":N,"total_calls":N,
"estimated_cost_usd":"0.000000","period_start":"...","period_end":"..."}
404 if workspace missing. Period is current UTC day.
11 new tests (4 handler + 7 parse-unit); 19/19 packages pass.
Closes#593
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
New ChannelAdapter implementation for Lark (international, open.larksuite.com)
and Feishu (China, open.feishu.cn). Both speak the same payload format —
only the host differs — so a single adapter covers both.
Outbound: POST text to a Custom Bot webhook URL with msg_type:"text".
Lark returns 200 OK even when delivery fails — the body's `code` field is
the truth. Adapter parses the response and returns a Go error when
code != 0 so callers don't think a revoked-webhook send succeeded.
Inbound: handles both v1 url_verification (handshake) and v2 event_callback
(im.message.receive_v1) shapes. Optional verify_token field — when set,
inbound payloads with mismatching tokens are rejected via constant-time
compare (#337 class — never raw == against a stored secret).
Sender ID resolution prefers user_id → falls back to open_id (open_id is
always present; user_id only when the bot has the contacts permission).
Non-text message types and non-message events return nil, nil so the
receiver responds 200 OK without dispatching.
Tests: 23 cases — identity, ValidateConfig (6 sub-cases incl. URL prefix
matrix), SendMessage (no URL / invalid prefix / happy-path body shape /
api-error-code surfacing), ParseWebhook (handshake + token mismatch +
text message + open_id fallback + non-message + non-text + token mismatch
+ malformed JSON + malformed content + empty text), StartPolling no-op,
registry presence.
Also: make migration 023 idempotent (ADD COLUMN IF NOT EXISTS) — the
platform's migration runner has no schema_migrations tracking table, so
every .up.sql replays on every boot. Without IF NOT EXISTS the second
boot against an existing volume crashes with "column already exists".
Followup issue to be filed for proper migration tracking.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add an optional channel_budget (INTEGER, nullable) to workspace_channels
via migration 024. When channel_budget IS NOT NULL and message_count has
reached the budget, the Send handler returns 429 {"error":"channel budget
exceeded"} and aborts before calling SendOutbound.
Implementation details:
- Single SELECT query reads both message_count and channel_budget in one
round-trip (avoids TOCTOU window between read and write)
- Fail-open on DB error: transient failures log but don't block sends
- Early-return on budget hit is before SendOutbound so message_count
cannot be incremented past the limit by a concurrent send that slips
through the window (best-effort; atomic enforcement requires DB-level CAS)
- NULL channel_budget = unlimited (default, backward-compatible)
Migration is idempotent (ADD COLUMN IF NOT EXISTS). Down migration drops
the column cleanly.
Four sqlmock tests cover: at-limit → 429, above-limit → 429, NULL budget
passes through, under-limit passes through.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Closes the silent-overwrite hole where two agents racing a read-modify-
write on the same memory key left only one agent's update. Relevant for
orchestrators (PM, Dev Lead, Marketing Lead) keeping structured running
state (delegation-result ledgers, task queues) in memory, and for the
``research-backlog:*`` keys that multiple idle loops write in parallel.
## Semantics
### Back-compat path (no if_match_version)
Unchanged: ``INSERT ... ON CONFLICT UPDATE`` last-write-wins. Every
existing agent tool, every existing ``commit_memory`` call, every
existing cron that writes memory — all continue to work with no edit.
### Optimistic-lock path (if_match_version set)
1. Client calls ``GET /memory/:key`` → ``{value, version: V}``
2. Client modifies value locally
3. Client ``POST /memory {key, value, if_match_version: V}``
4. Server: ``UPDATE ... WHERE version = V`` + RETURNING new version
5. On match → 200 + ``{version: V+1}``
6. On mismatch → 409 + ``{expected_version: V, current_version: <actual>}``
7. Client reads the actual version and retries.
### Create-only marker
``if_match_version: 0`` means "create iff the key doesn't exist yet".
Two agents simultaneously seeding a shared key will see exactly one
success + one 409 — no silent collision, no duplicate-init work.
### Schema
Migration 023 adds ``version BIGINT NOT NULL DEFAULT 1``. Existing rows
baseline at 1. New rows start at 1. Every successful write (both paths)
increments: ``version = version + 1`` on update, ``1`` on insert.
## Why version, not updated_at
``updated_at`` has second-granularity and can collide between concurrent
writers on a fast clock. A monotonic counter is collision-free and more
readable in the 409 response body ("expected 5, current is 7 — you
missed 2 writes" tells an agent exactly what to re-read).
## Why ``if_match_version`` and not an ETag header
JSON field keeps it in the request body, visible alongside the value
payload. Agents assembling requests programmatically don't have to
remember to thread a header through their HTTP client wrapper; the
existing ``commit_memory`` tool can grow one optional kwarg and match
the existing signature shape.
## Tests
11 memory-handler cases covering every path:
- GET list / get (with version in response shape)
- Set with no version (back-compat upsert, returns new version)
- Set with if_match_version match (happy path, increment)
- Set with if_match_version mismatch (409 + expected/current fields)
- Set with if_match_version=0 on absent key (create-only success)
- Set with if_match_version=N on absent key (409 — caller's mental
model is wrong)
- Bad inputs (missing key, malformed JSON)
- Delete happy + error path
Full ``go test ./internal/handlers/`` green.
## Follow-up (not in this PR)
- Workspace-template tool update: ``commit_memory(content, *,
if_match_version=None)`` surfaces the new option + on 409 surfaces
the current_version so agents can retry without manual re-read.
- Named checkpoints table (``workspace_checkpoints``) for durable
orchestrator state snapshots. Different concern than per-key locking;
separate PR.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Addresses code-review warnings on PR #76:
- Migration 022 now backfills pre-existing workspace_schedules rows to
source='template' before flipping NOT NULL + DEFAULT 'runtime'. Legacy
rows (all seeded via org/import historically) stay refreshable on
re-import. Down migration drops the CHECK constraint too.
- Extracted the import UPSERT into const orgImportScheduleSQL so the shape
test asserts against the const directly instead of file-scraping org.go.
Removed the os.ReadFile helper.
- scheduleResponse.Source gets json:\",omitempty\" so old clients that
predate the migration don't see an empty string they can't explain.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Resolves#24 per CEO direction.
DB is source of truth for workspace_schedules. POST /org/import becomes
idempotent — only touches rows it owns (source='template'); runtime-added
schedules (Canvas / API) are preserved across re-imports.
- Migration 022: adds source TEXT NOT NULL DEFAULT 'runtime' CHECK in
('template','runtime'); unique index on (workspace_id, name) so the
org/import upsert can use ON CONFLICT.
- org.go: schedule INSERT becomes
INSERT ... 'template' ON CONFLICT (workspace_id, name) DO UPDATE
SET ... WHERE workspace_schedules.source='template'.
Never DELETEs.
- schedules.go: runtime POST writes 'runtime' explicitly; List handler
surfaces the source field on the response so Canvas can render badges.
- 3 new unit tests assert source='runtime' default for runtime CRUD,
the SQL shape contract for org/import (additive + idempotent +
runtime-preserving + never-DELETE), and List response surface.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>