When a cron fires, the scheduler now fetches the last 10 messages from
the workspace's Slack channel via conversations.history and prepends them
to the cron prompt as '[Slack channel context — recent team messages]'.
This gives each agent ambient awareness of what peers are doing:
- Backend sees Frontend posted 'PR #840 ready for review' → can check
- Security Auditor sees Backend posted 'new endpoint added' → plans review
- PM sees all engineering activity → better synthesis in rollup
Implementation:
- slack.go: FetchChannelHistory() calls conversations.history, filters
bot's own messages, returns last N as SlackHistoryMessage structs
- manager.go: FetchWorkspaceChannelContext() looks up the workspace's
Slack config, fetches history, formats as readable context block
- scheduler.go: ChannelBroadcaster interface extended with
FetchWorkspaceChannelContext; fireSchedule injects context before
the cron prompt (prepended, not appended, so the agent sees team
context BEFORE its task instructions)
Best-effort: if Slack API fails or workspace has no channels, the
prompt is unchanged. Truncated to 200 chars per message, 10 messages
max to keep prompt overhead bounded.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Code review findings addressed:
Critical:
1. Bot echo loop: add bot_id + subtype='bot_message' check in ParseWebhook
to prevent outbound auto-posts from triggering inbound → infinite loop
2. Connection leak: close resp.Body immediately after reading instead of
defer inside loop (was holding N connections open for N chunks)
3. Cancelled context: auto-post goroutine now uses context.Background()
with 30s timeout instead of inheriting fireCtx (which gets cancelled
by deferred cancel() when fireSchedule returns)
4. Slug validation: regex ^[a-zA-Z0-9 _-]+$ rejects path traversal and
special chars in [slug] routing
Improvements:
5. Shared HTTP client (slackHTTPClient) for connection pooling instead of
per-request &http.Client{}
6. Rune-safe truncation in BroadcastToWorkspaceChannels for CJK/emoji
7. Log async HandleInbound errors instead of silently discarding
8. url_verification challenge properly returned (c.JSON with challenge)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Humans type [backend] what's #800? in a shared #mol-engineering channel
and the message routes specifically to Backend Engineer's workspace.
Matching logic (case-insensitive):
[pm] → PM
[backend] → Backend Engineer
[dev-lead] → Dev Lead
[security] → Security Auditor (prefix match on 'security-auditor')
Unknown slugs return the available agent list for that channel so the
user knows what slugs are valid.
Messages without a [slug] prefix route to the first matching workspace
(backward compat with Level 2).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Level 1 — Auto-post cron output to Slack:
- scheduler.go: captures A2A response body, extracts agent text via
extractResponseSummary(), broadcasts to workspace's configured Slack
channels on successful non-empty cron completions
- manager.go: adds BroadcastToWorkspaceChannels() — fans out to all
enabled channels for a workspace (engineering+firehose for eng agents,
research+firehose for research agents, etc.)
- main.go: wires scheduler → channel manager via SetChannels()
- Truncates output to 500 chars for Slack readability
Level 2 — Inbound Slack messages route to workspaces:
Already implemented by the existing webhook handler (POST /webhooks/slack)
+ the ParseWebhook method in slack.go which handles both Events API JSON
payloads and slash command form-encoded payloads. Needs Slack App Events
API URL configured to: https://<platform-host>/webhooks/slack
Also in this commit:
- slack.go: dual-mode adapter (bot_token + webhook fallback)
- 031 migration: pgvector guard wraps entire DO block
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Slack adapter: adds chat.postMessage mode alongside legacy webhooks.
When bot_token is configured, uses chat:write.customize for per-agent
display name + emoji on every message. Each of the 15 active agents
posts with a distinct identity (PM 💼, Backend ⚙️, etc.).
5 channels configured:
#mol-engineering — PM, Dev Lead, Frontend, Backend, QA, Security, UIUX, Docs
#mol-research — Research Lead, Market Analyst, Tech Researcher, Competitive Intel
#mol-ops — DevOps, Triage, Offensive Security
#mol-ceo-feed — PM synthesized rollup (CEO-facing)
#mol-firehose — all agents (raw feed)
Tested live: 5 test messages across 4 channels, all ok=true.
pgvector migration: moved ALTER TABLE + CREATE INDEX inside the DO
block so the entire migration is skipped when pgvector extension is
unavailable (was crashing platform on restart — the guard caught
CREATE EXTENSION but execution continued to ALTER TABLE which used
the non-existent vector type).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces the racy SELECT-then-Stop two-step in HibernateWorkspace with a
three-step atomic pattern that eliminates the TOCTOU window (SAFE-819):
1. Atomic claim: single UPDATE WHERE id=$1
AND status IN ('online','degraded')
AND active_tasks = 0
— rowsAffected=0 means another caller already claimed it or tasks
arrived; we abort immediately without calling Stop.
2. provisioner.Stop: safe because status='hibernating' blocks new task
routing between step 1 and step 2 (no new task can be dispatched).
3. Final UPDATE to 'hibernated': records the completed hibernation.
Also adds stopFnOverride func(ctx, id) to WorkspaceHandler (always nil in
production) so tests can count Stop calls without a running Docker daemon.
Tests added/updated (13 total across 2 files):
- TestHibernateWorkspace_ActiveTasksNotHibernated
- TestHibernateWorkspace_AlreadyHibernatingNotHibernated
- TestHibernateWorkspace_SuccessPath
- TestHibernateWorkspace_ConcurrentOnlyOneStop
- TestHibernateWorkspace_DBErrorOnClaim
- Updated 3 existing HibernateWorkspace tests + 1 HTTP handler test
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add role="separator" + aria-valuenow/min/max/orientation + tabIndex={0}
to make the resize handle focusable and discoverable by screen readers
(WAI-ARIA slider pattern). Add onKeyDown handler: ArrowLeft/Right moves
by 16px, Home/End snaps to min/max. Persist width to localStorage on
keyboard resize, matching the existing mouse behaviour.
Focus ring uses focus-visible:ring-2 to avoid showing on mouse click.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously loadMessagesFromDB swallowed all errors and returned [] — a
network failure was indistinguishable from an empty history, so the user
had no way to know loading failed. Now the function returns
{ messages, error } and the MyChatPanel renders a role="alert" banner
with the error message and a Retry button when messages are empty and
a load error occurred.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace title attribute (not read by screen readers for truncated text)
with aria-label, add role="status" so live regions announce the error,
and raise text color from text-amber-300/60 (~2.1:1) to text-amber-400
(~10.6:1) to meet WCAG AA contrast (4.5:1 minimum).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add bodyId derived from entry.key, attach aria-controls={bodyId} to the
toggle button, and add id={bodyId} role="region" aria-label to the
collapsible body div. Screen readers can now announce the expand/collapse
relationship between the button and the region it controls (WCAG 4.1.2).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The migration SQL is read as raw SQL (not through Go fmt.Sprintf),
so %% is two parameters, not an escaped percent. Postgres RAISE
uses single % for parameter substitution.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The ALTER TABLE and CREATE INDEX referenced vector(1536) outside the
exception-handling DO block, so when pgvector wasn't installed they
crashed the migration runner — blocking ALL E2E runs on main.
Fix: move all DDL inside the single DO block so the EXCEPTION handler
catches any pgvector-related failure and skips the entire migration.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Closes#790. Depends on feat/issue-583-1-checkpoint-persistence (PR #788).
Platform (Go) — checkpoints_integration_test.go (5 new tests):
1. ThreeStepPersistence: POST task_receive/llm_call/task_complete → GET returns
all 3 in step_index DESC order with correct names and payloads.
2. CrashResume_HighestStepIsResumptionPoint: POST steps 0+1 only (crash before
step 2) → GET shows step_index=1 as the resume point; task_complete absent.
3. UpsertIdempotency_LatestPayloadWins: POST same (wf_id, step_name) twice with
different payloads → List returns only the second payload (ON CONFLICT DO UPDATE).
4. PostCascadeDelete_Returns404: simulate post ON-DELETE-CASCADE state (empty
rows) → List returns 404 as expected after workspace deletion.
5. AuthGate_NoToken_Returns401: router-level test with WorkspaceAuth middleware;
POST/GET/DELETE all return 401 without a bearer token (no DB calls made).
workspace-template — _save_checkpoint + 4 Python tests:
- Add async _save_checkpoint() to temporal_workflow.py: POST to the platform
checkpoint endpoint after each activity stage; fully non-fatal (try/except
inside the function, plus defence-in-depth try/except at every call site).
- 4 new pytest cases (test_temporal_workflow.py):
- nonfatal_on_http_error: _save_checkpoint raises HTTPStatusError (500) →
task_receive_activity still returns {"status":"received"}.
- nonfatal_on_network_error: _save_checkpoint raises ConnectError →
llm_call_activity still returns success LLMResult.
- success_path: _save_checkpoint no-op → activity returns correctly;
checkpoint called with correct args.
- standalone_http_error_is_swallowed: real _save_checkpoint function swallows
HTTP 500 from a mocked httpx.AsyncClient; returns None.
All 36 temporal workflow Python tests pass.
Go tests: Go binary not in this container; test file verified for syntax and
against the sqlmock patterns used throughout the handlers package.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Documents three upgrade strategies for keeping tenant EC2 instances
current with platform-tenant:latest:
- Option A: Rolling restart via CP admin endpoint (coordinated)
- Option B: Sidecar auto-updater cron (implemented, 5 min interval)
- Option C: Blue-green via Worker (zero downtime, future)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Badge was always text-zinc-500; apply blue-500 (>=0.8), zinc-400 (0.5–0.8),
zinc-600 (<0.5) per spec. Add 3 vitest tests for each color tier (725 total).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds step-level checkpoint storage so workflows can resume from the
last completed step after a crash or restart without replaying prior work.
- Migration: `workflow_checkpoints` table — workspace_id (FK + CASCADE),
workflow_id, step_name, step_index, completed_at, payload JSONB.
UNIQUE(workspace_id, workflow_id, step_name) + covering index on
(workspace_id, workflow_id, completed_at DESC).
- Handlers (platform/internal/handlers/checkpoints.go):
POST /workspaces/:id/checkpoints — upsert via ON CONFLICT DO UPDATE
GET /workspaces/:id/checkpoints/:wfid — list steps ordered step_index DESC
DELETE /workspaces/:id/checkpoints/:wfid — clear on clean shutdown (404 if none)
- Router: all three routes on the wsAuth group (WorkspaceAuth middleware);
workspace A's token cannot reach workspace B's checkpoints.
- Tests (11 cases, sqlmock + race-safe): upsert-insert, upsert-update,
payload forwarding, list-ordered, list-not-found, rows.Err() → 500,
delete-success, delete-not-found, callerMismatch 403 on all 3 endpoints.
Closes#788. Parent: #583-1.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Post-mortem fix: UIUX Designer ran 22 cron fires over 23 hours with
every single response being empty or '(no response generated)'. The
scheduler reported status=ok because the HTTP call succeeded — nobody
caught it until the CEO asked.
Changes:
- Migration 032: adds consecutive_empty_runs INT to workspace_schedules
- scheduler.go: captures response body from ProxyA2ARequest (was _),
checks for empty/sentinel markers via isEmptyResponse(), increments
consecutive_empty_runs on empty ok responses, resets on non-empty.
When consecutive_empty_runs >= 3, sets last_status='stale' with a
descriptive error message.
The 'stale' status is surfaced via:
- GET /admin/schedules/health (merged in #671)
- PM's silence detector (companion fix in org-template PR)
- Maintenance loop response-body sampling (operator-side fix)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rebase of feat/issue-576-pgvector-semantic-memory onto current main,
preserving the #767 security layer (globalMemoryDelimiter + GLOBAL audit
log) that predates this branch.
Changes layered on top of main:
- Migration 031: embedding vector(1536) column + ivfflat cosine-ops index
(renumbered from 029 — 029/030 were taken by workspace-hibernation and
audit-events)
- Commit: embed-on-write after INSERT, non-fatal on embedding failure
- Search: semantic cosine-distance path when EmbeddingFunc is wired up;
falls back to FTS/ILIKE; GLOBAL delimiter wrapping applies on both paths
- EmbeddingFunc injection pattern; WithEmbedding chainable builder
All security invariants preserved:
- globalMemoryDelimiter wrapping on GLOBAL scope in both semantic + FTS
- GLOBAL write audit log (SHA-256 forensic trail) in Commit
- TestRecallMemory_GlobalScope_HasDelimiter passes
- TestMemoriesCommit_Global_AsRoot passes
- 3 new pgvector tests pass
Co-authored-by: molecule-ai[bot] <276602405+molecule-ai[bot]@users.noreply.github.com>
Addresses all 4 review points from PR #786:
1. Worker resilience: 3-tier cache (in-memory → KV → CP API) with stale
fallback so CP outages are invisible to tenants
2. WebSocket proxying: documented upgradeHeader handling, fallback to
keep Caddy for WS-only if Workers WS is unreliable
3. SG automation: note to auto-update Cloudflare IP ranges, don't hardcode
4. Trusted proxy: X-Forwarded-For / CF-Connecting-IP trust chain documented
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add org-templates/molecule-dev/system-prompt.md as a canonical org-level
shared-context template for all molecule-dev org agents. The Communication
section explains that /workspace/AGENTS.md is auto-generated at startup from
config.yaml (via agents_md.py / PR #763), describes the AAIF format it
follows, explains the GET /workspace/AGENTS.md peer-discovery contract, and
tells agents to keep their config.yaml name/role/description accurate as the
sole source of truth.
Also restructure the /org-templates/ gitignore rule from a hard directory-ignore
to a content-glob pattern so this specific reference template can be tracked
while all other cloned standalone-repo content remains ignored.
Co-authored-by: Molecule AI Documentation Specialist <documentation-specialist@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds path-filter table so developers and agents know which files
trigger which CI jobs, and that docs-only PRs skip everything.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CI now detects which paths changed and skips irrelevant jobs:
- Platform (Go): only runs when platform/** changes
- Canvas (Next.js): only runs when canvas/** changes
- Python Lint: only runs when workspace-template/** changes
- Shellcheck: only runs when tests/e2e/** or scripts/** change
- E2E API: only runs when platform/** or tests/e2e/** change
Docs-only PRs (*.md, docs/**) skip all 5 jobs, saving ~15 min of
runner time per PR. Uses dorny/paths-filter for the CI workflow and
native paths: filter for the E2E workflow.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore(eco-watch): add BeeAI ACP + Claw Code — 2026-04-17
BeeAI ACP (i-am-bee/acp, IBM) — REST/OpenAPI agent comm protocol, direct
A2A alternative; Copilot CLI ACP support already in preview. GH #777 filed
for TR comparison vs A2A.
Claw Code (ultraworkers/claw-code) — 100k+★ Rust+Python clean-room rewrite
of Claude Code architecture; architectural reference + competitive signal for
molecule-ai-workspace-template-claude-code.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* chore(eco-watch): mark BeeAI ACP as archived — A2A won consolidation
IBM archived i-am-bee/acp on Aug 27, 2025; contributed to AAIF/A2A
working group. No bridge or shim needed — Molecule's A2A bet vindicated.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Molecule AI Research Lead <research-lead@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds Phase 33 plan and architecture doc for replacing per-tenant DNS
records with a wildcard DNS + Cloudflare Worker proxy pattern.
Eliminates: DNS propagation delays, NXDOMAIN caching, per-instance
Let's Encrypt, Caddy on EC2. Same pattern used by Vercel, Railway,
Fly.io, WordPress, n8n.
4-phase migration: deploy Worker → stop creating DNS records →
remove Caddy from EC2 → cleanup.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds two missing env vars to .env.example + docker-compose.yml platform block:
1. HIBERNATION_IDLE_MINUTES (default 60)
Source: issue #724 / workspace hibernation feature.
Note: currently configured per-workspace via the hibernation_idle_minutes
DB column. This placeholder documents the planned global-default env var;
the platform does not yet read it. Per-workspace DB column is active now.
2. PLUGIN_ALLOW_UNPINNED (empty = false)
Source: issue #768 / PR #775 (supply chain hardening, not yet merged).
Pre-emptive documentation — takes effect when PR #775 lands.
ADMIN_TOKEN (item 3): already present with clear generation instructions
(openssl rand -base64 32) and NEVER-commit reminder. No changes needed.
docker-compose.yml cross-check — vars present in .env.example but absent from
the platform service env block (flagged, not fixed in this PR — all have safe
compiled-in defaults and are optional):
SECRETS_ENCRYPTION_KEY, AWARENESS_URL, MOLECULE_ENV, MOLECULE_IN_DOCKER,
MOLECULE_ENABLE_TEST_TOKENS, MOLECULE_ORG_ID, CP_PROVISION_URL,
ACTIVITY_RETENTION_DAYS, ACTIVITY_CLEANUP_INTERVAL_HOURS,
REMOTE_LIVENESS_STALE_AFTER, PLUGIN_INSTALL_{BODY_MAX_BYTES,FETCH_TIMEOUT,
MAX_DIR_BYTES}, TIER{2,3,4}_{MEMORY_MB,CPU_SHARES}, WORKSPACE_DIR.
These are not forwarded by docker-compose because they either auto-detect or
have safe defaults — operators override them via .env on the host. Adding
all of them to docker-compose would be noisy; a separate cleanup issue tracks
this.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds platform/internal/plugins/supply_chain_test.go with 8 tests (7 from
the spec + 1 end-to-end combo) specifying both security controls.
Control 1 — SHA256 content integrity (tests 1-3 + end-to-end):
Tests call VerifyManifestIntegrity(stagedDir string) error, which does
NOT exist yet → 5 compile errors / build failure until supply_chain.go
is written. Once stubbed to nil, SHA256Mismatch test fails at runtime.
VerifyManifestIntegrity contract:
- manifest.json absent → nil (backward compat)
- manifest.json present, no sha256 field → nil (backward compat)
- sha256 matches computed stagedDirDigest → nil
- sha256 mismatch → error mentioning "sha256"
stagedDirDigest algorithm (canonical, test + impl must agree):
Walk all files except manifest.json, sorted by rel path,
format each as "<rel>\x00<content>", concatenate, SHA256, hex.
Control 2 — Pinned-ref enforcement (tests 4-7):
Tests call GithubResolver.Fetch with/without "#ref" fragment.
Currently returns nil for bare refs → TestPluginInstall_UnpinnedRef_Rejected
fails (GitRunner IS called; no "pinned ref" in error message).
PLUGIN_ALLOW_UNPINNED=true escape hatch tested by test 7.
RED state summary (current):
go test ./internal/plugins/... -v -run TestPluginInstall
→ build failed: 5× undefined: VerifyManifestIntegrity
→ (with no-op stub) 2 runtime failures:
FAIL TestPluginInstall_SHA256Mismatch_AbortsInstall
FAIL TestPluginInstall_UnpinnedRef_Rejected
Backend Engineer implementation checklist:
[ ] Add supply_chain.go in package plugins with VerifyManifestIntegrity
[ ] Add pinned-ref gate to GithubResolver.Fetch in github.go
[ ] PLUGIN_ALLOW_UNPINNED=true check skips the gate
[ ] All 8 tests GREEN before merge
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Resolves 4 merge conflicts: Toolbar.tsx (2), Canvas.a11y.test.tsx (1),
Canvas.pan-to-node.test.tsx (1). All conflicts were additive — PR adds
selectedNodeId/setPanelTab selectors and the Audit toolbar button; main
didn't have them. Took PR additions throughout.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>