docs/content/docs/troubleshooting.mdx
rabbitblood 40bd0cfdde fix: restore build infrastructure deleted by bad PR #59 merge
[Molecule-Platform-Evolvement-Manager]

PR #59 (commit dae42e2) was merged ~2 weeks ago with a bad diff that
deleted all Next.js/Fumadocs build files (package.json, app/, lib/,
source.config.ts, tsconfig.json, etc.) and most MDX content pages.
This broke the Vercel build, taking doc.moleculesai.app offline.

Root cause: the PR branch was likely rebased or reset to a state that
only contained the marketing/ subtree, so the merge diff showed
deletions for every other file.

This commit:
1. Restores all build infrastructure from the last good commit (86fa0e9)
2. Restores 25 deleted MDX content pages (concepts, quickstart, etc.)
3. Adds frontmatter (title) to 55 .md files added post-bad-merge that
   were missing the required YAML frontmatter for Fumadocs
4. Removes duplicate quickstart.mdx (superseded by quickstart.md)
5. Adds CI workflow (.github/workflows/ci.yml) to catch build failures
   on PRs before merge — this would have prevented the outage

Build verified: 99 static pages generated successfully.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-22 14:03:24 -07:00

165 lines
6.4 KiB
Plaintext

---
title: Troubleshooting
description: Common issues and how to fix them.
---
## Workspace Stuck in "Provisioning"
A workspace that stays in `provisioning` for more than 30 seconds usually indicates a container startup failure.
**Steps to diagnose:**
1. Check Docker logs for the workspace container:
```bash
docker logs <container-id>
```
2. Verify the workspace image exists locally:
```bash
docker images | grep workspace-template
```
3. Check tier resource limits -- the container may be OOM-killed on start. Review `TIER2_MEMORY_MB` / `TIER3_MEMORY_MB` / `TIER4_MEMORY_MB` values.
4. Ensure the platform can reach the Docker daemon (Docker Desktop must be running).
## 401 Unauthorized on API Calls
Bearer tokens can expire or be revoked. Workspace tokens are also auto-revoked when a workspace is deleted.
**Resolution:**
- For workspace-scoped endpoints, mint a new token:
```bash
# Development/staging only (hidden when MOLECULE_ENV=production)
curl http://localhost:8080/admin/workspaces/:id/test-token
```
- For admin endpoints, verify your token is still valid against a known-good endpoint like `GET /health`.
- Legacy workspaces (created before Phase 30.1) are grandfathered and do not require tokens on heartbeat/update-card routes.
## WebSocket Shows "Reconnecting"
The canvas WebSocket connection (`/ws`) drops and retries.
**Common causes:**
- `CORS_ORIGINS` does not include your domain -- the WebSocket upgrade is rejected. Add your origin to the comma-separated list.
- A reverse proxy or firewall is terminating the long-lived connection. Ensure WebSocket upgrade headers are forwarded.
- The platform process crashed or restarted. Check platform logs.
**Verify connectivity:**
```bash
# Quick check that the WS endpoint is reachable
curl -i -N \
-H "Connection: Upgrade" \
-H "Upgrade: websocket" \
-H "Sec-WebSocket-Version: 13" \
-H "Sec-WebSocket-Key: dGVzdA==" \
http://localhost:8080/ws
```
## Agent Not Responding to A2A
When one agent cannot reach another via the A2A proxy (`POST /workspaces/:id/a2a`), check communication rules.
**The `CanCommunicate` access check allows:**
- Same workspace (self-call)
- Siblings (same parent)
- Root-level siblings (both have no parent)
- Parent to child or child to parent
**Everything else is denied.** If two agents need to communicate, they must be in the same subtree.
**Also verify:**
- The target workspace is `online` (not `paused`, `offline`, or `provisioning`)
- The target's heartbeat is fresh (Redis TTL has not expired)
- The caller includes `X-Workspace-ID` and `Authorization: Bearer <token>` headers
## Schedule Not Firing
Cron schedules are managed by the platform scheduler subsystem.
**Checklist:**
- Verify the cron expression is valid (standard 5-field cron syntax)
- Confirm the workspace is `online` -- paused workspaces skip all schedules
- Check if the schedule was `skipped` due to concurrency: the scheduler skips when `active_tasks > 0`. Review schedule history:
```
GET /workspaces/:id/schedules/:scheduleId/history
```
- Inspect `GET /admin/liveness` to ensure the scheduler subsystem is alive (age should be under 60 seconds)
## Channel Test Fails
Social channel integrations (Telegram, Slack, etc.) can fail for several reasons.
**Diagnose:**
- Verify the bot token is correct and has not been revoked by the platform provider
- Check the allowlist config in the channel's JSONB settings -- messages from non-allowlisted chats are silently dropped
- Ensure the webhook URL is registered with the external platform:
```
POST /webhooks/:type
```
This is the endpoint the external platform (Telegram, Slack) should send events to.
- Test the connection explicitly:
```
POST /workspaces/:id/channels/:channelId/test
```
## Migration Crash on Boot
The platform runs all `*.up.sql` migrations on every startup (there is no `schema_migrations` tracking table yet).
**Common issues:**
- Migrations must be idempotent (`CREATE TABLE IF NOT EXISTS`, `ALTER TABLE ... IF NOT EXISTS`). If a migration lacks this guard, the second boot fails.
- Before PR #212, the migration runner did not filter `.down.sql` files, causing tables to be dropped on every boot. Ensure you are running a platform version that includes this fix.
- If you see errors about duplicate columns or tables, the migration is not idempotent. Patch the `.up.sql` file to add `IF NOT EXISTS` guards.
## Canvas Blank or 502 on Tenant Deploy
In tenant mode (`platform/Dockerfile.tenant`), the Go server proxies canvas requests.
**Verify:**
- `CANVAS_PROXY_URL` is set and points to the running Next.js process inside the container
- Both the Go server and the Node.js process are running (check container logs for both)
- The Next.js build completed successfully during `docker build`
## Plugin Install Timeout
Large plugins or slow network connections can exceed the default fetch deadline.
**Adjust limits:**
| Variable | Default | Description |
|----------|---------|-------------|
| `PLUGIN_INSTALL_FETCH_TIMEOUT` | `5m` | Increase for large or remote plugins |
| `PLUGIN_INSTALL_MAX_DIR_BYTES` | `104857600` (100 MiB) | Increase if the plugin tree exceeds 100 MiB |
| `PLUGIN_INSTALL_BODY_MAX_BYTES` | `65536` (64 KiB) | Increase if the install request body is large |
## Memory or Disk Usage Growing
Activity logs and structure events accumulate over time.
**Tune retention:**
- `ACTIVITY_RETENTION_DAYS` (default `7`) -- reduce to 3 or even 1 for high-traffic deployments
- `ACTIVITY_CLEANUP_INTERVAL_HOURS` (default `6`) -- reduce to run cleanup more frequently
- Monitor the `activity_logs` and `structure_events` tables directly if disk usage is a concern:
```sql
SELECT pg_size_pretty(pg_total_relation_size('activity_logs'));
SELECT pg_size_pretty(pg_total_relation_size('structure_events'));
```
## Container Health Detection
If workspaces go offline unexpectedly (e.g., Docker Desktop crash), three layers detect the failure:
1. **Passive (Redis TTL):** 60-second heartbeat key expires, liveness monitor triggers auto-restart
2. **Proactive (Health Sweep):** Docker API polled every 15 seconds, catches dead containers faster than TTL expiry
3. **Reactive (A2A Proxy):** On connection error to a workspace, checks `provisioner.IsRunning()` and triggers immediate offline + restart
If none of these are catching a dead container, check `GET /admin/liveness` to verify the health sweep and liveness monitor subsystems are running.