[Molecule-Platform-Evolvement-Manager] PR #59 (commitdae42e2) was merged ~2 weeks ago with a bad diff that deleted all Next.js/Fumadocs build files (package.json, app/, lib/, source.config.ts, tsconfig.json, etc.) and most MDX content pages. This broke the Vercel build, taking doc.moleculesai.app offline. Root cause: the PR branch was likely rebased or reset to a state that only contained the marketing/ subtree, so the merge diff showed deletions for every other file. This commit: 1. Restores all build infrastructure from the last good commit (86fa0e9) 2. Restores 25 deleted MDX content pages (concepts, quickstart, etc.) 3. Adds frontmatter (title) to 55 .md files added post-bad-merge that were missing the required YAML frontmatter for Fumadocs 4. Removes duplicate quickstart.mdx (superseded by quickstart.md) 5. Adds CI workflow (.github/workflows/ci.yml) to catch build failures on PRs before merge — this would have prevented the outage Build verified: 99 static pages generated successfully. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
165 lines
6.4 KiB
Plaintext
165 lines
6.4 KiB
Plaintext
---
|
|
title: Troubleshooting
|
|
description: Common issues and how to fix them.
|
|
---
|
|
|
|
## Workspace Stuck in "Provisioning"
|
|
|
|
A workspace that stays in `provisioning` for more than 30 seconds usually indicates a container startup failure.
|
|
|
|
**Steps to diagnose:**
|
|
|
|
1. Check Docker logs for the workspace container:
|
|
```bash
|
|
docker logs <container-id>
|
|
```
|
|
2. Verify the workspace image exists locally:
|
|
```bash
|
|
docker images | grep workspace-template
|
|
```
|
|
3. Check tier resource limits -- the container may be OOM-killed on start. Review `TIER2_MEMORY_MB` / `TIER3_MEMORY_MB` / `TIER4_MEMORY_MB` values.
|
|
4. Ensure the platform can reach the Docker daemon (Docker Desktop must be running).
|
|
|
|
## 401 Unauthorized on API Calls
|
|
|
|
Bearer tokens can expire or be revoked. Workspace tokens are also auto-revoked when a workspace is deleted.
|
|
|
|
**Resolution:**
|
|
|
|
- For workspace-scoped endpoints, mint a new token:
|
|
```bash
|
|
# Development/staging only (hidden when MOLECULE_ENV=production)
|
|
curl http://localhost:8080/admin/workspaces/:id/test-token
|
|
```
|
|
- For admin endpoints, verify your token is still valid against a known-good endpoint like `GET /health`.
|
|
- Legacy workspaces (created before Phase 30.1) are grandfathered and do not require tokens on heartbeat/update-card routes.
|
|
|
|
## WebSocket Shows "Reconnecting"
|
|
|
|
The canvas WebSocket connection (`/ws`) drops and retries.
|
|
|
|
**Common causes:**
|
|
|
|
- `CORS_ORIGINS` does not include your domain -- the WebSocket upgrade is rejected. Add your origin to the comma-separated list.
|
|
- A reverse proxy or firewall is terminating the long-lived connection. Ensure WebSocket upgrade headers are forwarded.
|
|
- The platform process crashed or restarted. Check platform logs.
|
|
|
|
**Verify connectivity:**
|
|
|
|
```bash
|
|
# Quick check that the WS endpoint is reachable
|
|
curl -i -N \
|
|
-H "Connection: Upgrade" \
|
|
-H "Upgrade: websocket" \
|
|
-H "Sec-WebSocket-Version: 13" \
|
|
-H "Sec-WebSocket-Key: dGVzdA==" \
|
|
http://localhost:8080/ws
|
|
```
|
|
|
|
## Agent Not Responding to A2A
|
|
|
|
When one agent cannot reach another via the A2A proxy (`POST /workspaces/:id/a2a`), check communication rules.
|
|
|
|
**The `CanCommunicate` access check allows:**
|
|
|
|
- Same workspace (self-call)
|
|
- Siblings (same parent)
|
|
- Root-level siblings (both have no parent)
|
|
- Parent to child or child to parent
|
|
|
|
**Everything else is denied.** If two agents need to communicate, they must be in the same subtree.
|
|
|
|
**Also verify:**
|
|
|
|
- The target workspace is `online` (not `paused`, `offline`, or `provisioning`)
|
|
- The target's heartbeat is fresh (Redis TTL has not expired)
|
|
- The caller includes `X-Workspace-ID` and `Authorization: Bearer <token>` headers
|
|
|
|
## Schedule Not Firing
|
|
|
|
Cron schedules are managed by the platform scheduler subsystem.
|
|
|
|
**Checklist:**
|
|
|
|
- Verify the cron expression is valid (standard 5-field cron syntax)
|
|
- Confirm the workspace is `online` -- paused workspaces skip all schedules
|
|
- Check if the schedule was `skipped` due to concurrency: the scheduler skips when `active_tasks > 0`. Review schedule history:
|
|
```
|
|
GET /workspaces/:id/schedules/:scheduleId/history
|
|
```
|
|
- Inspect `GET /admin/liveness` to ensure the scheduler subsystem is alive (age should be under 60 seconds)
|
|
|
|
## Channel Test Fails
|
|
|
|
Social channel integrations (Telegram, Slack, etc.) can fail for several reasons.
|
|
|
|
**Diagnose:**
|
|
|
|
- Verify the bot token is correct and has not been revoked by the platform provider
|
|
- Check the allowlist config in the channel's JSONB settings -- messages from non-allowlisted chats are silently dropped
|
|
- Ensure the webhook URL is registered with the external platform:
|
|
```
|
|
POST /webhooks/:type
|
|
```
|
|
This is the endpoint the external platform (Telegram, Slack) should send events to.
|
|
- Test the connection explicitly:
|
|
```
|
|
POST /workspaces/:id/channels/:channelId/test
|
|
```
|
|
|
|
## Migration Crash on Boot
|
|
|
|
The platform runs all `*.up.sql` migrations on every startup (there is no `schema_migrations` tracking table yet).
|
|
|
|
**Common issues:**
|
|
|
|
- Migrations must be idempotent (`CREATE TABLE IF NOT EXISTS`, `ALTER TABLE ... IF NOT EXISTS`). If a migration lacks this guard, the second boot fails.
|
|
- Before PR #212, the migration runner did not filter `.down.sql` files, causing tables to be dropped on every boot. Ensure you are running a platform version that includes this fix.
|
|
- If you see errors about duplicate columns or tables, the migration is not idempotent. Patch the `.up.sql` file to add `IF NOT EXISTS` guards.
|
|
|
|
## Canvas Blank or 502 on Tenant Deploy
|
|
|
|
In tenant mode (`platform/Dockerfile.tenant`), the Go server proxies canvas requests.
|
|
|
|
**Verify:**
|
|
|
|
- `CANVAS_PROXY_URL` is set and points to the running Next.js process inside the container
|
|
- Both the Go server and the Node.js process are running (check container logs for both)
|
|
- The Next.js build completed successfully during `docker build`
|
|
|
|
## Plugin Install Timeout
|
|
|
|
Large plugins or slow network connections can exceed the default fetch deadline.
|
|
|
|
**Adjust limits:**
|
|
|
|
| Variable | Default | Description |
|
|
|----------|---------|-------------|
|
|
| `PLUGIN_INSTALL_FETCH_TIMEOUT` | `5m` | Increase for large or remote plugins |
|
|
| `PLUGIN_INSTALL_MAX_DIR_BYTES` | `104857600` (100 MiB) | Increase if the plugin tree exceeds 100 MiB |
|
|
| `PLUGIN_INSTALL_BODY_MAX_BYTES` | `65536` (64 KiB) | Increase if the install request body is large |
|
|
|
|
## Memory or Disk Usage Growing
|
|
|
|
Activity logs and structure events accumulate over time.
|
|
|
|
**Tune retention:**
|
|
|
|
- `ACTIVITY_RETENTION_DAYS` (default `7`) -- reduce to 3 or even 1 for high-traffic deployments
|
|
- `ACTIVITY_CLEANUP_INTERVAL_HOURS` (default `6`) -- reduce to run cleanup more frequently
|
|
- Monitor the `activity_logs` and `structure_events` tables directly if disk usage is a concern:
|
|
```sql
|
|
SELECT pg_size_pretty(pg_total_relation_size('activity_logs'));
|
|
SELECT pg_size_pretty(pg_total_relation_size('structure_events'));
|
|
```
|
|
|
|
## Container Health Detection
|
|
|
|
If workspaces go offline unexpectedly (e.g., Docker Desktop crash), three layers detect the failure:
|
|
|
|
1. **Passive (Redis TTL):** 60-second heartbeat key expires, liveness monitor triggers auto-restart
|
|
2. **Proactive (Health Sweep):** Docker API polled every 15 seconds, catches dead containers faster than TTL expiry
|
|
3. **Reactive (A2A Proxy):** On connection error to a workspace, checks `provisioner.IsRunning()` and triggers immediate offline + restart
|
|
|
|
If none of these are catching a dead container, check `GET /admin/liveness` to verify the health sweep and liveness monitor subsystems are running.
|