docs/content/docs/troubleshooting.mdx

---
title: Troubleshooting
description: Common issues and how to fix them.
---

## Workspace Stuck in "Provisioning"

A workspace that stays in `provisioning` for more than 30 seconds usually indicates a container startup failure.

**Steps to diagnose:**

1. Check Docker logs for the workspace container:
   ```bash
   docker logs <container-id>
   ```
2. Verify the workspace image exists locally:
   ```bash
   docker images | grep workspace-template
   ```
3. Check tier resource limits -- the container may be OOM-killed on start. Review `TIER2_MEMORY_MB` / `TIER3_MEMORY_MB` / `TIER4_MEMORY_MB` values.
4. Ensure the platform can reach the Docker daemon (Docker Desktop must be running).

## 401 Unauthorized on API Calls

Bearer tokens can expire or be revoked. Workspace tokens are also auto-revoked when a workspace is deleted.

**Resolution:**

- For workspace-scoped endpoints, mint a new token:
  ```bash
  # Development/staging only (hidden when MOLECULE_ENV=production)
  curl http://localhost:8080/admin/workspaces/:id/test-token
  ```
- For admin endpoints, verify your token is still valid against a known-good endpoint like `GET /health`.
- Legacy workspaces (created before Phase 30.1) are grandfathered and do not require tokens on heartbeat/update-card routes.

## WebSocket Shows "Reconnecting"

The canvas WebSocket connection (`/ws`) drops and retries.

**Common causes:**

- `CORS_ORIGINS` does not include your domain -- the WebSocket upgrade is rejected. Add your origin to the comma-separated list.
- A reverse proxy or firewall is terminating the long-lived connection. Ensure WebSocket upgrade headers are forwarded.
- The platform process crashed or restarted. Check platform logs.

**Verify connectivity:**

```bash
# Quick check that the WS endpoint is reachable
curl -i -N \
  -H "Connection: Upgrade" \
  -H "Upgrade: websocket" \
  -H "Sec-WebSocket-Version: 13" \
  -H "Sec-WebSocket-Key: dGVzdA==" \
  http://localhost:8080/ws
```

## Agent Not Responding to A2A

When one agent cannot reach another via the A2A proxy (`POST /workspaces/:id/a2a`), check communication rules.

**The `CanCommunicate` access check allows:**

- Same workspace (self-call)
- Siblings (same parent)
- Root-level siblings (both have no parent)
- Parent to child or child to parent

**Everything else is denied.** If two agents need to communicate, they must be in the same subtree.

**Also verify:**

- The target workspace is `online` (not `paused`, `offline`, or `provisioning`)
- The target's heartbeat is fresh (Redis TTL has not expired)
- The caller includes `X-Workspace-ID` and `Authorization: Bearer <token>` headers

## Schedule Not Firing

Cron schedules are managed by the platform scheduler subsystem.

**Checklist:**

- Verify the cron expression is valid (standard 5-field cron syntax)
- Confirm the workspace is `online` -- paused workspaces skip all schedules
- Check if the schedule was `skipped` due to concurrency: the scheduler skips when `active_tasks > 0`. Review schedule history:
  ```
  GET /workspaces/:id/schedules/:scheduleId/history
  ```
- Inspect `GET /admin/liveness` to ensure the scheduler subsystem is alive (age should be under 60 seconds)

## Channel Test Fails

Social channel integrations (Telegram, Slack, etc.) can fail for several reasons.

**Diagnose:**

- Verify the bot token is correct and has not been revoked by the platform provider
- Check the allowlist config in the channel's JSONB settings -- messages from non-allowlisted chats are silently dropped
- Ensure the webhook URL is registered with the external platform:
  ```
  POST /webhooks/:type
  ```
  This is the endpoint the external platform (Telegram, Slack) should send events to.
- Test the connection explicitly:
  ```
  POST /workspaces/:id/channels/:channelId/test
  ```

## Migration Crash on Boot

The platform runs all `*.up.sql` migrations on every startup (there is no `schema_migrations` tracking table yet).

**Common issues:**

- Migrations must be idempotent (`CREATE TABLE IF NOT EXISTS`, `ALTER TABLE ... IF NOT EXISTS`). If a migration lacks this guard, the second boot fails.
- Before PR #212, the migration runner did not filter `.down.sql` files, causing tables to be dropped on every boot. Ensure you are running a platform version that includes this fix.
- If you see errors about duplicate columns or tables, the migration is not idempotent. Patch the `.up.sql` file to add `IF NOT EXISTS` guards.

## Canvas Blank or 502 on Tenant Deploy

In tenant mode (`platform/Dockerfile.tenant`), the Go server proxies canvas requests.

**Verify:**

- `CANVAS_PROXY_URL` is set and points to the running Next.js process inside the container
- Both the Go server and the Node.js process are running (check container logs for both)
- The Next.js build completed successfully during `docker build`

## Plugin Install Timeout

Large plugins or slow network connections can exceed the default fetch deadline.

**Adjust limits:**

| Variable | Default | Description |
|----------|---------|-------------|
| `PLUGIN_INSTALL_FETCH_TIMEOUT` | `5m` | Increase for large or remote plugins |
| `PLUGIN_INSTALL_MAX_DIR_BYTES` | `104857600` (100 MiB) | Increase if the plugin tree exceeds 100 MiB |
| `PLUGIN_INSTALL_BODY_MAX_BYTES` | `65536` (64 KiB) | Increase if the install request body is large |

## Memory or Disk Usage Growing

Activity logs and structure events accumulate over time.

**Tune retention:**

- `ACTIVITY_RETENTION_DAYS` (default `7`) -- reduce to 3 or even 1 for high-traffic deployments
- `ACTIVITY_CLEANUP_INTERVAL_HOURS` (default `6`) -- reduce to run cleanup more frequently
- Monitor the `activity_logs` and `structure_events` tables directly if disk usage is a concern:
  ```sql
  SELECT pg_size_pretty(pg_total_relation_size('activity_logs'));
  SELECT pg_size_pretty(pg_total_relation_size('structure_events'));
  ```

## Container Health Detection

If workspaces go offline unexpectedly (e.g., Docker Desktop crash), three layers detect the failure:

1. **Passive (Redis TTL):** 60-second heartbeat key expires, liveness monitor triggers auto-restart
2. **Proactive (Health Sweep):** Docker API polled every 15 seconds, catches dead containers faster than TTL expiry
3. **Reactive (A2A Proxy):** On connection error to a workspace, checks `provisioner.IsRunning()` and triggers immediate offline + restart

If none of these are catching a dead container, check `GET /admin/liveness` to verify the health sweep and liveness monitor subsystems are running.