forked from molecule-ai/molecule-core
feat(workspace): persist CP-returned EC2 instance_id on provision
Foundation for the EIC-based terminal handler (#1528). The tenant's workspace-server needs to map workspace_id → EC2 instance_id to open an SSH session, but CPProvisioner.Start returned the instance id only for logging — it was never written anywhere. This PR adds the column and writes it at provision time. Scope kept intentionally small: no terminal code yet. The follow-up PR will consume this column from the terminal handler. What's here: - migrations/038_workspace_instance_id — nullable TEXT column on workspaces, partial index on non-null for fast lookup - workspace_provision.go — UPDATE after CPProvisioner.Start; failure logs but doesn't fail provisioning (row just lacks instance_id and terminal falls back to the existing not-reachable error) - docs/infra/workspace-terminal.md — full design for the terminal flow: EIC vs SSM comparison, IAM policy JSON, SG rules, key lifetime, failure modes, rollout checklist Refs: #1528 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
73464a21dd
commit
46a8d24b2d
159
docs/infra/workspace-terminal.md
Normal file
159
docs/infra/workspace-terminal.md
Normal file
@ -0,0 +1,159 @@
|
||||
# Workspace Terminal over EIC + SSH
|
||||
|
||||
Tracking: [molecule-core#1528](https://github.com/Molecule-AI/molecule-core/issues/1528)
|
||||
|
||||
## Problem
|
||||
|
||||
Canvas's Terminal tab calls `workspace-server /workspaces/:id/terminal` which tries `docker.ContainerInspect` on the tenant's local Docker daemon. That works for locally-provisioned workspaces, but CP-provisioned (SaaS) workspaces run on **separate EC2 instances** — the tenant has no path to their Docker. Users see "Failed to connect — is the workspace container running?" while `STATUS: online` because A2A heartbeats come from the remote instance independently.
|
||||
|
||||
## Chosen approach: EC2 Instance Connect + SSH
|
||||
|
||||
`ec2-instance-connect:SendSSHPublicKey` pushes an ephemeral SSH public key (valid 60s) into the instance's metadata. A short-lived SSH connection uses the matching private key, runs `docker exec -it ws-<id> /bin/bash`, and bridges stdin/stdout to the canvas WebSocket.
|
||||
|
||||
### Why not SSM Session Manager
|
||||
|
||||
SSM would be the "right" answer in a mature infra but requires:
|
||||
- An IAM instance profile with `AmazonSSMManagedInstanceCore` on every workspace EC2 (currently none have one — `aws ssm describe-instance-information` returns an empty list across the fleet)
|
||||
- SSM agent on the AMI (already present on AL2023/Ubuntu, but unverified)
|
||||
- Outbound to `ssm.*.amazonaws.com` (current VPC config unknown)
|
||||
|
||||
EIC short-circuits all three. The existing `molecule-cp` IAM user picks up a small policy addition and we're done — no per-instance identity to bootstrap.
|
||||
|
||||
### Comparison
|
||||
|
||||
| Axis | EIC + SSH | SSM Session Manager |
|
||||
|---|---|---|
|
||||
| Uses existing `molecule-cp` creds | Yes | No — needs instance profile |
|
||||
| AMI changes | None (EIC in OS since AL2 2019+, Ubuntu 20.04+) | Verify agent present |
|
||||
| Infra changes | IAM policy + security group | IAM role + instance profile + maybe NAT/VPCe |
|
||||
| Audit | CloudTrail for `SendSSHPublicKey` | CloudTrail + SSM session logs (richer) |
|
||||
| Rotation | Every session (60s key lifetime) | Managed by AWS |
|
||||
| Compliance story | "SSH with per-session keys, CloudTrailed" | "SSM Session Manager with recording available" |
|
||||
|
||||
Pick SSM later if compliance needs session recording. For now EIC is strictly less work.
|
||||
|
||||
## Data flow
|
||||
|
||||
```
|
||||
[Canvas] [Tenant workspace-server] [Workspace EC2]
|
||||
│ │ │
|
||||
│ WS /workspaces/:id/terminal │ │
|
||||
├────────────────────────────▶│ │
|
||||
│ │ SELECT instance_id │
|
||||
│ │ FROM workspaces WHERE id=:id │
|
||||
│ │ │
|
||||
│ │ ec2:DescribeInstances(instance_id) │
|
||||
│ │ → public_dns, availability_zone, az │
|
||||
│ │ │
|
||||
│ │ ec2-instance-connect:SendSSHPublicKey │
|
||||
│ │ target: instance_id │
|
||||
│ │ os_user: ec2-user|ubuntu │
|
||||
│ │ public_key: ephemeral (ed25519) │
|
||||
│ │ │
|
||||
│ │ ssh ec2-user@public_dns │
|
||||
│ │ -o StrictHostKeyChecking=no │
|
||||
│ ├────────────────────────────────────────▶│
|
||||
│ │ │
|
||||
│ │ docker exec -it ws-<id> /bin/bash │
|
||||
│ ├────────────────────────────────────────▶│
|
||||
│ │ │
|
||||
│◀───── stdout bridge ────────┤◀──────────── stdout ────────────────────┤
|
||||
│───── stdin bridge ─────────▶│───────────── stdin ─────────────────────▶│
|
||||
```
|
||||
|
||||
`instance_id` is persisted on provision by migration `038_workspace_instance_id`. Terminal handler branches on `instance_id IS NOT NULL`.
|
||||
|
||||
## IAM policy addition for `molecule-cp`
|
||||
|
||||
```json
|
||||
{
|
||||
"Version": "2012-10-17",
|
||||
"Statement": [
|
||||
{
|
||||
"Sid": "DescribeInstancesForTerminalResolution",
|
||||
"Effect": "Allow",
|
||||
"Action": ["ec2:DescribeInstances"],
|
||||
"Resource": "*"
|
||||
},
|
||||
{
|
||||
"Sid": "PushEphemeralSSHKeyToWorkspaceInstances",
|
||||
"Effect": "Allow",
|
||||
"Action": ["ec2-instance-connect:SendSSHPublicKey"],
|
||||
"Resource": "arn:aws:ec2:*:*:instance/*",
|
||||
"Condition": {
|
||||
"StringEquals": {
|
||||
"aws:ResourceTag/molecule:role": "workspace"
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Rationale for the tag condition: if CP tags workspace EC2s with `molecule:role=workspace` at launch, this policy can't be weaponized to push keys into unrelated instances. **Action: CP launch code must set that tag** — otherwise drop the Condition block (broader blast radius; less safe).
|
||||
|
||||
## Security group rule
|
||||
|
||||
On each workspace EC2's security group, allow inbound `22/tcp` from the tenant VPC's CIDR only:
|
||||
|
||||
```
|
||||
Inbound:
|
||||
Port: 22
|
||||
Protocol: TCP
|
||||
Source: <tenant-vpc-cidr> (NOT 0.0.0.0/0)
|
||||
Description: Tenant workspace-server → EIC-pushed SSH for canvas terminal
|
||||
```
|
||||
|
||||
Two notes:
|
||||
- EIC does NOT open 0.0.0.0/0 on port 22. The ingress must already allow the caller IP (here: the tenant). EIC only pushes the key — you still need network reachability.
|
||||
- If tenant and workspace EC2s live in different VPCs (likely), use VPC peering or EIC Endpoint instead of a direct CIDR rule.
|
||||
|
||||
## Key lifetime
|
||||
|
||||
- ed25519 keypair generated per-session in the terminal handler
|
||||
- Public half pushed via `SendSSHPublicKey` (valid 60s)
|
||||
- Private half held in-memory only, discarded when the WS closes
|
||||
- No keys on disk, no rotation cron, no secrets rotation debt
|
||||
|
||||
## Failure modes + their user-visible messages
|
||||
|
||||
| Condition | Message | Actionable? |
|
||||
|---|---|---|
|
||||
| `instance_id IS NULL` (local workspace) | Falls through to current local-Docker handler | n/a — existing behavior |
|
||||
| `instance_id` set, DescribeInstances returns nothing | "workspace instance no longer exists — recreate the workspace" | Yes |
|
||||
| `SendSSHPublicKey` 403 | "tenant lacks EIC permission — contact your admin" | Yes (requires IAM fix) |
|
||||
| SSH connect timeout | "tenant cannot reach workspace instance — check security group" | Yes (SG fix) |
|
||||
| `docker exec` fails (no container) | "workspace container is not running — try restart" | Yes (normal ops) |
|
||||
|
||||
## Rollout checklist
|
||||
|
||||
### 1. Infra prep (CP side)
|
||||
|
||||
- [ ] CP tags new workspace EC2s with `molecule:role=workspace` on launch (add a 5-line change to the RunInstances call in the CP service)
|
||||
- [ ] Backfill tag on existing workspace EC2s:
|
||||
```
|
||||
aws ec2 create-tags --resources $(aws ec2 describe-instances \
|
||||
--filters 'Name=tag:Component,Values=molecule-workspace' \
|
||||
--query 'Reservations[].Instances[].InstanceId' --output text) \
|
||||
--tags Key=molecule:role,Value=workspace
|
||||
```
|
||||
- [ ] Add the IAM policy above to `molecule-cp`
|
||||
- [ ] Update workspace security group to allow port 22 from tenant VPC CIDR
|
||||
|
||||
### 2. Tenant code (this repo)
|
||||
|
||||
- [ ] PR 1 (this one): migration `038_workspace_instance_id` + persist instance_id on CP provision
|
||||
- [ ] PR 2 (follow-up): terminal handler EIC + SSH branch + tests
|
||||
|
||||
### 3. Verification
|
||||
|
||||
- [ ] After PR 1 merges + deploys, provision a new CP workspace → verify `SELECT instance_id FROM workspaces` returns the EC2 id
|
||||
- [ ] After PR 2 merges + deploys, open Terminal tab on a CP workspace → bash prompt appears
|
||||
- [ ] Intentionally terminate the EC2 → Terminal tab shows the "instance no longer exists" message
|
||||
- [ ] Pull the `ec2-instance-connect` action from molecule-cp temporarily → Terminal shows "tenant lacks EIC permission"
|
||||
|
||||
## Future work (not in scope)
|
||||
|
||||
- Session recording for compliance → SSM migration with instance profile
|
||||
- Multi-user concurrent terminals → connection pooling per workspace
|
||||
- Terminal for workspaces behind a private NAT with no EIC route → fall back to SSM
|
||||
@ -631,6 +631,18 @@ func (h *WorkspaceHandler) provisionWorkspaceCP(workspaceID, templatePath string
|
||||
return
|
||||
}
|
||||
|
||||
// Persist the backing instance id so later operations (terminal via
|
||||
// EIC+SSH, live logs, debug introspection) can resolve workspace → EC2
|
||||
// without re-asking CP on every request.
|
||||
if _, err := db.DB.ExecContext(ctx,
|
||||
`UPDATE workspaces SET instance_id = $2, updated_at = now() WHERE id = $1`,
|
||||
workspaceID, machineID); err != nil {
|
||||
// Non-fatal: provisioning succeeded, the workspace will still run.
|
||||
// The row stays without instance_id — terminal falls back to the
|
||||
// "CP-provisioned but unreachable" error, not a silent failure.
|
||||
log.Printf("CPProvisioner: persist instance_id failed for %s: %v", workspaceID, err)
|
||||
}
|
||||
|
||||
log.Printf("CPProvisioner: workspace %s started as machine %s via control plane", workspaceID, machineID)
|
||||
// Token issuance is deliberately deferred to the workspace's first
|
||||
// /registry/register call. Minting here without also delivering the
|
||||
|
||||
@ -0,0 +1,2 @@
|
||||
DROP INDEX IF EXISTS idx_workspaces_instance_id;
|
||||
ALTER TABLE workspaces DROP COLUMN IF EXISTS instance_id;
|
||||
12
workspace-server/migrations/038_workspace_instance_id.up.sql
Normal file
12
workspace-server/migrations/038_workspace_instance_id.up.sql
Normal file
@ -0,0 +1,12 @@
|
||||
-- Persist the CP-provisioner-returned EC2 instance id on the workspace row.
|
||||
-- Needed so the tenant workspace-server can resolve a workspace to its
|
||||
-- backing EC2 for operations like terminal (EIC + SSH), live logs, and
|
||||
-- debug introspection without re-calling the control plane on every hit.
|
||||
--
|
||||
-- Nullable: local-Docker workspaces never populate this column.
|
||||
ALTER TABLE workspaces
|
||||
ADD COLUMN IF NOT EXISTS instance_id TEXT;
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_workspaces_instance_id
|
||||
ON workspaces(instance_id)
|
||||
WHERE instance_id IS NOT NULL;
|
||||
Loading…
Reference in New Issue
Block a user