From 46a8d24b2d90371fb5acd797d18b68a4cd2f3cdc Mon Sep 17 00:00:00 2001 From: Hongming Wang Date: Tue, 21 Apr 2026 17:56:15 -0700 Subject: [PATCH] feat(workspace): persist CP-returned EC2 instance_id on provision MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Foundation for the EIC-based terminal handler (#1528). The tenant's workspace-server needs to map workspace_id → EC2 instance_id to open an SSH session, but CPProvisioner.Start returned the instance id only for logging — it was never written anywhere. This PR adds the column and writes it at provision time. Scope kept intentionally small: no terminal code yet. The follow-up PR will consume this column from the terminal handler. What's here: - migrations/038_workspace_instance_id — nullable TEXT column on workspaces, partial index on non-null for fast lookup - workspace_provision.go — UPDATE after CPProvisioner.Start; failure logs but doesn't fail provisioning (row just lacks instance_id and terminal falls back to the existing not-reachable error) - docs/infra/workspace-terminal.md — full design for the terminal flow: EIC vs SSM comparison, IAM policy JSON, SG rules, key lifetime, failure modes, rollout checklist Refs: #1528 Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/infra/workspace-terminal.md | 159 ++++++++++++++++++ .../internal/handlers/workspace_provision.go | 12 ++ .../038_workspace_instance_id.down.sql | 2 + .../038_workspace_instance_id.up.sql | 12 ++ 4 files changed, 185 insertions(+) create mode 100644 docs/infra/workspace-terminal.md create mode 100644 workspace-server/migrations/038_workspace_instance_id.down.sql create mode 100644 workspace-server/migrations/038_workspace_instance_id.up.sql diff --git a/docs/infra/workspace-terminal.md b/docs/infra/workspace-terminal.md new file mode 100644 index 00000000..cb22f227 --- /dev/null +++ b/docs/infra/workspace-terminal.md @@ -0,0 +1,159 @@ +# Workspace Terminal over EIC + SSH + +Tracking: [molecule-core#1528](https://github.com/Molecule-AI/molecule-core/issues/1528) + +## Problem + +Canvas's Terminal tab calls `workspace-server /workspaces/:id/terminal` which tries `docker.ContainerInspect` on the tenant's local Docker daemon. That works for locally-provisioned workspaces, but CP-provisioned (SaaS) workspaces run on **separate EC2 instances** — the tenant has no path to their Docker. Users see "Failed to connect — is the workspace container running?" while `STATUS: online` because A2A heartbeats come from the remote instance independently. + +## Chosen approach: EC2 Instance Connect + SSH + +`ec2-instance-connect:SendSSHPublicKey` pushes an ephemeral SSH public key (valid 60s) into the instance's metadata. A short-lived SSH connection uses the matching private key, runs `docker exec -it ws- /bin/bash`, and bridges stdin/stdout to the canvas WebSocket. + +### Why not SSM Session Manager + +SSM would be the "right" answer in a mature infra but requires: +- An IAM instance profile with `AmazonSSMManagedInstanceCore` on every workspace EC2 (currently none have one — `aws ssm describe-instance-information` returns an empty list across the fleet) +- SSM agent on the AMI (already present on AL2023/Ubuntu, but unverified) +- Outbound to `ssm.*.amazonaws.com` (current VPC config unknown) + +EIC short-circuits all three. The existing `molecule-cp` IAM user picks up a small policy addition and we're done — no per-instance identity to bootstrap. + +### Comparison + +| Axis | EIC + SSH | SSM Session Manager | +|---|---|---| +| Uses existing `molecule-cp` creds | Yes | No — needs instance profile | +| AMI changes | None (EIC in OS since AL2 2019+, Ubuntu 20.04+) | Verify agent present | +| Infra changes | IAM policy + security group | IAM role + instance profile + maybe NAT/VPCe | +| Audit | CloudTrail for `SendSSHPublicKey` | CloudTrail + SSM session logs (richer) | +| Rotation | Every session (60s key lifetime) | Managed by AWS | +| Compliance story | "SSH with per-session keys, CloudTrailed" | "SSM Session Manager with recording available" | + +Pick SSM later if compliance needs session recording. For now EIC is strictly less work. + +## Data flow + +``` +[Canvas] [Tenant workspace-server] [Workspace EC2] + │ │ │ + │ WS /workspaces/:id/terminal │ │ + ├────────────────────────────▶│ │ + │ │ SELECT instance_id │ + │ │ FROM workspaces WHERE id=:id │ + │ │ │ + │ │ ec2:DescribeInstances(instance_id) │ + │ │ → public_dns, availability_zone, az │ + │ │ │ + │ │ ec2-instance-connect:SendSSHPublicKey │ + │ │ target: instance_id │ + │ │ os_user: ec2-user|ubuntu │ + │ │ public_key: ephemeral (ed25519) │ + │ │ │ + │ │ ssh ec2-user@public_dns │ + │ │ -o StrictHostKeyChecking=no │ + │ ├────────────────────────────────────────▶│ + │ │ │ + │ │ docker exec -it ws- /bin/bash │ + │ ├────────────────────────────────────────▶│ + │ │ │ + │◀───── stdout bridge ────────┤◀──────────── stdout ────────────────────┤ + │───── stdin bridge ─────────▶│───────────── stdin ─────────────────────▶│ +``` + +`instance_id` is persisted on provision by migration `038_workspace_instance_id`. Terminal handler branches on `instance_id IS NOT NULL`. + +## IAM policy addition for `molecule-cp` + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Sid": "DescribeInstancesForTerminalResolution", + "Effect": "Allow", + "Action": ["ec2:DescribeInstances"], + "Resource": "*" + }, + { + "Sid": "PushEphemeralSSHKeyToWorkspaceInstances", + "Effect": "Allow", + "Action": ["ec2-instance-connect:SendSSHPublicKey"], + "Resource": "arn:aws:ec2:*:*:instance/*", + "Condition": { + "StringEquals": { + "aws:ResourceTag/molecule:role": "workspace" + } + } + } + ] +} +``` + +Rationale for the tag condition: if CP tags workspace EC2s with `molecule:role=workspace` at launch, this policy can't be weaponized to push keys into unrelated instances. **Action: CP launch code must set that tag** — otherwise drop the Condition block (broader blast radius; less safe). + +## Security group rule + +On each workspace EC2's security group, allow inbound `22/tcp` from the tenant VPC's CIDR only: + +``` +Inbound: + Port: 22 + Protocol: TCP + Source: (NOT 0.0.0.0/0) + Description: Tenant workspace-server → EIC-pushed SSH for canvas terminal +``` + +Two notes: +- EIC does NOT open 0.0.0.0/0 on port 22. The ingress must already allow the caller IP (here: the tenant). EIC only pushes the key — you still need network reachability. +- If tenant and workspace EC2s live in different VPCs (likely), use VPC peering or EIC Endpoint instead of a direct CIDR rule. + +## Key lifetime + +- ed25519 keypair generated per-session in the terminal handler +- Public half pushed via `SendSSHPublicKey` (valid 60s) +- Private half held in-memory only, discarded when the WS closes +- No keys on disk, no rotation cron, no secrets rotation debt + +## Failure modes + their user-visible messages + +| Condition | Message | Actionable? | +|---|---|---| +| `instance_id IS NULL` (local workspace) | Falls through to current local-Docker handler | n/a — existing behavior | +| `instance_id` set, DescribeInstances returns nothing | "workspace instance no longer exists — recreate the workspace" | Yes | +| `SendSSHPublicKey` 403 | "tenant lacks EIC permission — contact your admin" | Yes (requires IAM fix) | +| SSH connect timeout | "tenant cannot reach workspace instance — check security group" | Yes (SG fix) | +| `docker exec` fails (no container) | "workspace container is not running — try restart" | Yes (normal ops) | + +## Rollout checklist + +### 1. Infra prep (CP side) + +- [ ] CP tags new workspace EC2s with `molecule:role=workspace` on launch (add a 5-line change to the RunInstances call in the CP service) +- [ ] Backfill tag on existing workspace EC2s: + ``` + aws ec2 create-tags --resources $(aws ec2 describe-instances \ + --filters 'Name=tag:Component,Values=molecule-workspace' \ + --query 'Reservations[].Instances[].InstanceId' --output text) \ + --tags Key=molecule:role,Value=workspace + ``` +- [ ] Add the IAM policy above to `molecule-cp` +- [ ] Update workspace security group to allow port 22 from tenant VPC CIDR + +### 2. Tenant code (this repo) + +- [ ] PR 1 (this one): migration `038_workspace_instance_id` + persist instance_id on CP provision +- [ ] PR 2 (follow-up): terminal handler EIC + SSH branch + tests + +### 3. Verification + +- [ ] After PR 1 merges + deploys, provision a new CP workspace → verify `SELECT instance_id FROM workspaces` returns the EC2 id +- [ ] After PR 2 merges + deploys, open Terminal tab on a CP workspace → bash prompt appears +- [ ] Intentionally terminate the EC2 → Terminal tab shows the "instance no longer exists" message +- [ ] Pull the `ec2-instance-connect` action from molecule-cp temporarily → Terminal shows "tenant lacks EIC permission" + +## Future work (not in scope) + +- Session recording for compliance → SSM migration with instance profile +- Multi-user concurrent terminals → connection pooling per workspace +- Terminal for workspaces behind a private NAT with no EIC route → fall back to SSM diff --git a/workspace-server/internal/handlers/workspace_provision.go b/workspace-server/internal/handlers/workspace_provision.go index b0117494..7a46d096 100644 --- a/workspace-server/internal/handlers/workspace_provision.go +++ b/workspace-server/internal/handlers/workspace_provision.go @@ -631,6 +631,18 @@ func (h *WorkspaceHandler) provisionWorkspaceCP(workspaceID, templatePath string return } + // Persist the backing instance id so later operations (terminal via + // EIC+SSH, live logs, debug introspection) can resolve workspace → EC2 + // without re-asking CP on every request. + if _, err := db.DB.ExecContext(ctx, + `UPDATE workspaces SET instance_id = $2, updated_at = now() WHERE id = $1`, + workspaceID, machineID); err != nil { + // Non-fatal: provisioning succeeded, the workspace will still run. + // The row stays without instance_id — terminal falls back to the + // "CP-provisioned but unreachable" error, not a silent failure. + log.Printf("CPProvisioner: persist instance_id failed for %s: %v", workspaceID, err) + } + log.Printf("CPProvisioner: workspace %s started as machine %s via control plane", workspaceID, machineID) // Token issuance is deliberately deferred to the workspace's first // /registry/register call. Minting here without also delivering the diff --git a/workspace-server/migrations/038_workspace_instance_id.down.sql b/workspace-server/migrations/038_workspace_instance_id.down.sql new file mode 100644 index 00000000..e185ae17 --- /dev/null +++ b/workspace-server/migrations/038_workspace_instance_id.down.sql @@ -0,0 +1,2 @@ +DROP INDEX IF EXISTS idx_workspaces_instance_id; +ALTER TABLE workspaces DROP COLUMN IF EXISTS instance_id; diff --git a/workspace-server/migrations/038_workspace_instance_id.up.sql b/workspace-server/migrations/038_workspace_instance_id.up.sql new file mode 100644 index 00000000..61c0b82a --- /dev/null +++ b/workspace-server/migrations/038_workspace_instance_id.up.sql @@ -0,0 +1,12 @@ +-- Persist the CP-provisioner-returned EC2 instance id on the workspace row. +-- Needed so the tenant workspace-server can resolve a workspace to its +-- backing EC2 for operations like terminal (EIC + SSH), live logs, and +-- debug introspection without re-calling the control plane on every hit. +-- +-- Nullable: local-Docker workspaces never populate this column. +ALTER TABLE workspaces + ADD COLUMN IF NOT EXISTS instance_id TEXT; + +CREATE INDEX IF NOT EXISTS idx_workspaces_instance_id + ON workspaces(instance_id) + WHERE instance_id IS NOT NULL;