From 456b8fd18403ae4ce43c1ea480b73e7387f610e7 Mon Sep 17 00:00:00 2001 From: Hongming Wang Date: Tue, 21 Apr 2026 19:50:59 -0700 Subject: [PATCH] docs(infra): workspace-terminal runbook with verified commands MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Expanded the rollout section with the exact scripts + env vars that landed to make Hermes workspace Terminal work on 2026-04-22. Points at molecule-controlplane#227 (which adds bootstrap script + EIC_ENDPOINT_SG_ID env var) so operators can reproduce the setup on a new AWS account in one command. Also documents the existing-workspace backfill for the instance_id column — the CP only writes on new provisions, so pre-migration workspaces need a manual UPDATE before Terminal routes to the remote path. Refs: #1528 (resolved) Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/infra/workspace-terminal.md | 82 ++++++++++++++++++++++++++------ 1 file changed, 68 insertions(+), 14 deletions(-) diff --git a/docs/infra/workspace-terminal.md b/docs/infra/workspace-terminal.md index b0364753..25acb519 100644 --- a/docs/infra/workspace-terminal.md +++ b/docs/infra/workspace-terminal.md @@ -1,6 +1,8 @@ # Workspace Terminal over EIC + SSH -Tracking: [molecule-core#1528](https://github.com/Molecule-AI/molecule-core/issues/1528) +Tracking: [molecule-core#1528](https://github.com/Molecule-AI/molecule-core/issues/1528) (resolved 2026-04-22) + +**Status: live in prod** on hongmingwang tenant as of 2026-04-22. Verified end-to-end against the Hermes workspace EC2. ## Problem @@ -142,25 +144,77 @@ Three more failure modes + ongoing bookkeeping per tenant. Skip unless you have | SSH connect timeout | "tenant cannot reach workspace instance — check security group" | Yes (SG fix) | | `docker exec` fails (no container) | "workspace container is not running — try restart" | Yes (normal ops) | -## Rollout checklist +## Rollout (verified recipe) -### 1. Infra prep (one-time) +Each AWS account (staging + prod, etc.) needs this once. The CP repo +ships `scripts/bootstrap-eic-terminal.sh` that automates everything +below — what's here is what the script does, in case you want to run +the steps by hand or audit it. -- [ ] Add IAM policy above to `molecule-cp` user (tag key is `Role`, already set by CP at launch — no CP change needed) -- [ ] Create one EIC Endpoint in the workspace VPC (see command above) -- [ ] No change to `workspaceIngressRules()` — EIC Endpoint bypasses SG ingress +### 1. Infra (one-shot) -### 2. Tenant code (this repo) +```bash +# From molecule-controlplane checkout (needs IAM admin creds): +./scripts/bootstrap-eic-terminal.sh +``` -- [ ] PR 1 (this one): migration `038_workspace_instance_id` + persist instance_id on CP provision -- [ ] PR 2 (follow-up): terminal handler EIC + SSH branch + tests +Creates (idempotent): +- EC2 Instance Connect **service-linked role** (`AWSServiceRoleForEC2InstanceConnect`) +- **Managed IAM policy** `MoleculeEICTerminal` (DescribeInstances + SendSSHPublicKey + OpenTunnel + CreateInstanceConnectEndpoint + DescribeInstanceConnectEndpoints) +- **IAM role + instance profile** `MoleculeTenantEICRole` / `MoleculeTenantEICProfile` (attach the managed policy) — this replaces env-var AWS creds on tenant EC2s +- **EIC Endpoint** in the workspace VPC (uses the default VPC SG for egress, which is all EIC Endpoint needs) -### 3. Verification +Script prints the endpoint SG id + profile name to set on the CP: -- [ ] After PR 1 merges + deploys, provision a new CP workspace → verify `SELECT instance_id FROM workspaces` returns the EC2 id -- [ ] After PR 2 merges + deploys, open Terminal tab on a CP workspace → bash prompt appears -- [ ] Intentionally terminate the EC2 → Terminal tab shows the "instance no longer exists" message -- [ ] Pull the `ec2-instance-connect:OpenTunnel` action from molecule-cp temporarily → Terminal shows "tenant lacks EIC permission" +``` +EIC_ENDPOINT_SG_ID=sg-xxxxxx +EC2_TENANT_IAM_PROFILE=MoleculeTenantEICProfile +``` + +### 2. CP config + redeploy + +Set those two env vars on the CP service (Railway dashboard or equivalent). On redeploy, [molecule-controlplane#227](https://github.com/Molecule-AI/molecule-controlplane/pull/227) ensures every **newly-provisioned** workspace + tenant SG auto-carries a `22/tcp` ingress rule sourced from the EIC Endpoint SG. + +### 3. Backfill existing instances + +Pre-existing SGs need one-time ingress added. The bootstrap script's final output includes this loop; shown here for visibility: + +```bash +for sg in $(aws ec2 describe-security-groups --region us-east-2 \ + --filters 'Name=tag:ManagedBy,Values=molecule-cp' \ + --query 'SecurityGroups[].GroupId' --output text); do + aws ec2 authorize-security-group-ingress --region us-east-2 \ + --group-id $sg --protocol tcp --port 22 --source-group sg-xxxxxx \ + 2>&1 | grep -v DuplicatePermission || true +done +``` + +### 4. Tenant code (this monorepo) + +Already merged: +- [#1531](https://github.com/Molecule-AI/molecule-core/pull/1531) — migration `038_workspace_instance_id` + persist on CP provision +- [#1533](https://github.com/Molecule-AI/molecule-core/pull/1533) — terminal handler remote branch (EIC open-tunnel + ssh + pty) + +Tenant image (`ghcr.io/molecule-ai/platform-tenant:latest`) ships with `aws-cli` + `openssh-client` as of 2026-04-22. + +### 5. Verification (how to confirm after deploy) + +- Provision a fresh CP workspace → `SELECT instance_id FROM workspaces WHERE id = ?` is non-null +- Open canvas Terminal on that workspace → bash prompt (`ubuntu@ip-...`) +- Terminate the workspace EC2 manually → Terminal shows "EIC tunnel didn't come up" +- Temporarily remove `ec2-instance-connect:OpenTunnel` from `MoleculeEICTerminal` → Terminal shows "failed to push session key" + +### Existing-workspace backfill of `instance_id` + +Migrations run on tenant boot, but pre-existing workspace rows have NULL `instance_id`. The CP provisioner only writes `instance_id` on NEW provisions; old workspaces need: + +```sql +-- Inside the tenant DB +UPDATE workspaces SET instance_id = '', updated_at = now() +WHERE id = ''; +``` + +For a whole fleet, join CP's workspace table with the DescribeInstances result by `WorkspaceID` tag and batch-UPDATE. ## Future work (not in scope)