diff --git a/docs/superpowers/plans/2026-04-21-k3s-server11-reprovision.md b/docs/superpowers/plans/2026-04-21-k3s-server11-reprovision.md new file mode 100644 index 0000000..1d72b50 --- /dev/null +++ b/docs/superpowers/plans/2026-04-21-k3s-server11-reprovision.md @@ -0,0 +1,339 @@ +# k3s-server11 Reprovision Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Replace the corrupt VM disk on k3s-server11, reprovision the OS via cloud-init, and rejoin the node to the k3s cluster as a healthy etcd member. + +**Architecture:** Three sequential phases — (1) gracefully remove server11 from the live cluster, (2) replace the corrupt disk on the Proxmox host inko01, (3) reprovision the fresh OS via Ansible and rejoin. etcd data is safe on server10 and server12 throughout. + +**Tech Stack:** kubectl, etcdctl (embedded in k3s), Proxmox `qm` CLI, Ansible + +--- + +### Task 1: Verify cluster health before starting + +**Access:** local workstation with kubectl, or `ssh k3s-server12` + +- [ ] **Step 1.1: Confirm all 3 etcd members are present and healthy** + +```bash +ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \ + --endpoints=https://192.168.20.43:2379,https://192.168.20.48:2379,https://192.168.20.56:2379 \ + --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \ + --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \ + --key=/var/lib/rancher/k3s/server/tls/etcd/client.key \ + endpoint health -w table' +``` + +Expected output — all three endpoints show `true`: +``` ++----------------------------+--------+-------+-------+ +| ENDPOINT | HEALTH | TOOK | ERROR | ++----------------------------+--------+-------+-------+ +| https://192.168.20.43:2379 | true | ~8ms | | +| https://192.168.20.56:2379 | true | ~11ms | | +| https://192.168.20.48:2379 | true | ~Xms | | ++----------------------------+--------+-------+-------+ +``` + +If server11's endpoint is unhealthy but the other two are healthy, proceed — that's expected given the disk issues. + +- [ ] **Step 1.2: Confirm server11's current etcd member ID** + +```bash +ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \ + --endpoints=https://127.0.0.1:2379 \ + --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \ + --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \ + --key=/var/lib/rancher/k3s/server/tls/etcd/client.key \ + member list -w table' +``` + +Expected: server11's member ID is `e9f8fa983ff7f958`. If it differs, use the ID shown here in Task 2 Step 2.2. + +- [ ] **Step 1.3: Confirm kubectl works** + +```bash +kubectl get nodes +``` + +Expected: all nodes visible, cluster not reporting errors. + +--- + +### Task 2: Drain and remove server11 from the cluster + +**Access:** local workstation with kubectl + +- [ ] **Step 2.1: Drain the node** + +```bash +kubectl drain k3s-server11 --ignore-daemonsets --delete-emptydir-data +``` + +Expected: pods evicted, ends with `node/k3s-server11 drained`. DaemonSet pods are skipped (normal). + +- [ ] **Step 2.2: Remove server11 from the etcd member list** + +Run this from server11 itself while it's still up: + +```bash +ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \ + --endpoints=https://127.0.0.1:2379 \ + --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \ + --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \ + --key=/var/lib/rancher/k3s/server/tls/etcd/client.key \ + member remove e9f8fa983ff7f958' +``` + +Expected: `Member e9f8fa983ff7f958 removed from cluster ...` + +If server11's etcd is not reachable, run from server12 instead: + +```bash +ssh k3s-server12 'sudo ETCDCTL_API=3 etcdctl \ + --endpoints=https://127.0.0.1:2379 \ + --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \ + --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \ + --key=/var/lib/rancher/k3s/server/tls/etcd/client.key \ + member remove e9f8fa983ff7f958' +``` + +- [ ] **Step 2.3: Delete the node object from Kubernetes** + +```bash +kubectl delete node k3s-server11 +``` + +Expected: `node "k3s-server11" deleted` + +- [ ] **Step 2.4: Verify cluster is healthy with 2 etcd members** + +```bash +ssh k3s-server12 'sudo ETCDCTL_API=3 etcdctl \ + --endpoints=https://192.168.20.43:2379,https://192.168.20.56:2379 \ + --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \ + --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \ + --key=/var/lib/rancher/k3s/server/tls/etcd/client.key \ + member list -w table' +``` + +Expected: exactly 2 members (server10 + server12), both `started`. + +```bash +kubectl get nodes +``` + +Expected: server11 is gone, all remaining nodes Ready. + +--- + +### Task 3: Replace the corrupt disk on inko01 + +**Access:** `ssh inko01` + +- [ ] **Step 3.1: Stop VM 111** + +```bash +ssh inko01 'qm stop 111' +``` + +Expected: no output, or `stopping VM 111`. Verify: + +```bash +ssh inko01 'qm status 111' +``` + +Expected: `status: stopped` + +- [ ] **Step 3.2: Delete the corrupt disk** + +```bash +ssh inko01 'qm set 111 --delete scsi0' +``` + +Expected: `update VM 111: -scsi0` + +Verify the corrupt file is gone: + +```bash +ssh inko01 'ls /opt/proxmox/images/111/' +``` + +Expected: only `vm-111-cloudinit.qcow2` remains (no `vm-111-disk-0.raw`). + +- [ ] **Step 3.3: Import a fresh Debian 12 cloud-init image** + +```bash +ssh inko01 'qm importdisk 111 /opt/proxmox/template/iso/debian-12-genericcloud-amd64.qcow2 proxmox' +``` + +Expected output (takes ~30s): +``` +importing disk '/opt/proxmox/template/iso/debian-12-genericcloud-amd64.qcow2' to VM 111 ... +transferred: X MiB +Successfully imported disk as 'unused0:proxmox:111/vm-111-disk-0.raw' +``` + +- [ ] **Step 3.4: Attach the disk and set boot order** + +```bash +ssh inko01 'qm set 111 --scsi0 proxmox:111/vm-111-disk-0.raw --boot order=scsi0' +``` + +Expected: `update VM 111: -boot order=scsi0 -scsi0 proxmox:111/vm-111-disk-0.raw` + +- [ ] **Step 3.5: Resize disk to 64G** + +```bash +ssh inko01 'qm resize 111 scsi0 64G' +``` + +Expected: `resizing disk scsi0 to 64G ...` or `size is already 64G` if the import was exact. + +- [ ] **Step 3.6: Start the VM** + +```bash +ssh inko01 'qm start 111' +``` + +Expected: no output. Verify: + +```bash +ssh inko01 'qm status 111' +``` + +Expected: `status: running` + +- [ ] **Step 3.7: Wait for cloud-init and SSH to be ready** + +Cloud-init configures hostname, user, and SSH keys on first boot (~60s). Poll until SSH responds: + +```bash +until ssh -o ConnectTimeout=5 -o StrictHostKeyChecking=no k3s-server11 'hostname' 2>/dev/null; do + echo "waiting for SSH..."; sleep 10 +done +``` + +Expected: prints `k3s-server11` when ready. + +- [ ] **Step 3.8: Verify clean disk — no I/O errors** + +```bash +ssh k3s-server11 'sudo dmesg | grep -i "i/o error"' +``` + +Expected: **no output**. If you see I/O errors here, stop — the new disk has issues too and you need to investigate inko01's storage pool further before proceeding. + +--- + +### Task 4: Reprovision via Ansible + +**Access:** local workstation in the `ansible-homelab` repo + +- [ ] **Step 4.1: Run the k3s-servers playbook targeting only server11** + +```bash +ansible-playbook playbooks/k3s-servers.yaml --limit k3s-server11 +``` + +This runs `common` and `k3s_server` roles. Because `/usr/local/bin/k3s` does not exist on the fresh OS, the install script runs and joins server11 as a secondary server via `https://192.168.20.47:6443` (loadbalancer). k3s automatically registers as a new etcd member. + +Expected: playbook completes with no failed tasks. + +- [ ] **Step 4.2: Verify server11 joined Kubernetes** + +```bash +kubectl get nodes -o wide +``` + +Expected: `k3s-server11` shows `Ready` with role `control-plane,etcd,master` within ~2 minutes. + +- [ ] **Step 4.3: Verify server11 is back in the etcd member list** + +```bash +ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \ + --endpoints=https://192.168.20.43:2379,https://192.168.20.48:2379,https://192.168.20.56:2379 \ + --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \ + --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \ + --key=/var/lib/rancher/k3s/server/tls/etcd/client.key \ + endpoint health -w table' +``` + +Expected: all 3 endpoints healthy, server11 responding in <100ms (not 400ms like before). + +- [ ] **Step 4.4: Verify etcd has 3 members** + +```bash +ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \ + --endpoints=https://127.0.0.1:2379 \ + --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \ + --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \ + --key=/var/lib/rancher/k3s/server/tls/etcd/client.key \ + member list -w table' +``` + +Expected: 3 members, all `started`. + +- [ ] **Step 4.5: Uncordon the node** + +The drain in Task 2 cordoned the node. Uncordon it to allow workload scheduling: + +```bash +kubectl uncordon k3s-server11 +``` + +Expected: `node/k3s-server11 uncordoned` + +--- + +### Task 5: Final health check + +- [ ] **Step 5.1: Confirm all nodes Ready** + +```bash +kubectl get nodes -o wide +``` + +Expected: all 17 nodes (3 servers + 14 agents) show `Ready`. + +- [ ] **Step 5.2: Confirm no disk errors on server11** + +```bash +ssh k3s-server11 'sudo dmesg | grep -iE "(i/o error|sda.*error|error.*sda)" | wc -l' +``` + +Expected: `0` + +- [ ] **Step 5.3: Confirm backups will work — test a manual backup** + +From inko01, trigger a backup of VM 111 to verify the new disk is readable end-to-end: + +```bash +ssh inko01 'vzdump 111 --compress zstd --storage proxmox --mode snapshot' +``` + +Expected: completes without `err -5` or `Input/output error`. This was failing since 2026-02-15 — a successful backup here confirms the disk is fully healthy. + +- [ ] **Step 5.4: Update the runbook** + +In `docs/runbooks/k3s-cluster-outage-2026-04-20.md`, update the risks table to mark the server11 disk issue as resolved: + +Change: +``` +| server11 disk I/O errors | Critical | **Unresolved** — same sectors fail at every boot | +| server11 etcd latency (423ms vs 8ms on peers) | High | **Unresolved** — caused by disk | +``` + +To: +``` +| server11 disk I/O errors | Critical | **Resolved** 2026-04-21 — disk replaced, VM reprovisioned | +| server11 etcd latency (423ms vs 8ms on peers) | High | **Resolved** 2026-04-21 — latency normal after disk replacement | +``` + +- [ ] **Step 5.5: Commit** + +```bash +git add docs/runbooks/k3s-cluster-outage-2026-04-20.md +git commit -m "docs: mark server11 disk issue resolved in runbook" +```