docs: add k3s-server11 reprovision spec and cluster outage runbook

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-21 21:55:18 +02:00
parent f0a45e3fda
commit da7bd42f07
2 changed files with 396 additions and 0 deletions
--- a/docs/superpowers/specs/2026-04-21-k3s-server11-reprovision-design.md
+++ b/docs/superpowers/specs/2026-04-21-k3s-server11-reprovision-design.md
@@ -0,0 +1,146 @@
+# Design: Reprovision k3s-server11
+
+**Date**: 2026-04-21
+**Status**: Approved
+
+## Background
+
+k3s-server11 (Proxmox VM 111 on inko01) has a corrupted btrfs VM disk image
+(`/opt/proxmox/images/111/vm-111-disk-0.raw`). The corruption has been present since
+~2026-02-15 (when backups started failing with I/O errors). The VM's guest OS sees this
+as bad sectors on `/dev/sda`, causing etcd to crash with SIGBUS when it mmap-reads those
+sectors. This triggered a full cluster outage on 2026-04-20.
+
+The physical SSD on inko01 is healthy (SMART PASSED). The corruption is at the btrfs
+filesystem layer (3279+ corrupt blocks, single-device — no redundancy to recover from).
+
+Since etcd data is fully replicated on server10 and server12, no data recovery is needed.
+The correct fix is to replace the disk with a fresh OS image and rejoin the node.
+
+## Architecture
+
+Three sequential phases. Each phase must complete successfully before the next begins.
+
+```
+Phase 1: k8s cleanup     →  Phase 2: Proxmox disk     →  Phase 3: Ansible reprovision
+(drain, etcd remove,           (stop VM, delete disk,        (common + k3s_server roles,
+ delete node)                   import fresh image,            joins as secondary server,
+                                resize, start)                 etcd re-adds member)
+```
+
+## Phase 1: Remove server11 from the cluster
+
+Run from a machine with `kubectl` access (e.g. local workstation).
+
+**1.1 Drain the node** — evicts all non-daemonset pods:
+```bash
+kubectl drain k3s-server11 --ignore-daemonsets --delete-emptydir-data
+```
+
+**1.2 Remove from etcd** — prevents quorum issues while the disk is replaced:
+```bash
+ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \
+  --endpoints=https://127.0.0.1:2379 \
+  --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
+  --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
+  --key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
+  member remove e9f8fa983ff7f958'
+```
+
+**1.3 Delete the node object**:
+```bash
+kubectl delete node k3s-server11
+```
+
+**Verify**: `kubectl get nodes` shows only server10, server12, and the agents. Etcd member
+list shows only 2 members (server10 + server12). Cluster remains healthy with quorum.
+
+## Phase 2: Replace the VM disk on inko01
+
+Run directly on inko01 via SSH.
+
+**2.1 Stop the VM**:
+```bash
+qm stop 111
+```
+
+**2.2 Delete the corrupt disk** (detaches and removes the raw file):
+```bash
+qm set 111 --delete scsi0
+```
+
+**2.3 Import a fresh Debian 12 cloud-init image as a new disk**:
+```bash
+qm importdisk 111 /opt/proxmox/template/iso/debian-12-genericcloud-amd64.qcow2 proxmox
+```
+This creates `/opt/proxmox/images/111/vm-111-disk-0.raw` from the clean base image.
+
+**2.4 Attach the disk and set boot order**:
+```bash
+qm set 111 --scsi0 proxmox:111/vm-111-disk-0.raw --boot order=scsi0
+```
+
+**2.5 Resize to 64G** (matching original disk size):
+```bash
+qm resize 111 scsi0 64G
+```
+
+**2.6 Start the VM**:
+```bash
+qm start 111
+```
+
+Cloud-init runs on first boot and configures: hostname (`k3s-server11`), user (`tudattr`),
+SSH keys, and DHCP networking. Wait ~60s for SSH to become available before Phase 3.
+
+**Verify**: `ssh k3s-server11 hostname` returns `k3s-server11` and no disk I/O errors
+appear in `dmesg`.
+
+## Phase 3: Reprovision via Ansible
+
+Run from local workstation in the ansible-homelab repo.
+
+```bash
+ansible-playbook playbooks/k3s-servers.yaml --limit k3s-server11
+```
+
+This runs the `common` and `k3s_server` roles against server11 only:
+
+- `common`: installs base packages, configures SSH, hostname, etc.
+- `k3s_server`: detects `/usr/local/bin/k3s` does not exist → runs install script with
+  `--server https://192.168.20.47:6443` (loadbalancer) → joins as a secondary server.
+  k3s fetches the cluster token from server10 (the primary) and registers as a new etcd
+  member automatically.
+
+**Verify**:
+```bash
+kubectl get nodes                    # server11 shows Ready
+ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \
+  --endpoints=https://127.0.0.1:2379 \
+  --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
+  --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
+  --key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
+  member list -w table'              # 3 members, all started
+ssh k3s-server11 'dmesg | grep -i "i/o error"'  # no output
+```
+
+## Key Facts
+
+| Item | Value |
+|------|-------|
+| VM ID | 111 |
+| Proxmox host | inko01 |
+| VM disk path | `/opt/proxmox/images/111/vm-111-disk-0.raw` |
+| Base image | `/opt/proxmox/template/iso/debian-12-genericcloud-amd64.qcow2` |
+| Proxmox storage pool | `proxmox` |
+| server11 IP | 192.168.20.48 |
+| server11 etcd member ID | `e9f8fa983ff7f958` |
+| Loadbalancer IP | 192.168.20.47 |
+| k3s primary server | server10 (192.168.20.43) |
+
+## Risk
+
+- **During Phase 1–2**: cluster runs on 2 etcd members. Still has quorum but no
+  redundancy. Avoid other disruptive changes until server11 is back.
+- **etcd member ID**: `e9f8fa983ff7f958` was confirmed on 2026-04-21. Verify it matches
+  before running the remove command if time has passed.