# Design: Reprovision k3s-server11 **Date**: 2026-04-21 **Status**: Approved ## Background k3s-server11 (Proxmox VM 111 on inko01) has a corrupted btrfs VM disk image (`/opt/proxmox/images/111/vm-111-disk-0.raw`). The corruption has been present since ~2026-02-15 (when backups started failing with I/O errors). The VM's guest OS sees this as bad sectors on `/dev/sda`, causing etcd to crash with SIGBUS when it mmap-reads those sectors. This triggered a full cluster outage on 2026-04-20. The physical SSD on inko01 is healthy (SMART PASSED). The corruption is at the btrfs filesystem layer (3279+ corrupt blocks, single-device — no redundancy to recover from). Since etcd data is fully replicated on server10 and server12, no data recovery is needed. The correct fix is to replace the disk with a fresh OS image and rejoin the node. ## Architecture Three sequential phases. Each phase must complete successfully before the next begins. ``` Phase 1: k8s cleanup → Phase 2: Proxmox disk → Phase 3: Ansible reprovision (drain, etcd remove, (stop VM, delete disk, (common + k3s_server roles, delete node) import fresh image, joins as secondary server, resize, start) etcd re-adds member) ``` ## Phase 1: Remove server11 from the cluster Run from a machine with `kubectl` access (e.g. local workstation). **1.1 Drain the node** — evicts all non-daemonset pods: ```bash kubectl drain k3s-server11 --ignore-daemonsets --delete-emptydir-data ``` **1.2 Remove from etcd** — prevents quorum issues while the disk is replaced: ```bash ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \ --endpoints=https://127.0.0.1:2379 \ --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \ --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \ --key=/var/lib/rancher/k3s/server/tls/etcd/client.key \ member remove e9f8fa983ff7f958' ``` **1.3 Delete the node object**: ```bash kubectl delete node k3s-server11 ``` **Verify**: `kubectl get nodes` shows only server10, server12, and the agents. Etcd member list shows only 2 members (server10 + server12). Cluster remains healthy with quorum. ## Phase 2: Replace the VM disk on inko01 Run directly on inko01 via SSH. **2.1 Stop the VM**: ```bash qm stop 111 ``` **2.2 Delete the corrupt disk** (detaches and removes the raw file): ```bash qm set 111 --delete scsi0 ``` **2.3 Import a fresh Debian 12 cloud-init image as a new disk**: ```bash qm importdisk 111 /opt/proxmox/template/iso/debian-12-genericcloud-amd64.qcow2 proxmox ``` This creates `/opt/proxmox/images/111/vm-111-disk-0.raw` from the clean base image. **2.4 Attach the disk and set boot order**: ```bash qm set 111 --scsi0 proxmox:111/vm-111-disk-0.raw --boot order=scsi0 ``` **2.5 Resize to 64G** (matching original disk size): ```bash qm resize 111 scsi0 64G ``` **2.6 Start the VM**: ```bash qm start 111 ``` Cloud-init runs on first boot and configures: hostname (`k3s-server11`), user (`tudattr`), SSH keys, and DHCP networking. Wait ~60s for SSH to become available before Phase 3. **Verify**: `ssh k3s-server11 hostname` returns `k3s-server11` and no disk I/O errors appear in `dmesg`. ## Phase 3: Reprovision via Ansible Run from local workstation in the ansible-homelab repo. ```bash ansible-playbook playbooks/k3s-servers.yaml --limit k3s-server11 ``` This runs the `common` and `k3s_server` roles against server11 only: - `common`: installs base packages, configures SSH, hostname, etc. - `k3s_server`: detects `/usr/local/bin/k3s` does not exist → runs install script with `--server https://192.168.20.47:6443` (loadbalancer) → joins as a secondary server. k3s fetches the cluster token from server10 (the primary) and registers as a new etcd member automatically. **Verify**: ```bash kubectl get nodes # server11 shows Ready ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \ --endpoints=https://127.0.0.1:2379 \ --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \ --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \ --key=/var/lib/rancher/k3s/server/tls/etcd/client.key \ member list -w table' # 3 members, all started ssh k3s-server11 'dmesg | grep -i "i/o error"' # no output ``` ## Key Facts | Item | Value | |------|-------| | VM ID | 111 | | Proxmox host | inko01 | | VM disk path | `/opt/proxmox/images/111/vm-111-disk-0.raw` | | Base image | `/opt/proxmox/template/iso/debian-12-genericcloud-amd64.qcow2` | | Proxmox storage pool | `proxmox` | | server11 IP | 192.168.20.48 | | server11 etcd member ID | `e9f8fa983ff7f958` | | Loadbalancer IP | 192.168.20.47 | | k3s primary server | server10 (192.168.20.43) | ## Risk - **During Phase 1–2**: cluster runs on 2 etcd members. Still has quorum but no redundancy. Avoid other disruptive changes until server11 is back. - **etcd member ID**: `e9f8fa983ff7f958` was confirmed on 2026-04-21. Verify it matches before running the remove command if time has passed.