Files

Tuan-Dat Tran da7bd42f07 docs: add k3s-server11 reprovision spec and cluster outage runbook

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-04-21 21:55:18 +02:00

4.9 KiB

Raw Blame History

Design: Reprovision k3s-server11

Date: 2026-04-21 Status: Approved

Background

k3s-server11 (Proxmox VM 111 on inko01) has a corrupted btrfs VM disk image (/opt/proxmox/images/111/vm-111-disk-0.raw). The corruption has been present since ~2026-02-15 (when backups started failing with I/O errors). The VM's guest OS sees this as bad sectors on /dev/sda, causing etcd to crash with SIGBUS when it mmap-reads those sectors. This triggered a full cluster outage on 2026-04-20.

The physical SSD on inko01 is healthy (SMART PASSED). The corruption is at the btrfs filesystem layer (3279+ corrupt blocks, single-device — no redundancy to recover from).

Since etcd data is fully replicated on server10 and server12, no data recovery is needed. The correct fix is to replace the disk with a fresh OS image and rejoin the node.

Architecture

Three sequential phases. Each phase must complete successfully before the next begins.

Phase 1: k8s cleanup     →  Phase 2: Proxmox disk     →  Phase 3: Ansible reprovision
(drain, etcd remove,           (stop VM, delete disk,        (common + k3s_server roles,
 delete node)                   import fresh image,            joins as secondary server,
                                resize, start)                 etcd re-adds member)

Phase 1: Remove server11 from the cluster

Run from a machine with kubectl access (e.g. local workstation).

1.1 Drain the node — evicts all non-daemonset pods:

kubectl drain k3s-server11 --ignore-daemonsets --delete-emptydir-data

1.2 Remove from etcd — prevents quorum issues while the disk is replaced:

ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
  --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
  --key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
  member remove e9f8fa983ff7f958'

1.3 Delete the node object:

kubectl delete node k3s-server11

Verify: kubectl get nodes shows only server10, server12, and the agents. Etcd member list shows only 2 members (server10 + server12). Cluster remains healthy with quorum.

Phase 2: Replace the VM disk on inko01

Run directly on inko01 via SSH.

2.1 Stop the VM:

qm stop 111

2.2 Delete the corrupt disk (detaches and removes the raw file):

qm set 111 --delete scsi0

2.3 Import a fresh Debian 12 cloud-init image as a new disk:

qm importdisk 111 /opt/proxmox/template/iso/debian-12-genericcloud-amd64.qcow2 proxmox

This creates /opt/proxmox/images/111/vm-111-disk-0.raw from the clean base image.

2.4 Attach the disk and set boot order:

qm set 111 --scsi0 proxmox:111/vm-111-disk-0.raw --boot order=scsi0

2.5 Resize to 64G (matching original disk size):

qm resize 111 scsi0 64G

2.6 Start the VM:

qm start 111

Cloud-init runs on first boot and configures: hostname (k3s-server11), user (tudattr), SSH keys, and DHCP networking. Wait ~60s for SSH to become available before Phase 3.

Verify: ssh k3s-server11 hostname returns k3s-server11 and no disk I/O errors appear in dmesg.

Phase 3: Reprovision via Ansible

Run from local workstation in the ansible-homelab repo.

ansible-playbook playbooks/k3s-servers.yaml --limit k3s-server11

This runs the common and k3s_server roles against server11 only:

common: installs base packages, configures SSH, hostname, etc.
k3s_server: detects /usr/local/bin/k3s does not exist → runs install script with --server https://192.168.20.47:6443 (loadbalancer) → joins as a secondary server. k3s fetches the cluster token from server10 (the primary) and registers as a new etcd member automatically.

Verify:

kubectl get nodes                    # server11 shows Ready
ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
  --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
  --key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
  member list -w table'              # 3 members, all started
ssh k3s-server11 'dmesg | grep -i "i/o error"'  # no output

Key Facts

Item	Value
VM ID	111
Proxmox host	inko01
VM disk path	`/opt/proxmox/images/111/vm-111-disk-0.raw`
Base image	`/opt/proxmox/template/iso/debian-12-genericcloud-amd64.qcow2`
Proxmox storage pool	`proxmox`
server11 IP	192.168.20.48
server11 etcd member ID	`e9f8fa983ff7f958`
Loadbalancer IP	192.168.20.47
k3s primary server	server10 (192.168.20.43)

Risk

During Phase 1–2: cluster runs on 2 etcd members. Still has quorum but no redundancy. Avoid other disruptive changes until server11 is back.
etcd member ID: e9f8fa983ff7f958 was confirmed on 2026-04-21. Verify it matches before running the remove command if time has passed.

4.9 KiB Raw Blame History Unescape Escape