Files
ansible/docs/superpowers/specs/2026-04-21-k3s-server11-reprovision-design.md
2026-04-21 21:55:18 +02:00

147 lines
4.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Design: Reprovision k3s-server11
**Date**: 2026-04-21
**Status**: Approved
## Background
k3s-server11 (Proxmox VM 111 on inko01) has a corrupted btrfs VM disk image
(`/opt/proxmox/images/111/vm-111-disk-0.raw`). The corruption has been present since
~2026-02-15 (when backups started failing with I/O errors). The VM's guest OS sees this
as bad sectors on `/dev/sda`, causing etcd to crash with SIGBUS when it mmap-reads those
sectors. This triggered a full cluster outage on 2026-04-20.
The physical SSD on inko01 is healthy (SMART PASSED). The corruption is at the btrfs
filesystem layer (3279+ corrupt blocks, single-device — no redundancy to recover from).
Since etcd data is fully replicated on server10 and server12, no data recovery is needed.
The correct fix is to replace the disk with a fresh OS image and rejoin the node.
## Architecture
Three sequential phases. Each phase must complete successfully before the next begins.
```
Phase 1: k8s cleanup → Phase 2: Proxmox disk → Phase 3: Ansible reprovision
(drain, etcd remove, (stop VM, delete disk, (common + k3s_server roles,
delete node) import fresh image, joins as secondary server,
resize, start) etcd re-adds member)
```
## Phase 1: Remove server11 from the cluster
Run from a machine with `kubectl` access (e.g. local workstation).
**1.1 Drain the node** — evicts all non-daemonset pods:
```bash
kubectl drain k3s-server11 --ignore-daemonsets --delete-emptydir-data
```
**1.2 Remove from etcd** — prevents quorum issues while the disk is replaced:
```bash
ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
--cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
--key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
member remove e9f8fa983ff7f958'
```
**1.3 Delete the node object**:
```bash
kubectl delete node k3s-server11
```
**Verify**: `kubectl get nodes` shows only server10, server12, and the agents. Etcd member
list shows only 2 members (server10 + server12). Cluster remains healthy with quorum.
## Phase 2: Replace the VM disk on inko01
Run directly on inko01 via SSH.
**2.1 Stop the VM**:
```bash
qm stop 111
```
**2.2 Delete the corrupt disk** (detaches and removes the raw file):
```bash
qm set 111 --delete scsi0
```
**2.3 Import a fresh Debian 12 cloud-init image as a new disk**:
```bash
qm importdisk 111 /opt/proxmox/template/iso/debian-12-genericcloud-amd64.qcow2 proxmox
```
This creates `/opt/proxmox/images/111/vm-111-disk-0.raw` from the clean base image.
**2.4 Attach the disk and set boot order**:
```bash
qm set 111 --scsi0 proxmox:111/vm-111-disk-0.raw --boot order=scsi0
```
**2.5 Resize to 64G** (matching original disk size):
```bash
qm resize 111 scsi0 64G
```
**2.6 Start the VM**:
```bash
qm start 111
```
Cloud-init runs on first boot and configures: hostname (`k3s-server11`), user (`tudattr`),
SSH keys, and DHCP networking. Wait ~60s for SSH to become available before Phase 3.
**Verify**: `ssh k3s-server11 hostname` returns `k3s-server11` and no disk I/O errors
appear in `dmesg`.
## Phase 3: Reprovision via Ansible
Run from local workstation in the ansible-homelab repo.
```bash
ansible-playbook playbooks/k3s-servers.yaml --limit k3s-server11
```
This runs the `common` and `k3s_server` roles against server11 only:
- `common`: installs base packages, configures SSH, hostname, etc.
- `k3s_server`: detects `/usr/local/bin/k3s` does not exist → runs install script with
`--server https://192.168.20.47:6443` (loadbalancer) → joins as a secondary server.
k3s fetches the cluster token from server10 (the primary) and registers as a new etcd
member automatically.
**Verify**:
```bash
kubectl get nodes # server11 shows Ready
ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
--cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
--key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
member list -w table' # 3 members, all started
ssh k3s-server11 'dmesg | grep -i "i/o error"' # no output
```
## Key Facts
| Item | Value |
|------|-------|
| VM ID | 111 |
| Proxmox host | inko01 |
| VM disk path | `/opt/proxmox/images/111/vm-111-disk-0.raw` |
| Base image | `/opt/proxmox/template/iso/debian-12-genericcloud-amd64.qcow2` |
| Proxmox storage pool | `proxmox` |
| server11 IP | 192.168.20.48 |
| server11 etcd member ID | `e9f8fa983ff7f958` |
| Loadbalancer IP | 192.168.20.47 |
| k3s primary server | server10 (192.168.20.43) |
## Risk
- **During Phase 12**: cluster runs on 2 etcd members. Still has quorum but no
redundancy. Avoid other disruptive changes until server11 is back.
- **etcd member ID**: `e9f8fa983ff7f958` was confirmed on 2026-04-21. Verify it matches
before running the remove command if time has passed.