From da7bd42f076dc816f4bfa5cacb9ee332f147108a Mon Sep 17 00:00:00 2001 From: Tuan-Dat Tran Date: Tue, 21 Apr 2026 21:55:18 +0200 Subject: [PATCH] docs: add k3s-server11 reprovision spec and cluster outage runbook Co-Authored-By: Claude Sonnet 4.6 --- .../runbooks/k3s-cluster-outage-2026-04-20.md | 250 ++++++++++++++++++ ...6-04-21-k3s-server11-reprovision-design.md | 146 ++++++++++ 2 files changed, 396 insertions(+) create mode 100644 docs/runbooks/k3s-cluster-outage-2026-04-20.md create mode 100644 docs/superpowers/specs/2026-04-21-k3s-server11-reprovision-design.md diff --git a/docs/runbooks/k3s-cluster-outage-2026-04-20.md b/docs/runbooks/k3s-cluster-outage-2026-04-20.md new file mode 100644 index 0000000..97276b3 --- /dev/null +++ b/docs/runbooks/k3s-cluster-outage-2026-04-20.md @@ -0,0 +1,250 @@ +# Runbook: k3s Cluster Outage (2026-04-20 / 2026-04-21) + +## Incident Summary + +- **Start**: ~22:43 CEST on 2026-04-20 (k3s-server10 stuck in activating state) +- **Cluster down**: ~23:06 CEST on 2026-04-20 (API servers unreachable on all nodes) +- **Recovery**: ~07:25 CEST on 2026-04-21 (both server11 and server12 rebooted, etcd reformed) +- **Root cause**: Failing virtual disk on k3s-server11 combined with etcd overload from Longhorn orphan writes + +--- + +## What Happened (Timeline) + +1. **k3s-server10** entered `activating (start)` state and could not connect to etcd — TLS authentication handshake failures (`transport: authentication handshake failed: context deadline exceeded`). server10 was not present in the etcd member list. + +2. **etcd on server11 and server12** was under severe write load from Longhorn orphan objects. Raft consensus was taking 480–780ms per request (expected <100ms). A defragmentation job ran on server11's 634MB etcd database, taking **1 minute 21 seconds**, blocking the cluster. + +3. **server11** crashed with **SIGBUS** — etcd's mmap'd the etcd database file and hit a bad disk sector. The journal also showed `Input/output error` when opening journal files. Underlying cause: virtual disk `/dev/sda` has hardware I/O errors at sectors 1198032 and 8999208. + +4. With server11's etcd gone, the 2-member cluster lost quorum. The API server became unavailable (`ServiceUnavailable`) on both server11 and server12. + +5. Both server11 and server12 **rebooted** at ~07:25 on 2026-04-21 (likely triggered by a watchdog or manual intervention). After reboot, all 3 etcd members reformed and the cluster recovered. + +--- + +## Symptoms + +### Cluster-level +- `kubectl get nodes` returns `Error from server (ServiceUnavailable)` +- All workloads stop responding +- `k3s kubectl` on server nodes returns permission denied or ServiceUnavailable + +### k3s service (control plane nodes) +- `systemctl status k3s` shows `activating (start)` for minutes with no progress +- Or: `inactive (dead)` with `Duration: Xm Ys` (short-lived — crash loop) +- k3s service exits with code 0/SUCCESS despite cluster being broken (graceful k3s shutdown due to etcd loss) + +### etcd +- Repeated log lines: `Failed to test etcd connection: failed to get etcd status: rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: context deadline exceeded"` +- etcd logs showing `apply request took too long` for requests >100ms +- `waiting for ReadIndex response took too long, retrying` +- Raft voting messages in a loop (`cast MsgPreVote for ...`) — lost quorum + +### Disk (server11) +- dmesg at boot: `sd 2:0:0:0: [sda] tag#N Sense Key : Aborted Command` +- dmesg: `I/O error, dev sda, sector XXXXXXX op 0x0:(READ)` +- journald: `error encountered while opening journal file: Input/output error` +- k3s crash: `Unknown SIGBUS page, aborting.` + +### Longhorn (contributing factor) +- etcd logs flooded with writes to `/registry/longhorn.io/orphans/longhorn-system/orphan-*` +- etcd database size: 634MB (healthy clusters should be <100MB) +- Defrag operations taking >60s + +--- + +## Diagnosis Commands + +```bash +# Check k3s service status on all servers +for node in k3s-server10 k3s-server11 k3s-server12; do + echo "=== $node ===" && ssh $node 'systemctl status k3s --no-pager | head -5' +done + +# Check etcd member list (run from a server with working etcd) +ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \ + --endpoints=https://127.0.0.1:2379 \ + --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \ + --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \ + --key=/var/lib/rancher/k3s/server/tls/etcd/client.key \ + member list -w table' + +# Check etcd endpoint health across all 3 servers +ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \ + --endpoints=https://192.168.20.43:2379,https://192.168.20.48:2379,https://192.168.20.56:2379 \ + --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \ + --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \ + --key=/var/lib/rancher/k3s/server/tls/etcd/client.key \ + endpoint health -w table' + +# Check etcd endpoint status (DB size, leader) +ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \ + --endpoints=https://192.168.20.43:2379,https://192.168.20.48:2379,https://192.168.20.56:2379 \ + --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \ + --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \ + --key=/var/lib/rancher/k3s/server/tls/etcd/client.key \ + endpoint status -w table' + +# Check for disk I/O errors (VM disks) +ssh k3s-server11 'sudo dmesg | grep -iE "(i/o error|sda|aborted command)" | tail -20' + +# Check recent k3s logs for errors +ssh k3s-server11 'sudo journalctl -u k3s -n 100 --no-pager | grep -iE "(error|fail|sigbus|panic)" | tail -30' + +# Count Longhorn orphans in etcd +ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \ + --endpoints=https://127.0.0.1:2379 \ + --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \ + --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \ + --key=/var/lib/rancher/k3s/server/tls/etcd/client.key \ + get /registry/longhorn.io/orphans/ --prefix --keys-only | wc -l' +``` + +--- + +## Root Causes + +### 1. Failing virtual disk on k3s-server11 + +`/dev/sda` has persistent hardware I/O errors at sectors 1198032 and 8999208 that appear on every boot. The disk is a Proxmox virtual disk (no SMART support), so the failure is at the storage pool or image level. + +**Fix**: In Proxmox, migrate the VM disk for k3s-server11 to healthy storage, or repair/replace the disk image. Check the Proxmox storage pool for errors. + +```bash +# On Proxmox host: check storage health +pvesm status +# Find the VM disk and move it +qm move-disk scsi0 +``` + +### 2. Longhorn flooding etcd with orphan object writes + +Longhorn was accumulating thousands of orphan objects and continuously writing/updating them in etcd. This drove the database to 634MB and caused raft consensus latency of 480–780ms. + +**Fix**: Clean up Longhorn orphans and compact/defrag etcd. + +```bash +# Delete all Longhorn orphans +kubectl delete orphan -n longhorn-system --all + +# Manually defrag etcd after cleanup +ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \ + --endpoints=https://192.168.20.43:2379,https://192.168.20.48:2379,https://192.168.20.56:2379 \ + --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \ + --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \ + --key=/var/lib/rancher/k3s/server/tls/etcd/client.key \ + defrag --cluster' + +# Verify DB size dropped +ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \ + --endpoints=https://192.168.20.43:2379,https://192.168.20.48:2379,https://192.168.20.56:2379 \ + --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \ + --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \ + --key=/var/lib/rancher/k3s/server/tls/etcd/client.key \ + endpoint status -w table' +``` + +--- + +## Recovery Steps (if cluster goes down again) + +### Step 1: Identify which servers have working etcd + +```bash +for node in k3s-server10 k3s-server11 k3s-server12; do + echo "=== $node ===" && ssh $node 'systemctl status k3s --no-pager | head -4' +done +``` + +Look for: `active (running)` vs `activating (start)` vs `inactive (dead)`. + +### Step 2: Check etcd quorum from a running server + +```bash +ssh 'sudo ETCDCTL_API=3 etcdctl \ + --endpoints=https://127.0.0.1:2379 \ + --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \ + --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \ + --key=/var/lib/rancher/k3s/server/tls/etcd/client.key \ + endpoint health' +``` + +If all endpoints are healthy but API is down, restart k3s: +```bash +ssh 'sudo systemctl restart k3s' +``` + +### Step 3: If etcd has lost quorum (fewer than 2 of 3 members healthy) + +With 3-member etcd, you need at least 2 members to have quorum. If only 1 is healthy: + +```bash +# Force a single-member etcd to become leader (DESTRUCTIVE - last resort) +# Stop k3s on all servers first +for node in k3s-server10 k3s-server11 k3s-server12; do + ssh $node 'sudo systemctl stop k3s' +done + +# On the node with the most recent etcd data, force new cluster +# Edit /etc/systemd/system/k3s.service.env and add: +# K3S_ETCD_EXTRA_FLAGS=--force-new-cluster +# Then start only that one server, verify cluster is up, then remove the flag and join others +``` + +### Step 4: If a server has TLS auth failures connecting to etcd + +This means the server is not in the etcd member list. Check: + +```bash +# Is the node actually in etcd? +ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \ + --endpoints=https://127.0.0.1:2379 \ + --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \ + --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \ + --key=/var/lib/rancher/k3s/server/tls/etcd/client.key \ + member list -w table' +``` + +If the failing server is missing: restart it — k3s will attempt to re-add it to the cluster. +If it still fails after restart: the etcd data directory may be corrupt. Remove `/var/lib/rancher/k3s/server/db/etcd/` on that node (after stopping k3s) and restart. k3s will resync from peers. + +### Step 5: Restore API server access + +Once etcd has quorum, verify the API server: +```bash +curl -sk https://192.168.20.47:6443/healthz # via loadbalancer +``` + +If still down after etcd is healthy, restart k3s on the servers: +```bash +for node in k3s-server10 k3s-server11 k3s-server12; do + ssh $node 'sudo systemctl restart k3s' && sleep 10 +done +``` + +--- + +## Ongoing Risks (as of 2026-04-21) + +| Risk | Severity | Status | +|------|----------|--------| +| server11 disk I/O errors | Critical | **Unresolved** — same sectors fail at every boot | +| server11 etcd latency (423ms vs 8ms on peers) | High | **Unresolved** — caused by disk | +| Longhorn orphan accumulation | High | **Unresolved** — may re-fill etcd | +| vaultwarden CrashLoopBackOff | Low | **Unresolved** — investigate separately | +| k3s agent version skew (v1.33.5–v1.34.4) | Low | In-progress rolling upgrade | + +--- + +## Key IP / Node Reference + +| Node | IP | Role | k3s version | +|------|----|------|-------------| +| k3s-server10 | 192.168.20.43 | control-plane, etcd | v1.34.6+k3s1 | +| k3s-server11 | 192.168.20.48 | control-plane, etcd, master | v1.34.6+k3s1 | +| k3s-server12 | 192.168.20.56 | control-plane, etcd, master | v1.34.6+k3s1 | +| k3s-loadbalancer | 192.168.20.47 | API load balancer | — | +| k3s-agent10–19 | 192.168.20.44–67 | workers | v1.33.5+k3s1 | +| k3s-agent20–21 | 192.168.20.69–70 | workers | v1.34.3+k3s1 | +| k3s-agent22–23 | 192.168.20.72–73 | workers | v1.34.4+k3s1 | diff --git a/docs/superpowers/specs/2026-04-21-k3s-server11-reprovision-design.md b/docs/superpowers/specs/2026-04-21-k3s-server11-reprovision-design.md new file mode 100644 index 0000000..4b62436 --- /dev/null +++ b/docs/superpowers/specs/2026-04-21-k3s-server11-reprovision-design.md @@ -0,0 +1,146 @@ +# Design: Reprovision k3s-server11 + +**Date**: 2026-04-21 +**Status**: Approved + +## Background + +k3s-server11 (Proxmox VM 111 on inko01) has a corrupted btrfs VM disk image +(`/opt/proxmox/images/111/vm-111-disk-0.raw`). The corruption has been present since +~2026-02-15 (when backups started failing with I/O errors). The VM's guest OS sees this +as bad sectors on `/dev/sda`, causing etcd to crash with SIGBUS when it mmap-reads those +sectors. This triggered a full cluster outage on 2026-04-20. + +The physical SSD on inko01 is healthy (SMART PASSED). The corruption is at the btrfs +filesystem layer (3279+ corrupt blocks, single-device — no redundancy to recover from). + +Since etcd data is fully replicated on server10 and server12, no data recovery is needed. +The correct fix is to replace the disk with a fresh OS image and rejoin the node. + +## Architecture + +Three sequential phases. Each phase must complete successfully before the next begins. + +``` +Phase 1: k8s cleanup → Phase 2: Proxmox disk → Phase 3: Ansible reprovision +(drain, etcd remove, (stop VM, delete disk, (common + k3s_server roles, + delete node) import fresh image, joins as secondary server, + resize, start) etcd re-adds member) +``` + +## Phase 1: Remove server11 from the cluster + +Run from a machine with `kubectl` access (e.g. local workstation). + +**1.1 Drain the node** — evicts all non-daemonset pods: +```bash +kubectl drain k3s-server11 --ignore-daemonsets --delete-emptydir-data +``` + +**1.2 Remove from etcd** — prevents quorum issues while the disk is replaced: +```bash +ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \ + --endpoints=https://127.0.0.1:2379 \ + --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \ + --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \ + --key=/var/lib/rancher/k3s/server/tls/etcd/client.key \ + member remove e9f8fa983ff7f958' +``` + +**1.3 Delete the node object**: +```bash +kubectl delete node k3s-server11 +``` + +**Verify**: `kubectl get nodes` shows only server10, server12, and the agents. Etcd member +list shows only 2 members (server10 + server12). Cluster remains healthy with quorum. + +## Phase 2: Replace the VM disk on inko01 + +Run directly on inko01 via SSH. + +**2.1 Stop the VM**: +```bash +qm stop 111 +``` + +**2.2 Delete the corrupt disk** (detaches and removes the raw file): +```bash +qm set 111 --delete scsi0 +``` + +**2.3 Import a fresh Debian 12 cloud-init image as a new disk**: +```bash +qm importdisk 111 /opt/proxmox/template/iso/debian-12-genericcloud-amd64.qcow2 proxmox +``` +This creates `/opt/proxmox/images/111/vm-111-disk-0.raw` from the clean base image. + +**2.4 Attach the disk and set boot order**: +```bash +qm set 111 --scsi0 proxmox:111/vm-111-disk-0.raw --boot order=scsi0 +``` + +**2.5 Resize to 64G** (matching original disk size): +```bash +qm resize 111 scsi0 64G +``` + +**2.6 Start the VM**: +```bash +qm start 111 +``` + +Cloud-init runs on first boot and configures: hostname (`k3s-server11`), user (`tudattr`), +SSH keys, and DHCP networking. Wait ~60s for SSH to become available before Phase 3. + +**Verify**: `ssh k3s-server11 hostname` returns `k3s-server11` and no disk I/O errors +appear in `dmesg`. + +## Phase 3: Reprovision via Ansible + +Run from local workstation in the ansible-homelab repo. + +```bash +ansible-playbook playbooks/k3s-servers.yaml --limit k3s-server11 +``` + +This runs the `common` and `k3s_server` roles against server11 only: + +- `common`: installs base packages, configures SSH, hostname, etc. +- `k3s_server`: detects `/usr/local/bin/k3s` does not exist → runs install script with + `--server https://192.168.20.47:6443` (loadbalancer) → joins as a secondary server. + k3s fetches the cluster token from server10 (the primary) and registers as a new etcd + member automatically. + +**Verify**: +```bash +kubectl get nodes # server11 shows Ready +ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \ + --endpoints=https://127.0.0.1:2379 \ + --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \ + --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \ + --key=/var/lib/rancher/k3s/server/tls/etcd/client.key \ + member list -w table' # 3 members, all started +ssh k3s-server11 'dmesg | grep -i "i/o error"' # no output +``` + +## Key Facts + +| Item | Value | +|------|-------| +| VM ID | 111 | +| Proxmox host | inko01 | +| VM disk path | `/opt/proxmox/images/111/vm-111-disk-0.raw` | +| Base image | `/opt/proxmox/template/iso/debian-12-genericcloud-amd64.qcow2` | +| Proxmox storage pool | `proxmox` | +| server11 IP | 192.168.20.48 | +| server11 etcd member ID | `e9f8fa983ff7f958` | +| Loadbalancer IP | 192.168.20.47 | +| k3s primary server | server10 (192.168.20.43) | + +## Risk + +- **During Phase 1–2**: cluster runs on 2 etcd members. Still has quorum but no + redundancy. Avoid other disruptive changes until server11 is back. +- **etcd member ID**: `e9f8fa983ff7f958` was confirmed on 2026-04-21. Verify it matches + before running the remove command if time has passed.