docs: add k3s-server11 reprovision spec and cluster outage runbook
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
250
docs/runbooks/k3s-cluster-outage-2026-04-20.md
Normal file
250
docs/runbooks/k3s-cluster-outage-2026-04-20.md
Normal file
@@ -0,0 +1,250 @@
|
||||
# Runbook: k3s Cluster Outage (2026-04-20 / 2026-04-21)
|
||||
|
||||
## Incident Summary
|
||||
|
||||
- **Start**: ~22:43 CEST on 2026-04-20 (k3s-server10 stuck in activating state)
|
||||
- **Cluster down**: ~23:06 CEST on 2026-04-20 (API servers unreachable on all nodes)
|
||||
- **Recovery**: ~07:25 CEST on 2026-04-21 (both server11 and server12 rebooted, etcd reformed)
|
||||
- **Root cause**: Failing virtual disk on k3s-server11 combined with etcd overload from Longhorn orphan writes
|
||||
|
||||
---
|
||||
|
||||
## What Happened (Timeline)
|
||||
|
||||
1. **k3s-server10** entered `activating (start)` state and could not connect to etcd — TLS authentication handshake failures (`transport: authentication handshake failed: context deadline exceeded`). server10 was not present in the etcd member list.
|
||||
|
||||
2. **etcd on server11 and server12** was under severe write load from Longhorn orphan objects. Raft consensus was taking 480–780ms per request (expected <100ms). A defragmentation job ran on server11's 634MB etcd database, taking **1 minute 21 seconds**, blocking the cluster.
|
||||
|
||||
3. **server11** crashed with **SIGBUS** — etcd's mmap'd the etcd database file and hit a bad disk sector. The journal also showed `Input/output error` when opening journal files. Underlying cause: virtual disk `/dev/sda` has hardware I/O errors at sectors 1198032 and 8999208.
|
||||
|
||||
4. With server11's etcd gone, the 2-member cluster lost quorum. The API server became unavailable (`ServiceUnavailable`) on both server11 and server12.
|
||||
|
||||
5. Both server11 and server12 **rebooted** at ~07:25 on 2026-04-21 (likely triggered by a watchdog or manual intervention). After reboot, all 3 etcd members reformed and the cluster recovered.
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
### Cluster-level
|
||||
- `kubectl get nodes` returns `Error from server (ServiceUnavailable)`
|
||||
- All workloads stop responding
|
||||
- `k3s kubectl` on server nodes returns permission denied or ServiceUnavailable
|
||||
|
||||
### k3s service (control plane nodes)
|
||||
- `systemctl status k3s` shows `activating (start)` for minutes with no progress
|
||||
- Or: `inactive (dead)` with `Duration: Xm Ys` (short-lived — crash loop)
|
||||
- k3s service exits with code 0/SUCCESS despite cluster being broken (graceful k3s shutdown due to etcd loss)
|
||||
|
||||
### etcd
|
||||
- Repeated log lines: `Failed to test etcd connection: failed to get etcd status: rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: context deadline exceeded"`
|
||||
- etcd logs showing `apply request took too long` for requests >100ms
|
||||
- `waiting for ReadIndex response took too long, retrying`
|
||||
- Raft voting messages in a loop (`cast MsgPreVote for ...`) — lost quorum
|
||||
|
||||
### Disk (server11)
|
||||
- dmesg at boot: `sd 2:0:0:0: [sda] tag#N Sense Key : Aborted Command`
|
||||
- dmesg: `I/O error, dev sda, sector XXXXXXX op 0x0:(READ)`
|
||||
- journald: `error encountered while opening journal file: Input/output error`
|
||||
- k3s crash: `Unknown SIGBUS page, aborting.`
|
||||
|
||||
### Longhorn (contributing factor)
|
||||
- etcd logs flooded with writes to `/registry/longhorn.io/orphans/longhorn-system/orphan-*`
|
||||
- etcd database size: 634MB (healthy clusters should be <100MB)
|
||||
- Defrag operations taking >60s
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis Commands
|
||||
|
||||
```bash
|
||||
# Check k3s service status on all servers
|
||||
for node in k3s-server10 k3s-server11 k3s-server12; do
|
||||
echo "=== $node ===" && ssh $node 'systemctl status k3s --no-pager | head -5'
|
||||
done
|
||||
|
||||
# Check etcd member list (run from a server with working etcd)
|
||||
ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \
|
||||
--endpoints=https://127.0.0.1:2379 \
|
||||
--cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
|
||||
--cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
|
||||
--key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
|
||||
member list -w table'
|
||||
|
||||
# Check etcd endpoint health across all 3 servers
|
||||
ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \
|
||||
--endpoints=https://192.168.20.43:2379,https://192.168.20.48:2379,https://192.168.20.56:2379 \
|
||||
--cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
|
||||
--cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
|
||||
--key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
|
||||
endpoint health -w table'
|
||||
|
||||
# Check etcd endpoint status (DB size, leader)
|
||||
ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \
|
||||
--endpoints=https://192.168.20.43:2379,https://192.168.20.48:2379,https://192.168.20.56:2379 \
|
||||
--cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
|
||||
--cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
|
||||
--key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
|
||||
endpoint status -w table'
|
||||
|
||||
# Check for disk I/O errors (VM disks)
|
||||
ssh k3s-server11 'sudo dmesg | grep -iE "(i/o error|sda|aborted command)" | tail -20'
|
||||
|
||||
# Check recent k3s logs for errors
|
||||
ssh k3s-server11 'sudo journalctl -u k3s -n 100 --no-pager | grep -iE "(error|fail|sigbus|panic)" | tail -30'
|
||||
|
||||
# Count Longhorn orphans in etcd
|
||||
ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \
|
||||
--endpoints=https://127.0.0.1:2379 \
|
||||
--cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
|
||||
--cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
|
||||
--key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
|
||||
get /registry/longhorn.io/orphans/ --prefix --keys-only | wc -l'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Root Causes
|
||||
|
||||
### 1. Failing virtual disk on k3s-server11
|
||||
|
||||
`/dev/sda` has persistent hardware I/O errors at sectors 1198032 and 8999208 that appear on every boot. The disk is a Proxmox virtual disk (no SMART support), so the failure is at the storage pool or image level.
|
||||
|
||||
**Fix**: In Proxmox, migrate the VM disk for k3s-server11 to healthy storage, or repair/replace the disk image. Check the Proxmox storage pool for errors.
|
||||
|
||||
```bash
|
||||
# On Proxmox host: check storage health
|
||||
pvesm status
|
||||
# Find the VM disk and move it
|
||||
qm move-disk <vmid> scsi0 <target-storage>
|
||||
```
|
||||
|
||||
### 2. Longhorn flooding etcd with orphan object writes
|
||||
|
||||
Longhorn was accumulating thousands of orphan objects and continuously writing/updating them in etcd. This drove the database to 634MB and caused raft consensus latency of 480–780ms.
|
||||
|
||||
**Fix**: Clean up Longhorn orphans and compact/defrag etcd.
|
||||
|
||||
```bash
|
||||
# Delete all Longhorn orphans
|
||||
kubectl delete orphan -n longhorn-system --all
|
||||
|
||||
# Manually defrag etcd after cleanup
|
||||
ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \
|
||||
--endpoints=https://192.168.20.43:2379,https://192.168.20.48:2379,https://192.168.20.56:2379 \
|
||||
--cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
|
||||
--cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
|
||||
--key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
|
||||
defrag --cluster'
|
||||
|
||||
# Verify DB size dropped
|
||||
ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \
|
||||
--endpoints=https://192.168.20.43:2379,https://192.168.20.48:2379,https://192.168.20.56:2379 \
|
||||
--cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
|
||||
--cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
|
||||
--key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
|
||||
endpoint status -w table'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Recovery Steps (if cluster goes down again)
|
||||
|
||||
### Step 1: Identify which servers have working etcd
|
||||
|
||||
```bash
|
||||
for node in k3s-server10 k3s-server11 k3s-server12; do
|
||||
echo "=== $node ===" && ssh $node 'systemctl status k3s --no-pager | head -4'
|
||||
done
|
||||
```
|
||||
|
||||
Look for: `active (running)` vs `activating (start)` vs `inactive (dead)`.
|
||||
|
||||
### Step 2: Check etcd quorum from a running server
|
||||
|
||||
```bash
|
||||
ssh <running-server> 'sudo ETCDCTL_API=3 etcdctl \
|
||||
--endpoints=https://127.0.0.1:2379 \
|
||||
--cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
|
||||
--cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
|
||||
--key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
|
||||
endpoint health'
|
||||
```
|
||||
|
||||
If all endpoints are healthy but API is down, restart k3s:
|
||||
```bash
|
||||
ssh <server> 'sudo systemctl restart k3s'
|
||||
```
|
||||
|
||||
### Step 3: If etcd has lost quorum (fewer than 2 of 3 members healthy)
|
||||
|
||||
With 3-member etcd, you need at least 2 members to have quorum. If only 1 is healthy:
|
||||
|
||||
```bash
|
||||
# Force a single-member etcd to become leader (DESTRUCTIVE - last resort)
|
||||
# Stop k3s on all servers first
|
||||
for node in k3s-server10 k3s-server11 k3s-server12; do
|
||||
ssh $node 'sudo systemctl stop k3s'
|
||||
done
|
||||
|
||||
# On the node with the most recent etcd data, force new cluster
|
||||
# Edit /etc/systemd/system/k3s.service.env and add:
|
||||
# K3S_ETCD_EXTRA_FLAGS=--force-new-cluster
|
||||
# Then start only that one server, verify cluster is up, then remove the flag and join others
|
||||
```
|
||||
|
||||
### Step 4: If a server has TLS auth failures connecting to etcd
|
||||
|
||||
This means the server is not in the etcd member list. Check:
|
||||
|
||||
```bash
|
||||
# Is the node actually in etcd?
|
||||
ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \
|
||||
--endpoints=https://127.0.0.1:2379 \
|
||||
--cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
|
||||
--cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
|
||||
--key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
|
||||
member list -w table'
|
||||
```
|
||||
|
||||
If the failing server is missing: restart it — k3s will attempt to re-add it to the cluster.
|
||||
If it still fails after restart: the etcd data directory may be corrupt. Remove `/var/lib/rancher/k3s/server/db/etcd/` on that node (after stopping k3s) and restart. k3s will resync from peers.
|
||||
|
||||
### Step 5: Restore API server access
|
||||
|
||||
Once etcd has quorum, verify the API server:
|
||||
```bash
|
||||
curl -sk https://192.168.20.47:6443/healthz # via loadbalancer
|
||||
```
|
||||
|
||||
If still down after etcd is healthy, restart k3s on the servers:
|
||||
```bash
|
||||
for node in k3s-server10 k3s-server11 k3s-server12; do
|
||||
ssh $node 'sudo systemctl restart k3s' && sleep 10
|
||||
done
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Ongoing Risks (as of 2026-04-21)
|
||||
|
||||
| Risk | Severity | Status |
|
||||
|------|----------|--------|
|
||||
| server11 disk I/O errors | Critical | **Unresolved** — same sectors fail at every boot |
|
||||
| server11 etcd latency (423ms vs 8ms on peers) | High | **Unresolved** — caused by disk |
|
||||
| Longhorn orphan accumulation | High | **Unresolved** — may re-fill etcd |
|
||||
| vaultwarden CrashLoopBackOff | Low | **Unresolved** — investigate separately |
|
||||
| k3s agent version skew (v1.33.5–v1.34.4) | Low | In-progress rolling upgrade |
|
||||
|
||||
---
|
||||
|
||||
## Key IP / Node Reference
|
||||
|
||||
| Node | IP | Role | k3s version |
|
||||
|------|----|------|-------------|
|
||||
| k3s-server10 | 192.168.20.43 | control-plane, etcd | v1.34.6+k3s1 |
|
||||
| k3s-server11 | 192.168.20.48 | control-plane, etcd, master | v1.34.6+k3s1 |
|
||||
| k3s-server12 | 192.168.20.56 | control-plane, etcd, master | v1.34.6+k3s1 |
|
||||
| k3s-loadbalancer | 192.168.20.47 | API load balancer | — |
|
||||
| k3s-agent10–19 | 192.168.20.44–67 | workers | v1.33.5+k3s1 |
|
||||
| k3s-agent20–21 | 192.168.20.69–70 | workers | v1.34.3+k3s1 |
|
||||
| k3s-agent22–23 | 192.168.20.72–73 | workers | v1.34.4+k3s1 |
|
||||
@@ -0,0 +1,146 @@
|
||||
# Design: Reprovision k3s-server11
|
||||
|
||||
**Date**: 2026-04-21
|
||||
**Status**: Approved
|
||||
|
||||
## Background
|
||||
|
||||
k3s-server11 (Proxmox VM 111 on inko01) has a corrupted btrfs VM disk image
|
||||
(`/opt/proxmox/images/111/vm-111-disk-0.raw`). The corruption has been present since
|
||||
~2026-02-15 (when backups started failing with I/O errors). The VM's guest OS sees this
|
||||
as bad sectors on `/dev/sda`, causing etcd to crash with SIGBUS when it mmap-reads those
|
||||
sectors. This triggered a full cluster outage on 2026-04-20.
|
||||
|
||||
The physical SSD on inko01 is healthy (SMART PASSED). The corruption is at the btrfs
|
||||
filesystem layer (3279+ corrupt blocks, single-device — no redundancy to recover from).
|
||||
|
||||
Since etcd data is fully replicated on server10 and server12, no data recovery is needed.
|
||||
The correct fix is to replace the disk with a fresh OS image and rejoin the node.
|
||||
|
||||
## Architecture
|
||||
|
||||
Three sequential phases. Each phase must complete successfully before the next begins.
|
||||
|
||||
```
|
||||
Phase 1: k8s cleanup → Phase 2: Proxmox disk → Phase 3: Ansible reprovision
|
||||
(drain, etcd remove, (stop VM, delete disk, (common + k3s_server roles,
|
||||
delete node) import fresh image, joins as secondary server,
|
||||
resize, start) etcd re-adds member)
|
||||
```
|
||||
|
||||
## Phase 1: Remove server11 from the cluster
|
||||
|
||||
Run from a machine with `kubectl` access (e.g. local workstation).
|
||||
|
||||
**1.1 Drain the node** — evicts all non-daemonset pods:
|
||||
```bash
|
||||
kubectl drain k3s-server11 --ignore-daemonsets --delete-emptydir-data
|
||||
```
|
||||
|
||||
**1.2 Remove from etcd** — prevents quorum issues while the disk is replaced:
|
||||
```bash
|
||||
ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \
|
||||
--endpoints=https://127.0.0.1:2379 \
|
||||
--cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
|
||||
--cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
|
||||
--key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
|
||||
member remove e9f8fa983ff7f958'
|
||||
```
|
||||
|
||||
**1.3 Delete the node object**:
|
||||
```bash
|
||||
kubectl delete node k3s-server11
|
||||
```
|
||||
|
||||
**Verify**: `kubectl get nodes` shows only server10, server12, and the agents. Etcd member
|
||||
list shows only 2 members (server10 + server12). Cluster remains healthy with quorum.
|
||||
|
||||
## Phase 2: Replace the VM disk on inko01
|
||||
|
||||
Run directly on inko01 via SSH.
|
||||
|
||||
**2.1 Stop the VM**:
|
||||
```bash
|
||||
qm stop 111
|
||||
```
|
||||
|
||||
**2.2 Delete the corrupt disk** (detaches and removes the raw file):
|
||||
```bash
|
||||
qm set 111 --delete scsi0
|
||||
```
|
||||
|
||||
**2.3 Import a fresh Debian 12 cloud-init image as a new disk**:
|
||||
```bash
|
||||
qm importdisk 111 /opt/proxmox/template/iso/debian-12-genericcloud-amd64.qcow2 proxmox
|
||||
```
|
||||
This creates `/opt/proxmox/images/111/vm-111-disk-0.raw` from the clean base image.
|
||||
|
||||
**2.4 Attach the disk and set boot order**:
|
||||
```bash
|
||||
qm set 111 --scsi0 proxmox:111/vm-111-disk-0.raw --boot order=scsi0
|
||||
```
|
||||
|
||||
**2.5 Resize to 64G** (matching original disk size):
|
||||
```bash
|
||||
qm resize 111 scsi0 64G
|
||||
```
|
||||
|
||||
**2.6 Start the VM**:
|
||||
```bash
|
||||
qm start 111
|
||||
```
|
||||
|
||||
Cloud-init runs on first boot and configures: hostname (`k3s-server11`), user (`tudattr`),
|
||||
SSH keys, and DHCP networking. Wait ~60s for SSH to become available before Phase 3.
|
||||
|
||||
**Verify**: `ssh k3s-server11 hostname` returns `k3s-server11` and no disk I/O errors
|
||||
appear in `dmesg`.
|
||||
|
||||
## Phase 3: Reprovision via Ansible
|
||||
|
||||
Run from local workstation in the ansible-homelab repo.
|
||||
|
||||
```bash
|
||||
ansible-playbook playbooks/k3s-servers.yaml --limit k3s-server11
|
||||
```
|
||||
|
||||
This runs the `common` and `k3s_server` roles against server11 only:
|
||||
|
||||
- `common`: installs base packages, configures SSH, hostname, etc.
|
||||
- `k3s_server`: detects `/usr/local/bin/k3s` does not exist → runs install script with
|
||||
`--server https://192.168.20.47:6443` (loadbalancer) → joins as a secondary server.
|
||||
k3s fetches the cluster token from server10 (the primary) and registers as a new etcd
|
||||
member automatically.
|
||||
|
||||
**Verify**:
|
||||
```bash
|
||||
kubectl get nodes # server11 shows Ready
|
||||
ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \
|
||||
--endpoints=https://127.0.0.1:2379 \
|
||||
--cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
|
||||
--cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
|
||||
--key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
|
||||
member list -w table' # 3 members, all started
|
||||
ssh k3s-server11 'dmesg | grep -i "i/o error"' # no output
|
||||
```
|
||||
|
||||
## Key Facts
|
||||
|
||||
| Item | Value |
|
||||
|------|-------|
|
||||
| VM ID | 111 |
|
||||
| Proxmox host | inko01 |
|
||||
| VM disk path | `/opt/proxmox/images/111/vm-111-disk-0.raw` |
|
||||
| Base image | `/opt/proxmox/template/iso/debian-12-genericcloud-amd64.qcow2` |
|
||||
| Proxmox storage pool | `proxmox` |
|
||||
| server11 IP | 192.168.20.48 |
|
||||
| server11 etcd member ID | `e9f8fa983ff7f958` |
|
||||
| Loadbalancer IP | 192.168.20.47 |
|
||||
| k3s primary server | server10 (192.168.20.43) |
|
||||
|
||||
## Risk
|
||||
|
||||
- **During Phase 1–2**: cluster runs on 2 etcd members. Still has quorum but no
|
||||
redundancy. Avoid other disruptive changes until server11 is back.
|
||||
- **etcd member ID**: `e9f8fa983ff7f958` was confirmed on 2026-04-21. Verify it matches
|
||||
before running the remove command if time has passed.
|
||||
Reference in New Issue
Block a user