# Runbook: k3s Cluster Outage (2026-04-20 / 2026-04-21) ## Incident Summary - **Start**: ~22:43 CEST on 2026-04-20 (k3s-server10 stuck in activating state) - **Cluster down**: ~23:06 CEST on 2026-04-20 (API servers unreachable on all nodes) - **Recovery**: ~07:25 CEST on 2026-04-21 (both server11 and server12 rebooted, etcd reformed) - **Root cause**: Failing virtual disk on k3s-server11 combined with etcd overload from Longhorn orphan writes --- ## What Happened (Timeline) 1. **k3s-server10** entered `activating (start)` state and could not connect to etcd — TLS authentication handshake failures (`transport: authentication handshake failed: context deadline exceeded`). server10 was not present in the etcd member list. 2. **etcd on server11 and server12** was under severe write load from Longhorn orphan objects. Raft consensus was taking 480–780ms per request (expected <100ms). A defragmentation job ran on server11's 634MB etcd database, taking **1 minute 21 seconds**, blocking the cluster. 3. **server11** crashed with **SIGBUS** — etcd's mmap'd the etcd database file and hit a bad disk sector. The journal also showed `Input/output error` when opening journal files. Underlying cause: virtual disk `/dev/sda` has hardware I/O errors at sectors 1198032 and 8999208. 4. With server11's etcd gone, the 2-member cluster lost quorum. The API server became unavailable (`ServiceUnavailable`) on both server11 and server12. 5. Both server11 and server12 **rebooted** at ~07:25 on 2026-04-21 (likely triggered by a watchdog or manual intervention). After reboot, all 3 etcd members reformed and the cluster recovered. --- ## Symptoms ### Cluster-level - `kubectl get nodes` returns `Error from server (ServiceUnavailable)` - All workloads stop responding - `k3s kubectl` on server nodes returns permission denied or ServiceUnavailable ### k3s service (control plane nodes) - `systemctl status k3s` shows `activating (start)` for minutes with no progress - Or: `inactive (dead)` with `Duration: Xm Ys` (short-lived — crash loop) - k3s service exits with code 0/SUCCESS despite cluster being broken (graceful k3s shutdown due to etcd loss) ### etcd - Repeated log lines: `Failed to test etcd connection: failed to get etcd status: rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: context deadline exceeded"` - etcd logs showing `apply request took too long` for requests >100ms - `waiting for ReadIndex response took too long, retrying` - Raft voting messages in a loop (`cast MsgPreVote for ...`) — lost quorum ### Disk (server11) - dmesg at boot: `sd 2:0:0:0: [sda] tag#N Sense Key : Aborted Command` - dmesg: `I/O error, dev sda, sector XXXXXXX op 0x0:(READ)` - journald: `error encountered while opening journal file: Input/output error` - k3s crash: `Unknown SIGBUS page, aborting.` ### Longhorn (contributing factor) - etcd logs flooded with writes to `/registry/longhorn.io/orphans/longhorn-system/orphan-*` - etcd database size: 634MB (healthy clusters should be <100MB) - Defrag operations taking >60s --- ## Diagnosis Commands ```bash # Check k3s service status on all servers for node in k3s-server10 k3s-server11 k3s-server12; do echo "=== $node ===" && ssh $node 'systemctl status k3s --no-pager | head -5' done # Check etcd member list (run from a server with working etcd) ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \ --endpoints=https://127.0.0.1:2379 \ --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \ --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \ --key=/var/lib/rancher/k3s/server/tls/etcd/client.key \ member list -w table' # Check etcd endpoint health across all 3 servers ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \ --endpoints=https://192.168.20.43:2379,https://192.168.20.48:2379,https://192.168.20.56:2379 \ --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \ --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \ --key=/var/lib/rancher/k3s/server/tls/etcd/client.key \ endpoint health -w table' # Check etcd endpoint status (DB size, leader) ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \ --endpoints=https://192.168.20.43:2379,https://192.168.20.48:2379,https://192.168.20.56:2379 \ --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \ --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \ --key=/var/lib/rancher/k3s/server/tls/etcd/client.key \ endpoint status -w table' # Check for disk I/O errors (VM disks) ssh k3s-server11 'sudo dmesg | grep -iE "(i/o error|sda|aborted command)" | tail -20' # Check recent k3s logs for errors ssh k3s-server11 'sudo journalctl -u k3s -n 100 --no-pager | grep -iE "(error|fail|sigbus|panic)" | tail -30' # Count Longhorn orphans in etcd ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \ --endpoints=https://127.0.0.1:2379 \ --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \ --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \ --key=/var/lib/rancher/k3s/server/tls/etcd/client.key \ get /registry/longhorn.io/orphans/ --prefix --keys-only | wc -l' ``` --- ## Root Causes ### 1. Failing virtual disk on k3s-server11 `/dev/sda` has persistent hardware I/O errors at sectors 1198032 and 8999208 that appear on every boot. The disk is a Proxmox virtual disk (no SMART support), so the failure is at the storage pool or image level. **Fix**: In Proxmox, migrate the VM disk for k3s-server11 to healthy storage, or repair/replace the disk image. Check the Proxmox storage pool for errors. ```bash # On Proxmox host: check storage health pvesm status # Find the VM disk and move it qm move-disk scsi0 ``` ### 2. Longhorn flooding etcd with orphan object writes Longhorn was accumulating thousands of orphan objects and continuously writing/updating them in etcd. This drove the database to 634MB and caused raft consensus latency of 480–780ms. **Fix**: Clean up Longhorn orphans and compact/defrag etcd. ```bash # Delete all Longhorn orphans kubectl delete orphan -n longhorn-system --all # Manually defrag etcd after cleanup ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \ --endpoints=https://192.168.20.43:2379,https://192.168.20.48:2379,https://192.168.20.56:2379 \ --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \ --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \ --key=/var/lib/rancher/k3s/server/tls/etcd/client.key \ defrag --cluster' # Verify DB size dropped ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \ --endpoints=https://192.168.20.43:2379,https://192.168.20.48:2379,https://192.168.20.56:2379 \ --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \ --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \ --key=/var/lib/rancher/k3s/server/tls/etcd/client.key \ endpoint status -w table' ``` --- ## Recovery Steps (if cluster goes down again) ### Step 1: Identify which servers have working etcd ```bash for node in k3s-server10 k3s-server11 k3s-server12; do echo "=== $node ===" && ssh $node 'systemctl status k3s --no-pager | head -4' done ``` Look for: `active (running)` vs `activating (start)` vs `inactive (dead)`. ### Step 2: Check etcd quorum from a running server ```bash ssh 'sudo ETCDCTL_API=3 etcdctl \ --endpoints=https://127.0.0.1:2379 \ --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \ --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \ --key=/var/lib/rancher/k3s/server/tls/etcd/client.key \ endpoint health' ``` If all endpoints are healthy but API is down, restart k3s: ```bash ssh 'sudo systemctl restart k3s' ``` ### Step 3: If etcd has lost quorum (fewer than 2 of 3 members healthy) With 3-member etcd, you need at least 2 members to have quorum. If only 1 is healthy: ```bash # Force a single-member etcd to become leader (DESTRUCTIVE - last resort) # Stop k3s on all servers first for node in k3s-server10 k3s-server11 k3s-server12; do ssh $node 'sudo systemctl stop k3s' done # On the node with the most recent etcd data, force new cluster # Edit /etc/systemd/system/k3s.service.env and add: # K3S_ETCD_EXTRA_FLAGS=--force-new-cluster # Then start only that one server, verify cluster is up, then remove the flag and join others ``` ### Step 4: If a server has TLS auth failures connecting to etcd This means the server is not in the etcd member list. Check: ```bash # Is the node actually in etcd? ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \ --endpoints=https://127.0.0.1:2379 \ --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \ --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \ --key=/var/lib/rancher/k3s/server/tls/etcd/client.key \ member list -w table' ``` If the failing server is missing: restart it — k3s will attempt to re-add it to the cluster. If it still fails after restart: the etcd data directory may be corrupt. Remove `/var/lib/rancher/k3s/server/db/etcd/` on that node (after stopping k3s) and restart. k3s will resync from peers. ### Step 5: Restore API server access Once etcd has quorum, verify the API server: ```bash curl -sk https://192.168.20.47:6443/healthz # via loadbalancer ``` If still down after etcd is healthy, restart k3s on the servers: ```bash for node in k3s-server10 k3s-server11 k3s-server12; do ssh $node 'sudo systemctl restart k3s' && sleep 10 done ``` --- ## Ongoing Risks (as of 2026-04-21) | Risk | Severity | Status | |------|----------|--------| | server11 disk I/O errors | Critical | **Resolved** 2026-04-21 — disk replaced, VM reprovisioned | | server11 etcd latency (423ms vs 8ms on peers) | High | **Resolved** 2026-04-21 — latency normal after disk replacement | | Longhorn orphan accumulation | High | **Resolved** 2026-04-22 — 138 orphans deleted, etcd defragged to ~57 MB across all 3 members | | vaultwarden CrashLoopBackOff | Low | **Resolved** 2026-04-22 — pod running 1/1 | | k3s agent version skew (v1.33.5–v1.34.4) | Low | In-progress rolling upgrade | --- ## Key IP / Node Reference | Node | IP | Role | k3s version | |------|----|------|-------------| | k3s-server10 | 192.168.20.43 | control-plane, etcd | v1.34.6+k3s1 | | k3s-server11 | 192.168.20.48 | control-plane, etcd, master | v1.34.6+k3s1 | | k3s-server12 | 192.168.20.56 | control-plane, etcd, master | v1.34.6+k3s1 | | k3s-loadbalancer | 192.168.20.47 | API load balancer | — | | k3s-agent10–19 | 192.168.20.44–67 | workers | v1.33.5+k3s1 | | k3s-agent20–21 | 192.168.20.69–70 | workers | v1.34.3+k3s1 | | k3s-agent22–23 | 192.168.20.72–73 | workers | v1.34.4+k3s1 |