10 KiB
Runbook: k3s Cluster Outage (2026-04-20 / 2026-04-21)
Incident Summary
- Start: ~22:43 CEST on 2026-04-20 (k3s-server10 stuck in activating state)
- Cluster down: ~23:06 CEST on 2026-04-20 (API servers unreachable on all nodes)
- Recovery: ~07:25 CEST on 2026-04-21 (both server11 and server12 rebooted, etcd reformed)
- Root cause: Failing virtual disk on k3s-server11 combined with etcd overload from Longhorn orphan writes
What Happened (Timeline)
-
k3s-server10 entered
activating (start)state and could not connect to etcd — TLS authentication handshake failures (transport: authentication handshake failed: context deadline exceeded). server10 was not present in the etcd member list. -
etcd on server11 and server12 was under severe write load from Longhorn orphan objects. Raft consensus was taking 480–780ms per request (expected <100ms). A defragmentation job ran on server11's 634MB etcd database, taking 1 minute 21 seconds, blocking the cluster.
-
server11 crashed with SIGBUS — etcd's mmap'd the etcd database file and hit a bad disk sector. The journal also showed
Input/output errorwhen opening journal files. Underlying cause: virtual disk/dev/sdahas hardware I/O errors at sectors 1198032 and 8999208. -
With server11's etcd gone, the 2-member cluster lost quorum. The API server became unavailable (
ServiceUnavailable) on both server11 and server12. -
Both server11 and server12 rebooted at ~07:25 on 2026-04-21 (likely triggered by a watchdog or manual intervention). After reboot, all 3 etcd members reformed and the cluster recovered.
Symptoms
Cluster-level
kubectl get nodesreturnsError from server (ServiceUnavailable)- All workloads stop responding
k3s kubectlon server nodes returns permission denied or ServiceUnavailable
k3s service (control plane nodes)
systemctl status k3sshowsactivating (start)for minutes with no progress- Or:
inactive (dead)withDuration: Xm Ys(short-lived — crash loop) - k3s service exits with code 0/SUCCESS despite cluster being broken (graceful k3s shutdown due to etcd loss)
etcd
- Repeated log lines:
Failed to test etcd connection: failed to get etcd status: rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: context deadline exceeded" - etcd logs showing
apply request took too longfor requests >100ms waiting for ReadIndex response took too long, retrying- Raft voting messages in a loop (
cast MsgPreVote for ...) — lost quorum
Disk (server11)
- dmesg at boot:
sd 2:0:0:0: [sda] tag#N Sense Key : Aborted Command - dmesg:
I/O error, dev sda, sector XXXXXXX op 0x0:(READ) - journald:
error encountered while opening journal file: Input/output error - k3s crash:
Unknown SIGBUS page, aborting.
Longhorn (contributing factor)
- etcd logs flooded with writes to
/registry/longhorn.io/orphans/longhorn-system/orphan-* - etcd database size: 634MB (healthy clusters should be <100MB)
- Defrag operations taking >60s
Diagnosis Commands
# Check k3s service status on all servers
for node in k3s-server10 k3s-server11 k3s-server12; do
echo "=== $node ===" && ssh $node 'systemctl status k3s --no-pager | head -5'
done
# Check etcd member list (run from a server with working etcd)
ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
--cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
--key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
member list -w table'
# Check etcd endpoint health across all 3 servers
ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \
--endpoints=https://192.168.20.43:2379,https://192.168.20.48:2379,https://192.168.20.56:2379 \
--cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
--cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
--key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
endpoint health -w table'
# Check etcd endpoint status (DB size, leader)
ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \
--endpoints=https://192.168.20.43:2379,https://192.168.20.48:2379,https://192.168.20.56:2379 \
--cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
--cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
--key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
endpoint status -w table'
# Check for disk I/O errors (VM disks)
ssh k3s-server11 'sudo dmesg | grep -iE "(i/o error|sda|aborted command)" | tail -20'
# Check recent k3s logs for errors
ssh k3s-server11 'sudo journalctl -u k3s -n 100 --no-pager | grep -iE "(error|fail|sigbus|panic)" | tail -30'
# Count Longhorn orphans in etcd
ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
--cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
--key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
get /registry/longhorn.io/orphans/ --prefix --keys-only | wc -l'
Root Causes
1. Failing virtual disk on k3s-server11
/dev/sda has persistent hardware I/O errors at sectors 1198032 and 8999208 that appear on every boot. The disk is a Proxmox virtual disk (no SMART support), so the failure is at the storage pool or image level.
Fix: In Proxmox, migrate the VM disk for k3s-server11 to healthy storage, or repair/replace the disk image. Check the Proxmox storage pool for errors.
# On Proxmox host: check storage health
pvesm status
# Find the VM disk and move it
qm move-disk <vmid> scsi0 <target-storage>
2. Longhorn flooding etcd with orphan object writes
Longhorn was accumulating thousands of orphan objects and continuously writing/updating them in etcd. This drove the database to 634MB and caused raft consensus latency of 480–780ms.
Fix: Clean up Longhorn orphans and compact/defrag etcd.
# Delete all Longhorn orphans
kubectl delete orphan -n longhorn-system --all
# Manually defrag etcd after cleanup
ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \
--endpoints=https://192.168.20.43:2379,https://192.168.20.48:2379,https://192.168.20.56:2379 \
--cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
--cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
--key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
defrag --cluster'
# Verify DB size dropped
ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \
--endpoints=https://192.168.20.43:2379,https://192.168.20.48:2379,https://192.168.20.56:2379 \
--cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
--cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
--key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
endpoint status -w table'
Recovery Steps (if cluster goes down again)
Step 1: Identify which servers have working etcd
for node in k3s-server10 k3s-server11 k3s-server12; do
echo "=== $node ===" && ssh $node 'systemctl status k3s --no-pager | head -4'
done
Look for: active (running) vs activating (start) vs inactive (dead).
Step 2: Check etcd quorum from a running server
ssh <running-server> 'sudo ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
--cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
--key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
endpoint health'
If all endpoints are healthy but API is down, restart k3s:
ssh <server> 'sudo systemctl restart k3s'
Step 3: If etcd has lost quorum (fewer than 2 of 3 members healthy)
With 3-member etcd, you need at least 2 members to have quorum. If only 1 is healthy:
# Force a single-member etcd to become leader (DESTRUCTIVE - last resort)
# Stop k3s on all servers first
for node in k3s-server10 k3s-server11 k3s-server12; do
ssh $node 'sudo systemctl stop k3s'
done
# On the node with the most recent etcd data, force new cluster
# Edit /etc/systemd/system/k3s.service.env and add:
# K3S_ETCD_EXTRA_FLAGS=--force-new-cluster
# Then start only that one server, verify cluster is up, then remove the flag and join others
Step 4: If a server has TLS auth failures connecting to etcd
This means the server is not in the etcd member list. Check:
# Is the node actually in etcd?
ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
--cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
--key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
member list -w table'
If the failing server is missing: restart it — k3s will attempt to re-add it to the cluster.
If it still fails after restart: the etcd data directory may be corrupt. Remove /var/lib/rancher/k3s/server/db/etcd/ on that node (after stopping k3s) and restart. k3s will resync from peers.
Step 5: Restore API server access
Once etcd has quorum, verify the API server:
curl -sk https://192.168.20.47:6443/healthz # via loadbalancer
If still down after etcd is healthy, restart k3s on the servers:
for node in k3s-server10 k3s-server11 k3s-server12; do
ssh $node 'sudo systemctl restart k3s' && sleep 10
done
Ongoing Risks (as of 2026-04-21)
| Risk | Severity | Status |
|---|---|---|
| server11 disk I/O errors | Critical | Resolved 2026-04-21 — disk replaced, VM reprovisioned |
| server11 etcd latency (423ms vs 8ms on peers) | High | Resolved 2026-04-21 — latency normal after disk replacement |
| Longhorn orphan accumulation | High | Resolved 2026-04-22 — 138 orphans deleted, etcd defragged to ~57 MB across all 3 members |
| vaultwarden CrashLoopBackOff | Low | Resolved 2026-04-22 — pod running 1/1 |
| k3s agent version skew (v1.33.5–v1.34.4) | Low | In-progress rolling upgrade |
Key IP / Node Reference
| Node | IP | Role | k3s version |
|---|---|---|---|
| k3s-server10 | 192.168.20.43 | control-plane, etcd | v1.34.6+k3s1 |
| k3s-server11 | 192.168.20.48 | control-plane, etcd, master | v1.34.6+k3s1 |
| k3s-server12 | 192.168.20.56 | control-plane, etcd, master | v1.34.6+k3s1 |
| k3s-loadbalancer | 192.168.20.47 | API load balancer | — |
| k3s-agent10–19 | 192.168.20.44–67 | workers | v1.33.5+k3s1 |
| k3s-agent20–21 | 192.168.20.69–70 | workers | v1.34.3+k3s1 |
| k3s-agent22–23 | 192.168.20.72–73 | workers | v1.34.4+k3s1 |