Files
ansible/docs/runbooks/k3s-cluster-outage-2026-04-20.md

12 KiB
Raw Permalink Blame History

Runbook: k3s Cluster Outage (2026-04-20 / 2026-04-21)

Incident Summary

  • Start: ~22:43 CEST on 2026-04-20 (k3s-server10 stuck in activating state)
  • Cluster down: ~23:06 CEST on 2026-04-20 (API servers unreachable on all nodes)
  • Recovery: ~07:25 CEST on 2026-04-21 (both server11 and server12 rebooted, etcd reformed)
  • Root cause: Failing virtual disk on k3s-server11 combined with etcd overload from Longhorn orphan writes

What Happened (Timeline)

  1. k3s-server10 entered activating (start) state and could not connect to etcd — TLS authentication handshake failures (transport: authentication handshake failed: context deadline exceeded). server10 was not present in the etcd member list.

  2. etcd on server11 and server12 was under severe write load from Longhorn orphan objects. Raft consensus was taking 480780ms per request (expected <100ms). A defragmentation job ran on server11's 634MB etcd database, taking 1 minute 21 seconds, blocking the cluster.

  3. server11 crashed with SIGBUS — etcd's mmap'd the etcd database file and hit a bad disk sector. The journal also showed Input/output error when opening journal files. Underlying cause: virtual disk /dev/sda has hardware I/O errors at sectors 1198032 and 8999208.

  4. With server11's etcd gone, the 2-member cluster lost quorum. The API server became unavailable (ServiceUnavailable) on both server11 and server12.

  5. Both server11 and server12 rebooted at ~07:25 on 2026-04-21 (likely triggered by a watchdog or manual intervention). After reboot, all 3 etcd members reformed and the cluster recovered.


Symptoms

Cluster-level

  • kubectl get nodes returns Error from server (ServiceUnavailable)
  • All workloads stop responding
  • k3s kubectl on server nodes returns permission denied or ServiceUnavailable

k3s service (control plane nodes)

  • systemctl status k3s shows activating (start) for minutes with no progress
  • Or: inactive (dead) with Duration: Xm Ys (short-lived — crash loop)
  • k3s service exits with code 0/SUCCESS despite cluster being broken (graceful k3s shutdown due to etcd loss)

etcd

  • Repeated log lines: Failed to test etcd connection: failed to get etcd status: rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: context deadline exceeded"
  • etcd logs showing apply request took too long for requests >100ms
  • waiting for ReadIndex response took too long, retrying
  • Raft voting messages in a loop (cast MsgPreVote for ...) — lost quorum

Disk (server11)

  • dmesg at boot: sd 2:0:0:0: [sda] tag#N Sense Key : Aborted Command
  • dmesg: I/O error, dev sda, sector XXXXXXX op 0x0:(READ)
  • journald: error encountered while opening journal file: Input/output error
  • k3s crash: Unknown SIGBUS page, aborting.

Longhorn (contributing factor)

  • etcd logs flooded with writes to /registry/longhorn.io/orphans/longhorn-system/orphan-*
  • etcd database size: 634MB (healthy clusters should be <100MB)
  • Defrag operations taking >60s

Diagnosis Commands

# Check k3s service status on all servers
for node in k3s-server10 k3s-server11 k3s-server12; do
  echo "=== $node ===" && ssh $node 'systemctl status k3s --no-pager | head -5'
done

# Check etcd member list (run from a server with working etcd)
ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
  --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
  --key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
  member list -w table'

# Check etcd endpoint health across all 3 servers
ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \
  --endpoints=https://192.168.20.43:2379,https://192.168.20.48:2379,https://192.168.20.56:2379 \
  --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
  --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
  --key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
  endpoint health -w table'

# Check etcd endpoint status (DB size, leader)
ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \
  --endpoints=https://192.168.20.43:2379,https://192.168.20.48:2379,https://192.168.20.56:2379 \
  --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
  --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
  --key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
  endpoint status -w table'

# Check for disk I/O errors (VM disks)
ssh k3s-server11 'sudo dmesg | grep -iE "(i/o error|sda|aborted command)" | tail -20'

# Check recent k3s logs for errors
ssh k3s-server11 'sudo journalctl -u k3s -n 100 --no-pager | grep -iE "(error|fail|sigbus|panic)" | tail -30'

# Count Longhorn orphans in etcd
ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
  --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
  --key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
  get /registry/longhorn.io/orphans/ --prefix --keys-only | wc -l'

Root Causes

1. Failing virtual disk on k3s-server11

/dev/sda has persistent hardware I/O errors at sectors 1198032 and 8999208 that appear on every boot. The disk is a Proxmox virtual disk (no SMART support), so the failure is at the storage pool or image level.

Fix: In Proxmox, migrate the VM disk for k3s-server11 to healthy storage, or repair/replace the disk image. Check the Proxmox storage pool for errors.

# On Proxmox host: check storage health
pvesm status
# Find the VM disk and move it
qm move-disk <vmid> scsi0 <target-storage>

2. Longhorn flooding etcd with orphan object writes

Longhorn was accumulating thousands of orphan objects and continuously writing/updating them in etcd. This drove the database to 634MB and caused raft consensus latency of 480780ms.

Fix (immediate): Clean up Longhorn orphans and compact/defrag etcd.

# Delete all Longhorn orphans
kubectl delete orphan -n longhorn-system --all

# Defrag each etcd member individually (--cluster flag can time out)
# Run from any control plane node with etcdctl installed
for endpoint in https://192.168.20.43:2379 https://192.168.20.48:2379 https://192.168.20.56:2379; do
  sudo ETCDCTL_API=3 etcdctl \
    --endpoints=$endpoint \
    --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
    --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
    --key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
    --dial-timeout=300s --command-timeout=300s \
    defrag
done

# Verify DB size dropped
sudo ETCDCTL_API=3 etcdctl \
  --endpoints=https://192.168.20.43:2379,https://192.168.20.48:2379,https://192.168.20.56:2379 \
  --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
  --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
  --key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
  endpoint status -w table

Fix (permanent — 2026-04-22): Enable Longhorn orphan auto-deletion so orphans are cleaned up automatically after a 5-minute grace period instead of accumulating indefinitely.

# Check current value (should be empty string if not yet set)
kubectl get settings.longhorn.io orphan-resource-auto-deletion -n longhorn-system

# Enable auto-deletion for replica data and instance orphans
kubectl patch settings.longhorn.io orphan-resource-auto-deletion \
  -n longhorn-system --type merge \
  -p '{"value": "replica-data;instance"}'

# Verify
kubectl get settings.longhorn.io orphan-resource-auto-deletion -n longhorn-system
# Expected: VALUE = replica-data;instance, APPLIED = true

Note: the grace period before deletion is controlled by orphan-resource-auto-deletion-grace-period (default: 300s). Orphans on nodes in down or unknown state are not auto-deleted.

Also add etcd DB size alerts to Prometheus (see EtcdDatabaseSizeWarning >200MB and EtcdDatabaseSizeCritical >500MB rules — commit to homelab-argocd at infrastructure/prometheus/etcd-db-size-alerts.yaml).


Recovery Steps (if cluster goes down again)

Step 1: Identify which servers have working etcd

for node in k3s-server10 k3s-server11 k3s-server12; do
  echo "=== $node ===" && ssh $node 'systemctl status k3s --no-pager | head -4'
done

Look for: active (running) vs activating (start) vs inactive (dead).

Step 2: Check etcd quorum from a running server

ssh <running-server> 'sudo ETCDCTL_API=3 etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
  --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
  --key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
  endpoint health'

If all endpoints are healthy but API is down, restart k3s:

ssh <server> 'sudo systemctl restart k3s'

Step 3: If etcd has lost quorum (fewer than 2 of 3 members healthy)

With 3-member etcd, you need at least 2 members to have quorum. If only 1 is healthy:

# Force a single-member etcd to become leader (DESTRUCTIVE - last resort)
# Stop k3s on all servers first
for node in k3s-server10 k3s-server11 k3s-server12; do
  ssh $node 'sudo systemctl stop k3s'
done

# On the node with the most recent etcd data, force new cluster
# Edit /etc/systemd/system/k3s.service.env and add:
# K3S_ETCD_EXTRA_FLAGS=--force-new-cluster
# Then start only that one server, verify cluster is up, then remove the flag and join others

Step 4: If a server has TLS auth failures connecting to etcd

This means the server is not in the etcd member list. Check:

# Is the node actually in etcd?
ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
  --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
  --key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
  member list -w table'

If the failing server is missing: restart it — k3s will attempt to re-add it to the cluster. If it still fails after restart: the etcd data directory may be corrupt. Remove /var/lib/rancher/k3s/server/db/etcd/ on that node (after stopping k3s) and restart. k3s will resync from peers.

Step 5: Restore API server access

Once etcd has quorum, verify the API server:

curl -sk https://192.168.20.47:6443/healthz  # via loadbalancer

If still down after etcd is healthy, restart k3s on the servers:

for node in k3s-server10 k3s-server11 k3s-server12; do
  ssh $node 'sudo systemctl restart k3s' && sleep 10
done

Ongoing Risks (as of 2026-04-21)

Risk Severity Status
server11 disk I/O errors Critical Resolved 2026-04-21 — disk replaced, VM reprovisioned
server11 etcd latency (423ms vs 8ms on peers) High Resolved 2026-04-21 — latency normal after disk replacement
Longhorn orphan accumulation High Resolved 2026-04-22 — 138 orphans deleted, etcd defragged to ~57 MB across all 3 members
vaultwarden CrashLoopBackOff Low Resolved 2026-04-22 — pod running 1/1
k3s agent version skew (v1.33.5v1.34.4) Low In-progress rolling upgrade

Key IP / Node Reference

Node IP Role k3s version
k3s-server10 192.168.20.43 control-plane, etcd v1.34.6+k3s1
k3s-server11 192.168.20.48 control-plane, etcd, master v1.34.6+k3s1
k3s-server12 192.168.20.56 control-plane, etcd, master v1.34.6+k3s1
k3s-loadbalancer 192.168.20.47 API load balancer
k3s-agent1019 192.168.20.4467 workers v1.33.5+k3s1
k3s-agent2021 192.168.20.6970 workers v1.34.3+k3s1
k3s-agent2223 192.168.20.7273 workers v1.34.4+k3s1