tudattr/ansible

Fork 0

Files

Tuan-Dat Tran afbc3e3c57 docs(runbook): add Longhorn orphan auto-deletion fix and etcd defrag procedure

2026-04-22 22:03:45 +02:00

12 KiB

Raw Permalink Blame History

Runbook: k3s Cluster Outage (2026-04-20 / 2026-04-21)

Incident Summary

Start: ~22:43 CEST on 2026-04-20 (k3s-server10 stuck in activating state)
Cluster down: ~23:06 CEST on 2026-04-20 (API servers unreachable on all nodes)
Recovery: ~07:25 CEST on 2026-04-21 (both server11 and server12 rebooted, etcd reformed)
Root cause: Failing virtual disk on k3s-server11 combined with etcd overload from Longhorn orphan writes

What Happened (Timeline)

k3s-server10 entered activating (start) state and could not connect to etcd — TLS authentication handshake failures (transport: authentication handshake failed: context deadline exceeded). server10 was not present in the etcd member list.
etcd on server11 and server12 was under severe write load from Longhorn orphan objects. Raft consensus was taking 480–780ms per request (expected <100ms). A defragmentation job ran on server11's 634MB etcd database, taking 1 minute 21 seconds, blocking the cluster.
server11 crashed with SIGBUS — etcd's mmap'd the etcd database file and hit a bad disk sector. The journal also showed Input/output error when opening journal files. Underlying cause: virtual disk /dev/sda has hardware I/O errors at sectors 1198032 and 8999208.
With server11's etcd gone, the 2-member cluster lost quorum. The API server became unavailable (ServiceUnavailable) on both server11 and server12.
Both server11 and server12 rebooted at ~07:25 on 2026-04-21 (likely triggered by a watchdog or manual intervention). After reboot, all 3 etcd members reformed and the cluster recovered.

Symptoms

Cluster-level

kubectl get nodes returns Error from server (ServiceUnavailable)
All workloads stop responding
k3s kubectl on server nodes returns permission denied or ServiceUnavailable

k3s service (control plane nodes)

systemctl status k3s shows activating (start) for minutes with no progress
Or: inactive (dead) with Duration: Xm Ys (short-lived — crash loop)
k3s service exits with code 0/SUCCESS despite cluster being broken (graceful k3s shutdown due to etcd loss)

etcd

Repeated log lines: Failed to test etcd connection: failed to get etcd status: rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: context deadline exceeded"
etcd logs showing apply request took too long for requests >100ms
waiting for ReadIndex response took too long, retrying
Raft voting messages in a loop (cast MsgPreVote for ...) — lost quorum

Disk (server11)

dmesg at boot: sd 2:0:0:0: [sda] tag#N Sense Key : Aborted Command
dmesg: I/O error, dev sda, sector XXXXXXX op 0x0:(READ)
journald: error encountered while opening journal file: Input/output error
k3s crash: Unknown SIGBUS page, aborting.

Longhorn (contributing factor)

etcd logs flooded with writes to /registry/longhorn.io/orphans/longhorn-system/orphan-*
etcd database size: 634MB (healthy clusters should be <100MB)
Defrag operations taking >60s

Diagnosis Commands

# Check k3s service status on all servers
for node in k3s-server10 k3s-server11 k3s-server12; do
  echo "=== $node ===" && ssh $node 'systemctl status k3s --no-pager | head -5'
done

# Check etcd member list (run from a server with working etcd)
ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
  --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
  --key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
  member list -w table'

# Check etcd endpoint health across all 3 servers
ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \
  --endpoints=https://192.168.20.43:2379,https://192.168.20.48:2379,https://192.168.20.56:2379 \
  --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
  --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
  --key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
  endpoint health -w table'

# Check etcd endpoint status (DB size, leader)
ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \
  --endpoints=https://192.168.20.43:2379,https://192.168.20.48:2379,https://192.168.20.56:2379 \
  --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
  --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
  --key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
  endpoint status -w table'

# Check for disk I/O errors (VM disks)
ssh k3s-server11 'sudo dmesg | grep -iE "(i/o error|sda|aborted command)" | tail -20'

# Check recent k3s logs for errors
ssh k3s-server11 'sudo journalctl -u k3s -n 100 --no-pager | grep -iE "(error|fail|sigbus|panic)" | tail -30'

# Count Longhorn orphans in etcd
ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
  --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
  --key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
  get /registry/longhorn.io/orphans/ --prefix --keys-only | wc -l'

Root Causes

1. Failing virtual disk on k3s-server11

/dev/sda has persistent hardware I/O errors at sectors 1198032 and 8999208 that appear on every boot. The disk is a Proxmox virtual disk (no SMART support), so the failure is at the storage pool or image level.

Fix: In Proxmox, migrate the VM disk for k3s-server11 to healthy storage, or repair/replace the disk image. Check the Proxmox storage pool for errors.

# On Proxmox host: check storage health
pvesm status
# Find the VM disk and move it
qm move-disk <vmid> scsi0 <target-storage>

2. Longhorn flooding etcd with orphan object writes

Longhorn was accumulating thousands of orphan objects and continuously writing/updating them in etcd. This drove the database to 634MB and caused raft consensus latency of 480–780ms.

Fix (immediate): Clean up Longhorn orphans and compact/defrag etcd.

# Delete all Longhorn orphans
kubectl delete orphan -n longhorn-system --all

# Defrag each etcd member individually (--cluster flag can time out)
# Run from any control plane node with etcdctl installed
for endpoint in https://192.168.20.43:2379 https://192.168.20.48:2379 https://192.168.20.56:2379; do
  sudo ETCDCTL_API=3 etcdctl \
    --endpoints=$endpoint \
    --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
    --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
    --key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
    --dial-timeout=300s --command-timeout=300s \
    defrag
done

# Verify DB size dropped
sudo ETCDCTL_API=3 etcdctl \
  --endpoints=https://192.168.20.43:2379,https://192.168.20.48:2379,https://192.168.20.56:2379 \
  --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
  --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
  --key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
  endpoint status -w table

Fix (permanent — 2026-04-22): Enable Longhorn orphan auto-deletion so orphans are cleaned up automatically after a 5-minute grace period instead of accumulating indefinitely.

# Check current value (should be empty string if not yet set)
kubectl get settings.longhorn.io orphan-resource-auto-deletion -n longhorn-system

# Enable auto-deletion for replica data and instance orphans
kubectl patch settings.longhorn.io orphan-resource-auto-deletion \
  -n longhorn-system --type merge \
  -p '{"value": "replica-data;instance"}'

# Verify
kubectl get settings.longhorn.io orphan-resource-auto-deletion -n longhorn-system
# Expected: VALUE = replica-data;instance, APPLIED = true

Note: the grace period before deletion is controlled by orphan-resource-auto-deletion-grace-period (default: 300s). Orphans on nodes in down or unknown state are not auto-deleted.

Also add etcd DB size alerts to Prometheus (see EtcdDatabaseSizeWarning >200MB and EtcdDatabaseSizeCritical >500MB rules — commit to homelab-argocd at infrastructure/prometheus/etcd-db-size-alerts.yaml).

Recovery Steps (if cluster goes down again)

Step 1: Identify which servers have working etcd

for node in k3s-server10 k3s-server11 k3s-server12; do
  echo "=== $node ===" && ssh $node 'systemctl status k3s --no-pager | head -4'
done

Look for: active (running) vs activating (start) vs inactive (dead).

Step 2: Check etcd quorum from a running server

ssh <running-server> 'sudo ETCDCTL_API=3 etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
  --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
  --key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
  endpoint health'

If all endpoints are healthy but API is down, restart k3s:

ssh <server> 'sudo systemctl restart k3s'

Step 3: If etcd has lost quorum (fewer than 2 of 3 members healthy)

With 3-member etcd, you need at least 2 members to have quorum. If only 1 is healthy:

# Force a single-member etcd to become leader (DESTRUCTIVE - last resort)
# Stop k3s on all servers first
for node in k3s-server10 k3s-server11 k3s-server12; do
  ssh $node 'sudo systemctl stop k3s'
done

# On the node with the most recent etcd data, force new cluster
# Edit /etc/systemd/system/k3s.service.env and add:
# K3S_ETCD_EXTRA_FLAGS=--force-new-cluster
# Then start only that one server, verify cluster is up, then remove the flag and join others

Step 4: If a server has TLS auth failures connecting to etcd

This means the server is not in the etcd member list. Check:

# Is the node actually in etcd?
ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
  --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
  --key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
  member list -w table'

If the failing server is missing: restart it — k3s will attempt to re-add it to the cluster. If it still fails after restart: the etcd data directory may be corrupt. Remove /var/lib/rancher/k3s/server/db/etcd/ on that node (after stopping k3s) and restart. k3s will resync from peers.

Step 5: Restore API server access

Once etcd has quorum, verify the API server:

curl -sk https://192.168.20.47:6443/healthz  # via loadbalancer

If still down after etcd is healthy, restart k3s on the servers:

for node in k3s-server10 k3s-server11 k3s-server12; do
  ssh $node 'sudo systemctl restart k3s' && sleep 10
done

Ongoing Risks (as of 2026-04-21)

Risk	Severity	Status
server11 disk I/O errors	Critical	Resolved 2026-04-21 — disk replaced, VM reprovisioned
server11 etcd latency (423ms vs 8ms on peers)	High	Resolved 2026-04-21 — latency normal after disk replacement
Longhorn orphan accumulation	High	Resolved 2026-04-22 — 138 orphans deleted, etcd defragged to ~57 MB across all 3 members
vaultwarden CrashLoopBackOff	Low	Resolved 2026-04-22 — pod running 1/1
k3s agent version skew (v1.33.5–v1.34.4)	Low	In-progress rolling upgrade

Key IP / Node Reference

Node	IP	Role	k3s version
k3s-server10	192.168.20.43	control-plane, etcd	v1.34.6+k3s1
k3s-server11	192.168.20.48	control-plane, etcd, master	v1.34.6+k3s1
k3s-server12	192.168.20.56	control-plane, etcd, master	v1.34.6+k3s1
k3s-loadbalancer	192.168.20.47	API load balancer	—
k3s-agent10–19	192.168.20.44–67	workers	v1.33.5+k3s1
k3s-agent20–21	192.168.20.69–70	workers	v1.34.3+k3s1
k3s-agent22–23	192.168.20.72–73	workers	v1.34.4+k3s1

12 KiB Raw Permalink Blame History Unescape Escape