docs(runbook): add Longhorn orphan auto-deletion fix and etcd defrag procedure
This commit is contained in:
@@ -122,29 +122,53 @@ qm move-disk <vmid> scsi0 <target-storage>
|
|||||||
|
|
||||||
Longhorn was accumulating thousands of orphan objects and continuously writing/updating them in etcd. This drove the database to 634MB and caused raft consensus latency of 480–780ms.
|
Longhorn was accumulating thousands of orphan objects and continuously writing/updating them in etcd. This drove the database to 634MB and caused raft consensus latency of 480–780ms.
|
||||||
|
|
||||||
**Fix**: Clean up Longhorn orphans and compact/defrag etcd.
|
**Fix (immediate)**: Clean up Longhorn orphans and compact/defrag etcd.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Delete all Longhorn orphans
|
# Delete all Longhorn orphans
|
||||||
kubectl delete orphan -n longhorn-system --all
|
kubectl delete orphan -n longhorn-system --all
|
||||||
|
|
||||||
# Manually defrag etcd after cleanup
|
# Defrag each etcd member individually (--cluster flag can time out)
|
||||||
ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \
|
# Run from any control plane node with etcdctl installed
|
||||||
--endpoints=https://192.168.20.43:2379,https://192.168.20.48:2379,https://192.168.20.56:2379 \
|
for endpoint in https://192.168.20.43:2379 https://192.168.20.48:2379 https://192.168.20.56:2379; do
|
||||||
--cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
|
sudo ETCDCTL_API=3 etcdctl \
|
||||||
--cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
|
--endpoints=$endpoint \
|
||||||
--key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
|
--cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
|
||||||
defrag --cluster'
|
--cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
|
||||||
|
--key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
|
||||||
|
--dial-timeout=300s --command-timeout=300s \
|
||||||
|
defrag
|
||||||
|
done
|
||||||
|
|
||||||
# Verify DB size dropped
|
# Verify DB size dropped
|
||||||
ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \
|
sudo ETCDCTL_API=3 etcdctl \
|
||||||
--endpoints=https://192.168.20.43:2379,https://192.168.20.48:2379,https://192.168.20.56:2379 \
|
--endpoints=https://192.168.20.43:2379,https://192.168.20.48:2379,https://192.168.20.56:2379 \
|
||||||
--cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
|
--cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
|
||||||
--cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
|
--cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
|
||||||
--key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
|
--key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
|
||||||
endpoint status -w table'
|
endpoint status -w table
|
||||||
```
|
```
|
||||||
|
|
||||||
|
**Fix (permanent — 2026-04-22)**: Enable Longhorn orphan auto-deletion so orphans are cleaned up automatically after a 5-minute grace period instead of accumulating indefinitely.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check current value (should be empty string if not yet set)
|
||||||
|
kubectl get settings.longhorn.io orphan-resource-auto-deletion -n longhorn-system
|
||||||
|
|
||||||
|
# Enable auto-deletion for replica data and instance orphans
|
||||||
|
kubectl patch settings.longhorn.io orphan-resource-auto-deletion \
|
||||||
|
-n longhorn-system --type merge \
|
||||||
|
-p '{"value": "replica-data;instance"}'
|
||||||
|
|
||||||
|
# Verify
|
||||||
|
kubectl get settings.longhorn.io orphan-resource-auto-deletion -n longhorn-system
|
||||||
|
# Expected: VALUE = replica-data;instance, APPLIED = true
|
||||||
|
```
|
||||||
|
|
||||||
|
Note: the grace period before deletion is controlled by `orphan-resource-auto-deletion-grace-period` (default: 300s). Orphans on nodes in `down` or `unknown` state are not auto-deleted.
|
||||||
|
|
||||||
|
Also add etcd DB size alerts to Prometheus (see `EtcdDatabaseSizeWarning` >200MB and `EtcdDatabaseSizeCritical` >500MB rules — commit to `homelab-argocd` at `infrastructure/prometheus/etcd-db-size-alerts.yaml`).
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Recovery Steps (if cluster goes down again)
|
## Recovery Steps (if cluster goes down again)
|
||||||
|
|||||||
Reference in New Issue
Block a user