diff --git a/docs/runbooks/k3s-cluster-outage-2026-04-20.md b/docs/runbooks/k3s-cluster-outage-2026-04-20.md index e4c0cc6..bed5bd4 100644 --- a/docs/runbooks/k3s-cluster-outage-2026-04-20.md +++ b/docs/runbooks/k3s-cluster-outage-2026-04-20.md @@ -122,29 +122,53 @@ qm move-disk scsi0 Longhorn was accumulating thousands of orphan objects and continuously writing/updating them in etcd. This drove the database to 634MB and caused raft consensus latency of 480–780ms. -**Fix**: Clean up Longhorn orphans and compact/defrag etcd. +**Fix (immediate)**: Clean up Longhorn orphans and compact/defrag etcd. ```bash # Delete all Longhorn orphans kubectl delete orphan -n longhorn-system --all -# Manually defrag etcd after cleanup -ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \ - --endpoints=https://192.168.20.43:2379,https://192.168.20.48:2379,https://192.168.20.56:2379 \ - --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \ - --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \ - --key=/var/lib/rancher/k3s/server/tls/etcd/client.key \ - defrag --cluster' +# Defrag each etcd member individually (--cluster flag can time out) +# Run from any control plane node with etcdctl installed +for endpoint in https://192.168.20.43:2379 https://192.168.20.48:2379 https://192.168.20.56:2379; do + sudo ETCDCTL_API=3 etcdctl \ + --endpoints=$endpoint \ + --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \ + --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \ + --key=/var/lib/rancher/k3s/server/tls/etcd/client.key \ + --dial-timeout=300s --command-timeout=300s \ + defrag +done # Verify DB size dropped -ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \ +sudo ETCDCTL_API=3 etcdctl \ --endpoints=https://192.168.20.43:2379,https://192.168.20.48:2379,https://192.168.20.56:2379 \ --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \ --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \ --key=/var/lib/rancher/k3s/server/tls/etcd/client.key \ - endpoint status -w table' + endpoint status -w table ``` +**Fix (permanent — 2026-04-22)**: Enable Longhorn orphan auto-deletion so orphans are cleaned up automatically after a 5-minute grace period instead of accumulating indefinitely. + +```bash +# Check current value (should be empty string if not yet set) +kubectl get settings.longhorn.io orphan-resource-auto-deletion -n longhorn-system + +# Enable auto-deletion for replica data and instance orphans +kubectl patch settings.longhorn.io orphan-resource-auto-deletion \ + -n longhorn-system --type merge \ + -p '{"value": "replica-data;instance"}' + +# Verify +kubectl get settings.longhorn.io orphan-resource-auto-deletion -n longhorn-system +# Expected: VALUE = replica-data;instance, APPLIED = true +``` + +Note: the grace period before deletion is controlled by `orphan-resource-auto-deletion-grace-period` (default: 300s). Orphans on nodes in `down` or `unknown` state are not auto-deleted. + +Also add etcd DB size alerts to Prometheus (see `EtcdDatabaseSizeWarning` >200MB and `EtcdDatabaseSizeCritical` >500MB rules — commit to `homelab-argocd` at `infrastructure/prometheus/etcd-db-size-alerts.yaml`). + --- ## Recovery Steps (if cluster goes down again)