diff --git a/docs/runbooks/k3s-cluster-outage-2026-04-20.md b/docs/runbooks/k3s-cluster-outage-2026-04-20.md
index e4c0cc6..bed5bd4 100644
--- a/docs/runbooks/k3s-cluster-outage-2026-04-20.md
+++ b/docs/runbooks/k3s-cluster-outage-2026-04-20.md
@@ -122,29 +122,53 @@ qm move-disk <vmid> scsi0 <target-storage>
 
 Longhorn was accumulating thousands of orphan objects and continuously writing/updating them in etcd. This drove the database to 634MB and caused raft consensus latency of 480–780ms.
 
-**Fix**: Clean up Longhorn orphans and compact/defrag etcd.
+**Fix (immediate)**: Clean up Longhorn orphans and compact/defrag etcd.
 
 ```bash
 # Delete all Longhorn orphans
 kubectl delete orphan -n longhorn-system --all
 
-# Manually defrag etcd after cleanup
-ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \
-  --endpoints=https://192.168.20.43:2379,https://192.168.20.48:2379,https://192.168.20.56:2379 \
-  --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
-  --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
-  --key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
-  defrag --cluster'
+# Defrag each etcd member individually (--cluster flag can time out)
+# Run from any control plane node with etcdctl installed
+for endpoint in https://192.168.20.43:2379 https://192.168.20.48:2379 https://192.168.20.56:2379; do
+  sudo ETCDCTL_API=3 etcdctl \
+    --endpoints=$endpoint \
+    --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
+    --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
+    --key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
+    --dial-timeout=300s --command-timeout=300s \
+    defrag
+done
 
 # Verify DB size dropped
-ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \
+sudo ETCDCTL_API=3 etcdctl \
   --endpoints=https://192.168.20.43:2379,https://192.168.20.48:2379,https://192.168.20.56:2379 \
   --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
   --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
   --key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
-  endpoint status -w table'
+  endpoint status -w table
 ```
 
+**Fix (permanent — 2026-04-22)**: Enable Longhorn orphan auto-deletion so orphans are cleaned up automatically after a 5-minute grace period instead of accumulating indefinitely.
+
+```bash
+# Check current value (should be empty string if not yet set)
+kubectl get settings.longhorn.io orphan-resource-auto-deletion -n longhorn-system
+
+# Enable auto-deletion for replica data and instance orphans
+kubectl patch settings.longhorn.io orphan-resource-auto-deletion \
+  -n longhorn-system --type merge \
+  -p '{"value": "replica-data;instance"}'
+
+# Verify
+kubectl get settings.longhorn.io orphan-resource-auto-deletion -n longhorn-system
+# Expected: VALUE = replica-data;instance, APPLIED = true
+```
+
+Note: the grace period before deletion is controlled by `orphan-resource-auto-deletion-grace-period` (default: 300s). Orphans on nodes in `down` or `unknown` state are not auto-deleted.
+
+Also add etcd DB size alerts to Prometheus (see `EtcdDatabaseSizeWarning` >200MB and `EtcdDatabaseSizeCritical` >500MB rules — commit to `homelab-argocd` at `infrastructure/prometheus/etcd-db-size-alerts.yaml`).
+
 ---
 
 ## Recovery Steps (if cluster goes down again)