Compare commits
9 Commits
d33117a752
...
057cd7a7f0
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
057cd7a7f0 | ||
|
|
db2d5dccd4 | ||
|
|
db7e130515 | ||
|
|
c16e7cf740 | ||
|
|
c084572521 | ||
|
|
da7bd42f07 | ||
|
|
f0a45e3fda | ||
|
|
b5f82e2978 | ||
|
|
29561c44c8 |
250
docs/runbooks/k3s-cluster-outage-2026-04-20.md
Normal file
250
docs/runbooks/k3s-cluster-outage-2026-04-20.md
Normal file
@@ -0,0 +1,250 @@
|
|||||||
|
# Runbook: k3s Cluster Outage (2026-04-20 / 2026-04-21)
|
||||||
|
|
||||||
|
## Incident Summary
|
||||||
|
|
||||||
|
- **Start**: ~22:43 CEST on 2026-04-20 (k3s-server10 stuck in activating state)
|
||||||
|
- **Cluster down**: ~23:06 CEST on 2026-04-20 (API servers unreachable on all nodes)
|
||||||
|
- **Recovery**: ~07:25 CEST on 2026-04-21 (both server11 and server12 rebooted, etcd reformed)
|
||||||
|
- **Root cause**: Failing virtual disk on k3s-server11 combined with etcd overload from Longhorn orphan writes
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What Happened (Timeline)
|
||||||
|
|
||||||
|
1. **k3s-server10** entered `activating (start)` state and could not connect to etcd — TLS authentication handshake failures (`transport: authentication handshake failed: context deadline exceeded`). server10 was not present in the etcd member list.
|
||||||
|
|
||||||
|
2. **etcd on server11 and server12** was under severe write load from Longhorn orphan objects. Raft consensus was taking 480–780ms per request (expected <100ms). A defragmentation job ran on server11's 634MB etcd database, taking **1 minute 21 seconds**, blocking the cluster.
|
||||||
|
|
||||||
|
3. **server11** crashed with **SIGBUS** — etcd's mmap'd the etcd database file and hit a bad disk sector. The journal also showed `Input/output error` when opening journal files. Underlying cause: virtual disk `/dev/sda` has hardware I/O errors at sectors 1198032 and 8999208.
|
||||||
|
|
||||||
|
4. With server11's etcd gone, the 2-member cluster lost quorum. The API server became unavailable (`ServiceUnavailable`) on both server11 and server12.
|
||||||
|
|
||||||
|
5. Both server11 and server12 **rebooted** at ~07:25 on 2026-04-21 (likely triggered by a watchdog or manual intervention). After reboot, all 3 etcd members reformed and the cluster recovered.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Symptoms
|
||||||
|
|
||||||
|
### Cluster-level
|
||||||
|
- `kubectl get nodes` returns `Error from server (ServiceUnavailable)`
|
||||||
|
- All workloads stop responding
|
||||||
|
- `k3s kubectl` on server nodes returns permission denied or ServiceUnavailable
|
||||||
|
|
||||||
|
### k3s service (control plane nodes)
|
||||||
|
- `systemctl status k3s` shows `activating (start)` for minutes with no progress
|
||||||
|
- Or: `inactive (dead)` with `Duration: Xm Ys` (short-lived — crash loop)
|
||||||
|
- k3s service exits with code 0/SUCCESS despite cluster being broken (graceful k3s shutdown due to etcd loss)
|
||||||
|
|
||||||
|
### etcd
|
||||||
|
- Repeated log lines: `Failed to test etcd connection: failed to get etcd status: rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: context deadline exceeded"`
|
||||||
|
- etcd logs showing `apply request took too long` for requests >100ms
|
||||||
|
- `waiting for ReadIndex response took too long, retrying`
|
||||||
|
- Raft voting messages in a loop (`cast MsgPreVote for ...`) — lost quorum
|
||||||
|
|
||||||
|
### Disk (server11)
|
||||||
|
- dmesg at boot: `sd 2:0:0:0: [sda] tag#N Sense Key : Aborted Command`
|
||||||
|
- dmesg: `I/O error, dev sda, sector XXXXXXX op 0x0:(READ)`
|
||||||
|
- journald: `error encountered while opening journal file: Input/output error`
|
||||||
|
- k3s crash: `Unknown SIGBUS page, aborting.`
|
||||||
|
|
||||||
|
### Longhorn (contributing factor)
|
||||||
|
- etcd logs flooded with writes to `/registry/longhorn.io/orphans/longhorn-system/orphan-*`
|
||||||
|
- etcd database size: 634MB (healthy clusters should be <100MB)
|
||||||
|
- Defrag operations taking >60s
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Diagnosis Commands
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check k3s service status on all servers
|
||||||
|
for node in k3s-server10 k3s-server11 k3s-server12; do
|
||||||
|
echo "=== $node ===" && ssh $node 'systemctl status k3s --no-pager | head -5'
|
||||||
|
done
|
||||||
|
|
||||||
|
# Check etcd member list (run from a server with working etcd)
|
||||||
|
ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \
|
||||||
|
--endpoints=https://127.0.0.1:2379 \
|
||||||
|
--cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
|
||||||
|
--cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
|
||||||
|
--key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
|
||||||
|
member list -w table'
|
||||||
|
|
||||||
|
# Check etcd endpoint health across all 3 servers
|
||||||
|
ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \
|
||||||
|
--endpoints=https://192.168.20.43:2379,https://192.168.20.48:2379,https://192.168.20.56:2379 \
|
||||||
|
--cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
|
||||||
|
--cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
|
||||||
|
--key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
|
||||||
|
endpoint health -w table'
|
||||||
|
|
||||||
|
# Check etcd endpoint status (DB size, leader)
|
||||||
|
ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \
|
||||||
|
--endpoints=https://192.168.20.43:2379,https://192.168.20.48:2379,https://192.168.20.56:2379 \
|
||||||
|
--cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
|
||||||
|
--cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
|
||||||
|
--key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
|
||||||
|
endpoint status -w table'
|
||||||
|
|
||||||
|
# Check for disk I/O errors (VM disks)
|
||||||
|
ssh k3s-server11 'sudo dmesg | grep -iE "(i/o error|sda|aborted command)" | tail -20'
|
||||||
|
|
||||||
|
# Check recent k3s logs for errors
|
||||||
|
ssh k3s-server11 'sudo journalctl -u k3s -n 100 --no-pager | grep -iE "(error|fail|sigbus|panic)" | tail -30'
|
||||||
|
|
||||||
|
# Count Longhorn orphans in etcd
|
||||||
|
ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \
|
||||||
|
--endpoints=https://127.0.0.1:2379 \
|
||||||
|
--cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
|
||||||
|
--cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
|
||||||
|
--key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
|
||||||
|
get /registry/longhorn.io/orphans/ --prefix --keys-only | wc -l'
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Root Causes
|
||||||
|
|
||||||
|
### 1. Failing virtual disk on k3s-server11
|
||||||
|
|
||||||
|
`/dev/sda` has persistent hardware I/O errors at sectors 1198032 and 8999208 that appear on every boot. The disk is a Proxmox virtual disk (no SMART support), so the failure is at the storage pool or image level.
|
||||||
|
|
||||||
|
**Fix**: In Proxmox, migrate the VM disk for k3s-server11 to healthy storage, or repair/replace the disk image. Check the Proxmox storage pool for errors.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# On Proxmox host: check storage health
|
||||||
|
pvesm status
|
||||||
|
# Find the VM disk and move it
|
||||||
|
qm move-disk <vmid> scsi0 <target-storage>
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Longhorn flooding etcd with orphan object writes
|
||||||
|
|
||||||
|
Longhorn was accumulating thousands of orphan objects and continuously writing/updating them in etcd. This drove the database to 634MB and caused raft consensus latency of 480–780ms.
|
||||||
|
|
||||||
|
**Fix**: Clean up Longhorn orphans and compact/defrag etcd.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Delete all Longhorn orphans
|
||||||
|
kubectl delete orphan -n longhorn-system --all
|
||||||
|
|
||||||
|
# Manually defrag etcd after cleanup
|
||||||
|
ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \
|
||||||
|
--endpoints=https://192.168.20.43:2379,https://192.168.20.48:2379,https://192.168.20.56:2379 \
|
||||||
|
--cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
|
||||||
|
--cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
|
||||||
|
--key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
|
||||||
|
defrag --cluster'
|
||||||
|
|
||||||
|
# Verify DB size dropped
|
||||||
|
ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \
|
||||||
|
--endpoints=https://192.168.20.43:2379,https://192.168.20.48:2379,https://192.168.20.56:2379 \
|
||||||
|
--cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
|
||||||
|
--cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
|
||||||
|
--key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
|
||||||
|
endpoint status -w table'
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Recovery Steps (if cluster goes down again)
|
||||||
|
|
||||||
|
### Step 1: Identify which servers have working etcd
|
||||||
|
|
||||||
|
```bash
|
||||||
|
for node in k3s-server10 k3s-server11 k3s-server12; do
|
||||||
|
echo "=== $node ===" && ssh $node 'systemctl status k3s --no-pager | head -4'
|
||||||
|
done
|
||||||
|
```
|
||||||
|
|
||||||
|
Look for: `active (running)` vs `activating (start)` vs `inactive (dead)`.
|
||||||
|
|
||||||
|
### Step 2: Check etcd quorum from a running server
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh <running-server> 'sudo ETCDCTL_API=3 etcdctl \
|
||||||
|
--endpoints=https://127.0.0.1:2379 \
|
||||||
|
--cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
|
||||||
|
--cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
|
||||||
|
--key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
|
||||||
|
endpoint health'
|
||||||
|
```
|
||||||
|
|
||||||
|
If all endpoints are healthy but API is down, restart k3s:
|
||||||
|
```bash
|
||||||
|
ssh <server> 'sudo systemctl restart k3s'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 3: If etcd has lost quorum (fewer than 2 of 3 members healthy)
|
||||||
|
|
||||||
|
With 3-member etcd, you need at least 2 members to have quorum. If only 1 is healthy:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Force a single-member etcd to become leader (DESTRUCTIVE - last resort)
|
||||||
|
# Stop k3s on all servers first
|
||||||
|
for node in k3s-server10 k3s-server11 k3s-server12; do
|
||||||
|
ssh $node 'sudo systemctl stop k3s'
|
||||||
|
done
|
||||||
|
|
||||||
|
# On the node with the most recent etcd data, force new cluster
|
||||||
|
# Edit /etc/systemd/system/k3s.service.env and add:
|
||||||
|
# K3S_ETCD_EXTRA_FLAGS=--force-new-cluster
|
||||||
|
# Then start only that one server, verify cluster is up, then remove the flag and join others
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 4: If a server has TLS auth failures connecting to etcd
|
||||||
|
|
||||||
|
This means the server is not in the etcd member list. Check:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Is the node actually in etcd?
|
||||||
|
ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \
|
||||||
|
--endpoints=https://127.0.0.1:2379 \
|
||||||
|
--cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
|
||||||
|
--cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
|
||||||
|
--key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
|
||||||
|
member list -w table'
|
||||||
|
```
|
||||||
|
|
||||||
|
If the failing server is missing: restart it — k3s will attempt to re-add it to the cluster.
|
||||||
|
If it still fails after restart: the etcd data directory may be corrupt. Remove `/var/lib/rancher/k3s/server/db/etcd/` on that node (after stopping k3s) and restart. k3s will resync from peers.
|
||||||
|
|
||||||
|
### Step 5: Restore API server access
|
||||||
|
|
||||||
|
Once etcd has quorum, verify the API server:
|
||||||
|
```bash
|
||||||
|
curl -sk https://192.168.20.47:6443/healthz # via loadbalancer
|
||||||
|
```
|
||||||
|
|
||||||
|
If still down after etcd is healthy, restart k3s on the servers:
|
||||||
|
```bash
|
||||||
|
for node in k3s-server10 k3s-server11 k3s-server12; do
|
||||||
|
ssh $node 'sudo systemctl restart k3s' && sleep 10
|
||||||
|
done
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Ongoing Risks (as of 2026-04-21)
|
||||||
|
|
||||||
|
| Risk | Severity | Status |
|
||||||
|
|------|----------|--------|
|
||||||
|
| server11 disk I/O errors | Critical | **Resolved** 2026-04-21 — disk replaced, VM reprovisioned |
|
||||||
|
| server11 etcd latency (423ms vs 8ms on peers) | High | **Resolved** 2026-04-21 — latency normal after disk replacement |
|
||||||
|
| Longhorn orphan accumulation | High | **Resolved** 2026-04-22 — 138 orphans deleted, etcd defragged to ~57 MB across all 3 members |
|
||||||
|
| vaultwarden CrashLoopBackOff | Low | **Resolved** 2026-04-22 — pod running 1/1 |
|
||||||
|
| k3s agent version skew (v1.33.5–v1.34.4) | Low | In-progress rolling upgrade |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Key IP / Node Reference
|
||||||
|
|
||||||
|
| Node | IP | Role | k3s version |
|
||||||
|
|------|----|------|-------------|
|
||||||
|
| k3s-server10 | 192.168.20.43 | control-plane, etcd | v1.34.6+k3s1 |
|
||||||
|
| k3s-server11 | 192.168.20.48 | control-plane, etcd, master | v1.34.6+k3s1 |
|
||||||
|
| k3s-server12 | 192.168.20.56 | control-plane, etcd, master | v1.34.6+k3s1 |
|
||||||
|
| k3s-loadbalancer | 192.168.20.47 | API load balancer | — |
|
||||||
|
| k3s-agent10–19 | 192.168.20.44–67 | workers | v1.33.5+k3s1 |
|
||||||
|
| k3s-agent20–21 | 192.168.20.69–70 | workers | v1.34.3+k3s1 |
|
||||||
|
| k3s-agent22–23 | 192.168.20.72–73 | workers | v1.34.4+k3s1 |
|
||||||
339
docs/superpowers/plans/2026-04-21-k3s-server11-reprovision.md
Normal file
339
docs/superpowers/plans/2026-04-21-k3s-server11-reprovision.md
Normal file
@@ -0,0 +1,339 @@
|
|||||||
|
# k3s-server11 Reprovision Implementation Plan
|
||||||
|
|
||||||
|
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||||
|
|
||||||
|
**Goal:** Replace the corrupt VM disk on k3s-server11, reprovision the OS via cloud-init, and rejoin the node to the k3s cluster as a healthy etcd member.
|
||||||
|
|
||||||
|
**Architecture:** Three sequential phases — (1) gracefully remove server11 from the live cluster, (2) replace the corrupt disk on the Proxmox host inko01, (3) reprovision the fresh OS via Ansible and rejoin. etcd data is safe on server10 and server12 throughout.
|
||||||
|
|
||||||
|
**Tech Stack:** kubectl, etcdctl (embedded in k3s), Proxmox `qm` CLI, Ansible
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 1: Verify cluster health before starting
|
||||||
|
|
||||||
|
**Access:** local workstation with kubectl, or `ssh k3s-server12`
|
||||||
|
|
||||||
|
- [ ] **Step 1.1: Confirm all 3 etcd members are present and healthy**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \
|
||||||
|
--endpoints=https://192.168.20.43:2379,https://192.168.20.48:2379,https://192.168.20.56:2379 \
|
||||||
|
--cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
|
||||||
|
--cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
|
||||||
|
--key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
|
||||||
|
endpoint health -w table'
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected output — all three endpoints show `true`:
|
||||||
|
```
|
||||||
|
+----------------------------+--------+-------+-------+
|
||||||
|
| ENDPOINT | HEALTH | TOOK | ERROR |
|
||||||
|
+----------------------------+--------+-------+-------+
|
||||||
|
| https://192.168.20.43:2379 | true | ~8ms | |
|
||||||
|
| https://192.168.20.56:2379 | true | ~11ms | |
|
||||||
|
| https://192.168.20.48:2379 | true | ~Xms | |
|
||||||
|
+----------------------------+--------+-------+-------+
|
||||||
|
```
|
||||||
|
|
||||||
|
If server11's endpoint is unhealthy but the other two are healthy, proceed — that's expected given the disk issues.
|
||||||
|
|
||||||
|
- [ ] **Step 1.2: Confirm server11's current etcd member ID**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \
|
||||||
|
--endpoints=https://127.0.0.1:2379 \
|
||||||
|
--cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
|
||||||
|
--cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
|
||||||
|
--key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
|
||||||
|
member list -w table'
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected: server11's member ID is `e9f8fa983ff7f958`. If it differs, use the ID shown here in Task 2 Step 2.2.
|
||||||
|
|
||||||
|
- [ ] **Step 1.3: Confirm kubectl works**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kubectl get nodes
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected: all nodes visible, cluster not reporting errors.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 2: Drain and remove server11 from the cluster
|
||||||
|
|
||||||
|
**Access:** local workstation with kubectl
|
||||||
|
|
||||||
|
- [ ] **Step 2.1: Drain the node**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kubectl drain k3s-server11 --ignore-daemonsets --delete-emptydir-data
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected: pods evicted, ends with `node/k3s-server11 drained`. DaemonSet pods are skipped (normal).
|
||||||
|
|
||||||
|
- [ ] **Step 2.2: Remove server11 from the etcd member list**
|
||||||
|
|
||||||
|
Run this from server11 itself while it's still up:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \
|
||||||
|
--endpoints=https://127.0.0.1:2379 \
|
||||||
|
--cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
|
||||||
|
--cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
|
||||||
|
--key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
|
||||||
|
member remove e9f8fa983ff7f958'
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected: `Member e9f8fa983ff7f958 removed from cluster ...`
|
||||||
|
|
||||||
|
If server11's etcd is not reachable, run from server12 instead:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh k3s-server12 'sudo ETCDCTL_API=3 etcdctl \
|
||||||
|
--endpoints=https://127.0.0.1:2379 \
|
||||||
|
--cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
|
||||||
|
--cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
|
||||||
|
--key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
|
||||||
|
member remove e9f8fa983ff7f958'
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2.3: Delete the node object from Kubernetes**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kubectl delete node k3s-server11
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected: `node "k3s-server11" deleted`
|
||||||
|
|
||||||
|
- [ ] **Step 2.4: Verify cluster is healthy with 2 etcd members**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh k3s-server12 'sudo ETCDCTL_API=3 etcdctl \
|
||||||
|
--endpoints=https://192.168.20.43:2379,https://192.168.20.56:2379 \
|
||||||
|
--cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
|
||||||
|
--cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
|
||||||
|
--key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
|
||||||
|
member list -w table'
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected: exactly 2 members (server10 + server12), both `started`.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kubectl get nodes
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected: server11 is gone, all remaining nodes Ready.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 3: Replace the corrupt disk on inko01
|
||||||
|
|
||||||
|
**Access:** `ssh inko01`
|
||||||
|
|
||||||
|
- [ ] **Step 3.1: Stop VM 111**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh inko01 'qm stop 111'
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected: no output, or `stopping VM 111`. Verify:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh inko01 'qm status 111'
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected: `status: stopped`
|
||||||
|
|
||||||
|
- [ ] **Step 3.2: Delete the corrupt disk**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh inko01 'qm set 111 --delete scsi0'
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected: `update VM 111: -scsi0`
|
||||||
|
|
||||||
|
Verify the corrupt file is gone:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh inko01 'ls /opt/proxmox/images/111/'
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected: only `vm-111-cloudinit.qcow2` remains (no `vm-111-disk-0.raw`).
|
||||||
|
|
||||||
|
- [ ] **Step 3.3: Import a fresh Debian 12 cloud-init image**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh inko01 'qm importdisk 111 /opt/proxmox/template/iso/debian-12-genericcloud-amd64.qcow2 proxmox'
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected output (takes ~30s):
|
||||||
|
```
|
||||||
|
importing disk '/opt/proxmox/template/iso/debian-12-genericcloud-amd64.qcow2' to VM 111 ...
|
||||||
|
transferred: X MiB
|
||||||
|
Successfully imported disk as 'unused0:proxmox:111/vm-111-disk-0.raw'
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3.4: Attach the disk and set boot order**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh inko01 'qm set 111 --scsi0 proxmox:111/vm-111-disk-0.raw --boot order=scsi0'
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected: `update VM 111: -boot order=scsi0 -scsi0 proxmox:111/vm-111-disk-0.raw`
|
||||||
|
|
||||||
|
- [ ] **Step 3.5: Resize disk to 64G**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh inko01 'qm resize 111 scsi0 64G'
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected: `resizing disk scsi0 to 64G ...` or `size is already 64G` if the import was exact.
|
||||||
|
|
||||||
|
- [ ] **Step 3.6: Start the VM**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh inko01 'qm start 111'
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected: no output. Verify:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh inko01 'qm status 111'
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected: `status: running`
|
||||||
|
|
||||||
|
- [ ] **Step 3.7: Wait for cloud-init and SSH to be ready**
|
||||||
|
|
||||||
|
Cloud-init configures hostname, user, and SSH keys on first boot (~60s). Poll until SSH responds:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
until ssh -o ConnectTimeout=5 -o StrictHostKeyChecking=no k3s-server11 'hostname' 2>/dev/null; do
|
||||||
|
echo "waiting for SSH..."; sleep 10
|
||||||
|
done
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected: prints `k3s-server11` when ready.
|
||||||
|
|
||||||
|
- [ ] **Step 3.8: Verify clean disk — no I/O errors**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh k3s-server11 'sudo dmesg | grep -i "i/o error"'
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected: **no output**. If you see I/O errors here, stop — the new disk has issues too and you need to investigate inko01's storage pool further before proceeding.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 4: Reprovision via Ansible
|
||||||
|
|
||||||
|
**Access:** local workstation in the `ansible-homelab` repo
|
||||||
|
|
||||||
|
- [ ] **Step 4.1: Run the k3s-servers playbook targeting only server11**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/k3s-servers.yaml --limit k3s-server11
|
||||||
|
```
|
||||||
|
|
||||||
|
This runs `common` and `k3s_server` roles. Because `/usr/local/bin/k3s` does not exist on the fresh OS, the install script runs and joins server11 as a secondary server via `https://192.168.20.47:6443` (loadbalancer). k3s automatically registers as a new etcd member.
|
||||||
|
|
||||||
|
Expected: playbook completes with no failed tasks.
|
||||||
|
|
||||||
|
- [ ] **Step 4.2: Verify server11 joined Kubernetes**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kubectl get nodes -o wide
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected: `k3s-server11` shows `Ready` with role `control-plane,etcd,master` within ~2 minutes.
|
||||||
|
|
||||||
|
- [ ] **Step 4.3: Verify server11 is back in the etcd member list**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \
|
||||||
|
--endpoints=https://192.168.20.43:2379,https://192.168.20.48:2379,https://192.168.20.56:2379 \
|
||||||
|
--cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
|
||||||
|
--cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
|
||||||
|
--key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
|
||||||
|
endpoint health -w table'
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected: all 3 endpoints healthy, server11 responding in <100ms (not 400ms like before).
|
||||||
|
|
||||||
|
- [ ] **Step 4.4: Verify etcd has 3 members**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \
|
||||||
|
--endpoints=https://127.0.0.1:2379 \
|
||||||
|
--cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
|
||||||
|
--cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
|
||||||
|
--key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
|
||||||
|
member list -w table'
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected: 3 members, all `started`.
|
||||||
|
|
||||||
|
- [ ] **Step 4.5: Uncordon the node**
|
||||||
|
|
||||||
|
The drain in Task 2 cordoned the node. Uncordon it to allow workload scheduling:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kubectl uncordon k3s-server11
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected: `node/k3s-server11 uncordoned`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 5: Final health check
|
||||||
|
|
||||||
|
- [ ] **Step 5.1: Confirm all nodes Ready**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kubectl get nodes -o wide
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected: all 17 nodes (3 servers + 14 agents) show `Ready`.
|
||||||
|
|
||||||
|
- [ ] **Step 5.2: Confirm no disk errors on server11**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh k3s-server11 'sudo dmesg | grep -iE "(i/o error|sda.*error|error.*sda)" | wc -l'
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected: `0`
|
||||||
|
|
||||||
|
- [ ] **Step 5.3: Confirm backups will work — test a manual backup**
|
||||||
|
|
||||||
|
From inko01, trigger a backup of VM 111 to verify the new disk is readable end-to-end:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh inko01 'vzdump 111 --compress zstd --storage proxmox --mode snapshot'
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected: completes without `err -5` or `Input/output error`. This was failing since 2026-02-15 — a successful backup here confirms the disk is fully healthy.
|
||||||
|
|
||||||
|
- [ ] **Step 5.4: Update the runbook**
|
||||||
|
|
||||||
|
In `docs/runbooks/k3s-cluster-outage-2026-04-20.md`, update the risks table to mark the server11 disk issue as resolved:
|
||||||
|
|
||||||
|
Change:
|
||||||
|
```
|
||||||
|
| server11 disk I/O errors | Critical | **Unresolved** — same sectors fail at every boot |
|
||||||
|
| server11 etcd latency (423ms vs 8ms on peers) | High | **Unresolved** — caused by disk |
|
||||||
|
```
|
||||||
|
|
||||||
|
To:
|
||||||
|
```
|
||||||
|
| server11 disk I/O errors | Critical | **Resolved** 2026-04-21 — disk replaced, VM reprovisioned |
|
||||||
|
| server11 etcd latency (423ms vs 8ms on peers) | High | **Resolved** 2026-04-21 — latency normal after disk replacement |
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 5.5: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add docs/runbooks/k3s-cluster-outage-2026-04-20.md
|
||||||
|
git commit -m "docs: mark server11 disk issue resolved in runbook"
|
||||||
|
```
|
||||||
@@ -0,0 +1,146 @@
|
|||||||
|
# Design: Reprovision k3s-server11
|
||||||
|
|
||||||
|
**Date**: 2026-04-21
|
||||||
|
**Status**: Approved
|
||||||
|
|
||||||
|
## Background
|
||||||
|
|
||||||
|
k3s-server11 (Proxmox VM 111 on inko01) has a corrupted btrfs VM disk image
|
||||||
|
(`/opt/proxmox/images/111/vm-111-disk-0.raw`). The corruption has been present since
|
||||||
|
~2026-02-15 (when backups started failing with I/O errors). The VM's guest OS sees this
|
||||||
|
as bad sectors on `/dev/sda`, causing etcd to crash with SIGBUS when it mmap-reads those
|
||||||
|
sectors. This triggered a full cluster outage on 2026-04-20.
|
||||||
|
|
||||||
|
The physical SSD on inko01 is healthy (SMART PASSED). The corruption is at the btrfs
|
||||||
|
filesystem layer (3279+ corrupt blocks, single-device — no redundancy to recover from).
|
||||||
|
|
||||||
|
Since etcd data is fully replicated on server10 and server12, no data recovery is needed.
|
||||||
|
The correct fix is to replace the disk with a fresh OS image and rejoin the node.
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
Three sequential phases. Each phase must complete successfully before the next begins.
|
||||||
|
|
||||||
|
```
|
||||||
|
Phase 1: k8s cleanup → Phase 2: Proxmox disk → Phase 3: Ansible reprovision
|
||||||
|
(drain, etcd remove, (stop VM, delete disk, (common + k3s_server roles,
|
||||||
|
delete node) import fresh image, joins as secondary server,
|
||||||
|
resize, start) etcd re-adds member)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Phase 1: Remove server11 from the cluster
|
||||||
|
|
||||||
|
Run from a machine with `kubectl` access (e.g. local workstation).
|
||||||
|
|
||||||
|
**1.1 Drain the node** — evicts all non-daemonset pods:
|
||||||
|
```bash
|
||||||
|
kubectl drain k3s-server11 --ignore-daemonsets --delete-emptydir-data
|
||||||
|
```
|
||||||
|
|
||||||
|
**1.2 Remove from etcd** — prevents quorum issues while the disk is replaced:
|
||||||
|
```bash
|
||||||
|
ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \
|
||||||
|
--endpoints=https://127.0.0.1:2379 \
|
||||||
|
--cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
|
||||||
|
--cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
|
||||||
|
--key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
|
||||||
|
member remove e9f8fa983ff7f958'
|
||||||
|
```
|
||||||
|
|
||||||
|
**1.3 Delete the node object**:
|
||||||
|
```bash
|
||||||
|
kubectl delete node k3s-server11
|
||||||
|
```
|
||||||
|
|
||||||
|
**Verify**: `kubectl get nodes` shows only server10, server12, and the agents. Etcd member
|
||||||
|
list shows only 2 members (server10 + server12). Cluster remains healthy with quorum.
|
||||||
|
|
||||||
|
## Phase 2: Replace the VM disk on inko01
|
||||||
|
|
||||||
|
Run directly on inko01 via SSH.
|
||||||
|
|
||||||
|
**2.1 Stop the VM**:
|
||||||
|
```bash
|
||||||
|
qm stop 111
|
||||||
|
```
|
||||||
|
|
||||||
|
**2.2 Delete the corrupt disk** (detaches and removes the raw file):
|
||||||
|
```bash
|
||||||
|
qm set 111 --delete scsi0
|
||||||
|
```
|
||||||
|
|
||||||
|
**2.3 Import a fresh Debian 12 cloud-init image as a new disk**:
|
||||||
|
```bash
|
||||||
|
qm importdisk 111 /opt/proxmox/template/iso/debian-12-genericcloud-amd64.qcow2 proxmox
|
||||||
|
```
|
||||||
|
This creates `/opt/proxmox/images/111/vm-111-disk-0.raw` from the clean base image.
|
||||||
|
|
||||||
|
**2.4 Attach the disk and set boot order**:
|
||||||
|
```bash
|
||||||
|
qm set 111 --scsi0 proxmox:111/vm-111-disk-0.raw --boot order=scsi0
|
||||||
|
```
|
||||||
|
|
||||||
|
**2.5 Resize to 64G** (matching original disk size):
|
||||||
|
```bash
|
||||||
|
qm resize 111 scsi0 64G
|
||||||
|
```
|
||||||
|
|
||||||
|
**2.6 Start the VM**:
|
||||||
|
```bash
|
||||||
|
qm start 111
|
||||||
|
```
|
||||||
|
|
||||||
|
Cloud-init runs on first boot and configures: hostname (`k3s-server11`), user (`tudattr`),
|
||||||
|
SSH keys, and DHCP networking. Wait ~60s for SSH to become available before Phase 3.
|
||||||
|
|
||||||
|
**Verify**: `ssh k3s-server11 hostname` returns `k3s-server11` and no disk I/O errors
|
||||||
|
appear in `dmesg`.
|
||||||
|
|
||||||
|
## Phase 3: Reprovision via Ansible
|
||||||
|
|
||||||
|
Run from local workstation in the ansible-homelab repo.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/k3s-servers.yaml --limit k3s-server11
|
||||||
|
```
|
||||||
|
|
||||||
|
This runs the `common` and `k3s_server` roles against server11 only:
|
||||||
|
|
||||||
|
- `common`: installs base packages, configures SSH, hostname, etc.
|
||||||
|
- `k3s_server`: detects `/usr/local/bin/k3s` does not exist → runs install script with
|
||||||
|
`--server https://192.168.20.47:6443` (loadbalancer) → joins as a secondary server.
|
||||||
|
k3s fetches the cluster token from server10 (the primary) and registers as a new etcd
|
||||||
|
member automatically.
|
||||||
|
|
||||||
|
**Verify**:
|
||||||
|
```bash
|
||||||
|
kubectl get nodes # server11 shows Ready
|
||||||
|
ssh k3s-server11 'sudo ETCDCTL_API=3 etcdctl \
|
||||||
|
--endpoints=https://127.0.0.1:2379 \
|
||||||
|
--cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
|
||||||
|
--cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
|
||||||
|
--key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
|
||||||
|
member list -w table' # 3 members, all started
|
||||||
|
ssh k3s-server11 'dmesg | grep -i "i/o error"' # no output
|
||||||
|
```
|
||||||
|
|
||||||
|
## Key Facts
|
||||||
|
|
||||||
|
| Item | Value |
|
||||||
|
|------|-------|
|
||||||
|
| VM ID | 111 |
|
||||||
|
| Proxmox host | inko01 |
|
||||||
|
| VM disk path | `/opt/proxmox/images/111/vm-111-disk-0.raw` |
|
||||||
|
| Base image | `/opt/proxmox/template/iso/debian-12-genericcloud-amd64.qcow2` |
|
||||||
|
| Proxmox storage pool | `proxmox` |
|
||||||
|
| server11 IP | 192.168.20.48 |
|
||||||
|
| server11 etcd member ID | `e9f8fa983ff7f958` |
|
||||||
|
| Loadbalancer IP | 192.168.20.47 |
|
||||||
|
| k3s primary server | server10 (192.168.20.43) |
|
||||||
|
|
||||||
|
## Risk
|
||||||
|
|
||||||
|
- **During Phase 1–2**: cluster runs on 2 etcd members. Still has quorum but no
|
||||||
|
redundancy. Avoid other disruptive changes until server11 is back.
|
||||||
|
- **etcd member ID**: `e9f8fa983ff7f958` was confirmed on 2026-04-21. Verify it matches
|
||||||
|
before running the remove command if time has passed.
|
||||||
77
roles/common/files/kitty/infocmp
Normal file
77
roles/common/files/kitty/infocmp
Normal file
@@ -0,0 +1,77 @@
|
|||||||
|
# Reconstructed via infocmp from file: /usr/lib/kitty/terminfo/./x/xterm-kitty
|
||||||
|
xterm-kitty|KovIdTTY,
|
||||||
|
am, bw, ccc, hs, km, mc5i, mir, msgr, npc, xenl, Su, Tc, XF, fullkbd,
|
||||||
|
colors#0x100, cols#80, it#8, lines#24, pairs#0x7fff,
|
||||||
|
acsc=++\,\,--..00``aaffgghhiijjkkllmmnnooppqqrrssttuuvvwwxxyyzz{{||}}~~,
|
||||||
|
bel=^G, blink=\E[5m, bold=\E[1m, cbt=\E[Z, civis=\E[?25l,
|
||||||
|
clear=\E[H\E[2J, cnorm=\E[?12h\E[?25h, cr=\r,
|
||||||
|
csr=\E[%i%p1%d;%p2%dr, cub=\E[%p1%dD, cub1=^H,
|
||||||
|
cud=\E[%p1%dB, cud1=\n, cuf=\E[%p1%dC, cuf1=\E[C,
|
||||||
|
cup=\E[%i%p1%d;%p2%dH, cuu=\E[%p1%dA, cuu1=\E[A,
|
||||||
|
cvvis=\E[?12;25h, dch=\E[%p1%dP, dch1=\E[P, dim=\E[2m,
|
||||||
|
dl=\E[%p1%dM, dl1=\E[M, dsl=\E]2;\E\\, ech=\E[%p1%dX,
|
||||||
|
ed=\E[J, el=\E[K, el1=\E[1K, flash=\E[?5h$<100/>\E[?5l,
|
||||||
|
fsl=^G, home=\E[H, hpa=\E[%i%p1%dG, ht=^I, hts=\EH,
|
||||||
|
ich=\E[%p1%d@, il=\E[%p1%dL, il1=\E[L, ind=\n,
|
||||||
|
indn=\E[%p1%dS,
|
||||||
|
initc=\E]4;%p1%d;rgb:%p2%{255}%*%{1000}%/%2.2X/%p3%{255}%*%{1000}%/%2.2X/%p4%{255}%*%{1000}%/%2.2X\E\\,
|
||||||
|
kBEG=\E[1;2E, kDC=\E[3;2~, kEND=\E[1;2F, kHOM=\E[1;2H,
|
||||||
|
kIC=\E[2;2~, kLFT=\E[1;2D, kNXT=\E[6;2~, kPRV=\E[5;2~,
|
||||||
|
kRIT=\E[1;2C, kbeg=\EOE, kbs=^?, kcbt=\E[Z, kcub1=\EOD,
|
||||||
|
kcud1=\EOB, kcuf1=\EOC, kcuu1=\EOA, kdch1=\E[3~, kend=\EOF,
|
||||||
|
kf1=\EOP, kf10=\E[21~, kf11=\E[23~, kf12=\E[24~,
|
||||||
|
kf13=\E[1;2P, kf14=\E[1;2Q, kf15=\E[13;2~, kf16=\E[1;2S,
|
||||||
|
kf17=\E[15;2~, kf18=\E[17;2~, kf19=\E[18;2~, kf2=\EOQ,
|
||||||
|
kf20=\E[19;2~, kf21=\E[20;2~, kf22=\E[21;2~,
|
||||||
|
kf23=\E[23;2~, kf24=\E[24;2~, kf25=\E[1;5P, kf26=\E[1;5Q,
|
||||||
|
kf27=\E[13;5~, kf28=\E[1;5S, kf29=\E[15;5~, kf3=\EOR,
|
||||||
|
kf30=\E[17;5~, kf31=\E[18;5~, kf32=\E[19;5~,
|
||||||
|
kf33=\E[20;5~, kf34=\E[21;5~, kf35=\E[23;5~,
|
||||||
|
kf36=\E[24;5~, kf37=\E[1;6P, kf38=\E[1;6Q, kf39=\E[13;6~,
|
||||||
|
kf4=\EOS, kf40=\E[1;6S, kf41=\E[15;6~, kf42=\E[17;6~,
|
||||||
|
kf43=\E[18;6~, kf44=\E[19;6~, kf45=\E[20;6~,
|
||||||
|
kf46=\E[21;6~, kf47=\E[23;6~, kf48=\E[24;6~,
|
||||||
|
kf49=\E[1;3P, kf5=\E[15~, kf50=\E[1;3Q, kf51=\E[13;3~,
|
||||||
|
kf52=\E[1;3S, kf53=\E[15;3~, kf54=\E[17;3~,
|
||||||
|
kf55=\E[18;3~, kf56=\E[19;3~, kf57=\E[20;3~,
|
||||||
|
kf58=\E[21;3~, kf59=\E[23;3~, kf6=\E[17~, kf60=\E[24;3~,
|
||||||
|
kf61=\E[1;4P, kf62=\E[1;4Q, kf63=\E[13;4~, kf7=\E[18~,
|
||||||
|
kf8=\E[19~, kf9=\E[20~, khome=\EOH, kich1=\E[2~,
|
||||||
|
kind=\E[1;2B, kmous=\E[M, knp=\E[6~, kpp=\E[5~,
|
||||||
|
kri=\E[1;2A, oc=\E]104\007, op=\E[39;49m, rc=\E8,
|
||||||
|
rep=%p1%c\E[%p2%{1}%-%db, rev=\E[7m, ri=\EM,
|
||||||
|
rin=\E[%p1%dT, ritm=\E[23m, rmacs=\E(B, rmam=\E[?7l,
|
||||||
|
rmcup=\E[?1049l, rmir=\E[4l, rmkx=\E[?1l, rmso=\E[27m,
|
||||||
|
rmul=\E[24m, rs1=\E]\E\\\Ec, sc=\E7,
|
||||||
|
setab=\E[%?%p1%{8}%<%t4%p1%d%e%p1%{16}%<%t10%p1%{8}%-%d%e48;5;%p1%d%;m,
|
||||||
|
setaf=\E[%?%p1%{8}%<%t3%p1%d%e%p1%{16}%<%t9%p1%{8}%-%d%e38;5;%p1%d%;m,
|
||||||
|
sgr=%?%p9%t\E(0%e\E(B%;\E[0%?%p6%t;1%;%?%p2%t;4%;%?%p1%p3%|%t;7%;%?%p4%t;5%;%?%p7%t;8%;%?%p5%t;2%;m,
|
||||||
|
sgr0=\E(B\E[m, sitm=\E[3m, smacs=\E(0, smam=\E[?7h,
|
||||||
|
smcup=\E[?1049h, smir=\E[4h, smkx=\E[?1h, smso=\E[7m,
|
||||||
|
smul=\E[4m, tbc=\E[3g, tsl=\E]2;, u6=\E[%i%d;%dR, u7=\E[6n,
|
||||||
|
u8=\E[?%[;0123456789]c, u9=\E[c, vpa=\E[%i%p1%dd,
|
||||||
|
BD=\E[?2004l, BE=\E[?2004h, Cr=\E]112\007,
|
||||||
|
Cs=\E]12;%p1%s\007, Ms=\E]52;%p1%s;%p2%s\E\\,
|
||||||
|
PE=\E[201~, PS=\E[200~, RV=\E[>c, Se=\E[2 q,
|
||||||
|
Setulc=\E[58:2:%p1%{65536}%/%d:%p1%{256}%/%{255}%&%d:%p1%{255}%&%d%;m,
|
||||||
|
Smulx=\E[4:%p1%dm, Ss=\E[%p1%d q, Sync=\EP=%p1%ds\E\\,
|
||||||
|
XR=\E[>0q, fd=\E[?1004l, fe=\E[?1004h, kBEG3=\E[1;3E,
|
||||||
|
kBEG4=\E[1;4E, kBEG5=\E[1;5E, kBEG6=\E[1;6E,
|
||||||
|
kBEG7=\E[1;7E, kDC3=\E[3;3~, kDC4=\E[3;4~, kDC5=\E[3;5~,
|
||||||
|
kDC6=\E[3;6~, kDC7=\E[3;7~, kDN=\E[1;2B, kDN3=\E[1;3B,
|
||||||
|
kDN4=\E[1;4B, kDN5=\E[1;5B, kDN6=\E[1;6B, kDN7=\E[1;7B,
|
||||||
|
kEND3=\E[1;3F, kEND4=\E[1;4F, kEND5=\E[1;5F,
|
||||||
|
kEND6=\E[1;6F, kEND7=\E[1;7F, kHOM3=\E[1;3H,
|
||||||
|
kHOM4=\E[1;4H, kHOM5=\E[1;5H, kHOM6=\E[1;6H,
|
||||||
|
kHOM7=\E[1;7H, kIC3=\E[2;3~, kIC4=\E[2;4~, kIC5=\E[2;5~,
|
||||||
|
kIC6=\E[2;6~, kIC7=\E[2;7~, kLFT3=\E[1;3D, kLFT4=\E[1;4D,
|
||||||
|
kLFT5=\E[1;5D, kLFT6=\E[1;6D, kLFT7=\E[1;7D,
|
||||||
|
kNXT3=\E[6;3~, kNXT4=\E[6;4~, kNXT5=\E[6;5~,
|
||||||
|
kNXT6=\E[6;6~, kNXT7=\E[6;7~, kPRV3=\E[5;3~,
|
||||||
|
kPRV4=\E[5;4~, kPRV5=\E[5;5~, kPRV6=\E[5;6~,
|
||||||
|
kPRV7=\E[5;7~, kRIT3=\E[1;3C, kRIT4=\E[1;4C,
|
||||||
|
kRIT5=\E[1;5C, kRIT6=\E[1;6C, kRIT7=\E[1;7C, kUP=\E[1;2A,
|
||||||
|
kUP3=\E[1;3A, kUP4=\E[1;4A, kUP5=\E[1;5A, kUP6=\E[1;6A,
|
||||||
|
kUP7=\E[1;7A, kxIN=\E[I, kxOUT=\E[O, rmxx=\E[29m,
|
||||||
|
setrgbb=\E[48:2:%p1%d:%p2%d:%p3%dm,
|
||||||
|
setrgbf=\E[38:2:%p1%d:%p2%d:%p3%dm, smxx=\E[9m,
|
||||||
@@ -4,3 +4,9 @@
|
|||||||
name: sshd
|
name: sshd
|
||||||
state: restarted
|
state: restarted
|
||||||
become: true
|
become: true
|
||||||
|
|
||||||
|
- name: Restart timesyncd
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: systemd-timesyncd
|
||||||
|
state: restarted
|
||||||
|
become: true
|
||||||
|
|||||||
@@ -22,3 +22,16 @@
|
|||||||
- name: Compile ghostty terminalinfo
|
- name: Compile ghostty terminalinfo
|
||||||
ansible.builtin.command: "tic -x {{ ansible_env.HOME }}/ghostty"
|
ansible.builtin.command: "tic -x {{ ansible_env.HOME }}/ghostty"
|
||||||
when: ghostty_terminfo.changed
|
when: ghostty_terminfo.changed
|
||||||
|
|
||||||
|
- name: Copy kitty infocmp
|
||||||
|
ansible.builtin.copy:
|
||||||
|
src: files/kitty/infocmp
|
||||||
|
dest: "{{ ansible_env.HOME }}/kitty"
|
||||||
|
owner: "{{ ansible_user_id }}"
|
||||||
|
group: "{{ ansible_user_id }}"
|
||||||
|
mode: "0644"
|
||||||
|
register: kitty_terminfo
|
||||||
|
|
||||||
|
- name: Compile kitty terminalinfo
|
||||||
|
ansible.builtin.command: "tic -x {{ ansible_env.HOME }}/kitty"
|
||||||
|
when: kitty_terminfo.changed
|
||||||
|
|||||||
@@ -9,3 +9,26 @@
|
|||||||
community.general.timezone:
|
community.general.timezone:
|
||||||
name: "{{ timezone }}"
|
name: "{{ timezone }}"
|
||||||
when: ansible_user_id == "root"
|
when: ansible_user_id == "root"
|
||||||
|
|
||||||
|
- name: Configure NTP servers for systemd-timesyncd
|
||||||
|
ansible.builtin.lineinfile:
|
||||||
|
path: /etc/systemd/timesyncd.conf
|
||||||
|
regexp: "^#?NTP="
|
||||||
|
line: "NTP=0.debian.pool.ntp.org 1.debian.pool.ntp.org 2.debian.pool.ntp.org 3.debian.pool.ntp.org"
|
||||||
|
become: true
|
||||||
|
notify: Restart timesyncd
|
||||||
|
|
||||||
|
- name: Enable and start systemd-timesyncd
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: systemd-timesyncd
|
||||||
|
enabled: true
|
||||||
|
state: started
|
||||||
|
become: true
|
||||||
|
when: ansible_user_id != "root"
|
||||||
|
|
||||||
|
- name: Enable and start systemd-timesyncd
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: systemd-timesyncd
|
||||||
|
enabled: true
|
||||||
|
state: started
|
||||||
|
when: ansible_user_id == "root"
|
||||||
|
|||||||
@@ -15,15 +15,15 @@
|
|||||||
|
|
||||||
- name: Install primary k3s server
|
- name: Install primary k3s server
|
||||||
include_tasks: primary_installation.yaml
|
include_tasks: primary_installation.yaml
|
||||||
when: ansible_default_ipv4.address == k3s_primary_server_ip
|
when: inventory_hostname == groups['k3s_server'] | first
|
||||||
|
|
||||||
- name: Get token from primary k3s server
|
- name: Get token from primary k3s server
|
||||||
include_tasks: pull_token.yaml
|
include_tasks: pull_token.yaml
|
||||||
|
|
||||||
- name: Install seconary k3s servers
|
- name: Install seconary k3s servers
|
||||||
include_tasks: secondary_installation.yaml
|
include_tasks: secondary_installation.yaml
|
||||||
when: ansible_default_ipv4.address != k3s_primary_server_ip
|
when: inventory_hostname != groups['k3s_server'] | first
|
||||||
|
|
||||||
- name: Set kubeconfig on localhost
|
- name: Set kubeconfig on localhost
|
||||||
include_tasks: create_kubeconfig.yaml
|
include_tasks: create_kubeconfig.yaml
|
||||||
when: ansible_default_ipv4.address == k3s_primary_server_ip
|
when: inventory_hostname == groups['k3s_server'] | first
|
||||||
|
|||||||
@@ -1,15 +1,15 @@
|
|||||||
- name: Get K3s token from the first server
|
- name: Get K3s token from the primary server
|
||||||
when: ansible_default_ipv4.address == k3s_primary_server_ip
|
|
||||||
ansible.builtin.slurp:
|
ansible.builtin.slurp:
|
||||||
src: /var/lib/rancher/k3s/server/node-token
|
src: /var/lib/rancher/k3s/server/node-token
|
||||||
register: k3s_token
|
register: k3s_token_raw
|
||||||
|
delegate_to: "{{ groups['k3s_server'] | first }}"
|
||||||
|
run_once: true
|
||||||
become: true
|
become: true
|
||||||
|
|
||||||
- name: Set fact on k3s_primary_server_ip
|
- name: Set k3s_token fact
|
||||||
ansible.builtin.set_fact:
|
ansible.builtin.set_fact:
|
||||||
k3s_token: "{{ k3s_token['content'] | b64decode | trim }}"
|
k3s_token: "{{ k3s_token_raw['content'] | b64decode | trim }}"
|
||||||
when:
|
run_once: true
|
||||||
- ansible_default_ipv4.address == k3s_primary_server_ip
|
|
||||||
|
|
||||||
- name: Write K3s token to local file for encryption
|
- name: Write K3s token to local file for encryption
|
||||||
ansible.builtin.copy:
|
ansible.builtin.copy:
|
||||||
|
|||||||
Reference in New Issue
Block a user