Compare commits
4 Commits
4563ef83f1
...
main
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
40fa132e0d | ||
|
|
c48ced6207 | ||
|
|
3ac7d91101 | ||
|
|
a187b648e7 |
97
README.md
97
README.md
@@ -1,6 +1,6 @@
|
||||
# Homelab
|
||||
|
||||
A production-grade homelab running on bare-metal Proxmox, with a 17-node Kubernetes cluster managed entirely through GitOps.
|
||||
17-node Kubernetes cluster on five bare-metal Proxmox hosts, provisioned with Terraform and Ansible, managed through ArgoCD GitOps. Runs my home automation, media stack, photo backup, documents, and a few side projects.
|
||||
|
||||

|
||||

|
||||
@@ -14,46 +14,45 @@ A production-grade homelab running on bare-metal Proxmox, with a 17-node Kuberne
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph ext[" External"]
|
||||
CF["Cloudflare\nCDN + DNS"]
|
||||
CF["Cloudflare CDN"]
|
||||
Admin["Remote Admin"]
|
||||
end
|
||||
|
||||
subgraph vps["Edge VPS"]
|
||||
WG["WireGuard\nVPN Gateway"]
|
||||
TraefikVPS["Traefik\nReverse Proxy"]
|
||||
Pangolin["Pangolin\nTunnel Server"]
|
||||
WG["WireGuard VPN Gateway"]
|
||||
TraefikVPS["Traefik"]
|
||||
Pangolin["Pangolin Tunnel Server"]
|
||||
end
|
||||
|
||||
subgraph proxmox["Proxmox Cluster — 5 physical nodes"]
|
||||
subgraph cp["Control Plane ×3 (HA etcd)"]
|
||||
subgraph cp["Control Plane x3 — HA etcd + kube-vip"]
|
||||
S["k3s-server"]
|
||||
end
|
||||
LB["nginx\nLoad Balancer"]
|
||||
subgraph workers["Worker Nodes ×14"]
|
||||
subgraph workers["Worker Nodes x14"]
|
||||
W["k3s-agent"]
|
||||
end
|
||||
DH["docker-host\nIntel QuickSync GPU"]
|
||||
NFS["NFS Server\nDedicated storage node"]
|
||||
DH["docker-host — Intel QuickSync GPU"]
|
||||
NFS["NFS Server — dedicated storage node"]
|
||||
end
|
||||
|
||||
subgraph k8s["Kubernetes"]
|
||||
subgraph platform["Platform layer"]
|
||||
subgraph platform["Platform"]
|
||||
direction LR
|
||||
MetalLB["MetalLB"]
|
||||
Traefik["Traefik"]
|
||||
Longhorn["Longhorn"]
|
||||
ArgoCD["ArgoCD"]
|
||||
Prometheus["Prometheus\n+ Grafana"]
|
||||
ECK["Elastic Stack\n(ECK)"]
|
||||
Istio["Istio\nAmbient"]
|
||||
MetalLB
|
||||
Traefik
|
||||
Longhorn
|
||||
ArgoCD
|
||||
Prometheus
|
||||
ECK["Elastic Stack"]
|
||||
Istio["Istio Ambient"]
|
||||
end
|
||||
subgraph apps["Applications"]
|
||||
direction LR
|
||||
Immich["Immich"]
|
||||
Immich
|
||||
VW["Vaultwarden"]
|
||||
HA["Home Assistant"]
|
||||
Media["Arr Stack\n+ Jellyfin"]
|
||||
Other["Paperless · N8n\nNtfy · Gitea · …"]
|
||||
Media["Arr Stack + Jellyfin"]
|
||||
Other["Paperless, N8n, Ntfy ..."]
|
||||
end
|
||||
end
|
||||
|
||||
@@ -62,11 +61,10 @@ graph TB
|
||||
CF -->|Cloudflare tunnel| k8s
|
||||
TraefikVPS --> Pangolin
|
||||
Pangolin -->|Newt client| k8s
|
||||
LB --> cp
|
||||
cp --- workers
|
||||
workers --- Longhorn
|
||||
NFS -->|NFS mount| Media
|
||||
DH -->|Jellyfin\nDocker| Media
|
||||
DH -->|Docker| Media
|
||||
```
|
||||
|
||||
---
|
||||
@@ -78,16 +76,15 @@ graph TB
|
||||
| Physical | `aya01` | Proxmox node + NFS server | Dedicated storage — no VMs |
|
||||
| Physical | `lulu` | Proxmox node | k3s agents |
|
||||
| Physical | `inko01` | Proxmox node | k3s server + agents + docker host |
|
||||
| Physical | `naruto01` | Proxmox node | k3s server + agents + LB |
|
||||
| Physical | `naruto01` | Proxmox node | k3s server + agents |
|
||||
| Physical | `mii01` | Proxmox node | k3s server + agents |
|
||||
| VM | `k3s-server-{10,11,12}` | K3s control plane (HA etcd) | 2 vCPU · 4 GB RAM · 64 GB |
|
||||
| VM | `k3s-server-{10,11,12}` | K3s control plane (HA etcd + kube-vip VIP) | 2 vCPU · 4 GB RAM · 64 GB |
|
||||
| VM | `k3s-agent-{10…23}` | K3s worker nodes ×14 | 2 vCPU · 4 GB RAM · 128 GB |
|
||||
| VM | `docker-host11` | Docker host w/ GPU passthrough | 2 vCPU · 4 GB RAM · 192 GB · Intel QuickSync |
|
||||
| VM | `k3s-loadbalancer` | nginx LB fronting control plane | 1 vCPU · 2 GB RAM |
|
||||
| VM | `docker-lb` | Caddy reverse proxy (LAN only) | 1 vCPU · 2 GB RAM |
|
||||
| VPS | `mii` | Edge node (Netcup) | WireGuard · Traefik · Pangolin |
|
||||
|
||||
All VMs run **Debian 12** on `virtio` network bridges, provisioned from cloud-init templates via **Terraform + Ansible**.
|
||||
All VMs run Debian 12 on `virtio` network bridges, provisioned from cloud-init templates via Terraform + Ansible.
|
||||
|
||||
---
|
||||
|
||||
@@ -97,6 +94,7 @@ All VMs run **Debian 12** on `virtio` network bridges, provisioned from cloud-in
|
||||
|-----------|-------------|---------|
|
||||
| **ArgoCD** | Helm (App-of-Apps) | GitOps CD — all cluster state driven from Git |
|
||||
| **ArgoCD Image Updater** | Helm | Watches registries, commits updated image tags back to Git |
|
||||
| **kube-vip** | DaemonSet on control plane | HA VIP for the K8s API server |
|
||||
| **Traefik** | k3s built-in | Ingress controller, fronted by MetalLB |
|
||||
| **MetalLB** | Helm (ArgoCD) | Bare-metal load balancer, assigns IPs from reserved pool |
|
||||
| **Cert-Manager** | Helm (ArgoCD) | Automated TLS via Let's Encrypt DNS-01 (Cloudflare API) |
|
||||
@@ -130,44 +128,43 @@ All VMs run **Debian 12** on `virtio` network bridges, provisioned from cloud-in
|
||||
| **Gitea Runner** | CI/CD runner | – |
|
||||
| **Zeroclaw** | Per-user instances (×3) via Kustomize overlays | – |
|
||||
| **Arr Stack** | Media automation suite | Prowlarr · Sonarr · Radarr · Unpackarr |
|
||||
| **qBittorrent** | Torrent clients (×2) | Gluetun VPN sidecar · ProtonVPN |
|
||||
| **Download clients** | VPN-isolated download clients (×2) | Gluetun sidecar |
|
||||
| **Jellyfin** | Media server with hardware transcoding | Docker · Intel QuickSync |
|
||||
|
||||
---
|
||||
|
||||
## Key Design Decisions
|
||||
## Design notes
|
||||
|
||||
**GitOps end-to-end.** Every cluster resource is declared in Git and applied by ArgoCD. Nothing is `kubectl apply`'d by hand. ArgoCD Image Updater closes the loop by writing image tag updates back to Git automatically.
|
||||
Everything goes through Git. ArgoCD owns the cluster state; nothing gets `kubectl apply`'d directly. ArgoCD Image Updater handles the image update loop: when a new tag appears in the registry, it commits the change back to Git and ArgoCD picks it up from there.
|
||||
|
||||
**Secrets in Git, safely.** Sealed Secrets lets encrypted `SealedSecret` manifests live in the same repo as everything else. Only the in-cluster controller can decrypt them.
|
||||
Secrets are committed to Git too, encrypted via Sealed Secrets. Only the in-cluster controller holds the decryption key.
|
||||
|
||||
**No cloud dependency for ingress.** MetalLB + Traefik handles all internal load balancing. External access goes through Cloudflare tunnels or a WireGuard VPN — no ports open on the home router.
|
||||
No ports are open on the home router. Internal load balancing goes through MetalLB + Traefik. External access uses Cloudflare tunnels or a WireGuard VPN routed through the edge VPS.
|
||||
|
||||
**Distributed storage without a SAN.** Longhorn replicates volumes across all 14 agent nodes. NFS on a dedicated bare-metal host serves the media library to Jellyfin with low latency.
|
||||
Longhorn handles block storage by replicating volumes across all 14 agent nodes. The media library lives on a dedicated NFS host instead — latency matters when Jellyfin is reading large video files, and NFS is simpler for that.
|
||||
|
||||
**Observability from day one.** Prometheus + Grafana for metrics, Elastic Stack (via ECK operator) for logs and fleet management. Elastic Agents run as a DaemonSet across the whole cluster.
|
||||
Metrics go to Prometheus + Grafana. Logs and fleet management go to Elastic Stack via the ECK operator, with Elastic Agents running as a DaemonSet so every node is covered.
|
||||
|
||||
**Provisioning is reproducible.** Proxmox VMs are created via Terraform (Proxmox provider), then configured by Ansible roles — from base OS hardening to k3s installation and kubeconfig management.
|
||||
All VMs are provisioned with Terraform and configured by Ansible. Rebuilding from scratch doesn't require remembering anything.
|
||||
|
||||
---
|
||||
|
||||
## Repository Layout
|
||||
## Repo layout
|
||||
|
||||
```
|
||||
ansible-homelab/ # Ansible roles + playbooks for all VM provisioning
|
||||
ansible-homelab/
|
||||
├── roles/
|
||||
│ ├── common/ # Base OS config, SSH hardening, node-exporter
|
||||
│ ├── k3s_server/ # HA control plane install + taint config
|
||||
│ ├── k3s_agent/ # Worker node install
|
||||
│ ├── k3s_loadbalancer/ # nginx LB config
|
||||
│ ├── kube_vip/ # VIP setup
|
||||
│ ├── docker_host/ # Docker + GPU passthrough
|
||||
│ ├── proxmox/ # Proxmox node config
|
||||
│ └── edge_vps/ # VPS services (WireGuard, Traefik, Pangolin)
|
||||
└── playbooks/ # Top-level playbooks per host group
|
||||
│ ├── common/ # base OS config, SSH hardening, node-exporter
|
||||
│ ├── k3s_server/ # control plane install + NoSchedule taint
|
||||
│ ├── k3s_agent/ # worker node install
|
||||
│ ├── kube_vip/ # kube-vip DaemonSet + TLS SAN config
|
||||
│ ├── docker_host/ # Docker + Intel QuickSync GPU passthrough
|
||||
│ ├── proxmox/ # Proxmox node setup
|
||||
│ └── edge_vps/ # VPS: WireGuard, Traefik, Pangolin, Elastic Agent
|
||||
└── playbooks/
|
||||
|
||||
argocd-homelab/ # All Kubernetes manifests (ArgoCD App-of-Apps)
|
||||
├── infrastructure/ # Platform: MetalLB, Longhorn, Cert-Manager, ECK, …
|
||||
├── services/ # Applications: Immich, Vaultwarden, arr-stack, …
|
||||
└── cluster-apps/ # ArgoCD ApplicationSets + root app
|
||||
argocd-homelab/
|
||||
├── infrastructure/ # MetalLB, Longhorn, Cert-Manager, ECK, Istio, ...
|
||||
├── services/ # Immich, Vaultwarden, arr-stack, Home Assistant, ...
|
||||
└── cluster-apps/ # ArgoCD App-of-Apps root + ApplicationSets
|
||||
```
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
# Networking
|
||||
|
||||
## IP Layout
|
||||
## IP layout
|
||||
|
||||
| Segment | Range | Purpose |
|
||||
|---------|-------|---------|
|
||||
@@ -12,7 +12,7 @@
|
||||
|
||||
---
|
||||
|
||||
## Traffic Flows
|
||||
## Traffic flows
|
||||
|
||||
### Public services (Cloudflare tunnel)
|
||||
|
||||
@@ -20,7 +20,7 @@
|
||||
User → Cloudflare (CDN + DDoS) → Cloudflared pod (×2, in-cluster) → Traefik → Service
|
||||
```
|
||||
|
||||
Cloudflare acts as both CDN and the TLS termination point for public services. No ports are forwarded on the home router.
|
||||
Cloudflare handles CDN and TLS termination. No ports are forwarded on the home router.
|
||||
|
||||
### VPS-proxied services (Pangolin tunnel)
|
||||
|
||||
@@ -38,7 +38,7 @@ Admin → WireGuard client → Edge VPS (WireGuard server)
|
||||
→ K8s service CIDR (10.43.0.0/16)
|
||||
```
|
||||
|
||||
The `mii-wireguard` pod acts as the WireGuard client inside the cluster. It masquerades the K8s service CIDR so all cluster services are reachable over the VPN — no split-DNS required.
|
||||
The `mii-wireguard` pod is the WireGuard client inside the cluster. It masquerades the K8s service CIDR so all cluster services are reachable over the VPN without split-DNS.
|
||||
|
||||
### Gitea → ArgoCD webhook
|
||||
|
||||
@@ -46,7 +46,7 @@ The `mii-wireguard` pod acts as the WireGuard client inside the cluster. It masq
|
||||
Gitea (docker-host11) → push webhook → ArgoCD (in-cluster) → reconcile manifests
|
||||
```
|
||||
|
||||
ArgoCD polls on a schedule and also receives webhooks from the self-hosted Gitea instance on git push.
|
||||
ArgoCD polls on a schedule and also receives webhooks on git push.
|
||||
|
||||
### ArgoCD Image Updater → Gitea
|
||||
|
||||
@@ -63,7 +63,7 @@ Keeps image versions in Git without a human in the loop.
|
||||
```
|
||||
Prowlarr (indexer aggregator)
|
||||
→ Sonarr / Radarr (request management)
|
||||
→ qBittorrent + Gluetun sidecar (download over ProtonVPN)
|
||||
→ download client + Gluetun sidecar (VPN-isolated)
|
||||
→ Unpackarr (extract archives)
|
||||
→ NFS share on aya01
|
||||
→ Jellyfin (on docker-host11, hardware transcoding via Intel QuickSync)
|
||||
@@ -71,14 +71,14 @@ Prowlarr (indexer aggregator)
|
||||
|
||||
---
|
||||
|
||||
## Certificate Management
|
||||
## Certificate management
|
||||
|
||||
Cert-Manager handles all TLS automatically via **Let's Encrypt DNS-01** using the Cloudflare API. No HTTP-01 challenges — DNS-01 works for internal-only domains and wildcard certs.
|
||||
Cert-Manager handles all TLS automatically via Let's Encrypt DNS-01 using the Cloudflare API. DNS-01 works for internal-only domains and wildcard certs without exposing any HTTP endpoint.
|
||||
|
||||
The edge VPS (Traefik) uses Netcup DNS API for its own certs.
|
||||
The edge VPS uses the Netcup DNS API for its own certs.
|
||||
|
||||
---
|
||||
|
||||
## Service Mesh
|
||||
## Service mesh
|
||||
|
||||
Istio runs in **Ambient mode** (no sidecars). The `ztunnel` DaemonSet runs on every node and handles transparent L4 proxying for all pods in the mesh. Waypoint proxies (L7) are not yet deployed.
|
||||
Istio runs in Ambient mode — no sidecars. The `ztunnel` DaemonSet runs on every node and handles transparent L4 proxying for all pods in the mesh. Waypoint proxies (L7) are not yet deployed.
|
||||
|
||||
@@ -1,24 +1,24 @@
|
||||
# Observability
|
||||
|
||||
Two parallel stacks cover metrics and logs.
|
||||
Two parallel stacks: Prometheus for metrics, Elastic for logs.
|
||||
|
||||
---
|
||||
|
||||
## Metrics — Prometheus + Grafana
|
||||
## Metrics
|
||||
|
||||
Deployed via the **kube-prometheus-stack** Helm chart (ArgoCD-managed), running in the `prometheus` namespace.
|
||||
kube-prometheus-stack runs in the `prometheus` namespace (ArgoCD-managed). Prometheus scrapes all nodes, pods, and control plane components. Grafana has dashboards for cluster overview, node resources, Longhorn, ArgoCD, and Traefik.
|
||||
|
||||
- **Prometheus** scrapes all nodes, pods, and K8s control plane components
|
||||
- **Grafana** dashboards: cluster overview, node resource usage, Longhorn, ArgoCD, Traefik
|
||||
- **Alertmanager** routes alerts to Ntfy (self-hosted push notifications) via a custom webhook bridge
|
||||
- **Node Exporter** runs on all VMs including docker-host11 and the edge VPS (Ansible-deployed)
|
||||
- **Goldilocks + VPA** analyse actual resource usage and recommend request/limit values
|
||||
Node Exporter is deployed via Ansible on every VM including `docker-host11` and the edge VPS, so coverage isn't limited to what's inside Kubernetes.
|
||||
|
||||
Goldilocks and VPA run alongside and analyze actual resource usage to suggest better request/limit values.
|
||||
|
||||
Alertmanager routes alerts to Ntfy via a custom webhook bridge.
|
||||
|
||||
---
|
||||
|
||||
## Logs + Fleet — Elastic Stack (ECK)
|
||||
## Logs and fleet management
|
||||
|
||||
Deployed via the **ECK operator** (Elastic Cloud on Kubernetes), running in the `elastic-system` namespace.
|
||||
The ECK operator (Elastic Cloud on Kubernetes) manages the Elastic stack in the `elastic-system` namespace:
|
||||
|
||||
| Component | Purpose |
|
||||
|-----------|---------|
|
||||
@@ -28,13 +28,13 @@ Deployed via the **ECK operator** (Elastic Cloud on Kubernetes), running in the
|
||||
| Elastic Agent (DaemonSet) | Ships logs and metrics from every cluster node |
|
||||
| Elastic Agent (standalone) | Runs on docker-host11 and the edge VPS |
|
||||
|
||||
The Elastic Agent DaemonSet tolerates the control-plane `NoSchedule` taint so logs are collected from server nodes as well as agents.
|
||||
The DaemonSet tolerates the control-plane `NoSchedule` taint so server nodes are covered too.
|
||||
|
||||
Alerts from Elasticsearch rules are bridged to Ntfy via a small CronJob (`elastic-ntfy-bridge`) that polls the Elasticsearch alerts API and forwards new alerts as push notifications.
|
||||
Elastic alert rules are bridged to Ntfy via `elastic-ntfy-bridge`, a small CronJob that polls the Elasticsearch alerts API and forwards new alerts as push notifications.
|
||||
|
||||
---
|
||||
|
||||
## Alerting Flow
|
||||
## Alerting flow
|
||||
|
||||
```
|
||||
Prometheus Alertmanager ──► Ntfy (push notification)
|
||||
@@ -42,4 +42,4 @@ Prometheus Alertmanager ──► Ntfy (push notification)
|
||||
Elasticsearch alert rule ──► elastic-ntfy-bridge CronJob ─┘
|
||||
```
|
||||
|
||||
All alerts land in the same Ntfy topic, accessible on mobile and desktop.
|
||||
Both sources land in the same Ntfy topic.
|
||||
|
||||
@@ -2,7 +2,7 @@
|
||||
|
||||
## Overview
|
||||
|
||||
Three storage tiers serve different workloads:
|
||||
Three storage tiers, each doing a different job:
|
||||
|
||||
| Tier | System | Access | Used by |
|
||||
|------|--------|--------|---------|
|
||||
@@ -14,46 +14,30 @@ Three storage tiers serve different workloads:
|
||||
|
||||
## Longhorn
|
||||
|
||||
Longhorn provides distributed block storage across all 14 agent nodes. Each volume is replicated (default: 3 replicas) across different nodes.
|
||||
Longhorn gives distributed block storage across all 14 agent nodes. Each volume is replicated (default: 3 replicas) across different nodes, using the local disk on each agent (128 GB each).
|
||||
|
||||
- **RWO** (ReadWriteOnce) — used for most services (Vaultwarden, Paperless, etc.)
|
||||
- **RWX** (ReadWriteMany) — used where multiple pods need shared access
|
||||
- Volumes are backed by the local disk on each agent node (128 GB each)
|
||||
- Longhorn manager runs as a DaemonSet; the CSI plugin integrates with the K8s storage layer
|
||||
- Snapshots and backups are supported via the Longhorn UI
|
||||
RWO (ReadWriteOnce) covers most services. RWX (ReadWriteMany) is used where multiple pods need access to the same volume. Snapshots and backups are available through the Longhorn UI.
|
||||
|
||||
Control plane nodes (`k3s-server-*`) are tainted `NoSchedule` — Longhorn manager tolerates this taint and runs everywhere, but user workloads are pushed to agent nodes only.
|
||||
Control plane nodes are tainted `NoSchedule` — Longhorn manager tolerates this and runs everywhere, but user workloads stay on agent nodes.
|
||||
|
||||
---
|
||||
|
||||
## CloudNativePG
|
||||
|
||||
The CNPG operator manages HA PostgreSQL clusters as first-class Kubernetes resources. Currently used by:
|
||||
|
||||
- **Immich** — primary database (photos, albums, users, ML embeddings)
|
||||
|
||||
CNPG handles streaming replication, failover, and scheduled backups. Data is stored on Longhorn PVCs.
|
||||
CloudNativePG manages HA PostgreSQL clusters as Kubernetes resources. Immich uses it for its primary database (photos, albums, users, ML embeddings). CNPG handles streaming replication, failover, and scheduled backups, with data stored on Longhorn PVCs.
|
||||
|
||||
---
|
||||
|
||||
## NFS
|
||||
|
||||
A dedicated physical node (`aya01`) runs a bare-metal NFS server. This serves the media library to Jellyfin.
|
||||
`aya01` is a dedicated bare-metal NFS server. Jellyfin mounts the share from `docker-host11` to access movies, TV shows, and music. Keeping the media library on a separate host means the Jellyfin VM can be rebuilt without touching the data.
|
||||
|
||||
- Movies, TV shows, and music live on `aya01`
|
||||
- `docker-host11` (where Jellyfin runs) mounts the NFS share
|
||||
- Separating media storage from the compute host means the Jellyfin VM can be rebuilt without touching the library
|
||||
- NFS is not used for K8s workloads — Longhorn handles all PVC-backed storage
|
||||
NFS is not used for K8s workloads — Longhorn handles all PVC-backed storage.
|
||||
|
||||
---
|
||||
|
||||
## Secret Storage
|
||||
## Secrets
|
||||
|
||||
Kubernetes secrets are managed with **Sealed Secrets** (Bitnami). The workflow:
|
||||
Kubernetes secrets go through Sealed Secrets (Bitnami). The workflow: create a regular `Secret`, encrypt it with `kubeseal` using the cluster's public key into a `SealedSecret`, then commit that to Git. Only the in-cluster controller can decrypt it.
|
||||
|
||||
1. Create a regular K8s `Secret`
|
||||
2. Encrypt it with `kubeseal` using the cluster's public key → produces a `SealedSecret`
|
||||
3. Commit the `SealedSecret` to Git — it is safe to store publicly
|
||||
4. The in-cluster Sealed Secrets controller decrypts it into a regular `Secret` at apply time
|
||||
|
||||
Ansible secrets (VM credentials, API tokens) are encrypted with **Ansible Vault** and stored in `vars/group_vars/*/secrets_*.yaml`.
|
||||
Ansible secrets (VM credentials, API tokens) are encrypted with Ansible Vault and live in `vars/group_vars/*/secrets_*.yaml`.
|
||||
|
||||
Reference in New Issue
Block a user