diff --git a/docs/networking.md b/docs/networking.md new file mode 100644 index 0000000..94d3484 --- /dev/null +++ b/docs/networking.md @@ -0,0 +1,84 @@ +# Networking + +## IP Layout + +| Segment | Range | Purpose | +|---------|-------|---------| +| LAN | `192.168.20.0/24` | All VMs — flat layer 2 | +| MetalLB pool | Reserved /28 within LAN | LoadBalancer services in Kubernetes | +| K8s service CIDR | `10.43.0.0/16` | In-cluster service IPs | +| K8s pod CIDR | `10.42.0.0/16` | Pod networking (Flannel) | +| WireGuard | `10.133.7.0/24` | VPN tunnel: cluster ↔ edge VPS | + +--- + +## Traffic Flows + +### Public services (Cloudflare tunnel) + +``` +User → Cloudflare (CDN + DDoS) → Cloudflared pod (×2, in-cluster) → Traefik → Service +``` + +Cloudflare acts as both CDN and the TLS termination point for public services. No ports are forwarded on the home router. + +### VPS-proxied services (Pangolin tunnel) + +``` +User → Edge VPS → Traefik (VPS) → Pangolin server → Newt client (in-cluster) → Traefik → Service +``` + +Used for services that need HTTP(S) proxying without Cloudflare in front. + +### Remote admin (WireGuard VPN) + +``` +Admin → WireGuard client → Edge VPS (WireGuard server) + → wg-gateway pod (10.133.7.4) + → K8s service CIDR (10.43.0.0/16) +``` + +The `mii-wireguard` pod acts as the WireGuard client inside the cluster. It masquerades the K8s service CIDR so all cluster services are reachable over the VPN — no split-DNS required. + +### Gitea → ArgoCD webhook + +``` +Gitea (docker-host11) → push webhook → ArgoCD (in-cluster) → reconcile manifests +``` + +ArgoCD polls on a schedule and also receives webhooks from the self-hosted Gitea instance on git push. + +### ArgoCD Image Updater → Gitea + +``` +Image Updater detects new tag in registry + → commits updated annotation to Gitea repo + → ArgoCD detects commit → re-syncs Deployment +``` + +Keeps image versions in Git without a human in the loop. + +### Media stack + +``` +Prowlarr (indexer aggregator) + → Sonarr / Radarr (request management) + → qBittorrent + Gluetun sidecar (download over ProtonVPN) + → Unpackarr (extract archives) + → NFS share on aya01 + → Jellyfin (on docker-host11, hardware transcoding via Intel QuickSync) +``` + +--- + +## Certificate Management + +Cert-Manager handles all TLS automatically via **Let's Encrypt DNS-01** using the Cloudflare API. No HTTP-01 challenges — DNS-01 works for internal-only domains and wildcard certs. + +The edge VPS (Traefik) uses Netcup DNS API for its own certs. + +--- + +## Service Mesh + +Istio runs in **Ambient mode** (no sidecars). The `ztunnel` DaemonSet runs on every node and handles transparent L4 proxying for all pods in the mesh. Waypoint proxies (L7) are not yet deployed. diff --git a/docs/observability.md b/docs/observability.md new file mode 100644 index 0000000..c094553 --- /dev/null +++ b/docs/observability.md @@ -0,0 +1,45 @@ +# Observability + +Two parallel stacks cover metrics and logs. + +--- + +## Metrics — Prometheus + Grafana + +Deployed via the **kube-prometheus-stack** Helm chart (ArgoCD-managed), running in the `prometheus` namespace. + +- **Prometheus** scrapes all nodes, pods, and K8s control plane components +- **Grafana** dashboards: cluster overview, node resource usage, Longhorn, ArgoCD, Traefik +- **Alertmanager** routes alerts to Ntfy (self-hosted push notifications) via a custom webhook bridge +- **Node Exporter** runs on all VMs including docker-host11 and the edge VPS (Ansible-deployed) +- **Goldilocks + VPA** analyse actual resource usage and recommend request/limit values + +--- + +## Logs + Fleet — Elastic Stack (ECK) + +Deployed via the **ECK operator** (Elastic Cloud on Kubernetes), running in the `elastic-system` namespace. + +| Component | Purpose | +|-----------|---------| +| Elasticsearch | Log storage and search (single-node, 15 Gi heap) | +| Kibana | Log exploration and dashboards | +| Fleet Server | Manages Elastic Agent enrollment and policies | +| Elastic Agent (DaemonSet) | Ships logs and metrics from every cluster node | +| Elastic Agent (standalone) | Runs on docker-host11 and the edge VPS | + +The Elastic Agent DaemonSet tolerates the control-plane `NoSchedule` taint so logs are collected from server nodes as well as agents. + +Alerts from Elasticsearch rules are bridged to Ntfy via a small CronJob (`elastic-ntfy-bridge`) that polls the Elasticsearch alerts API and forwards new alerts as push notifications. + +--- + +## Alerting Flow + +``` +Prometheus Alertmanager ──► Ntfy (push notification) + ▲ +Elasticsearch alert rule ──► elastic-ntfy-bridge CronJob ─┘ +``` + +All alerts land in the same Ntfy topic, accessible on mobile and desktop. diff --git a/docs/storage.md b/docs/storage.md new file mode 100644 index 0000000..c26ca55 --- /dev/null +++ b/docs/storage.md @@ -0,0 +1,59 @@ +# Storage + +## Overview + +Three storage tiers serve different workloads: + +| Tier | System | Access | Used by | +|------|--------|--------|---------| +| Distributed block | Longhorn | RWO + RWX | All stateful K8s workloads | +| Relational | CloudNativePG | In-cluster Postgres | Immich | +| Network file | NFS (bare-metal) | NFS mount | Jellyfin media library | + +--- + +## Longhorn + +Longhorn provides distributed block storage across all 14 agent nodes. Each volume is replicated (default: 3 replicas) across different nodes. + +- **RWO** (ReadWriteOnce) — used for most services (Vaultwarden, Paperless, etc.) +- **RWX** (ReadWriteMany) — used where multiple pods need shared access +- Volumes are backed by the local disk on each agent node (128 GB each) +- Longhorn manager runs as a DaemonSet; the CSI plugin integrates with the K8s storage layer +- Snapshots and backups are supported via the Longhorn UI + +Control plane nodes (`k3s-server-*`) are tainted `NoSchedule` — Longhorn manager tolerates this taint and runs everywhere, but user workloads are pushed to agent nodes only. + +--- + +## CloudNativePG + +The CNPG operator manages HA PostgreSQL clusters as first-class Kubernetes resources. Currently used by: + +- **Immich** — primary database (photos, albums, users, ML embeddings) + +CNPG handles streaming replication, failover, and scheduled backups. Data is stored on Longhorn PVCs. + +--- + +## NFS + +A dedicated physical node (`aya01`) runs a bare-metal NFS server. This serves the media library to Jellyfin. + +- Movies, TV shows, and music live on `aya01` +- `docker-host11` (where Jellyfin runs) mounts the NFS share +- Separating media storage from the compute host means the Jellyfin VM can be rebuilt without touching the library +- NFS is not used for K8s workloads — Longhorn handles all PVC-backed storage + +--- + +## Secret Storage + +Kubernetes secrets are managed with **Sealed Secrets** (Bitnami). The workflow: + +1. Create a regular K8s `Secret` +2. Encrypt it with `kubeseal` using the cluster's public key → produces a `SealedSecret` +3. Commit the `SealedSecret` to Git — it is safe to store publicly +4. The in-cluster Sealed Secrets controller decrypts it into a regular `Secret` at apply time + +Ansible secrets (VM credentials, API tokens) are encrypted with **Ansible Vault** and stored in `vars/group_vars/*/secrets_*.yaml`.