add networking, storage, and observability docs

2026-04-28 08:29:48 +02:00
parent 8b75546305
commit 4563ef83f1
3 changed files with 188 additions and 0 deletions
--- a/docs/networking.md
+++ b/docs/networking.md
@@ -0,0 +1,84 @@
+# Networking
+
+## IP Layout
+
+| Segment | Range | Purpose |
+|---------|-------|---------|
+| LAN | `192.168.20.0/24` | All VMs — flat layer 2 |
+| MetalLB pool | Reserved /28 within LAN | LoadBalancer services in Kubernetes |
+| K8s service CIDR | `10.43.0.0/16` | In-cluster service IPs |
+| K8s pod CIDR | `10.42.0.0/16` | Pod networking (Flannel) |
+| WireGuard | `10.133.7.0/24` | VPN tunnel: cluster ↔ edge VPS |
+
+---
+
+## Traffic Flows
+
+### Public services (Cloudflare tunnel)
+
+```
+User → Cloudflare (CDN + DDoS) → Cloudflared pod (×2, in-cluster) → Traefik → Service
+```
+
+Cloudflare acts as both CDN and the TLS termination point for public services. No ports are forwarded on the home router.
+
+### VPS-proxied services (Pangolin tunnel)
+
+```
+User → Edge VPS → Traefik (VPS) → Pangolin server → Newt client (in-cluster) → Traefik → Service
+```
+
+Used for services that need HTTP(S) proxying without Cloudflare in front.
+
+### Remote admin (WireGuard VPN)
+
+```
+Admin → WireGuard client → Edge VPS (WireGuard server)
+      → wg-gateway pod (10.133.7.4)
+      → K8s service CIDR (10.43.0.0/16)
+```
+
+The `mii-wireguard` pod acts as the WireGuard client inside the cluster. It masquerades the K8s service CIDR so all cluster services are reachable over the VPN — no split-DNS required.
+
+### Gitea → ArgoCD webhook
+
+```
+Gitea (docker-host11) → push webhook → ArgoCD (in-cluster) → reconcile manifests
+```
+
+ArgoCD polls on a schedule and also receives webhooks from the self-hosted Gitea instance on git push.
+
+### ArgoCD Image Updater → Gitea
+
+```
+Image Updater detects new tag in registry
+  → commits updated annotation to Gitea repo
+  → ArgoCD detects commit → re-syncs Deployment
+```
+
+Keeps image versions in Git without a human in the loop.
+
+### Media stack
+
+```
+Prowlarr (indexer aggregator)
+  → Sonarr / Radarr (request management)
+  → qBittorrent + Gluetun sidecar (download over ProtonVPN)
+  → Unpackarr (extract archives)
+  → NFS share on aya01
+  → Jellyfin (on docker-host11, hardware transcoding via Intel QuickSync)
+```
+
+---
+
+## Certificate Management
+
+Cert-Manager handles all TLS automatically via **Let's Encrypt DNS-01** using the Cloudflare API. No HTTP-01 challenges — DNS-01 works for internal-only domains and wildcard certs.
+
+The edge VPS (Traefik) uses Netcup DNS API for its own certs.
+
+---
+
+## Service Mesh
+
+Istio runs in **Ambient mode** (no sidecars). The `ztunnel` DaemonSet runs on every node and handles transparent L4 proxying for all pods in the mesh. Waypoint proxies (L7) are not yet deployed.
--- a/docs/observability.md
+++ b/docs/observability.md
@@ -0,0 +1,45 @@
+# Observability
+
+Two parallel stacks cover metrics and logs.
+
+---
+
+## Metrics — Prometheus + Grafana
+
+Deployed via the **kube-prometheus-stack** Helm chart (ArgoCD-managed), running in the `prometheus` namespace.
+
+- **Prometheus** scrapes all nodes, pods, and K8s control plane components
+- **Grafana** dashboards: cluster overview, node resource usage, Longhorn, ArgoCD, Traefik
+- **Alertmanager** routes alerts to Ntfy (self-hosted push notifications) via a custom webhook bridge
+- **Node Exporter** runs on all VMs including docker-host11 and the edge VPS (Ansible-deployed)
+- **Goldilocks + VPA** analyse actual resource usage and recommend request/limit values
+
+---
+
+## Logs + Fleet — Elastic Stack (ECK)
+
+Deployed via the **ECK operator** (Elastic Cloud on Kubernetes), running in the `elastic-system` namespace.
+
+| Component | Purpose |
+|-----------|---------|
+| Elasticsearch | Log storage and search (single-node, 15 Gi heap) |
+| Kibana | Log exploration and dashboards |
+| Fleet Server | Manages Elastic Agent enrollment and policies |
+| Elastic Agent (DaemonSet) | Ships logs and metrics from every cluster node |
+| Elastic Agent (standalone) | Runs on docker-host11 and the edge VPS |
+
+The Elastic Agent DaemonSet tolerates the control-plane `NoSchedule` taint so logs are collected from server nodes as well as agents.
+
+Alerts from Elasticsearch rules are bridged to Ntfy via a small CronJob (`elastic-ntfy-bridge`) that polls the Elasticsearch alerts API and forwards new alerts as push notifications.
+
+---
+
+## Alerting Flow
+
+```
+Prometheus Alertmanager ──► Ntfy (push notification)
+                                      ▲
+Elasticsearch alert rule ──► elastic-ntfy-bridge CronJob ─┘
+```
+
+All alerts land in the same Ntfy topic, accessible on mobile and desktop.
--- a/docs/storage.md
+++ b/docs/storage.md
@@ -0,0 +1,59 @@
+# Storage
+
+## Overview
+
+Three storage tiers serve different workloads:
+
+| Tier | System | Access | Used by |
+|------|--------|--------|---------|
+| Distributed block | Longhorn | RWO + RWX | All stateful K8s workloads |
+| Relational | CloudNativePG | In-cluster Postgres | Immich |
+| Network file | NFS (bare-metal) | NFS mount | Jellyfin media library |
+
+---
+
+## Longhorn
+
+Longhorn provides distributed block storage across all 14 agent nodes. Each volume is replicated (default: 3 replicas) across different nodes.
+
+- **RWO** (ReadWriteOnce) — used for most services (Vaultwarden, Paperless, etc.)
+- **RWX** (ReadWriteMany) — used where multiple pods need shared access
+- Volumes are backed by the local disk on each agent node (128 GB each)
+- Longhorn manager runs as a DaemonSet; the CSI plugin integrates with the K8s storage layer
+- Snapshots and backups are supported via the Longhorn UI
+
+Control plane nodes (`k3s-server-*`) are tainted `NoSchedule` — Longhorn manager tolerates this taint and runs everywhere, but user workloads are pushed to agent nodes only.
+
+---
+
+## CloudNativePG
+
+The CNPG operator manages HA PostgreSQL clusters as first-class Kubernetes resources. Currently used by:
+
+- **Immich** — primary database (photos, albums, users, ML embeddings)
+
+CNPG handles streaming replication, failover, and scheduled backups. Data is stored on Longhorn PVCs.
+
+---
+
+## NFS
+
+A dedicated physical node (`aya01`) runs a bare-metal NFS server. This serves the media library to Jellyfin.
+
+- Movies, TV shows, and music live on `aya01`
+- `docker-host11` (where Jellyfin runs) mounts the NFS share
+- Separating media storage from the compute host means the Jellyfin VM can be rebuilt without touching the library
+- NFS is not used for K8s workloads — Longhorn handles all PVC-backed storage
+
+---
+
+## Secret Storage
+
+Kubernetes secrets are managed with **Sealed Secrets** (Bitnami). The workflow:
+
+1. Create a regular K8s `Secret`
+2. Encrypt it with `kubeseal` using the cluster's public key → produces a `SealedSecret`
+3. Commit the `SealedSecret` to Git — it is safe to store publicly
+4. The in-cluster Sealed Secrets controller decrypts it into a regular `Secret` at apply time
+
+Ansible secrets (VM credentials, API tokens) are encrypted with **Ansible Vault** and stored in `vars/group_vars/*/secrets_*.yaml`.