diff --git a/docs/networking.md b/docs/networking.md index 94d3484..391dc00 100644 --- a/docs/networking.md +++ b/docs/networking.md @@ -1,6 +1,6 @@ # Networking -## IP Layout +## IP layout | Segment | Range | Purpose | |---------|-------|---------| @@ -12,7 +12,7 @@ --- -## Traffic Flows +## Traffic flows ### Public services (Cloudflare tunnel) @@ -20,7 +20,7 @@ User → Cloudflare (CDN + DDoS) → Cloudflared pod (×2, in-cluster) → Traefik → Service ``` -Cloudflare acts as both CDN and the TLS termination point for public services. No ports are forwarded on the home router. +Cloudflare handles CDN and TLS termination. No ports are forwarded on the home router. ### VPS-proxied services (Pangolin tunnel) @@ -38,7 +38,7 @@ Admin → WireGuard client → Edge VPS (WireGuard server) → K8s service CIDR (10.43.0.0/16) ``` -The `mii-wireguard` pod acts as the WireGuard client inside the cluster. It masquerades the K8s service CIDR so all cluster services are reachable over the VPN — no split-DNS required. +The `mii-wireguard` pod is the WireGuard client inside the cluster. It masquerades the K8s service CIDR so all cluster services are reachable over the VPN without split-DNS. ### Gitea → ArgoCD webhook @@ -46,7 +46,7 @@ The `mii-wireguard` pod acts as the WireGuard client inside the cluster. It masq Gitea (docker-host11) → push webhook → ArgoCD (in-cluster) → reconcile manifests ``` -ArgoCD polls on a schedule and also receives webhooks from the self-hosted Gitea instance on git push. +ArgoCD polls on a schedule and also receives webhooks on git push. ### ArgoCD Image Updater → Gitea @@ -63,7 +63,7 @@ Keeps image versions in Git without a human in the loop. ``` Prowlarr (indexer aggregator) → Sonarr / Radarr (request management) - → qBittorrent + Gluetun sidecar (download over ProtonVPN) + → qBittorrent + Gluetun sidecar (VPN-isolated download) → Unpackarr (extract archives) → NFS share on aya01 → Jellyfin (on docker-host11, hardware transcoding via Intel QuickSync) @@ -71,14 +71,14 @@ Prowlarr (indexer aggregator) --- -## Certificate Management +## Certificate management -Cert-Manager handles all TLS automatically via **Let's Encrypt DNS-01** using the Cloudflare API. No HTTP-01 challenges — DNS-01 works for internal-only domains and wildcard certs. +Cert-Manager handles all TLS automatically via Let's Encrypt DNS-01 using the Cloudflare API. DNS-01 works for internal-only domains and wildcard certs without exposing any HTTP endpoint. -The edge VPS (Traefik) uses Netcup DNS API for its own certs. +The edge VPS uses the Netcup DNS API for its own certs. --- -## Service Mesh +## Service mesh -Istio runs in **Ambient mode** (no sidecars). The `ztunnel` DaemonSet runs on every node and handles transparent L4 proxying for all pods in the mesh. Waypoint proxies (L7) are not yet deployed. +Istio runs in Ambient mode — no sidecars. The `ztunnel` DaemonSet runs on every node and handles transparent L4 proxying for all pods in the mesh. Waypoint proxies (L7) are not yet deployed. diff --git a/docs/observability.md b/docs/observability.md index c094553..542b65f 100644 --- a/docs/observability.md +++ b/docs/observability.md @@ -1,24 +1,24 @@ # Observability -Two parallel stacks cover metrics and logs. +Two parallel stacks: Prometheus for metrics, Elastic for logs. --- -## Metrics — Prometheus + Grafana +## Metrics -Deployed via the **kube-prometheus-stack** Helm chart (ArgoCD-managed), running in the `prometheus` namespace. +kube-prometheus-stack runs in the `prometheus` namespace (ArgoCD-managed). Prometheus scrapes all nodes, pods, and control plane components. Grafana has dashboards for cluster overview, node resources, Longhorn, ArgoCD, and Traefik. -- **Prometheus** scrapes all nodes, pods, and K8s control plane components -- **Grafana** dashboards: cluster overview, node resource usage, Longhorn, ArgoCD, Traefik -- **Alertmanager** routes alerts to Ntfy (self-hosted push notifications) via a custom webhook bridge -- **Node Exporter** runs on all VMs including docker-host11 and the edge VPS (Ansible-deployed) -- **Goldilocks + VPA** analyse actual resource usage and recommend request/limit values +Node Exporter is deployed via Ansible on every VM including `docker-host11` and the edge VPS, so coverage isn't limited to what's inside Kubernetes. + +Goldilocks and VPA run alongside and analyze actual resource usage to suggest better request/limit values. + +Alertmanager routes alerts to Ntfy via a custom webhook bridge. --- -## Logs + Fleet — Elastic Stack (ECK) +## Logs and fleet management -Deployed via the **ECK operator** (Elastic Cloud on Kubernetes), running in the `elastic-system` namespace. +The ECK operator (Elastic Cloud on Kubernetes) manages the Elastic stack in the `elastic-system` namespace: | Component | Purpose | |-----------|---------| @@ -28,13 +28,13 @@ Deployed via the **ECK operator** (Elastic Cloud on Kubernetes), running in the | Elastic Agent (DaemonSet) | Ships logs and metrics from every cluster node | | Elastic Agent (standalone) | Runs on docker-host11 and the edge VPS | -The Elastic Agent DaemonSet tolerates the control-plane `NoSchedule` taint so logs are collected from server nodes as well as agents. +The DaemonSet tolerates the control-plane `NoSchedule` taint so server nodes are covered too. -Alerts from Elasticsearch rules are bridged to Ntfy via a small CronJob (`elastic-ntfy-bridge`) that polls the Elasticsearch alerts API and forwards new alerts as push notifications. +Elastic alert rules are bridged to Ntfy via `elastic-ntfy-bridge`, a small CronJob that polls the Elasticsearch alerts API and forwards new alerts as push notifications. --- -## Alerting Flow +## Alerting flow ``` Prometheus Alertmanager ──► Ntfy (push notification) @@ -42,4 +42,4 @@ Prometheus Alertmanager ──► Ntfy (push notification) Elasticsearch alert rule ──► elastic-ntfy-bridge CronJob ─┘ ``` -All alerts land in the same Ntfy topic, accessible on mobile and desktop. +Both sources land in the same Ntfy topic. diff --git a/docs/storage.md b/docs/storage.md index c26ca55..60891df 100644 --- a/docs/storage.md +++ b/docs/storage.md @@ -2,7 +2,7 @@ ## Overview -Three storage tiers serve different workloads: +Three storage tiers, each doing a different job: | Tier | System | Access | Used by | |------|--------|--------|---------| @@ -14,46 +14,30 @@ Three storage tiers serve different workloads: ## Longhorn -Longhorn provides distributed block storage across all 14 agent nodes. Each volume is replicated (default: 3 replicas) across different nodes. +Longhorn gives distributed block storage across all 14 agent nodes. Each volume is replicated (default: 3 replicas) across different nodes, using the local disk on each agent (128 GB each). -- **RWO** (ReadWriteOnce) — used for most services (Vaultwarden, Paperless, etc.) -- **RWX** (ReadWriteMany) — used where multiple pods need shared access -- Volumes are backed by the local disk on each agent node (128 GB each) -- Longhorn manager runs as a DaemonSet; the CSI plugin integrates with the K8s storage layer -- Snapshots and backups are supported via the Longhorn UI +RWO (ReadWriteOnce) covers most services. RWX (ReadWriteMany) is used where multiple pods need access to the same volume. Snapshots and backups are available through the Longhorn UI. -Control plane nodes (`k3s-server-*`) are tainted `NoSchedule` — Longhorn manager tolerates this taint and runs everywhere, but user workloads are pushed to agent nodes only. +Control plane nodes are tainted `NoSchedule` — Longhorn manager tolerates this and runs everywhere, but user workloads stay on agent nodes. --- ## CloudNativePG -The CNPG operator manages HA PostgreSQL clusters as first-class Kubernetes resources. Currently used by: - -- **Immich** — primary database (photos, albums, users, ML embeddings) - -CNPG handles streaming replication, failover, and scheduled backups. Data is stored on Longhorn PVCs. +CloudNativePG manages HA PostgreSQL clusters as Kubernetes resources. Immich uses it for its primary database (photos, albums, users, ML embeddings). CNPG handles streaming replication, failover, and scheduled backups, with data stored on Longhorn PVCs. --- ## NFS -A dedicated physical node (`aya01`) runs a bare-metal NFS server. This serves the media library to Jellyfin. +`aya01` is a dedicated bare-metal NFS server. Jellyfin mounts the share from `docker-host11` to access movies, TV shows, and music. Keeping the media library on a separate host means the Jellyfin VM can be rebuilt without touching the data. -- Movies, TV shows, and music live on `aya01` -- `docker-host11` (where Jellyfin runs) mounts the NFS share -- Separating media storage from the compute host means the Jellyfin VM can be rebuilt without touching the library -- NFS is not used for K8s workloads — Longhorn handles all PVC-backed storage +NFS is not used for K8s workloads — Longhorn handles all PVC-backed storage. --- -## Secret Storage +## Secrets -Kubernetes secrets are managed with **Sealed Secrets** (Bitnami). The workflow: +Kubernetes secrets go through Sealed Secrets (Bitnami). The workflow: create a regular `Secret`, encrypt it with `kubeseal` using the cluster's public key into a `SealedSecret`, then commit that to Git. Only the in-cluster controller can decrypt it. -1. Create a regular K8s `Secret` -2. Encrypt it with `kubeseal` using the cluster's public key → produces a `SealedSecret` -3. Commit the `SealedSecret` to Git — it is safe to store publicly -4. The in-cluster Sealed Secrets controller decrypts it into a regular `Secret` at apply time - -Ansible secrets (VM credentials, API tokens) are encrypted with **Ansible Vault** and stored in `vars/group_vars/*/secrets_*.yaml`. +Ansible secrets (VM credentials, API tokens) are encrypted with Ansible Vault and live in `vars/group_vars/*/secrets_*.yaml`.