add networking, storage, and observability docs

This commit is contained in:
Tuan-Dat Tran
2026-04-28 08:29:48 +02:00
parent 8b75546305
commit 4563ef83f1
3 changed files with 188 additions and 0 deletions

84
docs/networking.md Normal file
View File

@@ -0,0 +1,84 @@
# Networking
## IP Layout
| Segment | Range | Purpose |
|---------|-------|---------|
| LAN | `192.168.20.0/24` | All VMs — flat layer 2 |
| MetalLB pool | Reserved /28 within LAN | LoadBalancer services in Kubernetes |
| K8s service CIDR | `10.43.0.0/16` | In-cluster service IPs |
| K8s pod CIDR | `10.42.0.0/16` | Pod networking (Flannel) |
| WireGuard | `10.133.7.0/24` | VPN tunnel: cluster ↔ edge VPS |
---
## Traffic Flows
### Public services (Cloudflare tunnel)
```
User → Cloudflare (CDN + DDoS) → Cloudflared pod (×2, in-cluster) → Traefik → Service
```
Cloudflare acts as both CDN and the TLS termination point for public services. No ports are forwarded on the home router.
### VPS-proxied services (Pangolin tunnel)
```
User → Edge VPS → Traefik (VPS) → Pangolin server → Newt client (in-cluster) → Traefik → Service
```
Used for services that need HTTP(S) proxying without Cloudflare in front.
### Remote admin (WireGuard VPN)
```
Admin → WireGuard client → Edge VPS (WireGuard server)
→ wg-gateway pod (10.133.7.4)
→ K8s service CIDR (10.43.0.0/16)
```
The `mii-wireguard` pod acts as the WireGuard client inside the cluster. It masquerades the K8s service CIDR so all cluster services are reachable over the VPN — no split-DNS required.
### Gitea → ArgoCD webhook
```
Gitea (docker-host11) → push webhook → ArgoCD (in-cluster) → reconcile manifests
```
ArgoCD polls on a schedule and also receives webhooks from the self-hosted Gitea instance on git push.
### ArgoCD Image Updater → Gitea
```
Image Updater detects new tag in registry
→ commits updated annotation to Gitea repo
→ ArgoCD detects commit → re-syncs Deployment
```
Keeps image versions in Git without a human in the loop.
### Media stack
```
Prowlarr (indexer aggregator)
→ Sonarr / Radarr (request management)
→ qBittorrent + Gluetun sidecar (download over ProtonVPN)
→ Unpackarr (extract archives)
→ NFS share on aya01
→ Jellyfin (on docker-host11, hardware transcoding via Intel QuickSync)
```
---
## Certificate Management
Cert-Manager handles all TLS automatically via **Let's Encrypt DNS-01** using the Cloudflare API. No HTTP-01 challenges — DNS-01 works for internal-only domains and wildcard certs.
The edge VPS (Traefik) uses Netcup DNS API for its own certs.
---
## Service Mesh
Istio runs in **Ambient mode** (no sidecars). The `ztunnel` DaemonSet runs on every node and handles transparent L4 proxying for all pods in the mesh. Waypoint proxies (L7) are not yet deployed.

45
docs/observability.md Normal file
View File

@@ -0,0 +1,45 @@
# Observability
Two parallel stacks cover metrics and logs.
---
## Metrics — Prometheus + Grafana
Deployed via the **kube-prometheus-stack** Helm chart (ArgoCD-managed), running in the `prometheus` namespace.
- **Prometheus** scrapes all nodes, pods, and K8s control plane components
- **Grafana** dashboards: cluster overview, node resource usage, Longhorn, ArgoCD, Traefik
- **Alertmanager** routes alerts to Ntfy (self-hosted push notifications) via a custom webhook bridge
- **Node Exporter** runs on all VMs including docker-host11 and the edge VPS (Ansible-deployed)
- **Goldilocks + VPA** analyse actual resource usage and recommend request/limit values
---
## Logs + Fleet — Elastic Stack (ECK)
Deployed via the **ECK operator** (Elastic Cloud on Kubernetes), running in the `elastic-system` namespace.
| Component | Purpose |
|-----------|---------|
| Elasticsearch | Log storage and search (single-node, 15 Gi heap) |
| Kibana | Log exploration and dashboards |
| Fleet Server | Manages Elastic Agent enrollment and policies |
| Elastic Agent (DaemonSet) | Ships logs and metrics from every cluster node |
| Elastic Agent (standalone) | Runs on docker-host11 and the edge VPS |
The Elastic Agent DaemonSet tolerates the control-plane `NoSchedule` taint so logs are collected from server nodes as well as agents.
Alerts from Elasticsearch rules are bridged to Ntfy via a small CronJob (`elastic-ntfy-bridge`) that polls the Elasticsearch alerts API and forwards new alerts as push notifications.
---
## Alerting Flow
```
Prometheus Alertmanager ──► Ntfy (push notification)
Elasticsearch alert rule ──► elastic-ntfy-bridge CronJob ─┘
```
All alerts land in the same Ntfy topic, accessible on mobile and desktop.

59
docs/storage.md Normal file
View File

@@ -0,0 +1,59 @@
# Storage
## Overview
Three storage tiers serve different workloads:
| Tier | System | Access | Used by |
|------|--------|--------|---------|
| Distributed block | Longhorn | RWO + RWX | All stateful K8s workloads |
| Relational | CloudNativePG | In-cluster Postgres | Immich |
| Network file | NFS (bare-metal) | NFS mount | Jellyfin media library |
---
## Longhorn
Longhorn provides distributed block storage across all 14 agent nodes. Each volume is replicated (default: 3 replicas) across different nodes.
- **RWO** (ReadWriteOnce) — used for most services (Vaultwarden, Paperless, etc.)
- **RWX** (ReadWriteMany) — used where multiple pods need shared access
- Volumes are backed by the local disk on each agent node (128 GB each)
- Longhorn manager runs as a DaemonSet; the CSI plugin integrates with the K8s storage layer
- Snapshots and backups are supported via the Longhorn UI
Control plane nodes (`k3s-server-*`) are tainted `NoSchedule` — Longhorn manager tolerates this taint and runs everywhere, but user workloads are pushed to agent nodes only.
---
## CloudNativePG
The CNPG operator manages HA PostgreSQL clusters as first-class Kubernetes resources. Currently used by:
- **Immich** — primary database (photos, albums, users, ML embeddings)
CNPG handles streaming replication, failover, and scheduled backups. Data is stored on Longhorn PVCs.
---
## NFS
A dedicated physical node (`aya01`) runs a bare-metal NFS server. This serves the media library to Jellyfin.
- Movies, TV shows, and music live on `aya01`
- `docker-host11` (where Jellyfin runs) mounts the NFS share
- Separating media storage from the compute host means the Jellyfin VM can be rebuilt without touching the library
- NFS is not used for K8s workloads — Longhorn handles all PVC-backed storage
---
## Secret Storage
Kubernetes secrets are managed with **Sealed Secrets** (Bitnami). The workflow:
1. Create a regular K8s `Secret`
2. Encrypt it with `kubeseal` using the cluster's public key → produces a `SealedSecret`
3. Commit the `SealedSecret` to Git — it is safe to store publicly
4. The in-cluster Sealed Secrets controller decrypts it into a regular `Secret` at apply time
Ansible secrets (VM credentials, API tokens) are encrypted with **Ansible Vault** and stored in `vars/group_vars/*/secrets_*.yaml`.