add networking, storage, and observability docs
This commit is contained in:
84
docs/networking.md
Normal file
84
docs/networking.md
Normal file
@@ -0,0 +1,84 @@
|
||||
# Networking
|
||||
|
||||
## IP Layout
|
||||
|
||||
| Segment | Range | Purpose |
|
||||
|---------|-------|---------|
|
||||
| LAN | `192.168.20.0/24` | All VMs — flat layer 2 |
|
||||
| MetalLB pool | Reserved /28 within LAN | LoadBalancer services in Kubernetes |
|
||||
| K8s service CIDR | `10.43.0.0/16` | In-cluster service IPs |
|
||||
| K8s pod CIDR | `10.42.0.0/16` | Pod networking (Flannel) |
|
||||
| WireGuard | `10.133.7.0/24` | VPN tunnel: cluster ↔ edge VPS |
|
||||
|
||||
---
|
||||
|
||||
## Traffic Flows
|
||||
|
||||
### Public services (Cloudflare tunnel)
|
||||
|
||||
```
|
||||
User → Cloudflare (CDN + DDoS) → Cloudflared pod (×2, in-cluster) → Traefik → Service
|
||||
```
|
||||
|
||||
Cloudflare acts as both CDN and the TLS termination point for public services. No ports are forwarded on the home router.
|
||||
|
||||
### VPS-proxied services (Pangolin tunnel)
|
||||
|
||||
```
|
||||
User → Edge VPS → Traefik (VPS) → Pangolin server → Newt client (in-cluster) → Traefik → Service
|
||||
```
|
||||
|
||||
Used for services that need HTTP(S) proxying without Cloudflare in front.
|
||||
|
||||
### Remote admin (WireGuard VPN)
|
||||
|
||||
```
|
||||
Admin → WireGuard client → Edge VPS (WireGuard server)
|
||||
→ wg-gateway pod (10.133.7.4)
|
||||
→ K8s service CIDR (10.43.0.0/16)
|
||||
```
|
||||
|
||||
The `mii-wireguard` pod acts as the WireGuard client inside the cluster. It masquerades the K8s service CIDR so all cluster services are reachable over the VPN — no split-DNS required.
|
||||
|
||||
### Gitea → ArgoCD webhook
|
||||
|
||||
```
|
||||
Gitea (docker-host11) → push webhook → ArgoCD (in-cluster) → reconcile manifests
|
||||
```
|
||||
|
||||
ArgoCD polls on a schedule and also receives webhooks from the self-hosted Gitea instance on git push.
|
||||
|
||||
### ArgoCD Image Updater → Gitea
|
||||
|
||||
```
|
||||
Image Updater detects new tag in registry
|
||||
→ commits updated annotation to Gitea repo
|
||||
→ ArgoCD detects commit → re-syncs Deployment
|
||||
```
|
||||
|
||||
Keeps image versions in Git without a human in the loop.
|
||||
|
||||
### Media stack
|
||||
|
||||
```
|
||||
Prowlarr (indexer aggregator)
|
||||
→ Sonarr / Radarr (request management)
|
||||
→ qBittorrent + Gluetun sidecar (download over ProtonVPN)
|
||||
→ Unpackarr (extract archives)
|
||||
→ NFS share on aya01
|
||||
→ Jellyfin (on docker-host11, hardware transcoding via Intel QuickSync)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Certificate Management
|
||||
|
||||
Cert-Manager handles all TLS automatically via **Let's Encrypt DNS-01** using the Cloudflare API. No HTTP-01 challenges — DNS-01 works for internal-only domains and wildcard certs.
|
||||
|
||||
The edge VPS (Traefik) uses Netcup DNS API for its own certs.
|
||||
|
||||
---
|
||||
|
||||
## Service Mesh
|
||||
|
||||
Istio runs in **Ambient mode** (no sidecars). The `ztunnel` DaemonSet runs on every node and handles transparent L4 proxying for all pods in the mesh. Waypoint proxies (L7) are not yet deployed.
|
||||
45
docs/observability.md
Normal file
45
docs/observability.md
Normal file
@@ -0,0 +1,45 @@
|
||||
# Observability
|
||||
|
||||
Two parallel stacks cover metrics and logs.
|
||||
|
||||
---
|
||||
|
||||
## Metrics — Prometheus + Grafana
|
||||
|
||||
Deployed via the **kube-prometheus-stack** Helm chart (ArgoCD-managed), running in the `prometheus` namespace.
|
||||
|
||||
- **Prometheus** scrapes all nodes, pods, and K8s control plane components
|
||||
- **Grafana** dashboards: cluster overview, node resource usage, Longhorn, ArgoCD, Traefik
|
||||
- **Alertmanager** routes alerts to Ntfy (self-hosted push notifications) via a custom webhook bridge
|
||||
- **Node Exporter** runs on all VMs including docker-host11 and the edge VPS (Ansible-deployed)
|
||||
- **Goldilocks + VPA** analyse actual resource usage and recommend request/limit values
|
||||
|
||||
---
|
||||
|
||||
## Logs + Fleet — Elastic Stack (ECK)
|
||||
|
||||
Deployed via the **ECK operator** (Elastic Cloud on Kubernetes), running in the `elastic-system` namespace.
|
||||
|
||||
| Component | Purpose |
|
||||
|-----------|---------|
|
||||
| Elasticsearch | Log storage and search (single-node, 15 Gi heap) |
|
||||
| Kibana | Log exploration and dashboards |
|
||||
| Fleet Server | Manages Elastic Agent enrollment and policies |
|
||||
| Elastic Agent (DaemonSet) | Ships logs and metrics from every cluster node |
|
||||
| Elastic Agent (standalone) | Runs on docker-host11 and the edge VPS |
|
||||
|
||||
The Elastic Agent DaemonSet tolerates the control-plane `NoSchedule` taint so logs are collected from server nodes as well as agents.
|
||||
|
||||
Alerts from Elasticsearch rules are bridged to Ntfy via a small CronJob (`elastic-ntfy-bridge`) that polls the Elasticsearch alerts API and forwards new alerts as push notifications.
|
||||
|
||||
---
|
||||
|
||||
## Alerting Flow
|
||||
|
||||
```
|
||||
Prometheus Alertmanager ──► Ntfy (push notification)
|
||||
▲
|
||||
Elasticsearch alert rule ──► elastic-ntfy-bridge CronJob ─┘
|
||||
```
|
||||
|
||||
All alerts land in the same Ntfy topic, accessible on mobile and desktop.
|
||||
59
docs/storage.md
Normal file
59
docs/storage.md
Normal file
@@ -0,0 +1,59 @@
|
||||
# Storage
|
||||
|
||||
## Overview
|
||||
|
||||
Three storage tiers serve different workloads:
|
||||
|
||||
| Tier | System | Access | Used by |
|
||||
|------|--------|--------|---------|
|
||||
| Distributed block | Longhorn | RWO + RWX | All stateful K8s workloads |
|
||||
| Relational | CloudNativePG | In-cluster Postgres | Immich |
|
||||
| Network file | NFS (bare-metal) | NFS mount | Jellyfin media library |
|
||||
|
||||
---
|
||||
|
||||
## Longhorn
|
||||
|
||||
Longhorn provides distributed block storage across all 14 agent nodes. Each volume is replicated (default: 3 replicas) across different nodes.
|
||||
|
||||
- **RWO** (ReadWriteOnce) — used for most services (Vaultwarden, Paperless, etc.)
|
||||
- **RWX** (ReadWriteMany) — used where multiple pods need shared access
|
||||
- Volumes are backed by the local disk on each agent node (128 GB each)
|
||||
- Longhorn manager runs as a DaemonSet; the CSI plugin integrates with the K8s storage layer
|
||||
- Snapshots and backups are supported via the Longhorn UI
|
||||
|
||||
Control plane nodes (`k3s-server-*`) are tainted `NoSchedule` — Longhorn manager tolerates this taint and runs everywhere, but user workloads are pushed to agent nodes only.
|
||||
|
||||
---
|
||||
|
||||
## CloudNativePG
|
||||
|
||||
The CNPG operator manages HA PostgreSQL clusters as first-class Kubernetes resources. Currently used by:
|
||||
|
||||
- **Immich** — primary database (photos, albums, users, ML embeddings)
|
||||
|
||||
CNPG handles streaming replication, failover, and scheduled backups. Data is stored on Longhorn PVCs.
|
||||
|
||||
---
|
||||
|
||||
## NFS
|
||||
|
||||
A dedicated physical node (`aya01`) runs a bare-metal NFS server. This serves the media library to Jellyfin.
|
||||
|
||||
- Movies, TV shows, and music live on `aya01`
|
||||
- `docker-host11` (where Jellyfin runs) mounts the NFS share
|
||||
- Separating media storage from the compute host means the Jellyfin VM can be rebuilt without touching the library
|
||||
- NFS is not used for K8s workloads — Longhorn handles all PVC-backed storage
|
||||
|
||||
---
|
||||
|
||||
## Secret Storage
|
||||
|
||||
Kubernetes secrets are managed with **Sealed Secrets** (Bitnami). The workflow:
|
||||
|
||||
1. Create a regular K8s `Secret`
|
||||
2. Encrypt it with `kubeseal` using the cluster's public key → produces a `SealedSecret`
|
||||
3. Commit the `SealedSecret` to Git — it is safe to store publicly
|
||||
4. The in-cluster Sealed Secrets controller decrypts it into a regular `Secret` at apply time
|
||||
|
||||
Ansible secrets (VM credentials, API tokens) are encrypted with **Ansible Vault** and stored in `vars/group_vars/*/secrets_*.yaml`.
|
||||
Reference in New Issue
Block a user