Files
homelab-docs/docs/observability.md

1.6 KiB

Observability

Two parallel stacks: Prometheus for metrics, Elastic for logs.


Metrics

kube-prometheus-stack runs in the prometheus namespace (ArgoCD-managed). Prometheus scrapes all nodes, pods, and control plane components. Grafana has dashboards for cluster overview, node resources, Longhorn, ArgoCD, and Traefik.

Node Exporter is deployed via Ansible on every VM including docker-host11 and the edge VPS, so coverage isn't limited to what's inside Kubernetes.

Goldilocks and VPA run alongside and analyze actual resource usage to suggest better request/limit values.

Alertmanager routes alerts to Ntfy via a custom webhook bridge.


Logs and fleet management

The ECK operator (Elastic Cloud on Kubernetes) manages the Elastic stack in the elastic-system namespace:

Component Purpose
Elasticsearch Log storage and search (single-node, 15 Gi heap)
Kibana Log exploration and dashboards
Fleet Server Manages Elastic Agent enrollment and policies
Elastic Agent (DaemonSet) Ships logs and metrics from every cluster node
Elastic Agent (standalone) Runs on docker-host11 and the edge VPS

The DaemonSet tolerates the control-plane NoSchedule taint so server nodes are covered too.

Elastic alert rules are bridged to Ntfy via elastic-ntfy-bridge, a small CronJob that polls the Elasticsearch alerts API and forwards new alerts as push notifications.


Alerting flow

Prometheus Alertmanager ──► Ntfy (push notification)
                                      ▲
Elasticsearch alert rule ──► elastic-ntfy-bridge CronJob ─┘

Both sources land in the same Ntfy topic.