Observability

Two parallel stacks cover metrics and logs.

Metrics — Prometheus + Grafana

Deployed via the kube-prometheus-stack Helm chart (ArgoCD-managed), running in the prometheus namespace.

Prometheus scrapes all nodes, pods, and K8s control plane components
Grafana dashboards: cluster overview, node resource usage, Longhorn, ArgoCD, Traefik
Alertmanager routes alerts to Ntfy (self-hosted push notifications) via a custom webhook bridge
Node Exporter runs on all VMs including docker-host11 and the edge VPS (Ansible-deployed)
Goldilocks + VPA analyse actual resource usage and recommend request/limit values

Logs + Fleet — Elastic Stack (ECK)

Deployed via the ECK operator (Elastic Cloud on Kubernetes), running in the elastic-system namespace.

Component	Purpose
Elasticsearch	Log storage and search (single-node, 15 Gi heap)
Kibana	Log exploration and dashboards
Fleet Server	Manages Elastic Agent enrollment and policies
Elastic Agent (DaemonSet)	Ships logs and metrics from every cluster node
Elastic Agent (standalone)	Runs on docker-host11 and the edge VPS

The Elastic Agent DaemonSet tolerates the control-plane NoSchedule taint so logs are collected from server nodes as well as agents.

Alerts from Elasticsearch rules are bridged to Ntfy via a small CronJob (elastic-ntfy-bridge) that polls the Elasticsearch alerts API and forwards new alerts as push notifications.

Alerting Flow

Prometheus Alertmanager ──► Ntfy (push notification)
                                      ▲
Elasticsearch alert rule ──► elastic-ntfy-bridge CronJob ─┘

All alerts land in the same Ntfy topic, accessible on mobile and desktop.

1.8 KiB Raw Blame History

Observability

Metrics — Prometheus + Grafana

Logs + Fleet — Elastic Stack (ECK)

Alerting Flow

1.8 KiB

Raw Blame History