homelab-docs/docs/observability.md

# Observability

Two parallel stacks: Prometheus for metrics, Elastic for logs.

---

## Metrics

kube-prometheus-stack runs in the `prometheus` namespace (ArgoCD-managed). Prometheus scrapes all nodes, pods, and control plane components. Grafana has dashboards for cluster overview, node resources, Longhorn, ArgoCD, and Traefik.

Node Exporter is deployed via Ansible on every VM including `docker-host11` and the edge VPS, so coverage isn't limited to what's inside Kubernetes.

Goldilocks and VPA run alongside and analyze actual resource usage to suggest better request/limit values.

Alertmanager routes alerts to Ntfy via a custom webhook bridge.

---

## Logs and fleet management

The ECK operator (Elastic Cloud on Kubernetes) manages the Elastic stack in the `elastic-system` namespace:

| Component | Purpose |
|-----------|---------|
| Elasticsearch | Log storage and search (single-node, 15 Gi heap) |
| Kibana | Log exploration and dashboards |
| Fleet Server | Manages Elastic Agent enrollment and policies |
| Elastic Agent (DaemonSet) | Ships logs and metrics from every cluster node |
| Elastic Agent (standalone) | Runs on docker-host11 and the edge VPS |

The DaemonSet tolerates the control-plane `NoSchedule` taint so server nodes are covered too.

Elastic alert rules are bridged to Ntfy via `elastic-ntfy-bridge`, a small CronJob that polls the Elasticsearch alerts API and forwards new alerts as push notifications.

---

## Alerting flow

```
Prometheus Alertmanager ──► Ntfy (push notification)
                                      ▲
Elasticsearch alert rule ──► elastic-ntfy-bridge CronJob ─┘
```

Both sources land in the same Ntfy topic.