46 lines
1.6 KiB
Markdown
46 lines
1.6 KiB
Markdown
# Observability
|
|
|
|
Two parallel stacks: Prometheus for metrics, Elastic for logs.
|
|
|
|
---
|
|
|
|
## Metrics
|
|
|
|
kube-prometheus-stack runs in the `prometheus` namespace (ArgoCD-managed). Prometheus scrapes all nodes, pods, and control plane components. Grafana has dashboards for cluster overview, node resources, Longhorn, ArgoCD, and Traefik.
|
|
|
|
Node Exporter is deployed via Ansible on every VM including `docker-host11` and the edge VPS, so coverage isn't limited to what's inside Kubernetes.
|
|
|
|
Goldilocks and VPA run alongside and analyze actual resource usage to suggest better request/limit values.
|
|
|
|
Alertmanager routes alerts to Ntfy via a custom webhook bridge.
|
|
|
|
---
|
|
|
|
## Logs and fleet management
|
|
|
|
The ECK operator (Elastic Cloud on Kubernetes) manages the Elastic stack in the `elastic-system` namespace:
|
|
|
|
| Component | Purpose |
|
|
|-----------|---------|
|
|
| Elasticsearch | Log storage and search (single-node, 15 Gi heap) |
|
|
| Kibana | Log exploration and dashboards |
|
|
| Fleet Server | Manages Elastic Agent enrollment and policies |
|
|
| Elastic Agent (DaemonSet) | Ships logs and metrics from every cluster node |
|
|
| Elastic Agent (standalone) | Runs on docker-host11 and the edge VPS |
|
|
|
|
The DaemonSet tolerates the control-plane `NoSchedule` taint so server nodes are covered too.
|
|
|
|
Elastic alert rules are bridged to Ntfy via `elastic-ntfy-bridge`, a small CronJob that polls the Elasticsearch alerts API and forwards new alerts as push notifications.
|
|
|
|
---
|
|
|
|
## Alerting flow
|
|
|
|
```
|
|
Prometheus Alertmanager ──► Ntfy (push notification)
|
|
▲
|
|
Elasticsearch alert rule ──► elastic-ntfy-bridge CronJob ─┘
|
|
```
|
|
|
|
Both sources land in the same Ntfy topic.
|