46 lines
1.8 KiB
Markdown
46 lines
1.8 KiB
Markdown
# Observability
|
|
|
|
Two parallel stacks cover metrics and logs.
|
|
|
|
---
|
|
|
|
## Metrics — Prometheus + Grafana
|
|
|
|
Deployed via the **kube-prometheus-stack** Helm chart (ArgoCD-managed), running in the `prometheus` namespace.
|
|
|
|
- **Prometheus** scrapes all nodes, pods, and K8s control plane components
|
|
- **Grafana** dashboards: cluster overview, node resource usage, Longhorn, ArgoCD, Traefik
|
|
- **Alertmanager** routes alerts to Ntfy (self-hosted push notifications) via a custom webhook bridge
|
|
- **Node Exporter** runs on all VMs including docker-host11 and the edge VPS (Ansible-deployed)
|
|
- **Goldilocks + VPA** analyse actual resource usage and recommend request/limit values
|
|
|
|
---
|
|
|
|
## Logs + Fleet — Elastic Stack (ECK)
|
|
|
|
Deployed via the **ECK operator** (Elastic Cloud on Kubernetes), running in the `elastic-system` namespace.
|
|
|
|
| Component | Purpose |
|
|
|-----------|---------|
|
|
| Elasticsearch | Log storage and search (single-node, 15 Gi heap) |
|
|
| Kibana | Log exploration and dashboards |
|
|
| Fleet Server | Manages Elastic Agent enrollment and policies |
|
|
| Elastic Agent (DaemonSet) | Ships logs and metrics from every cluster node |
|
|
| Elastic Agent (standalone) | Runs on docker-host11 and the edge VPS |
|
|
|
|
The Elastic Agent DaemonSet tolerates the control-plane `NoSchedule` taint so logs are collected from server nodes as well as agents.
|
|
|
|
Alerts from Elasticsearch rules are bridged to Ntfy via a small CronJob (`elastic-ntfy-bridge`) that polls the Elasticsearch alerts API and forwards new alerts as push notifications.
|
|
|
|
---
|
|
|
|
## Alerting Flow
|
|
|
|
```
|
|
Prometheus Alertmanager ──► Ntfy (push notification)
|
|
▲
|
|
Elasticsearch alert rule ──► elastic-ntfy-bridge CronJob ─┘
|
|
```
|
|
|
|
All alerts land in the same Ntfy topic, accessible on mobile and desktop.
|