Prometheus常用Exporter及硬件指标
Exporter
node_exporter
链接:https://github.com/prometheus/node_exporter
用途:用于类UNIX内核暴露的系统及硬件指标获取,比如CPU,MEM,DISK,FD等
process_exporter
链接:https://github.com/ncabatoff/process-exporter
用途:对于一些没有适配Prometheus的应用,可以抓取/proc下面的数据进行指标获取,比如线程数,上下文切换,IO,FD等
blackbox_exporter
链接:https://github.com/prometheus/blackbox_exporter
用途:黑盒探针,应用于一些web服务,API服务的可用性监控,可以通过TCP,HTTP,HTTPS等方式,常见指标包括response code, dns lookup time, duration time等
mysqld_exporter
链接:https://github.com/prometheus/mysqld_exporter
用途:暴露的mysql监控指标,比如连接数,各类CMD qps,buffer大小等
redis_exporter
链接:https://github.com/oliver006/redis_exporter
用途:暴露的redis监控指标,比如QPS,命中率,网络IO,key数量等
druid_exporter
链接:https://github.com/wikimedia/operations-software-druid_exporter
用途:暴露的druid.io相关监控指标,不是很全,可自己定制
jmx_exporter
链接:https://github.com/prometheus/jmx_exporter
用途:java类应用暴露的相关指标。比如kafka,hadoop生态
另外,告警体系如果包括钉钉的话,推荐dingtalk_webhook
Alerts Rules
blackbox_exporter
常用的只有存活状态和状态码异常告警
- alert: BlackboxProbeFailed
expr: probe_success == 0
for: 0m
labels:
severity: critical
annotations:
summary: Blackbox probe failed (instance {{ $labels.instance }})
description: "Probe failed\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: BlackboxProbeHttpFailure
expr: probe_http_status_code <= 199 OR probe_http_status_code >= 400
for: 0m
labels:
severity: critical
annotations:
summary: Blackbox probe HTTP failure (instance {{ $labels.instance }})
description: "HTTP status code is not 200-399\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
mysql_exporter
通常包括存活状态,慢查询等。
- alert: MysqlDown
expr: mysql_up == 0
for: 0m
labels:
severity: critical
annotations:
summary: MySQL down (instance {{ $labels.instance }})
description: "MySQL instance is down on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MysqlSlowQueries
expr: increase(mysql_global_status_slow_queries[1m]) > 0
for: 2m
labels:
severity: warning
annotations:
summary: MySQL slow queries (instance {{ $labels.instance }})
description: "MySQL server mysql has some new slow query.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
process_exporter
通常用于进程存活检查
- alert: ProcessDown
expr: namedprocess_namegroup_num_procs == 0
for: 1m
labels:
severity: high
annotations:
summary: "Instance {{ $labels.instance }} process is down."
description: "Instance {{ $labels.instance }}: {{ $labels.groupname }} is down"
jmx_exporter
这里我只使用了kafka监控
- name: KafkaStatsAlert
rules:
- alert: KafkaOfflineAlert
expr: kafka_controller_kafkacontroller_offlinepartitionscount > 0
for: 5m
labels:
severity: average
annotations:
summary: "Instance {{ $labels.instance }} have offline partitions."
description: "Offline partitions > 0 (current value: {{ $value }})"
- alert: KafkaUnderreplicatedAlert
expr: sum(kafka_cluster_partition_underreplicated) by (instance) > 0
for: 5m
labels:
severity: average
annotations:
summary: "Instance {{ $labels.instance }} have partitions underreplicated."
description: "Underreplicated partitions > 0 (current value: {{ $value }})"
- alert: KafkaNocontrollerAlert
expr: sum (kafka_controller_kafkacontroller_activecontrollercount ) == 0
for: 5m
labels:
severity: average
annotations:
summary: "No active controller."
description: "Current active controller = 0"
- alert: KafkaGCtimeAlert
expr: sum without(gc)(rate(jvm_gc_collection_seconds_sum{job="kafka"}[5m])) > 0.8
for: 5m
labels:
severity: high
annotations:
summary: "Kafka spent too much time in GC"
description: "GCtime spend > 8% (current value: {{ $value }})"
node_exporter
以下是筛选后常用的node_exporter获取的硬件指标告警触发语句,包括各种硬件容量预警。
groups:
- name: hostStatsAlert
rules:
- alert: hostCpuUsageAlert
expr: sum by(instance) (avg without(cpu) (irate(node_cpu_seconds_total{mode!="idle",instance!~"dev.*"}[5m]))) > 0.9
for: 5m
labels:
severity: high
annotations:
summary: "Instance {{ $labels.instance }} CPU usage high"
description: "CPU usage above 90% (current value: {{ $value }})"
- alert: hostMemUsageAlert
expr: (node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes) / node_memory_MemTotal_bytes{cluster!="develop"} > 0.9
for: 10m
labels:
severity: average
annotations:
summary: "Instance {{ $labels.instance }} MEM usage high"
description: "MEM usage above 90% (current value: {{ $value }})"
- alert: hostDiskUsageAlert
expr: (node_filesystem_size_bytes - node_filesystem_free_bytes ) / node_filesystem_size_bytes {fstype=~"ext[2-5]*"} > 0.8
for: 5m
labels:
severity: high
annotations:
summary: "Instance {{ $labels.instance }} Disk usage high"
description: "Disk usage above 85% (current value: {{ $value }})"
- alert: hostDiskInodeUsageAlert
expr: (node_filesystem_files - node_filesystem_files_free)/node_filesystem_files {fstype=~"ext[2-5]*"} > 0.8
labels:
severity: high
annotations:
summary: "Instance {{ $labels.instance }} Disk Inode usage high"
description: "{{ $labels.instance }} Disk Inode usage above 85% (current value: {{ $value }})"
- alert: hostCpuLoadAlert
expr: node_load15 / (count without (cpu, mode) (node_cpu_seconds_total{mode="system",cluster!="develop"})) > 2
for: 10m
labels:
severity: average
annotations:
summary: "instance {{ $labels.instance }}) CPU load (15m) high"
description: "CPU load (15m) is high\n VALUE = {{ $value }}"
# mute during 0:00-07:00
- alert: hostDiskIOAlert
expr: sum by(instance) (avg without(cpu) (irate(node_cpu_seconds_total{mode=~"iowait",instance!~"dev.*"}[5m]))) > 0.2 and ON() hour() > 0 < 16
for: 10m
labels:
severity: average
annotations:
summary: "instance {{ $labels.instance }})Disk I/O is overloaded"
description: "{{ $labels.instance }} Disk I/O is overloaded for more than 5 minutes (current value: {{ $value }})"
redis_exporter
结合自定义参数获取的指标
- name: RedisAlert
rules:
- alert: BlockedClient
expr: redis_blocked_clients > 0
for: 1m
labels:
severity: high
annotations:
summary: "redis connection have blocked clients."
description: "redis connections have blocked clients: {{ $value }}."
- alert: RedisClusterSlotsFail
expr: redis_cluster_slots_fail > 0
for: 1m
labels:
severity: high
annotations:
summary: "redis cluster have failed slots."
description: "redis cluster have failed slots: {{ $value }}."
- alert: RedisClusterStorage
expr: redis_cluster_stats{type="StorageUs"} > 85
labels:
severity: high
annotations:
summary: "redis cluster storage used over 85%."
description: "redis cluster {{ $labels.ins}} storage used over 85%: {{ $value }}."
- alert: RedisClusterStorageLack
expr: redis_cluster_stats{type="StorageUs"} > 90
labels:
severity: disaster
annotations:
summary: "redis cluster storage used over 90%."
description: "redis cluster {{ $labels.ins}} storage used over 90%: {{ $value }}."
- alert: RedisClusterMaxConnection
expr: redis_cluster_stats{type="Connections"} > 5000
for: 2m
labels:
severity: disaster
annotations:
summary: "redis cluster got too many connections."
description: "redis cluster {{ $labels.ins}} got too many connections : {{ $value }}."
- alert: RedisInstanceDown
expr: redis_up{instance!~'ops-.*'} == 0
labels:
severity: disaster
annotations:
summary: "redis instance is down"
description: "redis instance {{$labels.alias}} is down"