Prometheus常用Exporter及硬件指标

Exporter

node_exporter

链接：https://github.com/prometheus/node_exporter
用途：用于类UNIX内核暴露的系统及硬件指标获取，比如CPU,MEM,DISK,FD等

process_exporter

链接：https://github.com/ncabatoff/process-exporter
用途：对于一些没有适配Prometheus的应用，可以抓取/proc下面的数据进行指标获取，比如线程数，上下文切换，IO,FD等

blackbox_exporter

链接：https://github.com/prometheus/blackbox_exporter
用途：黑盒探针，应用于一些web服务，API服务的可用性监控，可以通过TCP,HTTP,HTTPS等方式，常见指标包括response code, dns lookup time, duration time等

mysqld_exporter

链接：https://github.com/prometheus/mysqld_exporter
用途：暴露的mysql监控指标，比如连接数，各类CMD qps，buffer大小等

redis_exporter

链接：https://github.com/oliver006/redis_exporter
用途：暴露的redis监控指标，比如QPS，命中率，网络IO，key数量等

druid_exporter

链接：https://github.com/wikimedia/operations-software-druid_exporter
用途：暴露的druid.io相关监控指标，不是很全，可自己定制

jmx_exporter

链接：https://github.com/prometheus/jmx_exporter
用途：java类应用暴露的相关指标。比如kafka，hadoop生态

另外，告警体系如果包括钉钉的话，推荐dingtalk_webhook

Alerts Rules

blackbox_exporter

常用的只有存活状态和状态码异常告警

  - alert: BlackboxProbeFailed
    expr: probe_success == 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Blackbox probe failed (instance {{ $labels.instance }})
      description: "Probe failed\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: BlackboxProbeHttpFailure
    expr: probe_http_status_code <= 199 OR probe_http_status_code >= 400
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Blackbox probe HTTP failure (instance {{ $labels.instance }})
      description: "HTTP status code is not 200-399\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

mysql_exporter

通常包括存活状态，慢查询等。

  - alert: MysqlDown
    expr: mysql_up == 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: MySQL down (instance {{ $labels.instance }})
      description: "MySQL instance is down on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: MysqlSlowQueries
    expr: increase(mysql_global_status_slow_queries[1m]) > 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: MySQL slow queries (instance {{ $labels.instance }})
      description: "MySQL server mysql has some new slow query.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

process_exporter

通常用于进程存活检查

  - alert: ProcessDown
    expr: namedprocess_namegroup_num_procs == 0
    for: 1m
    labels:
      severity: high
    annotations:
      summary: "Instance {{ $labels.instance }} process is down."
      description: "Instance {{ $labels.instance }}: {{ $labels.groupname }} is down"

jmx_exporter

这里我只使用了kafka监控

- name: KafkaStatsAlert
  rules:
  - alert: KafkaOfflineAlert
    expr: kafka_controller_kafkacontroller_offlinepartitionscount > 0
    for: 5m
    labels:
      severity: average
    annotations:
      summary: "Instance {{ $labels.instance }} have offline partitions."
      description: "Offline partitions > 0 (current value: {{ $value }})"
  - alert: KafkaUnderreplicatedAlert
    expr: sum(kafka_cluster_partition_underreplicated) by (instance) > 0
    for: 5m
    labels:
      severity: average
    annotations:
      summary: "Instance {{ $labels.instance }} have partitions underreplicated."
      description: "Underreplicated partitions > 0 (current value: {{ $value }})"
  - alert: KafkaNocontrollerAlert
    expr: sum (kafka_controller_kafkacontroller_activecontrollercount ) == 0
    for: 5m
    labels:
      severity: average
    annotations:
      summary: "No active controller."
      description: "Current active controller = 0"
  - alert: KafkaGCtimeAlert
    expr: sum without(gc)(rate(jvm_gc_collection_seconds_sum{job="kafka"}[5m])) > 0.8
    for: 5m
    labels:
      severity: high
    annotations:
      summary: "Kafka spent too much time in GC"
      description: "GCtime spend > 8% (current value: {{ $value }})"

node_exporter

以下是筛选后常用的node_exporter获取的硬件指标告警触发语句，包括各种硬件容量预警。

groups:
- name: hostStatsAlert
  rules:
  - alert: hostCpuUsageAlert
    expr: sum by(instance) (avg without(cpu) (irate(node_cpu_seconds_total{mode!="idle",instance!~"dev.*"}[5m]))) > 0.9
    for: 5m
    labels:
      severity: high
    annotations:
      summary: "Instance {{ $labels.instance }} CPU usage high"
      description: "CPU usage above 90% (current value: {{ $value }})"

  - alert: hostMemUsageAlert
    expr: (node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes) / node_memory_MemTotal_bytes{cluster!="develop"} > 0.9
    for: 10m
    labels:
      severity: average
    annotations:
      summary: "Instance {{ $labels.instance }} MEM usage high"
      description: "MEM usage above 90% (current value: {{ $value }})"

  - alert: hostDiskUsageAlert
    expr: (node_filesystem_size_bytes - node_filesystem_free_bytes ) / node_filesystem_size_bytes {fstype=~"ext[2-5]*"} > 0.8
    for: 5m
    labels:
      severity: high
    annotations:
      summary: "Instance {{ $labels.instance }} Disk usage high"
      description: "Disk usage above 85% (current value: {{ $value }})"

  - alert: hostDiskInodeUsageAlert
    expr: (node_filesystem_files - node_filesystem_files_free)/node_filesystem_files {fstype=~"ext[2-5]*"} > 0.8
    labels:
      severity: high
    annotations:
      summary: "Instance {{ $labels.instance }} Disk Inode usage high"
      description: "{{ $labels.instance }} Disk Inode usage above 85% (current value: {{ $value }})"

  - alert: hostCpuLoadAlert
    expr: node_load15 / (count without (cpu, mode) (node_cpu_seconds_total{mode="system",cluster!="develop"})) > 2
    for: 10m
    labels:
      severity: average
    annotations:
      summary: "instance {{ $labels.instance }}) CPU load (15m) high"
      description: "CPU load (15m) is high\n  VALUE = {{ $value }}"

  # mute during 0:00-07:00
  - alert: hostDiskIOAlert
    expr: sum by(instance) (avg without(cpu) (irate(node_cpu_seconds_total{mode=~"iowait",instance!~"dev.*"}[5m]))) > 0.2 and ON() hour() > 0 < 16
    for: 10m
    labels:
      severity: average
    annotations:
      summary: "instance {{ $labels.instance }})Disk I/O is overloaded"
      description: "{{ $labels.instance }} Disk I/O is overloaded for more than 5 minutes (current value: {{ $value }})"

redis_exporter

结合自定义参数获取的指标

- name: RedisAlert
  rules:
  - alert: BlockedClient
    expr: redis_blocked_clients > 0 
    for: 1m
    labels:
      severity: high
    annotations:
      summary: "redis connection have blocked clients."
      description: "redis connections have blocked clients: {{ $value }}."
  - alert: RedisClusterSlotsFail
    expr: redis_cluster_slots_fail > 0 
    for: 1m
    labels:
      severity: high
    annotations:
      summary: "redis cluster have failed slots."
      description: "redis cluster have failed slots: {{ $value }}."
  - alert: RedisClusterStorage
    expr: redis_cluster_stats{type="StorageUs"} > 85
    labels:
      severity: high
    annotations:
      summary: "redis cluster storage used over 85%."
      description: "redis cluster {{ $labels.ins}} storage used over 85%: {{ $value }}."
  - alert: RedisClusterStorageLack
    expr: redis_cluster_stats{type="StorageUs"} > 90
    labels:
      severity: disaster
    annotations:
      summary: "redis cluster storage used over 90%."
      description: "redis cluster {{ $labels.ins}} storage used over 90%: {{ $value }}."

  - alert: RedisClusterMaxConnection
    expr: redis_cluster_stats{type="Connections"} > 5000
    for: 2m
    labels:
      severity: disaster
    annotations:
      summary: "redis cluster got too many connections."
      description: "redis cluster {{ $labels.ins}} got too many connections : {{ $value }}."

  - alert: RedisInstanceDown
    expr: redis_up{instance!~'ops-.*'} == 0
    labels:
      severity: disaster
    annotations:
      summary: "redis instance is down"
      description: "redis instance {{$labels.alias}} is down"