杜龙少(sdvdxl)

prometheus机器组件安装配置

字数统计: 1.6k阅读时长: 8 min
2019/12/30 Share

Prometheus 安装

准备:

1
mkdir -p /data/logs /data/soft/monitor

安装核心服务 Prometheus

下载 Prometheus 到 /data/soft 目录,并解压到 /data/soft/monitor/prometheus

执行命令为:

1
2
3
mkdir -p /data/soft/monitor && mkdir -p /data/logs && cd /data/soft

wget https://hekr-files.oss-cn-shanghai.aliyuncs.com/soft/prometheus/linux/prometheus-2.14.0.linux-amd64.tar.gz && tar -xvf prometheus-2.14.0.linux-amd64.tar.gz && mv prometheus-2.14.0.linux-amd64 monitor/prometheus

启动

1
2
cd /data/soft/monitor/prometheus
setsid ./prometheus --config.file="prometheus.yml" --web.enable-lifecycle >> /data/logs/prometheus.log 2>&1

上面使用默认配置启动了Prometheus,至于配置文件下面会再进行修改。

检查是否启动成功,输入命令 curl -XGET -sI localhost:9090/metrics|head -n1 如果输出 HTTP/1.1 200 OK 则代表成功,否则查看日志 /data/logs/prometheus.log 定位问题。

安装Linux监控组件

该组件需要安装在被监控的主机上,如果主机有多台,那么应该分别进行安装并启动。

下载 node_export 到 目录 /data/soft/monitor 并解压,命令:

1
2
3
4
cd /data/soft
wget https://hekr-files.oss-cn-shanghai.aliyuncs.com/soft/prometheus/linux/node_exporter-0.18.1.linux-amd64.tar.gz

tar -xvf node_exporter-0.18.1.linux-amd64.tar.gz && mv node_exporter-0.18.1.linux-amd64 monitor/node_exporter

启动 node_exporter

1
2
3
cd /data/soft/monitor/node_exporter

setsid ./node_exporter >> /data/logs/node_exporter.log 2>&1

检查是否启动成功, curl -XGET -sI localhost:9100/metrics|head -n1 如果输出 HTTP/1.1 200 OK 则启动成功,否则查看日志 /data/logs/node_exporter.log 定位问题。

安装 redis 监控组件

下载 redis-exporter 到目录 /data/soft 并解压,命令:

1
2
3
4
cd /data/soft
wget https://hekr-files.oss-cn-shanghai.aliyuncs.com/soft/prometheus/linux/redis_exporter-v1.3.4.linux-amd64.tar.gz

tar -xvf redis_exporter-v1.3.4.linux-amd64.tar.gz && mv redis_exporter-v1.3.4.linux-amd64 monitor/redis_exporter

运行 redis_exporter

1
2
cd /data/soft/monitor/redis_exporter
setsid ./redis_exporter -redis.addr='redis://192.168.25.150:6379' --check-keys=db1=stream*,db0=stream* >> /data/logs/redis_exporter.log 2>&1

注意 --check-keys 参数是要监控的 redis 对应 db 的 key名字,比如要监控 db0 中的 stream 开头的key,则加入参数 --check-keys='db1=stream*',多个用英文逗号隔开,这里需要根据 iot 服务部署时候设置的redis db 设置,默认是1。

-redis.addr 是redis 地址。

检查是否启动成功, curl -XGET -sI localhost:9121/metrics|head -n1 如果输出 HTTP/1.1 200 OK 则启动成功,否则查看日志 /data/logs/redis_exporter.log 定位问题。

安装大盘显示组件 Grafana

下载 grafana 到目录 /data/soft 并解压,命令:

1
2
3
4
cd /data/soft
wget https://hekr-files.oss-cn-shanghai.aliyuncs.com/soft/prometheus/linux/grafana-6.4.4.linux-amd64.tar.gz

tar -xvf grafana-6.4.4.linux-amd64.tar.gz && mv grafana-6.4.4 monitor/grafana

启动:

1
2
cd /data/soft/monitor/grafana
setsid bin/grafana-server -config conf/defaults.ini >> /data/logs/grafana.log 2>&1

检查是否启动成功, curl -XGET -sI localhost:3000/login|head -n1 如果输出 HTTP/1.1 200 OK 则启动成功,否则查看日志 /data/logs/grafana.log 定位问题。

安装告警组件 AlertManager

下载 alertmanager 到目录 /data/soft 并解压,命令:

1
2
3
4
cd /data/soft
wget https://hekr-files.oss-cn-shanghai.aliyuncs.com/soft/prometheus/linux/alertmanager-0.19.0.linux-amd64.tar.gz

tar -xvf alertmanager-0.19.0.linux-amd64.tar.gz && mv alertmanager-0.19.0.linux-amd64 monitor/alertmanager

启动:

1
2
cd /data/soft/monitor/alertmanager
setsid ./alertmanager --config.file="alertmanager.yml" >> /data/logs/alertmanager.log 2>&1

检查是否启动成功, curl -XGET -sI http://localhost:9093|head -n1 如果输出 HTTP/1.1 200 OK 则启动成功,否则查看日志 /data/logs/grafana.log 定位问题。

安装 健康检查附加组件

该组件可以针对 http,tcp 等端口进行检查,从而获得服务运行状态并进行告警。

下载 black_exporter 到目录 /data/soft 并解压,命令:

1
2
3
4
cd /data/soft
wget https://hekr-files.oss-cn-shanghai.aliyuncs.com/soft/prometheus/linux/blackbox_exporter-0.16.0.linux-amd64.tar.gz

tar -xvf blackbox_exporter-0.16.0.linux-amd64.tar.gz && mv blackbox_exporter-0.16.0.linux-amd64 monitor/blackbox_exporter

启动:

1
2
cd /data/soft/monitor/blackbox_exporter
setsid ./blackbox_exporter --config.file="blackbox.yml" >> /data/logs/blackbox_exporter.log 2>&1

检查是否启动成功, curl -XGET -sI http://localhost:9115 |head -n1 如果输出 HTTP/1.1 200 OK 则启动成功,否则查看日志 /data/logs/blackbox_exporter.log 定位问题。

配置检查项

将如下配置文件写入(覆盖) balckbox.yml :

1
2
3
4
5
6
7
8
9
10
11
12
13
modules:
http_spring:
prober: http
timeout: 3s
http:
fail_if_body_not_matches_regexp:
- '"status":"UP"'
http_2xx:
prober: http
timeout: 3s
tcp_connect:
prober: tcp
timeout: 3s

使用命令: curl -XPOST http://localhost:9115/-/reload 使其重新加载生效。

检查服务是否都已经启动

检查已经启动服务: ps -ef|grep -v grep|egrep "prometheus|grafana|exporter|alarmmanager"

增加监控配置

修改 /data/soft/monitor/prometheus/prometheus.yml,替换为如下内容,注意 {{}} 内的内容需要动态替换为对应内容。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "rules/*.yml"
# - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'

# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.

static_configs:
- targets: ['localhost:9090']

- job_name: redis_exporter
static_configs:
- targets: ['{{localhost:9121}}']

- job_name: iot_cloud_os_core
metrics_path: /actuator/prometheus
static_configs:
- targets: ['{{ core ip:port }}']

- job_name: linux_host
static_configs:
- targets: ['{{ localhost:9100 }}']

- job_name: emqx
static_configs:
- targets: ['{{localhost:9540}}']

# iot_cloud_os_core 健康检查
- job_name: health_iot_cloud_os_core
metrics_path: /probe
params:
module: [http_spring]
static_configs:
- targets:
- '{{localhost:8001}}/actuator/health'
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: "{{192.168.25.192:9115}}"

# redis 端口检查
- job_name: health_redis_port
metrics_path: /probe
params:
module: [tcp_connect]
static_configs:
- targets:
- '{{localhost:6379}}'
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: "{{192.168.25.192:9115}}"

# mongodb 端口检查
- job_name: health_mongodb_port
metrics_path: /probe
params:
module: [tcp_connect]
static_configs:
- targets:
- '{{localhost:27017}}'
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: "{{192.168.25.192:9115}}"

# rocketmq 端口检查
- job_name: health_mongodb_port
metrics_path: /probe
params:
module: [tcp_connect]
static_configs:
- targets:
- '{{localhost:9876}}'
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: "{{192.168.25.192:9115}}"

保存后输入 curl -XPOST localhost:9090/-/reload 使Prometheus重新加载配置,如果没有输出则正确。

添加告警配置

创建目录 mkdir -p /data/soft/monitor/prometheus/rules 并将以下文件内容保存 /data/soft/monitor/prometheus/rules 目录下:

health.yml

1
2
3
4
5
6
7
8
9
10
11
groups:
- name: 监控检查
rules:
- alert: "健康检查失败"
expr: probe_success!=1
for: 1m
labels:
severity: critical
annotations:
summary: "健康检查失败 {{$labels.job}} {{$labels.instance}}"
description: "健康检查失败 {{$labels.job}} {{$labels.instance}}"

原文作者:杜龙少(sdvdxl)

原文链接:https://todu.top/posts/d995/

发表日期:2019-12-30 11:54:55

更新日期:2020-04-11 18:33:01

版权声明:本文采用知识共享署名-非商业性使用 4.0 国际许可协议进行许可

CATALOG
  1. 1. Prometheus 安装
    1. 1.1. 安装核心服务 Prometheus
    2. 1.2. 安装Linux监控组件
    3. 1.3. 安装 redis 监控组件
    4. 1.4. 安装大盘显示组件 Grafana
    5. 1.5. 安装告警组件 AlertManager
    6. 1.6. 安装 健康检查附加组件
      1. 1.6.1. 配置检查项
    7. 1.7. 检查服务是否都已经启动
    8. 1.8. 增加监控配置
    9. 1.9. 添加告警配置