封面《恋する乙女と守護の楯》

引言

因为最近的运维需求需要学习一点 Prometheus，因此记录一下

前置准备

因为我这项目使用的是 nodejs 的 express 框架。因此就使用这个框架进行简单介绍 prometheus 的使用。首先新建一个 express 工程

express 工程

# 新建工程
npm init -y

# 安装依赖
npm install typescript \
  ts-node \
  @types/node \
  express \
  @types/express \
  express-prom-bundle \
  prom-client 

# initialize typescript
npx tsc --init

新增 prometheus.ts 文件，内容如下

/*prometheus.ts*/
import express_prom_bundle from "express-prom-bundle";


const prometheusMiddleware = express_prom_bundle({
    includeMethod: true,  // 包含请求方法名
    includePath: true,    // 包含请求路径
    promClient: {
        collectDefaultMetrics: {
        }
    },
})

export default prometheusMiddleware;

然后增加 app.ts 文件内容如下

/*app.ts*/
import express, { Express } from 'express';
import prometheusMiddleware from './prometheus';

const PORT: number = parseInt(process.env.PORT || '8080');
const app: Express = express();

// prometheusMiddleware
app.use(prometheusMiddleware);

function getRandomNumber(min: number, max: number) {
  return Math.floor(Math.random() * (max - min + 1) + min);
}

app.get('/rolldice', (req, res) => {
  res.send(getRandomNumber(1, 6).toString());
});

app.listen(PORT, () => {
  console.log(`Listening for requests on http://localhost:${PORT}`);
});

运行下面的命令启动工程

1	npx ts-node app.ts

此时访问 http://localhost:8080/rolldice 可以看到服务输出，访问 http://localhost:8080/metrics 可以看到 prometheus 需要收集的监控信息

prometheus

采用 docker 的方式部署 prometheus，首先新建一个 prometheus.yml 文件，内容如下

# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ["localhost:9090"]
  
  - job_name: "nodejs"
    static_configs:
      - targets: ["192.168.124.22:8080"]

其中 nodejs 中的 targets 需要替换成你自己的 nodejs 服务的地址，本地可以是使用 ipconfig 或者 ifconfig 命令查看到的地址

1	docker run -p 9090:9090 -v .\config\prometheus.yml:/etc/prometheus/prometheus.yml -d prom/prometheus

打开 http://localhost:9090/targets 可以看到 prometheus 的监控目标

grafana

启动 grafana

1	docker run -d -p 3000:3000 --name=grafana grafana/grafana-enterprise

打开 http://localhost:3000，默认用户名和密码都是 admin，首次登录后会提示修改密码。在 connection 中添加数据源，选择 prometheus, 填写 Prometheus 的地址，默认是 http://localhost:9090，然后点击 save&test 按钮测试连接是否成功

promethues

Prometheus 提供了一种名为 PromQL（Prometheus 查询语言）的功能性查询语言，使用户可以实时选择和聚合时间序列数据。具体语法见官网介绍这里仅作简单介绍。

例如使用 UP 检查节点是否存活

up

1	up{job="nodejs"}

Metric 类型

Prometheus 提供了四种主要的指标类型，在节点暴露的 metric 中也有对样本类型的注释。例如

1
2
3

# HELP process_cpu_user_seconds_total Total user CPU time spent in seconds.
# TYPE process_cpu_user_seconds_total counter
process_cpu_user_seconds_total 77.95299999999963

下面将对 metric 类型一一介绍

Counter

Counter 是单调递增的整数值，它只能增加（或在重启时重置为零）。计数器通常用于表示请求的数量、错误的数量等。

Gauge

Gauge 是一个可以增加或减少的数值，其指标侧重于反应系统的当前状态。例如可用内存大小等。

Histogram

Histogram 用于度量分布数据的直方图。它将数据分为多个桶（buckets），并计算每个桶中的样本数量。Histogram 通常用于测量请求延迟等。

Summary

Summary 与 Histogram 类似，但它提供了更精确的分位数计算。Summary 会计算请求的总数、总大小和分位数等信息。Summary 通常用于测量请求延迟等。

Histogram 和 Summary 的区别

Summary 和 Histogram 的区别在于，Summary 在客户端流式计算分位数 φ，而 Histogram 在服务端计算分位数。Histogram 的分位数计算是基于桶的，而 Summary 的分位数计算是基于样本的。具体差别如下

	Histogram	Summary
查询方式	`histogram_quantile(0.95,sum(rate(http_request_duration_seconds_bucket[5m])) by (le))`	`http_request_duration_seconds_summary{quantile="0.95"}`
必需的配置	为预期的观察值范围选择合适的桶	选择所需的 φ- 分位数和滑动窗口。其他 φ- 分位数和滑动窗口无法在之后计算
客户端性能	观察成本非常低，只需递增 Counter 即可	观察成本昂贵，因为需要进行流式分位数计算
服务器性能	服务器实时计算分位数	服务器端成本低
额外的时间序列数量	每个配置的存储桶有一个时间序列	每个配置的分位数有一个时间序列
分位数误差	误差受相关观察上存储桶宽度维度的限制	误差受限于 φ 维度，由可配置的值限制

Grafana

通常来说我们不会直接使用 Prometheus 进行可视化，而是使用一些第三方可视化工具，例如 Grafana。Grafana 是一个开源的分析和监控平台，支持多种数据源，如 Prometheus、Jaeger、Loki 等。Grafana 提供了丰富的可视化功能，可以帮助用户更好地理解和分析数据，还可以直接使用他人上传到云端的监控配置。

在前文中我们已经讲述了如何安装 Grafana 并添加 Prometheus 数据源。接下来我们将介绍如何使用 Grafana 进行可视化。

CPU 使用率

1	process_cpu_user_seconds_total{job="nodejs"}/process_cpu_seconds_total{job="nodejs"}

QPM

1	sum(rate(http_request_duration_seconds_count{job="nodejs"}[1m]))

prometheus+grafana 学习

引言