8.Prometheus监控之所遇问题解决总结

2021-04-28
monitor
kubernetes
OperationTools

8.Prometheus监控之所遇问题解决总结

2021-04-28

[TOC]

0x01 Prometheus 安装异常整理

问题1.访问node_exporter程序的9100的/metrics提示context deadline exceeded错误

异常信息:Get http://192.168.90.177:9100/metrics: context deadline exceeded
问题原因: 有可能是系统端口未开放。
解决办法：指定其他端口或者更改防火墙访问策略。

点击阅读完整原文

[TOC]

0x01 Prometheus 安装异常整理

问题1.访问node_exporter程序的9100的/metrics提示context deadline exceeded错误

# 解决方式1
nohup ./node_exporter  --web.listen-address=":9100" & 

# 解决方式2
# CentOS
firewall-cmd --zone=public --add-port=9100/tcp --permanent #永久生效没有permanent参数重启后失效
firewall-cmd --reload #重新载入配置生效
# Ubuntu
sudo ufw allow 9100

问题2.采用Docker进行prometheus server端安装报`msg="Failed to create directory for logging active queries"`错误;

异常信息: 在持久化映射目录下没有prometheus.yml文件因此会被临时创建一个目录文件导致出错或者因为持久化的数据目录权限问题。

$ docker logs prometheus_server
  # prometheus_server | level=error ts=2021-04-30T07:50:11.241Z caller=query_logger.go:109 component=activeQueryTracker msg="Failed to create directory for logging active queries"
  # prometheus_server | level=error ts=2021-04-30T07:50:11.241Z caller=query_logger.go:87 component=activeQueryTracker msg="Error opening query log file" file=data/queries.active err="open data/queries.active: no such file or directory"
  # prometheus_server | panic: Unable to create mmap-ed active query log
  # component=activeQueryTracker msg="Error opening query log file" file=/prometheus/queries.active err="open /prometheus/queries.active: permission denied"
# panic: Unable to create mmap-ed active query log

解决办法: 在映射的持久化目录下创建prometheus.yml文件并进行相应权限配置。

chmod +777 /nfsdisk-31/monitor/prometheus
docker run -p 9090:9090 -v /tmp/prometheus.yml:/etc/prometheus/prometheus.yml \
  -v /nfsdisk-31/monitor/prometheus:/prometheus \ 
  prom/prometheus

Tips : 注意 Prometheus 2.x 版本容器的数据目录是/prometheus而非/prometheus-data目录;

问题3.安装运行prometheus_server端时显示`This filesystem is not supported and may lead to data corruption and data loss`警告。

异常信息: 由于prometheus包括本地磁盘时间序列数据库，但也可以选择与远程存储系统集成，此文件系统不受支持可能导致数据损坏和数据丢失。

prometheus_server | level=warn ts=2021-05-10T05:46:49.843Z caller=main.go:813 fs_type=NFS_SUPER_MAGIC msg="This filesystem is not supported and may lead to data corruption and data loss. Please carefully read https://prometheus.io/docs/prometheus/latest/storage/ to learn more about supported filesystems."

异常原因: Prometheus的本地存储不支持不兼容POSIX的文件系统，因为可能会发生不可恢复的损坏。不支持NFS文件系统（包括AWS的EFS）。NFS可能符合POSIX，但大多数实现均不符合。强烈建议使用本地文件系统以提高可靠性，所以此种共享存储文件的方式不推荐。
解决办法: 如果您的本地存储由于某种原因而损坏，解决该问题的最佳策略是关闭Prometheus，然后删除整个存储目录,您也可以尝试删除单个块目录或WAL目录以解决问题。如果数据没问题只是报警你使用了NFS共享存储文件的格式建议采用FC SAN 存储直连或者加大磁盘存储空间。

问题4.使用Prometheus监控外部k8s集群时提示 `x509: certificate signed by unknown authority` 错误

异常信息:

prometheus_server | level=error ts=2021-05-10T05:43:12.126Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/client-go@v0.20.5/tools/cache/reflector.go:167: Failed to watch *v1.Pod: failed to list *v1.Pod: Get \"https://k8s-dev.weiyigeek:6443/api/v1/pods?limit=500&resourceVersion=0\": x509: certificate signed by unknown authority"

解决办法: 在prometheus.yaml主配置中在kubernetes_sd_file对象内的使用insecure_skip_verify: true 来跳过 tls 验证。

1 2	tls_config: insecure_skip_verify: true

问题5.使用Prometheus监控外部k8s集群时提示`cannot list resource \"services\" in API group` 错误

异常信息:

prometheus_server | level=error ts=2021-05-10T09:09:30.960Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/client-go@v0.20.5/tools/cache/reflector.go:167: Failed to watch *v1.Service: failed to list *v1.Service: services is forbidden: User \"system:serviceaccount:default:prometheus\" cannot list resource \"services\" in API group \"\" at the cluster scope"

问题原因: 当前Token所属prometheus用户无该services查看浏览权限需要在ClsuterRole进行添加相对应的权限。

1
2
3

kubectl get clusterrole prometheus -o yaml # 采用yaml格式查看其权限集群角色权限
  # NAME         CREATED AT
  # prometheus   2021-05-14T08:29:08Z

0x02 Prometheus 使用异常整理

问题1.Warning: Error fetching server time: Detected 82.30200004577637 seconds time difference between your browser and the server.

异常信息: 服务器时间:检测到您的浏览器和服务器之间有82.30200004577637秒的时间差。Prometheus依赖于精确的时间，而时间漂移可能会导致意外的查询结果。
问题原因: 服务器的时间与本地时间不一致从而导致, PS 在 Prometheus web 中偏差大于 5 min 时，无法查询到任何数据，当时间偏差小于 5min 时，可以查到数据，并且正常在 Grafana 中显示。
解决办法: 服务器时间与本机时间设置一致即可;

weiyigeek.top-Prometheus-time-sync

问题2.在启动postgres_exporter监测失败

解决办法: 在启动postgres_exporter的Linux用户下加入DATA_SOURCE_NAME环境变量

# 环境变量添加
tee -a ~/.bash_profile <<'EOF'
export DATA_SOURCE_NAME="postgresql://postgres:postgres@127.0.0.1:5432/postgres?sslmode=disable"
EOF

# 环境变量生效
source .bash_profile && echo $DATA_SOURCE_NAME

# 启动 postgres 导出器
./postgres_exporter

0x03 AlertManager 使用异常整理

问题1.配置使用企业邮箱进行报警时显示`email.loginAuth failed: 530 Must issue a STARTTLS command first`错误

问题原因: 接入的邮件服务器必须使用tls并且进行有效身份校验。
解决办法:

1	smtp_require_tls: true

问题2.配置使用企业邮箱进行报警时显示`starttls failed: x509: certificate signed by unknown authority`错误

问题原因: 未知机构签署的证书即客户端访问服务端时证书不受信赖
解决办法: 需要在 email_configs 下配置 insecure_skip_verify: true 来跳过 tls 验证

receivers:
- name: 'default-email'
  email_configs:
  - to: 'master@weiyigeek.top'
    insecure_skip_verify: true
    send_resolved: true

问题3.邮件预警配置时遇到的错误记录

# - 当在global配置 smtp.weiyigeek.top:25
报错信息：
level=error ts=2020-04-08T06:02:44.036Z caller=notify.go:372 component=dispatcher msg=“Error on notify” err=“send STARTTLS command: x509: certificate is valid for *.mxhichina.com, mxhichina.com, not smtp.weiyigeek.top” context_err=“context deadline exceeded”
level=error ts=2020-04-08T06:02:44.036Z caller=dispatch.go:301 component=dispatcher msg=“Notify for alerts failed” num_alerts=1 err=“send STARTTLS command: x509: certificate is valid for *.mxhichina.com, mxhichina.com, not smtp.weiyigeek.top”

# - 当在global配置
# smtp.weiyigeek.top
# smtp_require_tls:false
报错信息：
level=warn ts=2020-10-12T10:34:11.780Z caller=notify.go:674 component=dispatcher receiver=mail-receiver integration=email[0] msg=“Notify attempt failed, will retry later” attempts=1 err="*smtp.plainAuth auth: unencrypted connection"
level=error ts=2020-10-12T10:34:21.581Z caller=dispatch.go:309 component=dispatcher msg=“Notify for alerts failed” num_alerts=1 err=“mail-receiver/email[0]: notify retry canceled after 7 attempts: *smtp.plainAuth auth: unencrypted connection”

# - 配置smtp.qiye.aliyun.com:465
报错信息：
level=warn ts=2020-10-12T11:36:41.779Z caller=notify.go:674 component=dispatcher receiver=mail-receiver integration=email[0] msg=“Notify attempt failed, will retry later” attempts=1 err="‘require_tls’ is true (default) but “smtp.qiye.aliyun.com:465” does not advertise the STARTTLS extension"
level=error ts=2020-10-12T11:36:51.578Z caller=dispatch.go:309 component=dispatcher msg=“Notify for alerts failed” num_alerts=1 err=“mail-receiver/email[0]: notify retry canceled after 8 attempts: ‘require_tls’ is true (default) but “smtp.qiye.aliyun.com:465” does not advertise the STARTTLS extension”tail: /opt/logs/alertmanager-9093.log: file truncated

解决办法: 配置以下两行发送邮件正常

1 2	smtp.qiye.aliyun.com:465 smtp_require_tls: false

问题4.邮件预警配置使用时报`notify retry canceled after 2 attempts: *email.loginAuth auth: 535 Error:`错误信息。

错误信息:

level=error ts=2021-05-20T14:34:43.637Z caller=dispatch.go:309 component=dispatcher msg="Notify for alerts failed" num_alerts=2 err="default-email/email[0]: notify retry canceled after 2 attempts: *email.loginAuth auth: 535 Error: \ufffd\ufffdʹ\ufffd\ufffd\ufffd\ufffdȨ\ufffd\ufffd\ufffd\ufffd¼\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd뿴: http://service.mail.qq.com/cgi-bin/help?subtype=1&&id=28&&no=1001256"

错误原因: 邮箱认证密码未采用用于登录第三方客户端的专用密码或者设置的tmpl模板路径不正确,此时只需按照上述的地址进行申请发信邮箱的专用密码。（实际测试发现QQ邮箱必须要使用第三方专用密码，而企业邮箱可以直接用登陆密码进行认证不推荐）

欢迎各位志同道合的朋友一起学习交流，如文章有误请在下方留下您宝贵的经验知识，个人邮箱地址【master#weiyigeek.top】或者个人公众号【WeiyiGeek】联系我。

更多文章来源于【WeiyiGeek Blog - 为了能到远方，脚下的每一步都不能少】, 个人首页地址( https://weiyigeek.top )

专栏书写不易，如果您觉得这个专栏还不错的，请给这篇专栏 【点个赞、投个币、收个藏、关个注、转个发、赞个助】，这将对我的肯定，我将持续整理发布更多优质原创文章！。

最后更新时间：2023-06-06 17:20:32
文章原始路径：_posts/虚拟云容/云容器/Kubernetes/功能组件/Prometheus/8.Prometheus监控之所遇问题解决总结.md
转载注明出处，原文地址：https://blog.weiyigeek.top/2021/4-28-579.html
本站文章内容遵循知识共享署名 - 非商业性 - 相同方式共享 4.0 国际协议

WeiyiGeeker

☕️ 请作者喝杯咖啡!

8.Prometheus监控之所遇问题解决总结

0x01 Prometheus 安装异常整理

问题1.访问node_exporter程序的9100的/metrics提示context deadline exceeded错误

0x01 Prometheus 安装异常整理

问题1.访问node_exporter程序的9100的/metrics提示context deadline exceeded错误

问题2.采用Docker进行prometheus server端安装报msg="Failed to create directory for logging active queries"错误;

问题3.安装运行prometheus_server端时显示This filesystem is not supported and may lead to data corruption and data loss警告。

问题4.使用Prometheus监控外部k8s集群时提示 x509: certificate signed by unknown authority 错误

问题5.使用Prometheus监控外部k8s集群时提示cannot list resource \"services\" in API group 错误

0x02 Prometheus 使用异常整理

问题1.Warning: Error fetching server time: Detected 82.30200004577637 seconds time difference between your browser and the server.

问题2.在启动postgres_exporter监测失败

0x03 AlertManager 使用异常整理

问题1.配置使用企业邮箱进行报警时显示email.loginAuth failed: 530 Must issue a STARTTLS command first错误

问题2.配置使用企业邮箱进行报警时显示starttls failed: x509: certificate signed by unknown authority错误

问题3.邮件预警配置时遇到的错误记录

问题4.邮件预警配置使用时报notify retry canceled after 2 attempts: *email.loginAuth auth: 535 Error:错误信息。

如果此篇文章对您有帮助，就请作者喝杯 Coffee ☕️☕️!

问题2.采用Docker进行prometheus server端安装报`msg="Failed to create directory for logging active queries"`错误;

问题3.安装运行prometheus_server端时显示`This filesystem is not supported and may lead to data corruption and data loss`警告。

问题4.使用Prometheus监控外部k8s集群时提示 `x509: certificate signed by unknown authority` 错误

问题5.使用Prometheus监控外部k8s集群时提示`cannot list resource \"services\" in API group` 错误

问题1.配置使用企业邮箱进行报警时显示`email.loginAuth failed: 530 Must issue a STARTTLS command first`错误

问题2.配置使用企业邮箱进行报警时显示`starttls failed: x509: certificate signed by unknown authority`错误

问题4.邮件预警配置使用时报`notify retry canceled after 2 attempts: *email.loginAuth auth: 535 Error:`错误信息。