[TOC]
0x00 前言
描述:本文章主要针对于本人日常运维所遇到的一些性能问题并进行总结解决思路流程;
无论是 CPU 使用率,还是平均负载,都只是反映系统健康状态的度量指标,而不是问题的根因;
因此它们的价值主要体现在两个方面:
- 一是综合反映当前系统的健康程度,结合监控告警产品,实现快速响应;
- 二是初步定位问题方向,缩小排查范围,降低故障恢复时间。
比如当 CPU iowait 高时,应优先排查磁盘 I/O;当 CPU steal 高时,就优先排查宿主机状态。
#### 0x01 Linux信息收集
描述:当我们对异常系统进行处理,必须先进行主机基础信息的收集,以防出错后可以更快的恢复或者求助;
CentOS系列:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
| #!/bin/bash echo "系统版本:$(cat /etc/redhat-release)" echo "内核信息:$(uname -a)" echo "SeLinux values 设置情况:$(getenforce)"
echo -e "用户信息:\n$(getent passwd)" echo -e "密码信息:\n$(getent shadow)"
echo -e "网络信息:\n$(ip addr show)"
echo "CPU信息:$(cat /proc/cpuinfo | grep name | cut -f2 -d: | uniq -c)" echo "物理CPU数:$(cat /proc/cpuinfo |grep 'physical id'|sort |uniq|wc -l)" echo "逻辑CPU数:$(cat /proc/cpuinfo |grep "processor"|wc -l)" echo "CPU核心数:$(cat cat /proc/cpuinfo |grep "cores"|uniq)" echo "CPU综合信息:\n$(lscpu)"
echo -e "磁盘UUID信息:\n$(blkid)" echo -e "磁盘信息:\n$(fdisk -l | egrep '/dev|Disk')" echo -e "磁盘分区信息:\n$(lsblk)" echo -e "磁盘空间信息:\n$(df -h)"
echo -e "挂载信息:\n$(mount -l)" echo -e "挂载配置文件:\n$(cat /etc/fstab | egrep -v '#|^$')"
|
CPU:通过下面的脚本来打印出当前机器的socket,core和thread的数量
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
| #!/bin/bash
function get_nr_processor() { grep '^processor' /proc/cpuinfo | wc -l }
function get_nr_socket() { grep 'physical id' /proc/cpuinfo | awk -F: '{ print $2 | "sort -un"}' | wc -l }
function get_nr_siblings() { grep 'siblings' /proc/cpuinfo | awk -F: '{ print $2 | "sort -un"}' }
function get_nr_cores_of_socket() { grep 'cpu cores' /proc/cpuinfo | awk -F: '{ print $2 | "sort -un"}' }
echo '===== CPU Topology Table =====' echo
echo '+--------------+---------+-----------+' echo '| Processor ID | Core ID | Socket ID |' echo '+--------------+---------+-----------+'
while read line; do if [ -z "$line" ]; then printf '| %-12s | %-7s | %-9s |\n' $p_id $c_id $s_id echo '+--------------+---------+-----------+' continue fi
if echo "$line" | grep -q "^processor"; then p_id=`echo "$line" | awk -F: '{print $2}' | tr -d ' '` fi
if echo "$line" | grep -q "^core id"; then c_id=`echo "$line" | awk -F: '{print $2}' | tr -d ' '` fi
if echo "$line" | grep -q "^physical id"; then s_id=`echo "$line" | awk -F: '{print $2}' | tr -d ' '` fi done < /proc/cpuinfo
echo
awk -F: '{ if ($1 ~ /processor/) { gsub(/ /,"",$2); p_id=$2; } else if ($1 ~ /physical id/){ gsub(/ /,"",$2); s_id=$2; arr[s_id]=arr[s_id] " " p_id } }
END{ for (i in arr) printf "Socket %s:%s\n", i, arr[i]; }' /proc/cpuinfo
echo echo '===== CPU Info Summary =====' echo
nr_processor=`get_nr_processor` echo "Logical processors: $nr_processor"
nr_socket=`get_nr_socket` echo "Physical socket: $nr_socket"
nr_siblings=`get_nr_siblings` echo "Siblings in one socket: $nr_siblings"
nr_cores=`get_nr_cores_of_socket` echo "Cores in one socket: $nr_cores"
let nr_cores*=nr_socket echo "Cores in total: $nr_cores"
if [ "$nr_cores" = "$nr_processor" ]; then echo "Hyper-Threading: off" else echo "Hyper-Threading: on" fi
echo echo '===== END ====='
|
0x02 异常解决
如何排查用户态 CPU 使用率高?
问题1.业务服务器 CPU 占用负载高问题
用户态 CPU 使用率反映了应用程序的繁忙程度,通常与我们自己写的代码息息相关。因此,当你在做应用发布、配置变更或性能优化时,如果想定位消耗 CPU 最多的 Java 代码,可以遵循如下思路:
排查思路:
1 2 3 4 5 6 7 8 9 10
| $top -n 1
$top -Hp [PID]
printf "0x%x" [PID]
jstack 1040|vim +/0x431 -
pwdx [PID]
|
WeiyiGeek.CPU占用率高