511 lines
29 KiB
Markdown
511 lines
29 KiB
Markdown
|
# 11 | 打开首页之一:一个案例,带你搞懂基础硬件设施的性能问题
|
|||
|
|
|||
|
你好,我是高楼。
|
|||
|
|
|||
|
这节课我要带你来看一个完整的性能分析案例的第一部分,用打开首页接口做压力场景,来分析下性能问题。通过这个案例,你将看到各种基础硬件设施层面的性能问题,比如由虚机超分导致的性能问题、CPU运行模式下的性能问题、IO高、硬件资源耗尽但TPS很低的问题等等。
|
|||
|
|
|||
|
如果你是从零开始做一个完整的项目,那么这些问题很可能是你首先要去面对的。并且,把它们解决好,是性能分析人员必备的一种能力。同时,你还会看到针对不同计数器采集的数据,我们的分析链路是不同的,而这个分析链路就是我一直强调的证据链,如果你不清楚可以再回顾一下[第3讲](https://time.geekbang.org/column/article/355982)。
|
|||
|
|
|||
|
通过这节课,我希望你能明白,有些性能问题其实并没有那么单一,而且不管性能问题出在哪里,我们都必须去处理。
|
|||
|
|
|||
|
好,不啰嗦了,下面我们就把打开首页接口的性能瓶颈仔细扒一扒。
|
|||
|
|
|||
|
## 看架构图
|
|||
|
|
|||
|
在每次分析性能瓶颈之前,我都会画这样一张图,看看这个接口会涉及到哪些服务和技术组件,这对我们后续的性能分析会有很大的帮助。
|
|||
|
|
|||
|
![](https://static001.geekbang.org/resource/image/69/8d/6913fb342aa32fae5b46c6f1ecddc58d.png)
|
|||
|
|
|||
|
如果你有工具可以直接展示,那就更好了。如果没有,那我建议你不要自信地认为自己可以记住一个简单的架构。相信我,哪怕是在纸上简单画一画,都会对你后面的分析思路有很大的帮助。
|
|||
|
|
|||
|
回到上面这张图,我们可以清楚地看到这个打开首页的逻辑是:User - Gateway(Redis)- Portal - (Redis,MySQL)。
|
|||
|
|
|||
|
## 顺便看下代码逻辑
|
|||
|
|
|||
|
在做打开首页的基准场景之前,我建议你先看一眼这个接口的代码实现逻辑,从代码中可以看到这个接口在做哪些动作。根据这些动作,我们可以分析它们的后续链路。
|
|||
|
|
|||
|
这个代码的逻辑很简单,就是列出首页上的各种信息,然后返回一个JSON。
|
|||
|
|
|||
|
```
|
|||
|
public HomeContentResult contentnew() {
|
|||
|
HomeContentResult result = new HomeContentResult();
|
|||
|
if (redisService.get("HomeContent") == null) {
|
|||
|
//首页广告
|
|||
|
result.setAdvertiseList(getHomeAdvertiseList());
|
|||
|
//品牌推荐
|
|||
|
result.setBrandList(homeDao.getRecommendBrandList(0, 6));
|
|||
|
//秒杀信息
|
|||
|
result.setHomeFlashPromotion(getHomeFlashPromotion());
|
|||
|
//新品推荐
|
|||
|
result.setNewProductList(homeDao.getNewProductList(0, 4));
|
|||
|
//人气推荐
|
|||
|
result.setHotProductList(homeDao.getHotProductList(0, 4));
|
|||
|
//专题推荐
|
|||
|
result.setSubjectList(homeDao.getRecommendSubjectList(0, 4));
|
|||
|
redisService.set("HomeContent", result);
|
|||
|
}
|
|||
|
Object homeContent = redisService.get("HomeContent");
|
|||
|
// result = JSON.parseObject(homeContent.toString(), HomeContentResult.class);
|
|||
|
result = JSONUtil.toBean(JSONUtil.toJsonPrettyStr(homeContent), HomeContentResult.class);
|
|||
|
|
|||
|
return result;
|
|||
|
}
|
|||
|
|
|||
|
```
|
|||
|
|
|||
|
我们可以看到,这里面一共调用了6个方法,并且这些方法都是直接到数据库里做了查询,如此而已。
|
|||
|
|
|||
|
## 确定压力数据
|
|||
|
|
|||
|
了解完代码逻辑后,我们上10个线程试运行一下,看看在一个个线程递增的过程中,TPS会有什么样的趋势。
|
|||
|
|
|||
|
运行之后,我们得到这样的结果:
|
|||
|
|
|||
|
![](https://static001.geekbang.org/resource/image/87/0f/876yya208a923dd9c42fe9538063b10f.png)
|
|||
|
|
|||
|
从结果来看,在一开始,一个线程会产生40左右的TPS。这里我们就要思考一下了:**如果想要执行一个场景,****并且这个场景****可以压出打开首页接口的最大TPS,****我们****应该****怎么****设置压力工具中的线程数、递增策略****和****持续执行策略呢?**
|
|||
|
|
|||
|
对此,我们先看看Portal应用节点所在机器的硬件使用情况,了解一下TPS趋势和资源使用率之间的关系。这个机器的情况如下图所示(注意,我跳过了Gateway所在的节点):
|
|||
|
|
|||
|
![](https://static001.geekbang.org/resource/image/d9/67/d991a5548f72d6f7bcf0257c40da6b67.png)
|
|||
|
|
|||
|
可以看到,当前Portal节点所在的机器是8C16G(虚拟机),并且这个机器基本上没什么压力。
|
|||
|
|
|||
|
现在我们先不计算其他资源,只考虑8C16G的配置情况。如果TPS是线性增长的话,那么当该机器的CPU使用率达到 100%的时候,TPS大概就是800左右。因此,我们压力工具中的线程数应该设置为:
|
|||
|
|
|||
|
$$ 线程数 = 800 TPS \\div 40 TPS = 20 个线程$$
|
|||
|
|
|||
|
不过,在压力持续的过程中,TPS和资源使用率之间的等比关系应该是做不到的。因为在压力过程中,各种资源的消耗都会增加一些响应时间,这些也都属于正常的响应时间损耗。
|
|||
|
|
|||
|
在确定了压力工具的线程数之后,我们再来看递增策略怎么设置。
|
|||
|
|
|||
|
我希望递增时间可以增加得慢一些,以便于我们查看各环节性能数据的反应。根据[第2讲](https://time.geekbang.org/column/article/355019)中的性能分析决策树,在这样的场景中,我们有不少计数器需要分析查看,所以我设置为30秒上一个线程,也就是说递增周期为600秒。
|
|||
|
|
|||
|
在确定好压力参数后,我们的试运行场景就可以在JMeter中设置为如下值:
|
|||
|
|
|||
|
```
|
|||
|
<stringProp name="ThreadGroup.num_threads">20</stringProp>
|
|||
|
<stringProp name="ThreadGroup.ramp_time">600</stringProp>
|
|||
|
<boolProp name="ThreadGroup.scheduler">true</boolProp>
|
|||
|
<stringProp name="ThreadGroup.duration">700</stringProp>
|
|||
|
|
|||
|
```
|
|||
|
|
|||
|
设置好试运行参数后,我们就可以在这样的场景下进一步设置足够的线程来运行,以达到资源使用率的最大化。
|
|||
|
|
|||
|
你可能会疑惑:难道不用更高的线程了吗?如果你想做一个正常的场景,那确实不需要用更高的线程了;如果你就是想知道压力线程加多了是什么样子,那你可以试试。我在性能场景执行时,也经常用各种方式压着玩。
|
|||
|
|
|||
|
不过,话说回来,确实有一种情况需要我们正儿八经地增加更多的压力,那就是你的响应时间已经增加了,可是增加得又不多,TPS也不再上升。这时候,我们拆分响应时间是比较困难的,特别是当一些系统很快的时候,响应时间可能只是几个毫秒之间。所以,在这种情况下,我们需要多增加一些线程,让响应时间慢的地方更清晰地表现出来,这样也就更容易拆分时间。
|
|||
|
|
|||
|
通过压力场景的递增设置(前面算的是只需要20个线程即可达到最大值,而这里,我把压力线程设置为100启动场景,目的是为了看到递增到更大压力时的TPS趋势以及响应时间的增加,这样更容易做时间的拆分),我们看到这个接口的响应时间确实在慢慢增加,并且随着线程数的增加,响应时间很快就上升到了几百毫秒。这是一个明显的瓶颈,我们自然是不能接受的。
|
|||
|
|
|||
|
![](https://static001.geekbang.org/resource/image/b5/b9/b51a9979095ba9bd963e657b96fyy0b9.png)
|
|||
|
|
|||
|
接下来,我们就要好好分析一下这个响应时间究竟消耗到了哪里。
|
|||
|
|
|||
|
## 拆分时间
|
|||
|
|
|||
|
我们前面提到,打开首页的逻辑是:User - Gateway(Redis)- Portal - (Redis,MySQL),那我们就按照这个逻辑,借助链路监控工具SkyWalking把响应时间具体拆分一下。
|
|||
|
|
|||
|
![](https://static001.geekbang.org/resource/image/1d/99/1d3b42340dd5dfdda16bdf1332d34c99.png)
|
|||
|
|
|||
|
* **User —Gateway之间的时间消耗**
|
|||
|
|
|||
|
![](https://static001.geekbang.org/resource/image/bd/b9/bd8117bdc6124d95893b16c7653be7b9.png)
|
|||
|
|
|||
|
我们看到,User - Gateway之间的时间消耗慢慢上升到了150毫秒左右。
|
|||
|
|
|||
|
* **Gateway响应时间**
|
|||
|
|
|||
|
![](https://static001.geekbang.org/resource/image/73/37/733a7e7f13ea455826aee0bbb2393237.png)
|
|||
|
|
|||
|
gateway上也消耗了150毫秒,这就说明user到gateway之间的网络并没有多少时间消耗,在毫秒级。
|
|||
|
|
|||
|
* **Gateway —Portal之间的时间消耗**
|
|||
|
|
|||
|
![](https://static001.geekbang.org/resource/image/f9/fe/f9ac563b64d7a1c6e923d8222fdfyyfe.png)
|
|||
|
|
|||
|
在Portal上,响应时间只消耗了50毫秒左右。我们再到Portal上看一眼。
|
|||
|
|
|||
|
* **Portal响应时间**
|
|||
|
|
|||
|
![](https://static001.geekbang.org/resource/image/72/49/721ae8d7ef027fdc56dd05860cafa849.png)
|
|||
|
|
|||
|
Portal的响应时间是50毫秒左右,和我们上面看到的时间一致。
|
|||
|
|
|||
|
通过上述对响应时间的拆分,我们可以确定是Gateway消耗了响应时间,并且这个时间达到了近100毫秒。所以,我们下一步定位的目标就是Gateway了。
|
|||
|
|
|||
|
## 定位Gateway上的响应时间消耗
|
|||
|
|
|||
|
#### 第一阶段:分析st cpu
|
|||
|
|
|||
|
既然Gateway上的响应时间消耗很高,我们自然就要查一下这台主机把时间消耗在了哪里。
|
|||
|
|
|||
|
我们的分析逻辑仍然是**先看全局监控,后看定向监控**。全局监控要从整个架构开始看起,然后再确定某个节点上的资源消耗。注意,在看全局监控时,我们要从最基础的查起,而分析的过程中最基础的就是操作系统了。
|
|||
|
|
|||
|
通过top命令,我们可以看到Gateway节点上的资源情况,具体如下:
|
|||
|
|
|||
|
![](https://static001.geekbang.org/resource/image/c5/f5/c5af9b566db8bdeb7fc8d6ea448aa2f5.png)
|
|||
|
|
|||
|
其中,st cpu达到了15%左右。我们知道,st cpu是指虚拟机被宿主机上的其他应用或虚拟机抢走的CPU,它的值这么高显然是不太正常的。所以,我们要进一步查看st cpu异常的原因。
|
|||
|
|
|||
|
我们用mpstat命令先来看看宿主机(运行Gateway的虚拟机所在的物理机)上的资源表现:
|
|||
|
|
|||
|
![](https://static001.geekbang.org/resource/image/f4/90/f40e7c2f2a790b289b6d2332dbc47390.png)
|
|||
|
|
|||
|
可以看到,CPU还有20%没有用完,说明宿主机还有空间。不过,宿主机的CPU使用率已经不小了,而消耗这些宿主机的就只有虚拟机里的应用。所以,我们要查一下是不是某个虚拟机的CPU消耗特别高。宿主机上的KVM列表如下:
|
|||
|
|
|||
|
```
|
|||
|
[root@dell-server-3 ~]# virsh list --all
|
|||
|
Id 名称 状态
|
|||
|
----------------------------------------------------
|
|||
|
12 vm-jmeter running
|
|||
|
13 vm-k8s-worker-8 running
|
|||
|
14 vm-k8s-worker-7 running
|
|||
|
15 vm-k8s-worker-9 running
|
|||
|
|
|||
|
[root@dell-server-3 ~]#
|
|||
|
|
|||
|
```
|
|||
|
|
|||
|
可以看到,在这个宿主机上跑了四个虚拟机,那我们就具体看一下这四个虚拟机的资源消耗情况。
|
|||
|
|
|||
|
* **vm-jmeter**
|
|||
|
|
|||
|
```
|
|||
|
top - 23:42:49 up 28 days, 8:14, 6 users, load average: 0.61, 0.48, 0.38
|
|||
|
Tasks: 220 total, 1 running, 218 sleeping, 1 stopped, 0 zombie
|
|||
|
%Cpu0 : 6.6 us, 3.5 sy, 0.0 ni, 88.5 id, 0.0 wa, 0.0 hi, 0.0 si, 1.4 st
|
|||
|
%Cpu1 : 6.5 us, 1.8 sy, 0.0 ni, 88.2 id, 0.0 wa, 0.0 hi, 0.4 si, 3.2 st
|
|||
|
KiB Mem : 3880180 total, 920804 free, 1506128 used, 1453248 buff/cache
|
|||
|
KiB Swap: 2097148 total, 1256572 free, 840576 used. 2097412 avail Mem
|
|||
|
|
|||
|
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
|
|||
|
7157 root 20 0 3699292 781204 17584 S 27.8 20.1 1:09.44 java
|
|||
|
9 root 20 0 0 0 0 S 0.3 0.0 30:25.77 rcu_sched
|
|||
|
376 root 20 0 0 0 0 S 0.3 0.0 16:40.44 xfsaild/dm-
|
|||
|
|
|||
|
```
|
|||
|
|
|||
|
* **vm-k8s-worker-8**
|
|||
|
|
|||
|
```
|
|||
|
top - 23:43:47 up 5 days, 22:28, 3 users, load average: 9.21, 6.45, 5.74
|
|||
|
Tasks: 326 total, 1 running, 325 sleeping, 0 stopped, 0 zombie
|
|||
|
%Cpu0 : 20.2 us, 3.7 sy, 0.0 ni, 60.7 id, 0.0 wa, 0.0 hi, 2.9 si, 12.5 st
|
|||
|
%Cpu1 : 27.3 us, 7.4 sy, 0.0 ni, 50.2 id, 0.0 wa, 0.0 hi, 3.7 si, 11.4 st
|
|||
|
%Cpu2 : 29.9 us, 5.6 sy, 0.0 ni, 48.5 id, 0.0 wa, 0.0 hi, 4.9 si, 11.2 st
|
|||
|
%Cpu3 : 31.2 us, 5.6 sy, 0.0 ni, 47.6 id, 0.0 wa, 0.0 hi, 4.5 si, 11.2 st
|
|||
|
%Cpu4 : 25.6 us, 4.3 sy, 0.0 ni, 52.7 id, 0.0 wa, 0.0 hi, 3.6 si, 13.7 st
|
|||
|
%Cpu5 : 26.0 us, 5.2 sy, 0.0 ni, 53.5 id, 0.0 wa, 0.0 hi, 4.1 si, 11.2 st
|
|||
|
%Cpu6 : 19.9 us, 6.2 sy, 0.0 ni, 57.6 id, 0.0 wa, 0.0 hi, 3.6 si, 12.7 st
|
|||
|
%Cpu7 : 27.3 us, 5.0 sy, 0.0 ni, 53.8 id, 0.0 wa, 0.0 hi, 2.3 si, 11.5 st
|
|||
|
KiB Mem : 16265688 total, 6772084 free, 4437840 used, 5055764 buff/cache
|
|||
|
KiB Swap: 0 total, 0 free, 0 used. 11452900 avail Mem
|
|||
|
|
|||
|
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
|
|||
|
13049 root 20 0 9853712 593464 15752 S 288.4 3.6 67:24.22 java
|
|||
|
1116 root 20 0 2469728 57932 16188 S 12.6 0.4 818:40.25 containerd
|
|||
|
1113 root 20 0 3496336 118048 38048 S 12.3 0.7 692:30.79 kubelet
|
|||
|
4961 root 20 0 1780136 40700 17864 S 12.3 0.3 205:51.15 calico-node
|
|||
|
3830 root 20 0 2170204 114920 33304 S 11.6 0.7 508:00.00 scope
|
|||
|
1118 root 20 0 1548060 111768 29336 S 11.3 0.7 685:27.95 dockerd
|
|||
|
8216 techstar 20 0 2747240 907080 114836 S 5.0 5.6 1643:33 prometheus
|
|||
|
21002 root 20 0 9898708 637616 17316 S 3.3 3.9 718:56.99 java
|
|||
|
1070 root 20 0 9806964 476716 15756 S 2.0 2.9 137:13.47 java
|
|||
|
11492 root 20 0 441996 33204 4236 S 1.3 0.2 38:10.49 gvfs-udisks2-vo
|
|||
|
|
|||
|
```
|
|||
|
|
|||
|
* **vm-k8s-worker-7**
|
|||
|
|
|||
|
```
|
|||
|
top - 23:44:22 up 5 days, 22:26, 3 users, load average: 2.50, 1.67, 1.13
|
|||
|
Tasks: 308 total, 1 running, 307 sleeping, 0 stopped, 0 zombie
|
|||
|
%Cpu0 : 4.2 us, 3.5 sy, 0.0 ni, 82.3 id, 0.0 wa, 0.0 hi, 1.7 si, 8.3 st
|
|||
|
%Cpu1 : 6.2 us, 2.7 sy, 0.0 ni, 82.8 id, 0.0 wa, 0.0 hi, 1.4 si, 6.9 st
|
|||
|
%Cpu2 : 5.2 us, 2.8 sy, 0.0 ni, 84.0 id, 0.0 wa, 0.0 hi, 1.0 si, 6.9 st
|
|||
|
%Cpu3 : 4.5 us, 3.8 sy, 0.0 ni, 81.2 id, 0.0 wa, 0.0 hi, 1.4 si, 9.2 st
|
|||
|
%Cpu4 : 4.4 us, 2.4 sy, 0.0 ni, 83.3 id, 0.0 wa, 0.0 hi, 1.4 si, 8.5 st
|
|||
|
%Cpu5 : 5.5 us, 2.4 sy, 0.0 ni, 84.5 id, 0.0 wa, 0.0 hi, 1.0 si, 6.6 st
|
|||
|
%Cpu6 : 3.7 us, 2.7 sy, 0.0 ni, 85.6 id, 0.0 wa, 0.0 hi, 0.7 si, 7.4 st
|
|||
|
%Cpu7 : 3.1 us, 1.7 sy, 0.0 ni, 84.7 id, 0.0 wa, 0.0 hi, 1.4 si, 9.0 st
|
|||
|
KiB Mem : 16265688 total, 8715820 free, 3848432 used, 3701436 buff/cache
|
|||
|
KiB Swap: 0 total, 0 free, 0 used. 12019164 avail Mem
|
|||
|
|
|||
|
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
|
|||
|
18592 27 20 0 4588208 271564 12196 S 66.9 1.7 154:58.93 mysqld
|
|||
|
1109 root 20 0 2381424 105512 37208 S 9.6 0.6 514:18.00 kubelet
|
|||
|
1113 root 20 0 1928952 55556 16024 S 8.9 0.3 567:43.53 containerd
|
|||
|
1114 root 20 0 1268692 105212 29644 S 8.6 0.6 516:43.38 dockerd
|
|||
|
3122 root 20 0 2169692 117212 33416 S 7.0 0.7 408:21.79 scope
|
|||
|
4132 root 20 0 1780136 43188 17952 S 6.0 0.3 193:27.58 calico-node
|
|||
|
3203 nfsnobo+ 20 0 116748 19720 5864 S 2.0 0.1 42:43.57 node_exporter
|
|||
|
12089 techstar 20 0 5666480 1.3g 23084 S 1.3 8.5 78:04.61 java
|
|||
|
5727 root 20 0 449428 38616 4236 S 1.0 0.2 49:02.98 gvfs-udisks2-vo
|
|||
|
|
|||
|
```
|
|||
|
|
|||
|
* **vm-k8s-worker-9**
|
|||
|
|
|||
|
```
|
|||
|
top - 23:45:23 up 5 days, 22:21, 4 users, load average: 12.51, 10.28, 9.19
|
|||
|
Tasks: 333 total, 4 running, 329 sleeping, 0 stopped, 0 zombie
|
|||
|
%Cpu0 : 20.1 us, 7.5 sy, 0.0 ni, 43.3 id, 0.0 wa, 0.0 hi, 13.4 si, 15.7 st
|
|||
|
%Cpu1 : 20.1 us, 11.2 sy, 0.0 ni, 41.4 id, 0.0 wa, 0.0 hi, 11.9 si, 15.3 st
|
|||
|
%Cpu2 : 23.8 us, 10.0 sy, 0.0 ni, 35.4 id, 0.0 wa, 0.0 hi, 14.2 si, 16.5 st
|
|||
|
%Cpu3 : 15.1 us, 7.7 sy, 0.0 ni, 49.1 id, 0.0 wa, 0.0 hi, 12.2 si, 15.9 st
|
|||
|
%Cpu4 : 22.8 us, 6.9 sy, 0.0 ni, 40.5 id, 0.0 wa, 0.0 hi, 14.7 si, 15.1 st
|
|||
|
%Cpu5 : 17.5 us, 5.8 sy, 0.0 ni, 50.0 id, 0.0 wa, 0.0 hi, 10.6 si, 16.1 st
|
|||
|
%Cpu6 : 22.0 us, 6.6 sy, 0.0 ni, 45.1 id, 0.0 wa, 0.0 hi, 11.0 si, 15.4 st
|
|||
|
%Cpu7 : 19.2 us, 8.0 sy, 0.0 ni, 44.9 id, 0.0 wa, 0.0 hi, 9.8 si, 18.1 st
|
|||
|
KiB Mem : 16265688 total, 2567932 free, 7138952 used, 6558804 buff/cache
|
|||
|
KiB Swap: 0 total, 0 free, 0 used. 8736000 avail Mem
|
|||
|
|
|||
|
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
|
|||
|
24122 root 20 0 9890064 612108 16880 S 201.0 3.8 1905:11 java
|
|||
|
2794 root 20 0 2307652 161224 33464 S 57.7 1.0 1065:54 scope
|
|||
|
1113 root 20 0 2607908 60552 15484 S 13.8 0.4 1008:04 containerd
|
|||
|
1109 root 20 0 2291748 110768 39140 S 12.8 0.7 722:41.17 kubelet
|
|||
|
1114 root 20 0 1285500 108664 30112 S 11.1 0.7 826:56.51 dockerd
|
|||
|
29 root 20 0 0 0 0 S 8.9 0.0 32:09.89 ksoftirqd/4
|
|||
|
6 root 20 0 0 0 0 S 8.2 0.0 41:28.14 ksoftirqd/0
|
|||
|
24 root 20 0 0 0 0 R 8.2 0.0 41:00.46 ksoftirqd/3
|
|||
|
39 root 20 0 0 0 0 R 8.2 0.0 41:08.18 ksoftirqd/6
|
|||
|
19 root 20 0 0 0 0 S 7.9 0.0 39:10.22 ksoftirqd/2
|
|||
|
14 root 20 0 0 0 0 S 6.2 0.0 40:58.25 ksoftirqd/1
|
|||
|
|
|||
|
```
|
|||
|
|
|||
|
很显然,worker-9的si(中断使用的CPU)和st(被偷走的CPU)都不算低。那这种情况就比较奇怪了,虚拟机本身都没有很高的CPU使用率,为什么st还这么高呢?难道CPU只能用到这种程度?
|
|||
|
|
|||
|
来,我们接着查下去。
|
|||
|
|
|||
|
#### 第二阶段:查看物理机CPU运行模式
|
|||
|
|
|||
|
在这个阶段,我们要查一下服务里有没有阻塞。就像前面提到的,我们要从全局监控的角度,来考虑所查看的性能分析计数器是不是完整,以免出现判断上的偏差。不过,我去查看了线程栈的具体内容,看到线程栈中并没有Blocked啥的,那我们就只能再回到物理机的配置里看了。
|
|||
|
|
|||
|
那对于物理机CPU,我们还有什么可看的呢?即使你盖上被子蒙着头想很久,从下到上把所有的逻辑都理一遍,也找不出什么地方会有阻塞。那我们就只有看宿主机的CPU运行模式了。
|
|||
|
|
|||
|
```
|
|||
|
-- 物理机器1
|
|||
|
[root@hp-server ~]# cpupower frequency-info
|
|||
|
analyzing CPU 0:
|
|||
|
driver: pcc-cpufreq
|
|||
|
CPUs which run at the same hardware frequency: 0
|
|||
|
CPUs which need to have their frequency coordinated by software: 0
|
|||
|
maximum transition latency: Cannot determine or is not supported.
|
|||
|
hardware limits: 1.20 GHz - 2.10 GHz
|
|||
|
available cpufreq governors: conservative userspace powersave ondemand performance
|
|||
|
current policy: frequency should be within 1.20 GHz and 2.10 GHz.
|
|||
|
The governor "conservative" may decide which speed to use
|
|||
|
within this range.
|
|||
|
current CPU frequency: 1.55 GHz (asserted by call to hardware)
|
|||
|
boost state support:
|
|||
|
Supported: yes
|
|||
|
Active: yes
|
|||
|
|
|||
|
-- 物理机器2
|
|||
|
[root@dell-server-2 ~]# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
|
|||
|
powersave
|
|||
|
[root@dell-server-2 ~]# cpupower frequency-info
|
|||
|
analyzing CPU 0:
|
|||
|
driver: intel_pstate
|
|||
|
CPUs which run at the same hardware frequency: 0
|
|||
|
CPUs which need to have their frequency coordinated by software: 0
|
|||
|
maximum transition latency: Cannot determine or is not supported.
|
|||
|
hardware limits: 1.20 GHz - 2.20 GHz
|
|||
|
available cpufreq governors: performance powersave
|
|||
|
current policy: frequency should be within 1.20 GHz and 2.20 GHz.
|
|||
|
The governor "powersave" may decide which speed to use
|
|||
|
within this range.
|
|||
|
current CPU frequency: 2.20 GHz (asserted by call to hardware)
|
|||
|
boost state support:
|
|||
|
Supported: no
|
|||
|
Active: no
|
|||
|
2200 MHz max turbo 4 active cores
|
|||
|
2200 MHz max turbo 3 active cores
|
|||
|
2200 MHz max turbo 2 active cores
|
|||
|
2200 MHz max turbo 1 active cores
|
|||
|
|
|||
|
-- 物理机器3
|
|||
|
[root@dell-server-3 ~]# cpupower frequency-info
|
|||
|
analyzing CPU 0:
|
|||
|
driver: intel_pstate
|
|||
|
CPUs which run at the same hardware frequency: 0
|
|||
|
CPUs which need to have their frequency coordinated by software: 0
|
|||
|
maximum transition latency: Cannot determine or is not supported.
|
|||
|
hardware limits: 1.20 GHz - 2.20 GHz
|
|||
|
available cpufreq governors: performance powersave
|
|||
|
current policy: frequency should be within 1.20 GHz and 2.20 GHz.
|
|||
|
The governor "powersave" may decide which speed to use
|
|||
|
within this range.
|
|||
|
current CPU frequency: 2.20 GHz (asserted by call to hardware)
|
|||
|
boost state support:
|
|||
|
Supported: no
|
|||
|
Active: no
|
|||
|
2200 MHz max turbo 4 active cores
|
|||
|
2200 MHz max turbo 3 active cores
|
|||
|
2200 MHz max turbo 2 active cores
|
|||
|
2200 MHz max turbo 1 active cores
|
|||
|
|
|||
|
-- 物理机器4
|
|||
|
[root@lenvo-nfs-server ~]# cpupower frequency-info
|
|||
|
analyzing CPU 0:
|
|||
|
driver: acpi-cpufreq
|
|||
|
CPUs which run at the same hardware frequency: 0
|
|||
|
CPUs which need to have their frequency coordinated by software: 0
|
|||
|
maximum transition latency: 10.0 us
|
|||
|
hardware limits: 2.00 GHz - 2.83 GHz
|
|||
|
available frequency steps: 2.83 GHz, 2.00 GHz
|
|||
|
available cpufreq governors: conservative userspace powersave ondemand performance
|
|||
|
current policy: frequency should be within 2.00 GHz and 2.83 GHz.
|
|||
|
The governor "conservative" may decide which speed to use
|
|||
|
within this range.
|
|||
|
current CPU frequency: 2.00 GHz (asserted by call to hardware)
|
|||
|
boost state support:
|
|||
|
Supported: no
|
|||
|
Active: no
|
|||
|
|
|||
|
```
|
|||
|
|
|||
|
可以看到,没有一个物理机是运行在performance模式之下的。
|
|||
|
|
|||
|
在这里,我们需要对CPU的运行模式有一个了解:
|
|||
|
|
|||
|
![](https://static001.geekbang.org/resource/image/8c/82/8cd3a3bee80eb77bf348b1a063a90682.jpg)
|
|||
|
|
|||
|
既然我们是性能分析人员,那自然要用performance模式了,所以我们把CPU模式修改如下:
|
|||
|
|
|||
|
```
|
|||
|
-- 物理机器1
|
|||
|
[root@hp-server ~]# cpupower -c all frequency-set -g performance
|
|||
|
Setting cpu: 0
|
|||
|
Setting cpu: 1
|
|||
|
Setting cpu: 2
|
|||
|
Setting cpu: 3
|
|||
|
Setting cpu: 4
|
|||
|
Setting cpu: 5
|
|||
|
Setting cpu: 6
|
|||
|
Setting cpu: 7
|
|||
|
Setting cpu: 8
|
|||
|
Setting cpu: 9
|
|||
|
Setting cpu: 10
|
|||
|
Setting cpu: 11
|
|||
|
Setting cpu: 12
|
|||
|
Setting cpu: 13
|
|||
|
Setting cpu: 14
|
|||
|
Setting cpu: 15
|
|||
|
Setting cpu: 16
|
|||
|
Setting cpu: 17
|
|||
|
Setting cpu: 18
|
|||
|
Setting cpu: 19
|
|||
|
Setting cpu: 20
|
|||
|
Setting cpu: 21
|
|||
|
Setting cpu: 22
|
|||
|
Setting cpu: 23
|
|||
|
Setting cpu: 24
|
|||
|
Setting cpu: 25
|
|||
|
Setting cpu: 26
|
|||
|
Setting cpu: 27
|
|||
|
Setting cpu: 28
|
|||
|
Setting cpu: 29
|
|||
|
Setting cpu: 30
|
|||
|
Setting cpu: 31
|
|||
|
[root@hp-server ~]# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
|
|||
|
performance
|
|||
|
[root@hp-server ~]#
|
|||
|
|
|||
|
-- 物理机器2
|
|||
|
[root@dell-server-2 ~]# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
|
|||
|
powersave
|
|||
|
[root@dell-server-2 ~]# cpupower -c all frequency-set -g performance
|
|||
|
Setting cpu: 0
|
|||
|
Setting cpu: 1
|
|||
|
Setting cpu: 2
|
|||
|
Setting cpu: 3
|
|||
|
Setting cpu: 4
|
|||
|
Setting cpu: 5
|
|||
|
Setting cpu: 6
|
|||
|
Setting cpu: 7
|
|||
|
Setting cpu: 8
|
|||
|
Setting cpu: 9
|
|||
|
Setting cpu: 10
|
|||
|
Setting cpu: 11
|
|||
|
Setting cpu: 12
|
|||
|
Setting cpu: 13
|
|||
|
Setting cpu: 14
|
|||
|
Setting cpu: 15
|
|||
|
[root@dell-server-2 ~]# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
|
|||
|
performance
|
|||
|
[root@dell-server-2 ~]#
|
|||
|
|
|||
|
-- 物理机器3
|
|||
|
[root@dell-server-3 ~]# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
|
|||
|
powersave
|
|||
|
[root@dell-server-3 ~]# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
|
|||
|
powersave
|
|||
|
[root@dell-server-3 ~]# cpupower -c all frequency-set -g performance
|
|||
|
Setting cpu: 0
|
|||
|
Setting cpu: 1
|
|||
|
Setting cpu: 2
|
|||
|
Setting cpu: 3
|
|||
|
Setting cpu: 4
|
|||
|
Setting cpu: 5
|
|||
|
Setting cpu: 6
|
|||
|
Setting cpu: 7
|
|||
|
Setting cpu: 8
|
|||
|
Setting cpu: 9
|
|||
|
Setting cpu: 10
|
|||
|
Setting cpu: 11
|
|||
|
Setting cpu: 12
|
|||
|
Setting cpu: 13
|
|||
|
Setting cpu: 14
|
|||
|
Setting cpu: 15
|
|||
|
[root@dell-server-3 ~]# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
|
|||
|
performance
|
|||
|
[root@dell-server-3 ~]#
|
|||
|
|
|||
|
-- 物理机器4
|
|||
|
[root@lenvo-nfs-server ~]# cpupower -c all frequency-set -g performance
|
|||
|
Setting cpu: 0
|
|||
|
Setting cpu: 1
|
|||
|
Setting cpu: 2
|
|||
|
Setting cpu: 3
|
|||
|
[root@lenvo-nfs-server ~]# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
|
|||
|
performance
|
|||
|
[root@lenvo-nfs-server ~]#
|
|||
|
|
|||
|
```
|
|||
|
|
|||
|
在我们一顿操作猛如虎之后,性能会怎么样呢?
|
|||
|
|
|||
|
结果,性能并没有好起来……这里我就不截图了,因为图和一开始的那张场景运行图一样。
|
|||
|
|
|||
|
在这里我们要知道,以上的分析过程说明不止是这个问题点,还有其他资源使用有短板我们没有找到。没办法,我们只能接着查。
|
|||
|
|
|||
|
## 总结
|
|||
|
|
|||
|
在这节课中,我们通过压力工具中的曲线,判断了瓶颈的存在。然后通过SkyWalking拆分了响应时间。
|
|||
|
|
|||
|
在确定了响应时间消耗点之后,我们又开始了两个阶段的分析:第一个阶段的证据链是从现象开始往下分析的,因为st cpu是指宿主机上的其他应用的消耗导致了此虚拟机的cpu资源被消耗,所以,我们去宿主机上去查了其他的虚拟机。这里我们要明确CPU资源应该用到什么样的程度,在发现了资源使用不合理之后,再接着做第二阶段的判断。
|
|||
|
|
|||
|
在第二阶段中,我们判断了CPU运行模式。在物理机中,如果我们自己不做主动的限制,CPU的消耗是没有默认限制的,所以我们才去查看CPU的运行模式。
|
|||
|
|
|||
|
但是,即便我们分析并尝试解决了以上的问题,TPS仍然没什么变化。可见,在计数器的分析逻辑中,虽然我们做了优化动作,但系统仍然有问题。只能说我们当前的优化手段,只解决了木桶中的最短板,但是其他短板,我们还没有找到。
|
|||
|
|
|||
|
请你注意,这并不是说我们这节课的分析优化过程没有意义。要知道,这些问题不解决,下一个问题也不会出现。所以,我们这节课的分析优化过程也非常有价值。
|
|||
|
|
|||
|
下节课,我们接着来找打开首页接口的性能瓶颈。
|
|||
|
|
|||
|
## 课后作业
|
|||
|
|
|||
|
最后,请你思考一下:
|
|||
|
|
|||
|
1. 为什么我们看到虚拟机中st cpu高,就要去查看宿主机上的其他虚拟机?如果在宿主机上看到st cpu高,我们应该做怎样的判断?
|
|||
|
2. CPU的运行模式在powersave时,CPU的运行逻辑是什么?
|
|||
|
|
|||
|
记得在留言区和我讨论、交流你的想法,每一次思考都会让你更进一步。
|
|||
|
|
|||
|
如果这节课让你有所收获,也欢迎你分享给你的朋友,共同学习进步。我们下一讲再见!
|
|||
|
|