0%

针对FDTD并行化性能的分析报告

计算算例为:

1
2
3
4
5
auto domain{std::make_shared<xfdtd::Object>(
"domain",
std::make_unique<xfdtd::Cube>(xfdtd::Vector{-0.175, -0.175, -0.175},
xfdtd::Vector{0.35, 0.35, 0.35}),
xfdtd::Material::createAir())};

测试平台

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
                          ./+o+-       franzero@franzero-MS-7C94
yyyyy- -yyyyyy+ OS: Ubuntu 24.04 noble
://+//////-yyyyyyo Kernel: x86_64 Linux 6.8.0-45-generic
.++ .:/++++++/-.+sss/` Uptime: 6h 29m
.:++o: /++++++++/:--:/- Packages: 2282
o:+o+:++.`..```.-/oo+++++/ Shell: zsh 5.9
.:+o:+o/. `+sssoo+/ Disk: 616G / 6.3T (11%)
.++/+:+oo+o:` /sssooo. CPU: AMD Ryzen 5 5600X 6-Core @ 12x 3.7GHz
/+++//+:`oo+o /::--:. RAM: 3630MiB / 15909MiB
\+/+o+++`o++o ++////.
.++.o+++oo+:` /dddhhh.
.+.o+oo:. `oddhhhh+
\+.++o+o``-````.:ohdhhhhh+
`:o+++ `ohhhhhhhhyo++os:
.o:`.syhhhhhhh/.oo++o`
/osyyyyyyo++ooo+++/
````` +oo+++o\:
`oo++.

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 12
On-line CPU(s) list: 0-11
Vendor ID: AuthenticAMD
Model name: AMD Ryzen 5 5600X 6-Core Processor
CPU family: 25
Model: 33
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 1
Stepping: 0
Frequency boost: enabled
CPU(s) scaling MHz: 57%
CPU max MHz: 4650.2920
CPU min MHz: 2200.0000

Stat报告:sudo perf stat -e task-clock,cycles,instructions,cache-references,cache-misses,branches,branch-misses,page-faults,minor-faults,major-faultscpu-migrations,context-switche

1. 无PML边界时

整个计算域大小为70x70x70。时间步长为1200

串行命令:./build/Release-x64/bin/benchmark -t 1200 -t_c 1 1 1

报告如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Performance counter stats for './build/Release-x64/bin/benchmark -t 1200 -t_c 1 1 1':

4,676.22 msec task-clock # 0.942 CPUs utilized
20,922,169,997 cycles # 4.474 GHz (83.38%)
82,578,031,208 instructions # 3.95 insn per cycle (83.36%)
4,229,791,865 cache-references # 904.532 M/sec (83.26%)
102,750,093 cache-misses # 2.43% of all cache refs (83.22%)
2,996,821,632 branches # 640.864 M/sec (83.40%)
14,714,284 branch-misses # 0.49% of all branches (83.38%)
40,282 page-faults # 8.614 K/sec
40,279 minor-faults # 8.614 K/sec
3 major-faults # 0.642 /sec
51 cpu-migrations # 10.804 /sec
3,125 context-switches # 661.987 /sec

4.964122624 seconds time elapsed

4.520302000 seconds user
0.099006000 seconds sys

多线程性能分析

XFDTD已经将线程绑定到CPU核心,0号线程绑定到0号核心,1号线程绑定到1号核心。

2x1x1: ./build/Release-x64/bin/benchmark -t 1200 -t_c 2 1 1

报告如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Performance counter stats for './build/Release-x64/bin/benchmark -t 1200 -t_c 2 1 1':

5,010.83 msec task-clock # 1.676 CPUs utilized
22,380,340,770 cycles # 4.466 GHz (83.36%)
82,709,456,232 instructions # 3.70 insn per cycle (83.19%)
4,270,380,181 cache-references # 852.231 M/sec (83.29%)
108,694,744 cache-misses # 2.55% of all cache refs (83.41%)
3,032,116,421 branches # 605.113 M/sec (83.29%)
19,110,567 branch-misses # 0.63% of all branches (83.46%)
40,303 page-faults # 8.043 K/sec
40,300 minor-faults # 8.043 K/sec
3 major-faults # 0.599 /sec
63 cpu-migrations # 12.451 /sec
9,973 context-switches # 1.971 K/sec

2.989421301 seconds time elapsed

4.741480000 seconds user
0.201530000 seconds sys

加速比为:4.96/2.99 = 1.66

多线程比单线程快了:(4.96-2.99)/4.96 = 39.72%

多线程的CPU利用率为:167.6%

在Stat报告中,2x1x1的分支预测错误率和缓存未命中率获得了与串行执行差不多的结果。加速比没有达到2,可能是CPU利用率不够高,也许是因为过多的上下文切换。

1x2x1: ./build/Release-x64/bin/benchmark -t 1200 -t_c 1 2 1

报告如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Performance counter stats for './build/Release-x64/bin/benchmark -t 1200 -t_c 1 2 1':

5,252.76 msec task-clock # 1.687 CPUs utilized
23,275,116,991 cycles # 4.431 GHz (83.29%)
82,704,202,147 instructions # 3.55 insn per cycle (83.35%)
4,554,703,918 cache-references # 867.107 M/sec (83.47%)
258,114,565 cache-misses # 5.67% of all cache refs (83.11%)
3,034,010,364 branches # 577.603 M/sec (83.33%)
19,516,081 branch-misses # 0.64% of all branches (83.47%)
40,297 page-faults # 7.672 K/sec
40,294 minor-faults # 7.671 K/sec
3 major-faults # 0.571 /sec
64 cpu-migrations # 12.297 /sec
8,633 context-switches # 1.659 K/sec

3.113548404 seconds time elapsed

5.030267000 seconds user
0.153733000 seconds sys

加速比为:4.96/3.11 = 1.59

多线程比单线程快了:(4.96-3.11)/4.96 = 37.30%

多线程的CPU利用率为:168.7%

1x2x1相比于2x1x1,在Stat报告中,缓存的未命中率有所增加,几乎是2x1x1或者串行的两倍。导致在CPU利用率相同与2x1x1差不多的情况下,加速效果没有2x1x1好。

1x1x2: ./build/Release-x64/bin/benchmark -t 1200 -t_c 1 1 2

报告如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Performance counter stats for './build/Release-x64/bin/benchmark -t 1200 -t_c 1 1 2':

7,140.03 msec task-clock # 1.774 CPUs utilized
32,256,025,684 cycles # 4.518 GHz (83.31%)
83,803,187,755 instructions # 2.60 insn per cycle (83.26%)
8,364,234,889 cache-references # 1.171 G/sec (83.36%)
311,977,954 cache-misses # 3.73% of all cache refs (83.37%)
3,089,285,740 branches # 432.671 M/sec (83.31%)
18,422,341 branch-misses # 0.60% of all branches (83.38%)
40,290 page-faults # 5.643 K/sec
40,287 minor-faults # 5.642 K/sec
3 major-faults # 0.420 /sec
56 cpu-migrations # 7.778 /sec
5,309 context-switches # 737.412 /sec

4.024689262 seconds time elapsed

6.935281000 seconds user
0.136749000 seconds sys

加速比为:4.96/4.02 = 1.23

多线程比单线程快了:(4.96-4.02)/4.96 = 18.95%

多线程的CPU利用率为:177.4%

1x1x2的性能几乎是最差的,最高的CPU利用率,却获得了最差的加速比。从报告中可以看到,缓存未命中率是虽然为3.73%,但是缓存引用次数是最高的,达到了80亿次,是其他的两倍。

原因: 3维数组的存储是行优先,索引按x,y,z顺序获取,即在z方向上是连续的。1x1x2的划分,切断了z方向上的连续性,缓存的引用次数增加。

在FDTD程序中占用大量时间的是Update函数,里面有一个三重for循环,按x,y,z的顺序遍历

1
2
3
4
5
6
7
for (int x = is; x < ie; ++x) {
for (int y = js; y < je; ++y) {
for (int z = ks; z < ke; ++z) {
// ...
}
}
}

某次循环中,加载到缓存的数据为[x_i, y_i, z_i], [x_i, y_i, z_i+1], [x_i, y_i, z_i+2],在1x1x2的划分下,访问到[x_i, y_i, z_i+1]时,z方向上的循化结束,下一次访问[x_i, y_i+1, z_i]时,z方向上的数据不在缓存中,需要重新加载。1x1x2划分会多次出现这样的情形,大约会是nx*ny次。

PS:在FDTD中多重for循环的分支预测错误却不是主要的性能瓶颈。

多进程性能分析

2x1x1: mpiexec --allow-run-as-root --bind-to core -n 2 ./build/Release-x64/bin/benchmark -t 1200 -t_c 1 1 1

报告如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Performance counter stats for 'mpiexec --allow-run-as-root --bind-to core -n 2 ./build/Release-x64/bin/benchmark -t 1200 -t_c 1 1 1':

5,260.68 msec task-clock # 1.797 CPUs utilized
23,443,996,940 cycles # 4.456 GHz (83.07%)
86,076,787,999 instructions # 3.67 insn per cycle (83.40%)
4,404,421,901 cache-references # 837.235 M/sec (83.42%)
112,032,269 cache-misses # 2.54% of all cache refs (83.25%)
3,371,057,675 branches # 640.803 M/sec (83.43%)
21,297,282 branch-misses # 0.63% of all branches (83.44%)
44,539 page-faults # 8.466 K/sec
44,535 minor-faults # 8.466 K/sec
4 major-faults # 0.760 /sec
136 cpu-migrations # 25.852 /sec
6,136 context-switches # 1.166 K/sec

2.926716656 seconds time elapsed

5.015942000 seconds user
0.233263000 seconds sys

1x1x2: mpiexec --allow-run-as-root --bind-to core -n 2 ./build/Release-x64/bin/benchmark -t 1200 -t_c 1 1 1

报告如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Performance counter stats for 'mpiexec --allow-run-as-root --bind-to core -n 2 ./build/Release-x64/bin/benchmark -t 1200 -t_c 1 1 1':

5,401.87 msec task-clock # 1.802 CPUs utilized
24,066,472,199 cycles # 4.455 GHz (83.29%)
88,250,212,639 instructions # 3.67 insn per cycle (83.34%)
4,487,750,246 cache-references # 830.778 M/sec (83.39%)
145,999,530 cache-misses # 3.25% of all cache refs (83.28%)
3,748,377,075 branches # 693.904 M/sec (83.40%)
20,100,212 branch-misses # 0.54% of all branches (83.30%)
44,628 page-faults # 8.262 K/sec
44,624 minor-faults # 8.261 K/sec
4 major-faults # 0.740 /sec
122 cpu-migrations # 22.585 /sec
6,166 context-switches # 1.141 K/sec

2.997049471 seconds time elapsed

5.182734000 seconds user
0.208839000 seconds sys

不受空间划分影响,多进程的性能表现与多线程的性能表现差不多。

总结

多线程在最佳情况下和多进程的性能表现差不多。

GPU

GPU性能

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Device 0:"NVIDIA GeForce RTX 2060 SUPER"
CUDA Driver Version / Runtime Version 12.4 / 12.0
CUDA Capability Major/Minor version number: 7.5
Total amount of global memory: 7933.94 MBytes (8319336448 bytes)
GPU Clock rate: 1665 MHz (1.66 GHz)
Memory Bus width: 256-bits
L2 Cache Size: 4194304 bytes
Max Texture Dimension Size (x,y,z) 1D=(131072),2D=(131072,65536),3D=(16384,16384,16384)
Max Layered Texture Size (dim) x layers 1D=(32768) x 2048,2D=(32768,32768) x 2048
Total amount of constant memory 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block:65536
Wrap size: 32
Maximun number of thread per multiprocesser: 1024
Maximun number of thread per block: 1024
Maximun size of each dimension of a block: 1024 x 1024 x 64
Maximun size of each dimension of a grid: 2147483647 x 65535 x 65535
Maximu memory pitch 2147483647 bytes

命令./build/Release-x64/bin/benchmark -d 0.005 -t 0 -s 1200 -g 64 64 1 -b 1 1 64

结论为

1
2
3
4
5
6
Elapsed time: 358.795 ms
SimulationHD::run() - domain run
SimulationHD::run() - copyDeviceToHost
SimulationHD::run() End!
Total elapsed time: 974.123 ms
Total elapsed time: 0.974123 s

10倍多

2. 有PML边界时

整个计算域大小为86x86x86。时间步长为1200。PML边界层数为8。

串行命令:./build/Release-x64/bin/benchmark --with_pml -t 1200 -t_c 1 1 1

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Performance counter stats for './build/Release-x64/bin/benchmark --with_pml -t 1200 -t_c 1 1 1':

12,992.83 msec task-clock # 0.977 CPUs utilized
58,734,301,508 cycles # 4.521 GHz (83.36%)
203,700,658,783 instructions # 3.47 insn per cycle (83.29%)
9,971,769,749 cache-references # 767.483 M/sec (83.34%)
525,622,296 cache-misses # 5.27% of all cache refs (83.35%)
7,340,039,091 branches # 564.930 M/sec (83.35%)
31,616,197 branch-misses # 0.43% of all branches (83.30%)
76,039 page-faults # 5.852 K/sec
76,036 minor-faults # 5.852 K/sec
3 major-faults # 0.231 /sec
58 cpu-migrations # 4.464 /sec
6,443 context-switches # 495.889 /sec

13.294737831 seconds time elapsed

12.758390000 seconds user
0.164669000 seconds sys

多线程性能分析

XFDTD已经将线程绑定到CPU核心,0号线程绑定到0号核心,1号线程绑定到1号核心。

2x1x1: ./build/Release-x64/bin/benchmark --with_pml -t 1200 -t_c 2 1 1

报告如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Performance counter stats for './build/Release-x64/bin/benchmark --with_pml -t 1200 -t_c 2 1 1':

14,354.55 msec task-clock # 1.845 CPUs utilized
64,710,649,907 cycles # 4.508 GHz (83.34%)
204,088,834,355 instructions # 3.15 insn per cycle (83.40%)
9,999,684,154 cache-references # 696.621 M/sec (83.28%)
539,018,362 cache-misses # 5.39% of all cache refs (83.38%)
7,374,690,193 branches # 513.753 M/sec (83.31%)
36,375,992 branch-misses # 0.49% of all branches (83.28%)
76,027 page-faults # 5.296 K/sec
76,024 minor-faults # 5.296 K/sec
3 major-faults # 0.209 /sec
57 cpu-migrations # 3.971 /sec
13,435 context-switches # 935.940 /sec

7.778543962 seconds time elapsed

14.035339000 seconds user
0.240954000 seconds sys

1x1x2: ./build/Release-x64/bin/benchmark --with_pml -t 1200 -t_c 1 1 2

报告如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Performance counter stats for './build/Release-x64/bin/benchmark --with_pml -t 1200 -t_c 1 1 2':

25,672.21 msec task-clock # 1.917 CPUs utilized
116,805,723,909 cycles # 4.550 GHz (83.33%)
205,888,912,957 instructions # 1.76 insn per cycle (83.34%)
19,328,422,246 cache-references # 752.893 M/sec (83.34%)
1,145,169,180 cache-misses # 5.92% of all cache refs (83.32%)
7,512,461,061 branches # 292.630 M/sec (83.32%)
61,675,797 branch-misses # 0.82% of all branches (83.35%)
76,035 page-faults # 2.962 K/sec
76,032 minor-faults # 2.962 K/sec
3 major-faults # 0.117 /sec
63 cpu-migrations # 2.454 /sec
12,918 context-switches # 503.190 /sec

13.393647514 seconds time elapsed

25.365368000 seconds user
0.215383000 seconds sys

1x1x2的加速效果极差,在最高的CPU利用率下,速度和串行执行差不多。缓存引用次数和分支次数都是最高的。

多进程性能分析

mpiexec --allow-run-as-root --bind-to core -n 2 ./build/Release-x64/bin/benchmark --with_pml -t 1200 -t_c 1 1 1

1x1x2报告如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Performance counter stats for 'mpiexec --allow-run-as-root --bind-to core -n 2 ./build/Release-x64/bin/benchmark --with_pml -t 1200 -t_c 1 1 1':

15,365.46 msec task-clock # 1.925 CPUs utilized
69,081,113,395 cycles # 4.496 GHz (83.32%)
213,563,002,414 instructions # 3.09 insn per cycle (83.33%)
10,682,103,857 cache-references # 695.202 M/sec (83.34%)
645,525,024 cache-misses # 6.04% of all cache refs (83.36%)
8,511,543,167 branches # 553.940 M/sec (83.39%)
45,406,924 branch-misses # 0.53% of all branches (83.26%)
81,154 page-faults # 5.282 K/sec
81,151 minor-faults # 5.281 K/sec
3 major-faults # 0.195 /sec
308 cpu-migrations # 20.045 /sec
9,800 context-switches # 637.794 /sec

7.980754877 seconds time elapsed

15.049898000 seconds user
0.298889000 seconds sys

2x1x1

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Performance counter stats for 'mpiexec --allow-run-as-root --bind-to core -n 2 ./build/Release-x64/bin/benchmark --with_pml -t 1200 -t_c 1 1 1':

14,956.17 msec task-clock # 1.924 CPUs utilized
67,179,684,611 cycles # 4.492 GHz (83.32%)
208,882,939,244 instructions # 3.11 insn per cycle (83.42%)
10,231,637,604 cache-references # 684.108 M/sec (83.35%)
549,804,692 cache-misses # 5.37% of all cache refs (83.31%)
7,747,500,546 branches # 518.014 M/sec (83.31%)
39,120,745 branch-misses # 0.50% of all branches (83.29%)
81,026 page-faults # 5.418 K/sec
81,023 minor-faults # 5.417 K/sec
3 major-faults # 0.201 /sec
232 cpu-migrations # 15.512 /sec
9,647 context-switches # 645.018 /sec

7.771637133 seconds time elapsed

14.583200000 seconds user
0.356060000 seconds sys

更大并行度

多线程4x1x1: ./build/Release-x64/bin/benchmark -t 1200 -t_c 4 1 1

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Performance counter stats for './build/Release-x64/bin/benchmark -t 1200 -t_c 4 1 1':

6,700.63 msec task-clock # 2.954 CPUs utilized
29,755,851,505 cycles # 4.441 GHz (83.51%)
83,205,528,265 instructions # 2.80 insn per cycle (83.10%)
4,199,090,239 cache-references # 626.671 M/sec (83.31%)
93,311,424 cache-misses # 2.22% of all cache refs (83.27%)
3,088,270,961 branches # 460.893 M/sec (83.47%)
26,095,146 branch-misses # 0.84% of all branches (83.34%)
40,295 page-faults # 6.014 K/sec
40,292 minor-faults # 6.013 K/sec
3 major-faults # 0.448 /sec
64 cpu-migrations # 9.551 /sec
22,750 context-switches # 3.395 K/sec

2.268141292 seconds time elapsed

6.323666000 seconds user
0.291030000 seconds sys

多线程1x1x4: ./build/Release-x64/bin/benchmark -t 1200 -t_c 1 1 4

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Performance counter stats for './build/Release-x64/bin/benchmark -t 1200 -t_c 1 1 4':

14,000.02 msec task-clock # 3.473 CPUs utilized
63,236,927,938 cycles # 4.517 GHz (83.28%)
86,178,188,140 instructions # 1.36 insn per cycle (83.34%)
14,446,451,610 cache-references # 1.032 G/sec (83.39%)
683,928,113 cache-misses # 4.73% of all cache refs (83.37%)
3,274,725,625 branches # 233.909 M/sec (83.33%)
27,821,046 branch-misses # 0.85% of all branches (83.29%)
40,298 page-faults # 2.878 K/sec
40,295 minor-faults # 2.878 K/sec
3 major-faults # 0.214 /sec
65 cpu-migrations # 4.643 /sec
11,451 context-switches # 817.928 /sec

4.030696660 seconds time elapsed

13.727729000 seconds user
0.188187000 seconds sys

多线程2x2x1: './build/Release-x64/bin/benchmark -t 1200 -t_c 2 2 1

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Performance counter stats for './build/Release-x64/bin/benchmark -t 1200 -t_c 2 2 1':

7,203.20 msec task-clock # 3.054 CPUs utilized
32,083,910,773 cycles # 4.454 GHz (83.21%)
82,976,776,310 instructions # 2.59 insn per cycle (83.38%)
4,303,902,243 cache-references # 597.499 M/sec (83.26%)
157,530,568 cache-misses # 3.66% of all cache refs (83.28%)
3,099,766,338 branches # 430.332 M/sec (83.38%)
26,860,431 branch-misses # 0.87% of all branches (83.49%)
40,318 page-faults # 5.597 K/sec
40,315 minor-faults # 5.597 K/sec
3 major-faults # 0.416 /sec
63 cpu-migrations # 8.746 /sec
22,403 context-switches # 3.110 K/sec

2.358286286 seconds time elapsed

6.834764000 seconds user
0.283820000 seconds sys

多线程1x2x2: ./build/Release-x64/bin/benchmark -t 1200 -t_c 1 2 2

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Performance counter stats for './build/Release-x64/bin/benchmark -t 1200 -t_c 1 2 2':

14,890.42 msec task-clock # 3.480 CPUs utilized
67,470,192,018 cycles # 4.531 GHz (83.39%)
84,160,170,749 instructions # 1.25 insn per cycle (83.28%)
7,447,985,604 cache-references # 500.186 M/sec (83.34%)
1,035,847,953 cache-misses # 13.91% of all cache refs (83.29%)
3,165,261,031 branches # 212.570 M/sec (83.37%)
29,836,493 branch-misses # 0.94% of all branches (83.34%)
40,295 page-faults # 2.706 K/sec
40,292 minor-faults # 2.706 K/sec
3 major-faults # 0.201 /sec
57 cpu-migrations # 3.828 /sec
20,126 context-switches # 1.352 K/sec

4.278950130 seconds time elapsed

14.502632000 seconds user
0.288415000 seconds sys