Introduction
As you probably know, a single physical CPU core with hyper-threading appears as two logical CPUs to an operating system.
While the operating system sees two CPUs for each core, the actual CPU hardware only has a single set of execution resources for each core.
Hyper-threading can help speed your system up, but it’s nowhere near as good as having additional cores.
Out of curiosity, I checked the impact of hyper-threading on the number of logical IO per second (LIOPS) during SLOB runs. The main idea is to compare:
- 2 FULL CACHED SLOB sessions running on the same core
- 2 FULL CACHED SLOB sessions running on two distincts cores
Into this blog post, I just want to share how I did the measure and my results (of course your results could differ depending of the cpus, the slob configuration and so on..).
Tests environment and setup
- Oracle database 11.2.0.4.
- SLOB with the following slob.conf in place (the rest of the file contains the default slob.conf values):
UPDATE_PCT=0 RUN_TIME=60 WORK_LOOP=0 SCALE=80M WORK_UNIT=64
- The cpu_topology is the following (the cpu_topology script comes from Karl Arao’s cputoolkit here):
$ sh ./cpu_topology model name : Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz processors (OS CPU count) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 physical id (processor socket) 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 siblings (logical CPUs/socket) 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 core id (# assigned to a core) 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 cpu cores (physical cores/socket) 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 2 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 45 Stepping: 7 CPU MHz: 2893.187 BogoMIPS: 5785.23 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 20480K
As you can see the box is made of 2 sockets, 16 cores and 32 threads (so that the database sees 32 cpus).
- Turbo boost does not come into play (you’ll see the proof during the tests thanks to the turbostat command).
- Obviously the same SLOB configuration has been used for the tests.
Launch the tests
Test 1: launch 2 SLOB sessions on 2 distincts sockets (so on 2 distincts cores) that way:
$ cd SLOB $ numactl --physcpubind=4,9 /bin/bash $ echo "startup" | sqlplus / as sysdba $ ./runit.sh 2
Cpus 4 and 9 are on 2 distincts sockets (see cpu_topology output).
The turbsostat command shows that they are 100% busy during the run:
Package Core CPU Avg_MHz %Busy Bzy_MHz TSC_MHz SMI CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt RAMWatt PKG_% RAM_% - - - 241 8.34 2893 2893 0 91.66 0.00 0.00 0.00 76 76 0.00 0.00 0.00 0.00 116.66 88.59 0.00 0.00 0.00 0 0 0 86 2.98 2893 2893 0 97.02 0.00 0.00 0.00 46 48 0.00 0.00 0.00 0.00 50.44 36.66 0.00 0.00 0.00 0 0 16 70 2.41 2893 2893 0 97.59 0 1 1 71 2.45 2893 2893 0 97.55 0.00 0.00 0.00 43 0 1 17 131 4.54 2893 2893 0 95.46 0 2 2 81 2.81 2893 2893 0 97.19 0.00 0.00 0.00 44 0 2 18 73 2.53 2893 2893 0 97.47 0 3 3 29 1.00 2893 2893 0 99.00 0.00 0.00 0.00 44 0 3 19 21 0.72 2893 2893 0 99.28 0 4 4 2893 100.00 2893 2893 0 0.00 0.00 0.00 0.00 46 0 4 20 40 1.39 2893 2893 0 98.61 0 5 5 58 2.02 2893 2893 0 97.98 0.00 0.00 0.00 46 0 5 21 82 2.82 2893 2893 0 97.18 0 6 6 106 3.65 2893 2893 0 96.35 0.00 0.00 0.00 44 0 6 22 42 1.46 2893 2893 0 98.54 0 7 7 28 0.97 2893 2893 0 99.03 0.00 0.00 0.00 48 0 7 23 15 0.52 2893 2893 0 99.48 1 0 8 222 7.67 2893 2893 0 92.33 0.00 0.00 0.00 72 76 0.00 0.00 0.00 0.00 66.22 51.92 0.00 0.00 0.00 1 0 24 63 2.17 2893 2893 0 97.83 1 1 9 2893 100.00 2893 2893 0 0.00 0.00 0.00 0.00 76 1 1 25 108 3.73 2893 2893 0 96.27 1 2 10 86 2.98 2893 2893 0 97.02 0.00 0.00 0.00 71 1 2 26 78 2.71 2893 2893 0 97.29 1 3 11 57 1.96 2893 2893 0 98.04 0.00 0.00 0.00 69 1 3 27 41 1.40 2893 2893 0 98.60 1 4 12 56 1.93 2893 2893 0 98.07 0.00 0.00 0.00 71 1 4 28 30 1.03 2893 2893 0 98.97 1 5 13 32 1.12 2893 2893 0 98.88 0.00 0.00 0.00 71 1 5 29 23 0.79 2893 2893 0 99.21 1 6 14 49 1.68 2893 2893 0 98.32 0.00 0.00 0.00 71 1 6 30 93 3.20 2893 2893 0 96.80 1 7 15 41 1.41 2893 2893 0 98.59 0.00 0.00 0.00 72 1 7 31 21 0.72 2893 2893 0 99.28
Turbo boost did not play (Bzy_MHz = TSC_MHz).
Those 2 sessions produced about 955 000 LIOPS:
Load Profile Per Second Per Transaction Per Exec Per Call ~~~~~~~~~~~~~~~ --------------- --------------- --------- --------- DB Time(s): 2.0 5.8 0.00 2.95 DB CPU(s): 1.9 5.6 0.00 2.89 Redo size (bytes): 16,541.5 48,682.3 Logical read (blocks): 955,750.1 2,812,818.1
Test 2: launch 2 SLOB sessions on 2 distincts cores on the same socket that way:
$ cd SLOB $ numactl --physcpubind=4,5 /bin/bash $ echo "startup" | sqlplus / as sysdba $ ./runit.sh 2
Cpus 4 and 5 are on 2 distincts cores on the same socket (see cpu_topology output).
The turbsostat command shows that they are 100% busy during the run:
Package Core CPU Avg_MHz %Busy Bzy_MHz TSC_MHz SMI CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt RAMWatt PKG_% RAM_% - - - 270 9.34 2893 2893 0 90.66 0.00 0.00 0.00 76 76 0.00 0.00 0.00 0.00 119.34 91.17 0.00 0.00 0.00 0 0 0 152 5.26 2893 2893 0 94.74 0.00 0.00 0.00 47 49 0.00 0.00 0.00 0.00 55.29 41.51 0.00 0.00 0.00 0 0 16 59 2.04 2893 2893 0 97.96 0 1 1 62 2.13 2893 2893 0 97.87 0.00 0.00 0.00 42 0 1 17 49 1.70 2893 2893 0 98.30 0 2 2 59 2.03 2893 2893 0 97.97 0.00 0.00 0.00 43 0 2 18 55 1.90 2893 2893 0 98.10 0 3 3 10 0.33 2893 2893 0 99.67 0.00 0.00 0.00 44 0 3 19 16 0.56 2893 2893 0 99.44 0 4 4 2893 100.00 2893 2893 0 0.00 0.00 0.00 0.00 46 0 4 20 40 1.40 2893 2893 0 98.60 0 5 5 2893 100.00 2893 2893 0 0.00 0.00 0.00 0.00 49 0 5 21 133 4.59 2893 2893 0 95.41 0 6 6 71 2.47 2893 2893 0 97.53 0.00 0.00 0.00 45 0 6 22 57 1.97 2893 2893 0 98.03 0 7 7 20 0.70 2893 2893 0 99.30 0.00 0.00 0.00 47 0 7 23 14 0.47 2893 2893 0 99.53 1 0 8 364 12.58 2893 2893 0 87.42 0.00 0.00 0.00 76 76 0.00 0.00 0.00 0.00 64.05 49.66 0.00 0.00 0.00 1 0 24 93 3.20 2893 2893 0 96.80 1 1 9 270 9.34 2893 2893 0 90.66 0.00 0.00 0.00 76 1 1 25 105 3.62 2893 2893 0 96.38 1 2 10 267 9.23 2893 2893 0 90.77 0.00 0.00 0.00 75 1 2 26 283 9.77 2893 2893 0 90.23 1 3 11 73 2.52 2893 2893 0 97.48 0.00 0.00 0.00 73 1 3 27 92 3.19 2893 2893 0 96.81 1 4 12 43 1.50 2893 2893 0 98.50 0.00 0.00 0.00 75 1 4 28 44 1.53 2893 2893 0 98.47 1 5 13 46 1.60 2893 2893 0 98.40 0.00 0.00 0.00 75 1 5 29 48 1.67 2893 2893 0 98.33 1 6 14 76 2.63 2893 2893 0 97.37 0.00 0.00 0.00 75 1 6 30 105 3.64 2893 2893 0 96.36 1 7 15 106 3.67 2893 2893 0 96.33 0.00 0.00 0.00 75
Turbo boost did not play (Bzy_MHz = TSC_MHz).
Those 2 sessions produced about 920 000 LIOPS:
Load Profile Per Second Per Transaction Per Exec Per Call ~~~~~~~~~~~~~~~ --------------- --------------- --------- --------- DB Time(s): 2.0 5.8 0.00 3.02 DB CPU(s): 1.9 5.7 0.00 2.97 Redo size (bytes): 16,014.6 47,163.1 Logical read (blocks): 921,446.7 2,713,660.6
So, let’s say this is more or less the same as with 2 sessions on 2 distincts sockets.
Test 3: launch 2 SLOB sessions on the same core that way:
$ cd SLOB $ numactl --physcpubind=4,20 /bin/bash $ echo "startup" | sqlplus / as sysdba $ ./runit.sh 2
Cpus 4 and 20 are sharing the same core (see cpu_topology output).
The turbsostat command shows that they are 100% busy during the run:
Package Core CPU Avg_MHz %Busy Bzy_MHz TSC_MHz SMI CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt RAMWatt PKG_% RAM_% - - - 250 8.65 2893 2893 0 91.35 0.00 0.00 0.00 76 76 0.00 0.00 0.00 0.00 114.06 85.95 0.00 0.00 0.00 0 0 0 100 3.45 2893 2893 0 96.55 0.00 0.00 0.00 48 48 0.00 0.00 0.00 0.00 51.21 37.45 0.00 0.00 0.00 0 0 16 41 1.41 2893 2893 0 98.59 0 1 1 75 2.61 2893 2893 0 97.39 0.00 0.00 0.00 42 0 1 17 58 2.01 2893 2893 0 97.99 0 2 2 51 1.75 2893 2893 0 98.25 0.00 0.00 0.00 44 0 2 18 40 1.38 2893 2893 0 98.62 0 3 3 25 0.86 2893 2893 0 99.14 0.00 0.00 0.00 45 0 3 19 9 0.30 2893 2893 0 99.70 0 4 4 2893 100.00 2893 2893 0 0.00 0.00 0.00 0.00 47 0 4 20 2893 100.00 2893 2893 0 0.00 0 5 5 120 4.14 2893 2893 0 95.86 0.00 0.00 0.00 44 0 5 21 73 2.54 2893 2893 0 97.46 0 6 6 65 2.25 2893 2893 0 97.75 0.00 0.00 0.00 45 0 6 22 51 1.76 2893 2893 0 98.24 0 7 7 18 0.64 2893 2893 0 99.36 0.00 0.00 0.00 48 0 7 23 7 0.25 2893 2893 0 99.75 1 0 8 491 16.97 2893 2893 0 83.03 0.00 0.00 0.00 76 76 0.00 0.00 0.00 0.00 62.85 48.50 0.00 0.00 0.00 1 0 24 88 3.04 2893 2893 0 96.96 1 1 9 91 3.16 2893 2893 0 96.84 0.00 0.00 0.00 75 1 1 25 80 2.75 2893 2893 0 97.25 1 2 10 87 3.00 2893 2893 0 97.00 0.00 0.00 0.00 74 1 2 26 104 3.61 2893 2893 0 96.39 1 3 11 62 2.16 2893 2893 0 97.84 0.00 0.00 0.00 72 1 3 27 80 2.77 2893 2893 0 97.23 1 4 12 36 1.24 2893 2893 0 98.76 0.00 0.00 0.00 74 1 4 28 36 1.25 2893 2893 0 98.75 1 5 13 47 1.62 2893 2893 0 98.38 0.00 0.00 0.00 75 1 5 29 75 2.61 2893 2893 0 97.39 1 6 14 52 1.79 2893 2893 0 98.21 0.00 0.00 0.00 75 1 6 30 92 3.17 2893 2893 0 96.83 1 7 15 39 1.35 2893 2893 0 98.65 0.00 0.00 0.00 76 1 7 31 32 1.11 2893 2893 0 98.89
Turbo boost did not play (Bzy_MHz = TSC_MHz).
Those 2 sessions produced about 600 000 LIOPS:
Load Profile Per Second Per Transaction Per Exec Per Call ~~~~~~~~~~~~~~~ --------------- --------------- --------- --------- DB Time(s): 2.0 5.8 0.00 3.02 DB CPU(s): 1.9 5.6 0.00 2.96 Redo size (bytes): 16,204.0 47,101.1 Logical read (blocks): 603,430.3 1,754,028.2
So, as you can see, the SLOB runs (with the same SLOB configuration) produced about:
- 955 000 LIOPS with the 2 sessions running on 2 distincts sockets (so on 2 distincts cores).
- 920 000 LIOPS with the 2 sessions running on 2 distincts cores on the same socket.
- 600 000 LIOPS with the 2 sessions running on the same core.
Out of curiosity, what if we disable the hyper-threading on one core and launch the 2 sessions on it?
Let’s put the cpu 20 offline (to be sure there is no noise on it during the test)
echo 0 > /sys/devices/system/cpu/cpu20/online
launch the 2 sessions on the same cpu (the remaining one online for this core as the second one has just been put offline) that way:
$ cd SLOB $ numactl --physcpubind=4 /bin/bash $ echo "startup" | sqlplus / as sysdba $ ./runit.sh 2
The turbsostat command shows that cpu 4 is 100% busy during the run:
Package Core CPU Avg_MHz %Busy Bzy_MHz TSC_MHz SMI CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt RAMWatt PKG_% RAM_% - - - 153 5.30 2893 2893 0 94.70 0.00 0.00 0.00 77 77 0.00 0.00 0.00 0.00 112.76 84.64 0.00 0.00 0.00 0 0 0 56 1.94 2893 2893 0 98.06 0.00 0.00 0.00 45 46 0.00 0.00 0.00 0.00 49.86 36.07 0.00 0.00 0.00 0 0 16 42 1.45 2893 2893 0 98.55 0 1 1 97 3.35 2893 2893 0 96.65 0.00 0.00 0.00 42 0 1 17 61 2.11 2893 2893 0 97.89 0 2 2 47 1.62 2893 2893 0 98.38 0.00 0.00 0.00 44 0 2 18 45 1.54 2893 2893 0 98.46 0 3 3 19 0.64 2893 2893 0 99.36 0.00 0.00 0.00 44 0 3 19 25 0.85 2893 2893 0 99.15 0 4 4 2893 100.00 2893 2893 0 0.00 0.00 0.00 0.00 46 0 5 5 9 0.31 2893 2893 0 99.69 0.00 0.00 0.00 44 0 5 21 14 0.49 2893 2893 0 99.51 0 6 6 14 0.48 2893 2893 0 99.52 0.00 0.00 0.00 44 0 6 22 12 0.41 2893 2893 0 99.59 0 7 7 25 0.88 2893 2893 0 99.12 0.00 0.00 0.00 46 0 7 23 9 0.33 2893 2893 0 99.67 1 0 8 133 4.59 2893 2893 0 95.41 0.00 0.00 0.00 76 77 0.00 0.00 0.00 0.00 62.90 48.56 0.00 0.00 0.00 1 0 24 69 2.37 2893 2893 0 97.63 1 1 9 89 3.07 2893 2893 0 96.93 0.00 0.00 0.00 76 1 1 25 73 2.52 2893 2893 0 97.48 1 2 10 98 3.38 2893 2893 0 96.62 0.00 0.00 0.00 76 1 2 26 113 3.92 2893 2893 0 96.08 1 3 11 76 2.62 2893 2893 0 97.38 0.00 0.00 0.00 74 1 3 27 78 2.71 2893 2893 0 97.29 1 4 12 194 6.71 2893 2893 0 93.29 0.00 0.00 0.00 75 1 4 28 27 0.94 2893 2893 0 99.06 1 5 13 62 2.14 2893 2893 0 97.86 0.00 0.00 0.00 76 1 5 29 95 3.29 2893 2893 0 96.71 1 6 14 101 3.48 2893 2893 0 96.52 0.00 0.00 0.00 75 1 6 30 92 3.19 2893 2893 0 96.81 1 7 15 64 2.20 2893 2893 0 97.80 0.00 0.00 0.00 77 1 7 31 20 0.68 2893 2893 0 99.32
and as you can see it is the only cpu available on that core.
Turbo boost did not play (Bzy_MHz = TSC_MHz).
Those 2 sessions produced about 460 000 LIOPS:
Load Profile Per Second Per Transaction Per Exec Per Call ~~~~~~~~~~~~~~~ --------------- --------------- --------- --------- DB Time(s): 2.0 5.8 0.00 2.69 DB CPU(s): 1.0 2.8 0.00 1.32 Redo size (bytes): 18,978.1 55,154.9 Logical read (blocks): 458,451.7 1,332,369.7
which makes fully sense as 2 sessions on 2 distincts cores on the same sockets produced about 920 000 LIOPS (Test 2).
So what?
Imagine that you launched multiple full cached SLOB runs on your new box to measure its behaviour/scalability (from one session up to the cpu_count value).
Example: I launched 11 full cached SLOB runs with 2,4,6,8,12,16,18,20,24,28 and 32 sessions running. The LIOPS graph is the following:
Just check the curve of the graph (not the values). As you can see, once the number of sessions reached the number of cores (16) the LIOPS is still increasing but its increase rate is slow down. One of the reason is the one we observed during our previous tests: 2 sessions running on the same core don’t perform as fast as 2 sessions running on 2 distincts cores.
Remarks
Worth watching presentation around oracle and cpu stuff:
- Karl Arao: Where did my cpu go
- Kevin Closson: Modern platform topics for modern DBAs
- Kevin Closson: SLOB – For More Than I/O!
The absolute LIOPS numbers are relatively low. It would have been possible to increase those numbers during the tests by increasing the WORK_UNIT slob parameter (see this blog post).
To get the turbostat utility, copy/paste the turbostat.c and msr-index.h contents into a directory and compile turbostat.c that way:
$ gcc -Wall -DMSRHEADER='"./msr-index.h"' turbostat.c -o turbostat
Conclusion
During the test I observed the following (of course your results could differ depending of the cpus, the slob configuration and so on..):
- 955 000 LIOPS with the 2 sessions running on 2 distincts sockets (so on 2 distincts cores)
- 920 000 LIOPS with the 2 sessions running on 2 distincts cores on the same socket
- 600 000 LIOPS with the 2 sessions running on the same core
- 460 000 LIOPS with the 2 sessions running on the same core with hyper-threading disabled
Hyper-threading helps to increase the overall LIOPS throughput of the system, but once the numbers of sessions is > number of cores then the average LIOPS per session decrease.
can you try re-running test 1 after starting the instance like this:
$ numactl –physcpubind=4,9 –membind=interleave sqlplus / as sysdba
then run SLOB runit.sh like this:
$ numacl –physcpubind=4,9 –membind=local sh ./runit.sh
I think I should have said:
$ numactl –cpunodebind=0,1 –membind=interleave sqlplus / as sysdba
and then numactl –physcpubind to the first core on each socket but do also force –membind=local for runit.sh (and thus the SLOB foregrounds)
numa is off on this machine Kevin:
$ cat /proc/cmdline
root=LABEL=DBSYS bootarea=dbsys bootfrom=BOOT ro loglevel=7 panic=60 debug pci=noaer log_buf_len=1m nmi_watchdog=0 nomce transparent_hugepage=never rd_NO_PLYMOUTH audit=1 console=tty1 console=ttyS0,115200n8 crashkernel=
380M@128M numa=off processor.max_cstate=1 intel_idle.max_cstate=0
Hi,
it’s strange that turboboost is not working properly. The expected behavior is to boost the clock on the fly. I just tried it on E5-2670. My test case is simple while loop with null:
begin
while true loop
null;
end loop;
end;
/
from 2 sessions and turbostat is showing the following:
CPU Avg_MHz %Busy Bzy_MHz TSC_MHz
– 223 7.52 2962 2594
0 63 2.01 3153 2594
16 93 2.94 3161 2594
1 3164 99.22 3189 2594
17 31 0.99 3113 2594
2 34 1.10 3104 2594
18 52 1.65 3142 2594
3 3164 99.22 3189 2594
19 23 0.74 3089 2594
.
.
.
Try to install cpupowerutils, and check the boost state:
[root@******** ~]# cpupower -c all frequency-info
analyzing CPU 0:
no or unknown cpufreq driver is active on this CPU
boost state support:
Supported: yes
Active: yes
3200 MHz max turbo 4 active cores
3200 MHz max turbo 3 active cores
3300 MHz max turbo 2 active cores
3300 MHz max turbo 1 active cores
.
.
.
cpupowerutils includes turbostat as well.
Regards
Hi Nikolay,
Turboboost is not active and that’s what I wanted.
I wanted to focus on Hyper-threading only without any noise/fluctuations that turboboost may have generated.
I’ll do another tests to focus on turboboost only.
Thx
Bertrand
Got it in the opposite way, thanks
Great post.
Would by very interesting if you could to plot % CPU utilization at each number of sessions in the graph (with another y-axis) like reported by SAR for example. I am very curious about CPU utilization at 16 sessions and how many sessions would be necessary to arrive about 100%.
Thanks