Graphing Oracle performance metrics with Telegraf, InfluxDB and Grafana

Introduction

As a picture is worth a thousand words, we may want to visualise the oracle performance metrics. There is already several tools to do so, but what if you could do it the way you want? build your own graph? Let’s try to achieve this with 3 layers:

  • telegraf: to collect the oracle metrics
  • InfluxDB: to store the time-series oracle metrics
  • grafana: to visualise the oracle time-series metrics

Installation

Let’s install InfluxDB and grafana that way (for example on the same host, namely influxgraf):

  • Install InfluxDB (detail here):
[root@influxgraf ~]# wget https://s3.amazonaws.com/influxdb/influxdb-0.10.2-1.x86_64.rpm

[root@influxgraf ~]# yum localinstall influxdb-0.10.2-1.x86_64.rpm
  • Install grafana (detail here):
[root@influxgraf ~]# yum install https://grafanarel.s3.amazonaws.com/builds/grafana-2.6.0-1.x86_64.rpm

Let’s install telegraf on the oracle host (namely Dprima) that way (detail here):

[root@Dprima ~]# wget http://get.influxdb.org/telegraf/telegraf-0.10.4.1-1.x86_64.rpm

[root@Dprima ~]# yum localinstall telegraf-0.10.4.1-1.x86_64.rpm

Setup

  • Create a telegraf database into InfluxDB (using the web interface):

Screen Shot 2016-03-05 at 09.06.10

  • Create a root user into InfluxDB (using the web interface):

Screen Shot 2016-03-05 at 09.08.38

  • Write a script to collect the oracle metrics. I am using python but this is not mandatory at all. Only the output of the script does matter, it has to be InfluxDB line-protocol. The script query the v$sysmetric and v$eventmetric views to get the wait class and the wait event metrics during the last minute. I am not reinventing the wheel, I am using Kyle Hailey‘s queries. The python code is:
[oracle@Dprima scripts]$ cat oracle_metrics.py
import os
import sys
import cx_Oracle
import argparse
import re

class OraMetrics():
    def __init__(self, user, passwd, sid):
        import cx_Oracle
        self.user = user
        self.passwd = passwd
        self.sid = sid
        self.connection = cx_Oracle.connect( self.user , self.passwd , self.sid )
        cursor = self.connection.cursor()
        cursor.execute("select HOST_NAME from v$instance")
        for hostname in cursor:
            self.hostname = hostname[0]

    def waitclassstats(self, user, passwd, sid):
        cursor = self.connection.cursor()
        cursor.execute("""
        select n.wait_class, round(m.time_waited/m.INTSIZE_CSEC,3) AAS
        from   v$waitclassmetric  m, v$system_wait_class n
        where m.wait_class_id=n.wait_class_id and n.wait_class != 'Idle'
        union
        select  'CPU', round(value/100,3) AAS
        from v$sysmetric where metric_name='CPU Usage Per Sec' and group_id=2
        union select 'CPU_OS', round((prcnt.busy*parameter.cpu_count)/100,3) - aas.cpu
        from
            ( select value busy
                from v$sysmetric
                where metric_name='Host CPU Utilization (%)'
                and group_id=2 ) prcnt,
                ( select value cpu_count from v$parameter where name='cpu_count' )  parameter,
                ( select  'CPU', round(value/100,3) cpu from v$sysmetric where metric_name='CPU Usage Per Sec' and group_id=2) aas
        """)
        for wait in cursor:
            wait_name = wait[0]
            wait_value = wait[1]
            print "oracle_wait_class,host=%s,db=%s,wait_class=%s wait_value=%s" % (self.hostname, sid,re.sub(' ', '_', wait_name), wait_value )

    def waitstats(self, user, passwd, sid):
        cursor = self.connection.cursor()
        cursor.execute("""
	select
	n.wait_class wait_class,
       	n.name wait_name,
       	m.wait_count cnt,
       	round(10*m.time_waited/nullif(m.wait_count,0),3) avgms
	from v$eventmetric m,
     	v$event_name n
	where m.event_id=n.event_id
  	and n.wait_class <> 'Idle' and m.wait_count > 0 order by 1 """)
        for wait in cursor:
         wait_class = wait[0]
         wait_name = wait[1]
         wait_cnt = wait[2]
         wait_avgms = wait[3]
         print "oracle_wait_event,host=%s,db=%s,wait_class=%s,wait_event=%s count=%s,latency=%s" % (self.hostname, sid,re.sub(' ', '_', wait_class),re.sub(' ', '_', wait_name)
, wait_cnt,wait_avgms)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('-u', '--user', help="Username", required=True)
    parser.add_argument('-p', '--passwd', required=True)
    parser.add_argument('-s', '--sid', help="SID", required=True)

    args = parser.parse_args()

    stats = OraMetrics(args.user, args.passwd, args.sid)
    stats.waitclassstats(args.user, args.passwd, args.sid)
    stats.waitstats(args.user, args.passwd, args.sid)

The output looks like (the output format is the InfluxDB line-protocol):

[oracle@Dprima scripts]$ python "/home/oracle/scripts/oracle_metrics.py" "-u" "system" "-p" "bdtbdt" "-s" PBDT
oracle_wait_class,host=Dprima,db=PBDT,wait_class=Administrative wait_value=0
oracle_wait_class,host=Dprima,db=PBDT,wait_class=CPU wait_value=0
oracle_wait_class,host=Dprima,db=PBDT,wait_class=CPU_OS wait_value=0.035
oracle_wait_class,host=Dprima,db=PBDT,wait_class=Commit wait_value=0
oracle_wait_class,host=Dprima,db=PBDT,wait_class=Concurrency wait_value=0
oracle_wait_class,host=Dprima,db=PBDT,wait_class=Configuration wait_value=0
oracle_wait_class,host=Dprima,db=PBDT,wait_class=Network wait_value=0
oracle_wait_class,host=Dprima,db=PBDT,wait_class=Other wait_value=0
oracle_wait_class,host=Dprima,db=PBDT,wait_class=Scheduler wait_value=0
oracle_wait_class,host=Dprima,db=PBDT,wait_class=System_I/O wait_value=0.005
oracle_wait_class,host=Dprima,db=PBDT,wait_class=User_I/O wait_value=0
oracle_wait_event,host=Dprima,db=PBDT,wait_class=System_I/O,wait_event=control_file_sequential_read count=163,latency=0.009
oracle_wait_event,host=Dprima,db=PBDT,wait_class=System_I/O,wait_event=control_file_parallel_write count=60,latency=3.933
oracle_wait_event,host=Dprima,db=PBDT,wait_class=System_I/O,wait_event=log_file_parallel_write count=60,latency=1.35
oracle_wait_event,host=Dprima,db=PBDT,wait_class=User_I/O,wait_event=Disk_file_operations_I/O count=16,latency=0.037
oracle_wait_event,host=Dprima,db=PBDT,wait_class=User_I/O,wait_event=Parameter_File_I/O count=16,latency=0.004
  • On the oracle host, configure telegraf to execute the python script with 60 seconds interval and send the output to InfluxDB. Edit the /etc/telegraf/telegraf.conf file so that it contains:
###############################################################################
#                                  OUTPUTS                                    #
###############################################################################

# Configuration for influxdb server to send metrics to
[[outputs.influxdb]]
  urls = ["http://influxgraf:8086"] # required
  database = "telegraf" # required
  precision = "s"
  timeout = "5s"


###############################################################################
#                                  INPUTS                                     #
###############################################################################

# Oracle metrics
[[inputs.exec]]
  # Shell/commands array
  commands = ["/home/oracle/scripts/oracle_metrics.sh"]
  # Data format to consume. This can be "json", "influx" or "graphite" (line-protocol)
  # NOTE json only reads numerical measurements, strings and booleans are ignored.
  data_format = "influx"
  interval = "60s"

The oracle_metrics.sh script contains the call to the python script:

#!/bin/env bash

export LD_LIBRARY_PATH=/u01/app/oracle/product/11.2.0/dbhome_1//lib
export ORACLE_HOME=/u01/app/oracle/product/11.2.0/dbhome_1/

python "/home/oracle/scripts/oracle_metrics.py" "-u" "system" "-p" "bdtbdt" "-s" PBDT
  • Launch telegraf:
[root@Dprima ~]# telegraf -config /etc/telegraf/telegraf.conf

Graph the metrics

  • First check that the metrics are stored the way we want into InfluxDB:

Screen Shot 2016-03-05 at 09.49.09

 

  • Configure InfluxDB as datasource in grafana:

Screen Shot 2016-03-05 at 10.03.55

  • In grafana, create a dashboard and create some variables (hosts, db and wait_class):

Screen Shot 2016-03-05 at 10.59.00

Screen Shot 2016-03-05 at 10.59.57

  • Now let’s create a graph:

Screen Shot 2016-03-05 at 11.04.09

  • Get the metrics:

Screen Shot 2016-03-05 at 11.06.18

  • So that the graph looks like:

Screen Shot 2016-03-05 at 11.20.18

Remarks

  • Thanks to the variables defined in grafana, we can filter on the host, database and wait class (instead of visualising all of them).
  • We can also graph the wait events we collected, or whatever you want to collect.

Conclusion

Thanks to:

  • telegraf and InfluxDB, we are able to collect and store the metrics we want to.
  • grafana, we are able to visualise the metrics the way we want to.

Update 04/21/2016:

As another example, with more informations collected:

dash1

dash2

dash3

dash4

dash5

 

Advertisements

Hyper-threading: measure its impact during SLOB runs thanks to numactl and turbostat

Introduction

As you probably know, a single physical CPU core with hyper-threading appears as two logical CPUs to an operating system. 

While the operating system sees two CPUs for each core, the actual CPU hardware only has a single set of execution resources for each core.

Hyper-threading can help speed your system up, but it’s nowhere near as good as having additional cores.

Out of curiosity, I checked the impact of hyper-threading on the number of logical IO per second (LIOPS) during SLOB runs. The main idea is to compare:

  • 2 FULL CACHED SLOB sessions running on the same core
  • 2 FULL CACHED SLOB sessions running on two distincts cores

Into this blog post, I just want to share how I did the measure and my results (of course your results could differ depending of the cpus, the slob configuration and so on..).

Tests environment and setup

  • Oracle database 11.2.0.4.
  • SLOB with the following slob.conf in place (the rest of the file contains the default slob.conf values):
UPDATE_PCT=0
RUN_TIME=60
WORK_LOOP=0
SCALE=80M
WORK_UNIT=64
  • The cpu_topology is the following (the cpu_topology script comes from Karl Arao’s cputoolkit here):
$ sh ./cpu_topology
	
model name	: Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz
processors  (OS CPU count)          0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
physical id (processor socket)      0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
siblings    (logical CPUs/socket)   16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16
core id     (# assigned to a core)  0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
cpu cores   (physical cores/socket) 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8

$ lscpu

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 2
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 45
Stepping: 7
CPU MHz: 2893.187
BogoMIPS: 5785.23
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 20480K

As you can see the box is made of 2 sockets, 16 cores and 32 threads (so that the database sees 32 cpus).

  • Turbo boost does not come into play (you’ll see the proof during the tests thanks to the turbostat command).
  • Obviously the same SLOB configuration has been used for the tests.

Launch the tests

Test 1: launch 2 SLOB sessions on 2 distincts sockets (so on 2 distincts cores) that way:

$ cd SLOB
$ numactl --physcpubind=4,9 /bin/bash 
$ echo "startup" | sqlplus / as sysdba
$ ./runit.sh 2

Cpus 4 and 9 are on 2 distincts sockets (see cpu_topology output).

The turbsostat command shows that they are 100% busy during the run:

Package    Core     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz     SMI  CPU%c1  CPU%c3  CPU%c6  CPU%c7 CoreTmp  PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt RAMWatt   PKG_%   RAM_%
       -       -       -     241    8.34    2893    2893       0   91.66    0.00    0.00    0.00      76      76    0.00    0.00    0.00    0.00  116.66   88.59    0.00    0.00    0.00
       0       0       0      86    2.98    2893    2893       0   97.02    0.00    0.00    0.00      46      48    0.00    0.00    0.00    0.00   50.44   36.66    0.00    0.00    0.00
       0       0      16      70    2.41    2893    2893       0   97.59
       0       1       1      71    2.45    2893    2893       0   97.55    0.00    0.00    0.00      43
       0       1      17     131    4.54    2893    2893       0   95.46
       0       2       2      81    2.81    2893    2893       0   97.19    0.00    0.00    0.00      44
       0       2      18      73    2.53    2893    2893       0   97.47
       0       3       3      29    1.00    2893    2893       0   99.00    0.00    0.00    0.00      44
       0       3      19      21    0.72    2893    2893       0   99.28
       0       4       4    2893  100.00    2893    2893       0    0.00    0.00    0.00    0.00      46
       0       4      20      40    1.39    2893    2893       0   98.61
       0       5       5      58    2.02    2893    2893       0   97.98    0.00    0.00    0.00      46
       0       5      21      82    2.82    2893    2893       0   97.18
       0       6       6     106    3.65    2893    2893       0   96.35    0.00    0.00    0.00      44
       0       6      22      42    1.46    2893    2893       0   98.54
       0       7       7      28    0.97    2893    2893       0   99.03    0.00    0.00    0.00      48
       0       7      23      15    0.52    2893    2893       0   99.48
       1       0       8     222    7.67    2893    2893       0   92.33    0.00    0.00    0.00      72      76    0.00    0.00    0.00    0.00   66.22   51.92    0.00    0.00    0.00
       1       0      24      63    2.17    2893    2893       0   97.83
       1       1       9    2893  100.00    2893    2893       0    0.00    0.00    0.00    0.00      76
       1       1      25     108    3.73    2893    2893       0   96.27
       1       2      10      86    2.98    2893    2893       0   97.02    0.00    0.00    0.00      71
       1       2      26      78    2.71    2893    2893       0   97.29
       1       3      11      57    1.96    2893    2893       0   98.04    0.00    0.00    0.00      69
       1       3      27      41    1.40    2893    2893       0   98.60
       1       4      12      56    1.93    2893    2893       0   98.07    0.00    0.00    0.00      71
       1       4      28      30    1.03    2893    2893       0   98.97
       1       5      13      32    1.12    2893    2893       0   98.88    0.00    0.00    0.00      71
       1       5      29      23    0.79    2893    2893       0   99.21
       1       6      14      49    1.68    2893    2893       0   98.32    0.00    0.00    0.00      71
       1       6      30      93    3.20    2893    2893       0   96.80
       1       7      15      41    1.41    2893    2893       0   98.59    0.00    0.00    0.00      72
       1       7      31      21    0.72    2893    2893       0   99.28

Turbo boost did not play  (Bzy_MHz = TSC_MHz).

Those 2 sessions produced about 955 000 LIOPS:

Load Profile                    Per Second   Per Transaction  Per Exec  Per Call
~~~~~~~~~~~~~~~            ---------------   --------------- --------- ---------
             DB Time(s):               2.0               5.8      0.00      2.95
              DB CPU(s):               1.9               5.6      0.00      2.89
      Redo size (bytes):          16,541.5          48,682.3
  Logical read (blocks):         955,750.1       2,812,818.1

Test 2: launch 2 SLOB sessions on 2 distincts cores on the same socket that way:

$ cd SLOB
$ numactl --physcpubind=4,5 /bin/bash 
$ echo "startup" | sqlplus / as sysdba
$ ./runit.sh 2

Cpus 4 and 5 are on 2 distincts cores on the same socket (see cpu_topology output).

The turbsostat command shows that they are 100% busy during the run:

Package    Core     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz     SMI  CPU%c1  CPU%c3  CPU%c6  CPU%c7 CoreTmp  PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt RAMWatt   PKG_%   RAM_%
       -       -       -     270    9.34    2893    2893       0   90.66    0.00    0.00    0.00      76      76    0.00    0.00    0.00    0.00  119.34   91.17    0.00    0.00    0.00
       0       0       0     152    5.26    2893    2893       0   94.74    0.00    0.00    0.00      47      49    0.00    0.00    0.00    0.00   55.29   41.51    0.00    0.00    0.00
       0       0      16      59    2.04    2893    2893       0   97.96
       0       1       1      62    2.13    2893    2893       0   97.87    0.00    0.00    0.00      42
       0       1      17      49    1.70    2893    2893       0   98.30
       0       2       2      59    2.03    2893    2893       0   97.97    0.00    0.00    0.00      43
       0       2      18      55    1.90    2893    2893       0   98.10
       0       3       3      10    0.33    2893    2893       0   99.67    0.00    0.00    0.00      44
       0       3      19      16    0.56    2893    2893       0   99.44
       0       4       4    2893  100.00    2893    2893       0    0.00    0.00    0.00    0.00      46
       0       4      20      40    1.40    2893    2893       0   98.60
       0       5       5    2893  100.00    2893    2893       0    0.00    0.00    0.00    0.00      49
       0       5      21     133    4.59    2893    2893       0   95.41
       0       6       6      71    2.47    2893    2893       0   97.53    0.00    0.00    0.00      45
       0       6      22      57    1.97    2893    2893       0   98.03
       0       7       7      20    0.70    2893    2893       0   99.30    0.00    0.00    0.00      47
       0       7      23      14    0.47    2893    2893       0   99.53
       1       0       8     364   12.58    2893    2893       0   87.42    0.00    0.00    0.00      76      76    0.00    0.00    0.00    0.00   64.05   49.66    0.00    0.00    0.00
       1       0      24      93    3.20    2893    2893       0   96.80
       1       1       9     270    9.34    2893    2893       0   90.66    0.00    0.00    0.00      76
       1       1      25     105    3.62    2893    2893       0   96.38
       1       2      10     267    9.23    2893    2893       0   90.77    0.00    0.00    0.00      75
       1       2      26     283    9.77    2893    2893       0   90.23
       1       3      11      73    2.52    2893    2893       0   97.48    0.00    0.00    0.00      73
       1       3      27      92    3.19    2893    2893       0   96.81
       1       4      12      43    1.50    2893    2893       0   98.50    0.00    0.00    0.00      75
       1       4      28      44    1.53    2893    2893       0   98.47
       1       5      13      46    1.60    2893    2893       0   98.40    0.00    0.00    0.00      75
       1       5      29      48    1.67    2893    2893       0   98.33
       1       6      14      76    2.63    2893    2893       0   97.37    0.00    0.00    0.00      75
       1       6      30     105    3.64    2893    2893       0   96.36
       1       7      15     106    3.67    2893    2893       0   96.33    0.00    0.00    0.00      75

Turbo boost did not play  (Bzy_MHz = TSC_MHz).

Those 2 sessions produced about 920 000 LIOPS:

Load Profile                    Per Second   Per Transaction  Per Exec  Per Call
~~~~~~~~~~~~~~~            ---------------   --------------- --------- ---------
             DB Time(s):               2.0               5.8      0.00      3.02
              DB CPU(s):               1.9               5.7      0.00      2.97
      Redo size (bytes):          16,014.6          47,163.1
  Logical read (blocks):         921,446.7       2,713,660.6

So, let’s say this is more or less the same as with 2 sessions on 2 distincts sockets.

Test 3: launch 2 SLOB sessions on the same core that way:

$ cd SLOB
$ numactl --physcpubind=4,20 /bin/bash 
$ echo "startup" | sqlplus / as sysdba
$ ./runit.sh 2

Cpus 4 and 20 are sharing the same core (see cpu_topology output).

The turbsostat command shows that they are 100% busy during the run:

Package    Core     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz     SMI  CPU%c1  CPU%c3  CPU%c6  CPU%c7 CoreTmp  PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt RAMWatt   PKG_%   RAM_%
       -       -       -     250    8.65    2893    2893       0   91.35    0.00    0.00    0.00      76      76    0.00    0.00    0.00    0.00  114.06   85.95    0.00    0.00    0.00
       0       0       0     100    3.45    2893    2893       0   96.55    0.00    0.00    0.00      48      48    0.00    0.00    0.00    0.00   51.21   37.45    0.00    0.00    0.00
       0       0      16      41    1.41    2893    2893       0   98.59
       0       1       1      75    2.61    2893    2893       0   97.39    0.00    0.00    0.00      42
       0       1      17      58    2.01    2893    2893       0   97.99
       0       2       2      51    1.75    2893    2893       0   98.25    0.00    0.00    0.00      44
       0       2      18      40    1.38    2893    2893       0   98.62
       0       3       3      25    0.86    2893    2893       0   99.14    0.00    0.00    0.00      45
       0       3      19       9    0.30    2893    2893       0   99.70
       0       4       4    2893  100.00    2893    2893       0    0.00    0.00    0.00    0.00      47
       0       4      20    2893  100.00    2893    2893       0    0.00
       0       5       5     120    4.14    2893    2893       0   95.86    0.00    0.00    0.00      44
       0       5      21      73    2.54    2893    2893       0   97.46
       0       6       6      65    2.25    2893    2893       0   97.75    0.00    0.00    0.00      45
       0       6      22      51    1.76    2893    2893       0   98.24
       0       7       7      18    0.64    2893    2893       0   99.36    0.00    0.00    0.00      48
       0       7      23       7    0.25    2893    2893       0   99.75
       1       0       8     491   16.97    2893    2893       0   83.03    0.00    0.00    0.00      76      76    0.00    0.00    0.00    0.00   62.85   48.50    0.00    0.00    0.00
       1       0      24      88    3.04    2893    2893       0   96.96
       1       1       9      91    3.16    2893    2893       0   96.84    0.00    0.00    0.00      75
       1       1      25      80    2.75    2893    2893       0   97.25
       1       2      10      87    3.00    2893    2893       0   97.00    0.00    0.00    0.00      74
       1       2      26     104    3.61    2893    2893       0   96.39
       1       3      11      62    2.16    2893    2893       0   97.84    0.00    0.00    0.00      72
       1       3      27      80    2.77    2893    2893       0   97.23
       1       4      12      36    1.24    2893    2893       0   98.76    0.00    0.00    0.00      74
       1       4      28      36    1.25    2893    2893       0   98.75
       1       5      13      47    1.62    2893    2893       0   98.38    0.00    0.00    0.00      75
       1       5      29      75    2.61    2893    2893       0   97.39
       1       6      14      52    1.79    2893    2893       0   98.21    0.00    0.00    0.00      75
       1       6      30      92    3.17    2893    2893       0   96.83
       1       7      15      39    1.35    2893    2893       0   98.65    0.00    0.00    0.00      76
       1       7      31      32    1.11    2893    2893       0   98.89

Turbo boost did not play  (Bzy_MHz = TSC_MHz).

Those 2 sessions produced about 600 000 LIOPS:

Load Profile                    Per Second   Per Transaction  Per Exec  Per Call
~~~~~~~~~~~~~~~            ---------------   --------------- --------- ---------
             DB Time(s):               2.0               5.8      0.00      3.02
              DB CPU(s):               1.9               5.6      0.00      2.96
      Redo size (bytes):          16,204.0          47,101.1
  Logical read (blocks):         603,430.3       1,754,028.2

So, as you can see, the SLOB runs (with the same SLOB configuration) produced about:

  • 955 000 LIOPS with the 2 sessions running on 2 distincts sockets (so on 2 distincts cores).
  • 920 000 LIOPS with the 2 sessions running on 2 distincts cores on the same socket.
  • 600 000 LIOPS with the 2 sessions running on the same core.

Out of curiosity, what if we disable the hyper-threading on one core and launch the 2 sessions on it?

Let’s put the cpu 20 offline (to be sure there is no noise on it during the test)

echo 0 > /sys/devices/system/cpu/cpu20/online

launch the 2 sessions on the same cpu (the remaining one online for this core as the second one has just been put offline) that way:

$ cd SLOB
$ numactl --physcpubind=4 /bin/bash 
$ echo "startup" | sqlplus / as sysdba
$ ./runit.sh 2

The turbsostat command shows that cpu 4 is 100% busy during the run:

Package    Core     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz     SMI  CPU%c1  CPU%c3  CPU%c6  CPU%c7 CoreTmp  PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt RAMWatt   PKG_%   RAM_%
       -       -       -     153    5.30    2893    2893       0   94.70    0.00    0.00    0.00      77      77    0.00    0.00    0.00    0.00  112.76   84.64    0.00    0.00    0.00
       0       0       0      56    1.94    2893    2893       0   98.06    0.00    0.00    0.00      45      46    0.00    0.00    0.00    0.00   49.86   36.07    0.00    0.00    0.00
       0       0      16      42    1.45    2893    2893       0   98.55
       0       1       1      97    3.35    2893    2893       0   96.65    0.00    0.00    0.00      42
       0       1      17      61    2.11    2893    2893       0   97.89
       0       2       2      47    1.62    2893    2893       0   98.38    0.00    0.00    0.00      44
       0       2      18      45    1.54    2893    2893       0   98.46
       0       3       3      19    0.64    2893    2893       0   99.36    0.00    0.00    0.00      44
       0       3      19      25    0.85    2893    2893       0   99.15
       0       4       4    2893  100.00    2893    2893       0    0.00    0.00    0.00    0.00      46
       0       5       5       9    0.31    2893    2893       0   99.69    0.00    0.00    0.00      44
       0       5      21      14    0.49    2893    2893       0   99.51
       0       6       6      14    0.48    2893    2893       0   99.52    0.00    0.00    0.00      44
       0       6      22      12    0.41    2893    2893       0   99.59
       0       7       7      25    0.88    2893    2893       0   99.12    0.00    0.00    0.00      46
       0       7      23       9    0.33    2893    2893       0   99.67
       1       0       8     133    4.59    2893    2893       0   95.41    0.00    0.00    0.00      76      77    0.00    0.00    0.00    0.00   62.90   48.56    0.00    0.00    0.00
       1       0      24      69    2.37    2893    2893       0   97.63
       1       1       9      89    3.07    2893    2893       0   96.93    0.00    0.00    0.00      76
       1       1      25      73    2.52    2893    2893       0   97.48
       1       2      10      98    3.38    2893    2893       0   96.62    0.00    0.00    0.00      76
       1       2      26     113    3.92    2893    2893       0   96.08
       1       3      11      76    2.62    2893    2893       0   97.38    0.00    0.00    0.00      74
       1       3      27      78    2.71    2893    2893       0   97.29
       1       4      12     194    6.71    2893    2893       0   93.29    0.00    0.00    0.00      75
       1       4      28      27    0.94    2893    2893       0   99.06
       1       5      13      62    2.14    2893    2893       0   97.86    0.00    0.00    0.00      76
       1       5      29      95    3.29    2893    2893       0   96.71
       1       6      14     101    3.48    2893    2893       0   96.52    0.00    0.00    0.00      75
       1       6      30      92    3.19    2893    2893       0   96.81
       1       7      15      64    2.20    2893    2893       0   97.80    0.00    0.00    0.00      77
       1       7      31      20    0.68    2893    2893       0   99.32

and as you can see it is the only cpu available on that core.

Turbo boost did not play  (Bzy_MHz = TSC_MHz).

Those 2 sessions produced about 460 000 LIOPS:

Load Profile                    Per Second   Per Transaction  Per Exec  Per Call
~~~~~~~~~~~~~~~            ---------------   --------------- --------- ---------
             DB Time(s):               2.0               5.8      0.00      2.69
              DB CPU(s):               1.0               2.8      0.00      1.32
      Redo size (bytes):          18,978.1          55,154.9
  Logical read (blocks):         458,451.7       1,332,369.7

which makes fully sense as 2 sessions on 2 distincts cores on the same sockets produced about 920 000 LIOPS (Test 2).

So what?

Imagine that you launched multiple full cached SLOB runs on your new box to measure its behaviour/scalability (from one session up to the cpu_count value).

Example: I launched 11 full cached SLOB runs with 2,4,6,8,12,16,18,20,24,28 and 32 sessions running. The LIOPS graph is the following:

Screen Shot 2015-09-22 at 09.29.08

Just check the curve of the graph (not the values). As you can see, once the number of sessions reached the number of cores (16) the LIOPS is still increasing but its increase rate is slow down. One of the reason is the one we observed during our previous tests: 2 sessions running on the same core don’t perform as fast as 2 sessions running on 2 distincts cores.

Remarks

Worth watching presentation around oracle and cpu stuff:

The absolute LIOPS numbers are relatively low. It would have been possible to increase those numbers during the tests by increasing the WORK_UNIT slob parameter (see this blog post).

To get the turbostat utility, copy/paste the turbostat.c and msr-index.h contents into a directory and compile turbostat.c that way:

$ gcc -Wall -DMSRHEADER='"./msr-index.h"' turbostat.c -o turbostat

Conclusion

During the test I observed the following (of course your results could differ depending of the cpus, the slob configuration and so on..):

  • 955 000 LIOPS with the 2 sessions running on 2 distincts sockets (so on 2 distincts cores)
  • 920 000 LIOPS with the 2 sessions running on 2 distincts cores on the same socket
  • 600 000 LIOPS with the 2 sessions running on the same core
  • 460 000 LIOPS with the 2 sessions running on the same core with hyper-threading disabled

Hyper-threading helps to increase the overall LIOPS throughput of the system, but once the numbers of sessions is > number of cores then the average LIOPS per session decrease.

Direct Path Read and “enq: KO – fast object checkpoint”

Introduction

When using direct path read, oracle reads the blocks from the disk. The buffer cache is not being used.

Then, when oracle begins to read a segment using direct path read, it flush its dirty blocks to disk (A dirty block is a block that does not have the same image in memory and on disk). During that time, the session that triggered the direct path read is waiting for the “enq: KO – fast object checkpoint” wait event.

In this post, I just want to see with my own eyes that the dirty blocks being flushed to disk belong only to the segment being read.

Tests

For the test, I used two tables on a 11.2.0.4 database. One table is made of 2 partitions and one is not partitioned (then 3 segments):

BDT:BDT1> select SEGMENT_NAME,PARTITION_NAME from dba_segments where segment_name in ('BDT1','BDT2');

SEGMENT_NAME PARTITION_NAME
------------ --------------
BDT2
BDT1         P2
BDT1         P1

The rows distribution across segments, blocks and file relative numbers is the following:

BDT:BDT1> l
  1  select u.object_name,u.subobject_name,dbms_rowid.rowid_block_number(BDT1.rowid) BLOCK_NUMBER, dbms_rowid.rowid_relative_fno(BDT1.rowid) FNO
  2	 from BDT1, user_objects u
  3	 where dbms_rowid.rowid_object(BDT1.rowid) = u.object_id
  4	 union all
  5	 select u.object_name,u.subobject_name,dbms_rowid.rowid_block_number(BDT2.rowid) BLOCK_NUMBER, dbms_rowid.rowid_relative_fno(BDT2.rowid) FNO
  6	 from BDT2, user_objects u
  7	 where dbms_rowid.rowid_object(BDT2.rowid) = u.object_id
  8	 order by 1,2
  9*
BDT:BDT1> /

OBJECT_NAME  SUBOBJECT_NAME  BLOCK_NUMBER FNO
------------ --------------- ------------ -----------
BDT1	     P1              2509842	  5
BDT1	     P1              2509842      5
BDT1	     P1              2509842      5
BDT1	     P1              2509842      5
BDT1	     P2              2510866      5
BDT1	     P2              2510866      5
BDT1	     P2              2510866      5
BDT1	     P2              2510866      5
BDT1	     P2              2510866      5
BDT2                         1359091      5
BDT2                         1359091      5
BDT2                         1359091      5
BDT2                         1359091      5
BDT2                         1359091      5
BDT2                         1359091      5
BDT2                         1359091      5
BDT2                         1359091      5
BDT2                         1359091      5

18 rows selected.

so that all the rows of the segment:

  • BDT1:P1 are in datafile 5, block 2509842
  • BDT1:P2 are in datafile 5, block 2510866
  • BDT2 are in datafile 5, block 1359091

Remark:

BDT1:P<N> stands for table BDT1 partition <N>

Now, let’s update both tables (without committing):

BDT:BDT1> update BDT1 set ID2=100;

9 rows updated.

BDT:BDT1> update BDT2 set ID2=100;

9 rows updated.

As I do not commit (nor rollback), then those 3 blocks in memory contain an active transaction.

In another session, let’s force serial direct path read and read partition P1 from BDT1 only:

BDT:BDT1> alter session set "_serial_direct_read"=always;

Session altered.

BDT:BDT1> select count(*) from BDT1 partition (P1);

COUNT(*)
-----------
4

Now, check its wait events and related statistics with snapper. A cool thing with snapper is that I can get wait events and statistics from one tool.

SYS:BDT1> @snapper all,end 5 1 20
Sampling SID 20 with interval 5 seconds, taking 1 snapshots...

-- Session Snapper v4.22 - by Tanel Poder ( http://blog.tanelpoder.com/snapper ) - Enjoy the Most Advanced Oracle Troubleshooting Script on the Planet! 🙂

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 SID @INST, USERNAME  , TYPE, STATISTIC                                                 ,         DELTA, HDELTA/SEC,    %TIME, GRAPH       , NUM_WAITS,  WAITS/SEC,   AVERAGES
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    20  @1, BDT       , STAT, physical reads direct                                     ,             1,        .02,         ,             ,          ,           ,         .5 per execution
    20  @1, BDT       , WAIT, enq: KO - fast object checkpoint                          ,          1117,    18.12us,      .0%, [          ],         1,        .02,     1.12ms average wait

--  End of Stats snap 1, end=2015-04-29 08:25:20, seconds=61.7

--  End of ASH snap 1, end=, seconds=, samples_taken=0, AAS=(No ASH sampling in begin/end snapshot mode)

PL/SQL procedure successfully completed.

For brevity and clarity, I keep only the wait events and statistics of interest.

As we can see, the direct path read has been used and the “fast object checkpoint” occurred (as the session waited for it).

Now let’s check which block(s) has/have been flushed to disk: to do so, I’ll dump those 3 blocks from disk and check if they contain an active transaction (as the ones in memory).

BDT:BDT1> !cat dump_blocks.sql
-- BDT2
alter system dump datafile 5 block 1359091;
-- BDT1:P2
alter system dump datafile 5 block 2510866;
-- BDT1:P1
alter system dump datafile 5 block 2509842;
select value from v$diag_info where name like 'Default%';

BDT:BDT1> @dump_blocks.sql

System altered.

System altered.

System altered.

VALUE
------------------------------
/u01/app/oracle/diag/rdbms/bdt/BDT1/trace/BDT1_ora_91815.trc

BDT:BDT1> !view /u01/app/oracle/diag/rdbms/bdt/BDT1/trace/BDT1_ora_91815.trc

The Interested Transaction List (ITL) for the first block that has been dumped from disk (BDT2 Table) looks like:

 Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0xffff.000.00000000  0x00000000.0000.00  C---    0  scn 0x0000.0016f9c7
0x02   0x0000.000.00000000  0x00000000.0000.00  ----    0  fsc 0x0000.00000000
0x03   0x0000.000.00000000  0x00000000.0000.00  ----    0  fsc 0x0000.00000000

We can see that the block on disk does not contain an active transaction in it (There is no rows locked as Lck=0). We can conclude that the block that belongs to the BDT2 table has not been flushed to disk (The block is still dirty).

The Interested Transaction List (ITL) for the second block that has been dumped from disk (BDT1:P2) looks like:

 Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0xffff.000.00000000  0x00000000.0000.00  C---    0  scn 0x0000.0016f99f
0x02   0x0000.000.00000000  0x00000000.0000.00  ----    0  fsc 0x0000.00000000
0x03   0x0000.000.00000000  0x00000000.0000.00  ----    0  fsc 0x0000.00000000

We can see that the block on disk does not contain an active transaction in it (There is no rows locked as Lck=0). We can conclude that the block that belongs to the partition P2 of the BDT1 table has not been flushed to disk (The block is still dirty).

The Interested Transaction List (ITL) for the third block that has been dumped from disk (BDT1:P1) looks like:

 Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0xffff.000.00000000  0x00000000.0000.00  C---    0  scn 0x0000.0016f998
0x02   0x000a.005.00001025  0x00c00211.038e.23  ----    4  fsc 0x0000.00000000
0x03   0x0000.000.00000000  0x00000000.0000.00  ----    0  fsc 0x0000.00000000

We can see that the block on disk does contain an active transaction in it (4 rows are locked as Lck=4). This is the block that has been read in direct path read during the select on BDT1:P1.

We can conclude that the block that belongs to BDT1:P1 has been flushed to disk (The block is not dirty anymore).

Conclusion

When oracle begins to read a segment using direct path read:

  • It does not flush to disk the dirty blocks from other segments.
  • Only the dirty blocks of the segment being read are flushed to disk.

processor_group_name and SGA distribution across NUMA nodes

Introduction

On a NUMA enabled system, a 12c database Instance allocates the SGA evenly across all the NUMA nodes by default (unless you change the “_enable_NUMA_interleave” hidden parameter to FALSE). You can find more details about this behaviour in Yves’s post.

Fine, but what if I am setting the processor_group_name parameter (to instruct the database instance to run itself within a specified operating system processor group) to a cgroup that contains more than one NUMA node?

I may need to link the processor_group_name parameter to more than one NUMA node because:

  • the database needs more memory that one NUMA node could offer.
  • the database needs more cpus that one NUMA node could offer.

In that case, is the SGA still evenly allocated across the NUMA nodes defined in the cgroup?

Time to test

My NUMA configuration is the following:

> lscpu | grep NUMA
NUMA node(s):          8
NUMA node0 CPU(s):     1-10,41-50
NUMA node1 CPU(s):     11-20,51-60
NUMA node2 CPU(s):     21-30,61-70
NUMA node3 CPU(s):     31-40,71-80
NUMA node4 CPU(s):     0,81-89,120-129
NUMA node5 CPU(s):     90-99,130-139
NUMA node6 CPU(s):     100-109,140-149
NUMA node7 CPU(s):     110-119,150-159

with this memory allocation (no Instances up):

 > sh ./numa_memory.sh
                MemTotal         MemFree         MemUsed           Shmem      HugePages_Total       HugePages_Free       HugePages_Surp
Node_0      136052736_kB     49000636_kB     87052100_kB        77728_kB                39300                39044                    0
Node_1      136052736_kB     50503228_kB     85549508_kB        77764_kB                39300                39044                    0
Node_2      136052736_kB     18163112_kB    117889624_kB        78236_kB                39300                39045                    0
Node_3      136052736_kB     47859560_kB     88193176_kB        77744_kB                39300                39045                    0
Node_4      136024568_kB     43286800_kB     92737768_kB        77780_kB                39300                39045                    0
Node_5      136052736_kB     50348004_kB     85704732_kB        77792_kB                39300                39045                    0
Node_6      136052736_kB     31591648_kB    104461088_kB        77976_kB                39299                39044                    0
Node_7      136052736_kB     48524064_kB     87528672_kB        77780_kB                39299                39044                    0

A cgroup (named oracle) has been created with those properties:

   group oracle {
   perm {
     task {
       uid = oracle;
       gid = dc_dba;
     }
     admin {
       uid = oracle;
       gid = dc_dba;
     }
   }
   cpuset {
     cpuset.mems=2,3;
     cpuset.cpus="21-30,61-70,31-40,71-80";
   }

The 12.1.0.2 database:

  • has been started using this cgroup thanks to the processor_group_name parameter (so that the database is linked to 2 NUMA nodes:  Nodes 2 and 3).
  • uses use_large_pages set to ONLY.
  • uses “_enable_NUMA_interleave” set to its default value: TRUE.
  • uses “_enable_NUMA_support” set to its default value: FALSE.

First test: The Instance has been started with a SGA that could fully fit into one NUMA node.

Let’s check the memory distribution:

> sh ./numa_memory.sh
                MemTotal         MemFree         MemUsed           Shmem      HugePages_Total       HugePages_Free       HugePages_Surp
Node_0      136052736_kB     48999132_kB     87053604_kB        77720_kB                39300                39044                    0
Node_1      136052736_kB     50504752_kB     85547984_kB        77748_kB                39300                39044                    0
Node_2      136052736_kB     17903752_kB    118148984_kB        78224_kB                39300                23427                    0
Node_3      136052736_kB     47511596_kB     88541140_kB        77736_kB                39300                39045                    0
Node_4      136024568_kB     43282316_kB     92742252_kB        77796_kB                39300                39045                    0
Node_5      136052736_kB     50345492_kB     85707244_kB        77804_kB                39300                39045                    0
Node_6      136052736_kB     31581004_kB    104471732_kB        77988_kB                39299                39044                    0
Node_7      136052736_kB     48516964_kB     87535772_kB        77784_kB                39299                39044                    0

As you can see, 39300-23427 large pages have been allocated from node 2 only. This is the amount of large pages needed for the SGA. So we can conclude that the memory is not evenly distributed across NUMA nodes 2 et 3.

Second test: The Instance has been started with a SGA that can not fully fit into one NUMA node.

Let’s check the memory distribution:

> sh ./numa_memory.sh
                MemTotal         MemFree         MemUsed           Shmem      HugePages_Total       HugePages_Free       HugePages_Surp
Node_0      136052736_kB     51332440_kB     84720296_kB        77832_kB                39300                39044                    0
Node_1      136052736_kB     52532564_kB     83520172_kB        77788_kB                39300                39044                    0
Node_2      136052736_kB     51669192_kB     84383544_kB        77892_kB                39300                    0                    0
Node_3      136052736_kB     52089448_kB     83963288_kB        77860_kB                39300                25864                    0
Node_4      136024568_kB     51992248_kB     84032320_kB        77876_kB                39300                39045                    0
Node_5      136052736_kB     52571468_kB     83481268_kB        77856_kB                39300                39045                    0
Node_6      136052736_kB     52131912_kB     83920824_kB        77844_kB                39299                39044                    0
Node_7      136052736_kB     52268200_kB     83784536_kB        77832_kB                39299                39044                    0

As you can see, the large pages have been allocated from nodes 2 and 3 but not evenly (as there is no more free large pages on node 2 while there is still about 25000 free on node 3).

Remarks

  • I observed the same with or without ASMM.
  • With no large pages (use_large_pages=FALSE) and no ASMM, I can observe this memory distribution:
> sh ./numa_memory.sh
                MemTotal         MemFree         MemUsed           Shmem      HugePages_Total       HugePages_Free       HugePages_Surp
Node_0      136052736_kB     48994024_kB     87058712_kB        77712_kB                39300                39044                    0
Node_1      136052736_kB     50497216_kB     85555520_kB        77752_kB                39300                39044                    0
Node_2      136052736_kB      9225964_kB    126826772_kB      8727192_kB                39300                39045                    0
Node_3      136052736_kB     38986380_kB     97066356_kB      8710796_kB                39300                39045                    0
Node_4      136024568_kB     43279124_kB     92745444_kB        77792_kB                39300                39045                    0
Node_5      136052736_kB     50341284_kB     85711452_kB        77796_kB                39300                39045                    0
Node_6      136052736_kB     31570200_kB    104482536_kB        78000_kB                39299                39044                    0
Node_7      136052736_kB     48505716_kB     87547020_kB        77776_kB                39299                39044                    0

As you can see the SGA has been evenly distributed across NUMA nodes 2 et 3 (see the Shmem column). I did not do more tests (means with ASMM or with AMM or…) with non large pages as not using large pages is not an option for me.

  • With large page and no processor_group_name set, the memory allocation looks like:
> sh ./numa_memory.sh
                MemTotal         MemFree         MemUsed           Shmem      HugePages_Total       HugePages_Free       HugePages_Surp
Node_0      136052736_kB     51234984_kB     84817752_kB        77824_kB                39300                37094                    0
Node_1      136052736_kB     52417252_kB     83635484_kB        77772_kB                39300                37095                    0
Node_2      136052736_kB     51178872_kB     84873864_kB        77904_kB                39300                37096                    0
Node_3      136052736_kB     52263300_kB     83789436_kB        77872_kB                39300                37096                    0
Node_4      136024568_kB     51870848_kB     84153720_kB        77864_kB                39300                37097                    0
Node_5      136052736_kB     52304932_kB     83747804_kB        77852_kB                39300                37097                    0
Node_6      136052736_kB     51837040_kB     84215696_kB        77840_kB                39299                37096                    0
Node_7      136052736_kB     52213888_kB     83838848_kB        77860_kB                39299                37096                    0

As you can see the large pages have been evenly allocated from all the NUMA nodes (this is what has been told in the introduction).

  • If you like it, you can get the simple num_memory.sh script from here.

Summary

  1. With large pages in place and processor_group_name linked to more than one NUMA node, then the SGA is not evenly distributed across the NUMA nodes defined in the cgroup.
  2. With large pages in place and processor_group_name not set,  then the SGA is evenly distributed across the NUMA nodes.

Modify on the fly the cgroup properties linked to the processor_group_name parameter

Introduction

I am preparing a blog post in which I’ll try to describe pros and cons of cpu binding (using the processor_group_name parameter) versus Instance caging.

Today’s comparison:

As you know you can change the cpu_count and resource_manager_plan parameters while the database Instance is up and running. Then, while the database Instance is up and running you can:

  • enable / disable Instance caging.
  • change the cpu_count value and as a consequence the cpu limitation (with the Instance caging already in place).

Fine, can we do the same with cpu binding? What is the effect of changing the cgroup attributes that are linked to the processor_group_name parameter while the database Instance is up and running?

Let’s test it.

Initial setup

The database version is 12.1.0.2 and the  _enable_NUMA_support parameter has been kept to its default value: FALSE.

With this NUMA setup on a single machine:

NUMA node0 CPU(s):     1-10,41-50
NUMA node1 CPU(s):     11-20,51-60
NUMA node2 CPU(s):     21-30,61-70
NUMA node3 CPU(s):     31-40,71-80
NUMA node4 CPU(s):     0,81-89,120-129
NUMA node5 CPU(s):     90-99,130-139
NUMA node6 CPU(s):     100-109,140-149
NUMA node7 CPU(s):     110-119,150-159

A cgroup (named oracle) has been created with those properties:

group oracle {
   perm {
     task {
       uid = oracle;
       gid = dc_dba;
     }
     admin {
       uid = oracle;
       gid = dc_dba;
     }
   }
   cpuset {
     cpuset.mems="0";
     cpuset.cpus="1-6";
   }
}

Means that I want the SGA to be allocated on NUMA node 0 and to use the cpus 1 to 6.

The database has been started using this cgroup thanks to the processor_group_name parameter (The database has to be restarted):

SQL> alter system set processor_group_name='oracle' scope=spfile sid='*';

System altered.

During the database startup you can see the following into the alert.log file:

Instance started in processor group oracle (NUMA Nodes: 0 CPUs: 1-6)

You can also check that the SGA has been allocated from NUMA node 0:

> grep -i HugePages /sys/devices/system/node/node*/meminfo
/sys/devices/system/node/node0/meminfo:Node 0 HugePages_Total: 39303
/sys/devices/system/node/node0/meminfo:Node 0 HugePages_Free:  23589
/sys/devices/system/node/node0/meminfo:Node 0 HugePages_Surp:      0
/sys/devices/system/node/node1/meminfo:Node 1 HugePages_Total: 39302
/sys/devices/system/node/node1/meminfo:Node 1 HugePages_Free:  39046
/sys/devices/system/node/node1/meminfo:Node 1 HugePages_Surp:      0
/sys/devices/system/node/node2/meminfo:Node 2 HugePages_Total: 39302
/sys/devices/system/node/node2/meminfo:Node 2 HugePages_Free:  39047
/sys/devices/system/node/node2/meminfo:Node 2 HugePages_Surp:      0
/sys/devices/system/node/node3/meminfo:Node 3 HugePages_Total: 39302
/sys/devices/system/node/node3/meminfo:Node 3 HugePages_Free:  39047
/sys/devices/system/node/node3/meminfo:Node 3 HugePages_Surp:      0
/sys/devices/system/node/node4/meminfo:Node 4 HugePages_Total: 39302
/sys/devices/system/node/node4/meminfo:Node 4 HugePages_Free:  39047
/sys/devices/system/node/node4/meminfo:Node 4 HugePages_Surp:      0
/sys/devices/system/node/node5/meminfo:Node 5 HugePages_Total: 39302
/sys/devices/system/node/node5/meminfo:Node 5 HugePages_Free:  39047
/sys/devices/system/node/node5/meminfo:Node 5 HugePages_Surp:      0
/sys/devices/system/node/node6/meminfo:Node 6 HugePages_Total: 39302
/sys/devices/system/node/node6/meminfo:Node 6 HugePages_Free:  39047
/sys/devices/system/node/node6/meminfo:Node 6 HugePages_Surp:      0
/sys/devices/system/node/node7/meminfo:Node 7 HugePages_Total: 39302
/sys/devices/system/node/node7/meminfo:Node 7 HugePages_Free:  39047
/sys/devices/system/node/node7/meminfo:Node 7 HugePages_Surp:      0

As you can see, 39303-23589 large pages have been allocated from node 0. This is the amount of large pages needed for the SGA.

We can check that the cpu_count parameter has been automatically set to 6:

SQL> select value from v$parameter where name='cpu_count';

VALUE
--------------------------------------------------------------------------------
6

We can also check the cpu affinity of a background process (smon for example):

> taskset -c -p `ps -ef | grep -i smon | grep BDT12CG | awk '{print $2}'`
pid 1019968's current affinity list: 1-6

Note that the same affinity would have been observed with a foreground process.

Now let’s change the cpuset.cpus attribute while the database instance is up and running:

I want the cpus 11 to 20 to be used (instead of 1 to 6):

> /bin/echo 11-20 > /cgroup/cpuset/oracle/cpuset.cpus

After a few seconds, those lines appear into the alert.log file:

Detected change in CPU count to 10

Great, it looks like the database is aware of the new cpuset.cpus value. Let’s check the smon process cpu affinity :

> taskset -c -p `ps -ef | grep -i smon | grep BDT12CG | awk '{print $2}'`
pid 1019968's current affinity list: 11-20

BINGO: The database is now using the new value (means the new cpus) and we don’t need to bounce it!

Remarks:

  • Depending of the new cpuset.cpus value the SGA allocation may not be “optimal” (If the cpus are not on the same NUMA node where the SGA is currently allocated).
  • Don’t forget to update the /etc/cgconfig.conf file accordingly (If not, the restart of the cgconfig service will overwrite your changes).
  • I did the same change during a full cached SLOB run of 9 readers, and the mpstat -P ALL 3 output moved from:
02:21:54 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
02:21:57 PM  all    3.79    0.01    0.21    0.01    0.00    0.00    0.00    0.00   95.97
02:21:57 PM    0    0.34    0.00    0.00    0.00    0.00    0.00    0.00    0.00   99.66
02:21:57 PM    1   98.66    0.00    1.34    0.00    0.00    0.00    0.00    0.00    0.00
02:21:57 PM    2   98.67    0.00    1.33    0.00    0.00    0.00    0.00    0.00    0.00
02:21:57 PM    3   92.31    0.00    7.69    0.00    0.00    0.00    0.00    0.00    0.00
02:21:57 PM    4   97.00    0.00    3.00    0.00    0.00    0.00    0.00    0.00    0.00
02:21:57 PM    5   98.34    0.00    1.66    0.00    0.00    0.00    0.00    0.00    0.00
02:21:57 PM    6   99.33    0.00    0.67    0.00    0.00    0.00    0.00    0.00    0.00
.
.
.
02:21:57 PM   11    1.34    1.68    0.34    0.00    0.00    0.00    0.00    0.00   96.64
02:21:57 PM   12    0.67    0.00    0.00    0.34    0.00    0.00    0.00    0.00   98.99
02:21:57 PM   13    0.33    0.00    0.00    0.00    0.00    0.00    0.00    0.00   99.67
02:21:57 PM   14    0.33    0.00    0.33    0.00    0.00    0.00    0.00    0.00   99.33
02:21:57 PM   15    0.34    0.00    0.34    0.00    0.00    0.00    0.00    0.00   99.33
02:21:57 PM   16    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
02:21:57 PM   17    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
02:21:57 PM   18    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
02:21:57 PM   19    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
02:21:57 PM   20    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00

to:

02:22:00 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
02:22:03 PM  all    5.71    0.00    0.18    0.02    0.00    0.00    0.00    0.00   94.09
02:22:03 PM    0    0.33    0.00    0.00    0.00    0.00    0.33    0.00    0.00   99.33
02:22:03 PM    1    0.34    0.00    0.00    0.00    0.00    0.00    0.00    0.00   99.66
02:22:03 PM    2    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
02:22:03 PM    3    2.70    0.00    6.08    0.00    0.00    0.00    0.00    0.00   91.22
02:22:03 PM    4    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
02:22:03 PM    5    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
02:22:03 PM    6    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
.
.
.
02:22:03 PM   11    2.05    0.00    1.71    0.34    0.00    0.00    0.00    0.00   95.89
02:22:03 PM   12   99.00    0.00    1.00    0.00    0.00    0.00    0.00    0.00    0.00
02:22:03 PM   13   99.33    0.00    0.67    0.00    0.00    0.00    0.00    0.00    0.00
02:22:03 PM   14   99.00    0.00    1.00    0.00    0.00    0.00    0.00    0.00    0.00
02:22:03 PM   15   97.32    0.00    2.68    0.00    0.00    0.00    0.00    0.00    0.00
02:22:03 PM   16   98.67    0.00    1.33    0.00    0.00    0.00    0.00    0.00    0.00
02:22:03 PM   17   99.33    0.00    0.67    0.00    0.00    0.00    0.00    0.00    0.00
02:22:03 PM   18   99.00    0.00    1.00    0.00    0.00    0.00    0.00    0.00    0.00
02:22:03 PM   19   99.00    0.00    1.00    0.00    0.00    0.00    0.00    0.00    0.00
02:22:03 PM   20   99.67    0.00    0.33    0.00    0.00    0.00    0.00    0.00    0.00

As you can see the process activity moved from cpus 1-6 to cpus 11-20: This is exactly what we want.

Now, out of curiosity: what about the memory? let’s change the cpuset.mems property while the database is up and running:

Let’s say that we want the SGA to migrate from NUMA node 0 to node 1.

To achieve this we also need to set the cpuset memory_migrate attribute to 1 (default is 0: means memory allocation is not following cpuset.mems update).

So, let’s change the memory_migrate attribute to 1 and cpuset.mems from node 0 to node 1:

/bin/echo 1 > /cgroup/cpuset/oracle/cpuset.memory_migrate
/bin/echo 1 > /cgroup/cpuset/oracle/cpuset.mems

What is the effect on the SGA allocation?

grep -i HugePages /sys/devices/system/node/node*/meminfo               
/sys/devices/system/node/node0/meminfo:Node 0 HugePages_Total: 39303
/sys/devices/system/node/node0/meminfo:Node 0 HugePages_Free:  23589
/sys/devices/system/node/node0/meminfo:Node 0 HugePages_Surp:      0
/sys/devices/system/node/node1/meminfo:Node 1 HugePages_Total: 39302
/sys/devices/system/node/node1/meminfo:Node 1 HugePages_Free:  39046
/sys/devices/system/node/node1/meminfo:Node 1 HugePages_Surp:      0
/sys/devices/system/node/node2/meminfo:Node 2 HugePages_Total: 39302
/sys/devices/system/node/node2/meminfo:Node 2 HugePages_Free:  39047
/sys/devices/system/node/node2/meminfo:Node 2 HugePages_Surp:      0
/sys/devices/system/node/node3/meminfo:Node 3 HugePages_Total: 39302
/sys/devices/system/node/node3/meminfo:Node 3 HugePages_Free:  39047
/sys/devices/system/node/node3/meminfo:Node 3 HugePages_Surp:      0
/sys/devices/system/node/node4/meminfo:Node 4 HugePages_Total: 39302
/sys/devices/system/node/node4/meminfo:Node 4 HugePages_Free:  39047
/sys/devices/system/node/node4/meminfo:Node 4 HugePages_Surp:      0
/sys/devices/system/node/node5/meminfo:Node 5 HugePages_Total: 39302
/sys/devices/system/node/node5/meminfo:Node 5 HugePages_Free:  39047
/sys/devices/system/node/node5/meminfo:Node 5 HugePages_Surp:      0
/sys/devices/system/node/node6/meminfo:Node 6 HugePages_Total: 39302
/sys/devices/system/node/node6/meminfo:Node 6 HugePages_Free:  39047
/sys/devices/system/node/node6/meminfo:Node 6 HugePages_Surp:      0
/sys/devices/system/node/node7/meminfo:Node 7 HugePages_Total: 39302
/sys/devices/system/node/node7/meminfo:Node 7 HugePages_Free:  39047
/sys/devices/system/node/node7/meminfo:Node 7 HugePages_Surp:      0

None: as you can see the SGA is still allocated on node 0 (I observed the same during the full cached SLOB run).

Then, let’s try to stop the database, add memory_migrate attribute into the /etc/cgconfig.conf file that way:

   cpuset {
     cpuset.mems="0";
     cpuset.cpus="1-6";
     cpuset.memory_migrate=1;
   }

restart the cgconfig service:

> service cgconfig restart

and check the value:

> cat /cgroup/cpuset/oracle/cpuset.memory_migrate
1

Well it is set to 1, let’s start the database and check again:

> cat /cgroup/cpuset/oracle/cpuset.memory_migrate
0

Hey! It has been set back to 0 after the database startup. So, let’s conclude that, by default, the memory can not be dynamically migrated from one NUMA node to another by changing on the fly the cpuset.mems value.

Remark:

Obviously, changing the cpuset.mems attribute from node 0 to node 1:

> /bin/echo 1 > /cgroup/cpuset/oracle/cpuset.mems

and bounce the Instance will allocate the memory on the new node. Again, don’t forget to update the /etc/cgconfig.conf file accordingly (If not, the restart of the cgconfig service will overwrite your changes).

Conclusion:

By changing the cpuset.cpus value:

  • We have been able to change the cpus affinity dynamically (without database Instance restart).

By changing the cpuset.mems value:

  • We have not been able to change the memory node dynamically (an Instance restart is needed).

The main goal is reached: With cpu binding already in place (using processor_group_name) we are able to change on the fly the number of cpus a database is limited to use.

cpu binding (processor_group_name) vs Instance caging comparison during LIO pressure

Introduction

Suppose that I am trying to consolidate multiple databases on to a single server and that I would like to limit the cpu usage of one of the database. Then I decide to test the cpu binding approach and the Instance caging approach.

You can find a great description and how to implement:

During those tests I would like to compare the cpu binding and the Instance caging performance during Logical IO pressure (By pressure I mean that the database would need more cpu resources than the ones it is limited to use).

Tests setup

The tests will be performed on a single machine and this NUMA configuration:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                160
On-line CPU(s) list:   0-159
Thread(s) per core:    2
Core(s) per socket:    10
Socket(s):             8
NUMA node(s):          8
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 47
Stepping:              2
CPU MHz:               1995.926
BogoMIPS:              3991.60
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              24576K
NUMA node0 CPU(s):     1-10,41-50
NUMA node1 CPU(s):     11-20,51-60
NUMA node2 CPU(s):     21-30,61-70
NUMA node3 CPU(s):     31-40,71-80
NUMA node4 CPU(s):     0,81-89,120-129
NUMA node5 CPU(s):     90-99,130-139
NUMA node6 CPU(s):     100-109,140-149
NUMA node7 CPU(s):     110-119,150-159

and this cpu topology (the script comes from Karl Arao’s cputoolkit here):

> sh ./cpu_topology
model name      : Intel(R) Xeon(R) CPU E7- 4850  @ 2.00GHz
processors  (OS CPU count)          0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159
physical id (processor socket)      0 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
siblings    (logical CPUs/socket)   20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
core id     (# assigned to a core)  0 0 1 2 8 9 16 17 18 24 25 0 1 2 8 9 16 17 18 24 25 0 1 2 8 9 16 17 18 24 25 0 1 2 8 9 16 17 18 24 25 0 1 2 8 9 16 17 18 24 25 0 1 2 8 9 16 17 18 24 25 0 1 2 8 9 16 17 18 24 25 0 1 2 8 9 16 17 18 24 25 1 2 8 9 16 17 18 24 25 0 1 2 8 9 16 17 18 24 25 0 1 2 8 9 16 17 18 24 25 0 1 2 8 9 16 17 18 24 25 0 1 2 8 9 16 17 18 24 25 0 1 2 8 9 16 17 18 24 25 0 1 2 8 9 16 17 18 24 25 0 1 2 8 9 16 17 18 24 25
cpu cores   (physical cores/socket) 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10

For the tests I’ll use SLOB and apply some parameters at the instance level for Logical IO testing:

> cat lio_init.sql
alter system reset "_db_block_prefetch_limit" scope=spfile sid='*';
alter system reset "_db_block_prefetch_quota"  scope=spfile sid='*';
alter system reset "_db_file_noncontig_mblock_read_count" scope=spfile sid='*';
alter system reset "cpu_count" scope=spfile sid='*';
alter system set "db_cache_size"=20g scope=spfile sid='*';
alter system set "shared_pool_size"=10g scope=spfile sid='*';
alter system set "sga_target"=0 scope=spfile sid='*';

I also have to say that the _enable_NUMA_support parameter has been kept to its default value: FALSE.

Then, for accurate comparison, each test will be based on the same SLOB configuration and workload: 9 readers and fix run time of 5 minutes. I’ll run SLOB 3 times per test and then I’ll compare how many logical reads have been done on average per test for all the 3 runs.

TEST 1: Just out of curiosity, launch the SLOB runs without cpu binding and without Instance caging (without LIO pressure)

With 9 readers, the average result is the following:

Load Profile                    Per Second   Per Transaction  Per Exec  Per Call
~~~~~~~~~~~~~~~            ---------------   --------------- --------- ---------
             DB Time(s):               8.9              58.8      0.00      3.66
              DB CPU(s):               8.9              58.7      0.00      3.65
  Logical read (blocks):       6,323,758.9      41,668,897.2



Top 10 Foreground Events by Total Wait Time
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                           Total Wait       Wait   % DB Wait
Event                                Waits Time (sec)    Avg(ms)   time Class
------------------------------ ----------- ---------- ---------- ------ --------
DB CPU                                         2700.6              99.8

As you can see during those 5 minutes run time, SLOB generated about 6 300 000 logical reads per second. Those are also Full cached SLOB runs as the top event is “DB CPU” for about 100% of the DB time.

TEST 2: Instance caging in place and LIO pressure

As I am using 9 SLOB readers, I’ll set the cpu_count parameter to 6 (to produce LIO pressure).

SQL> alter system set cpu_count=6;

System altered.

SQL> alter system set resource_manager_plan='default_plan';

System altered.

The average result is the following:

Load Profile                    Per Second   Per Transaction  Per Exec  Per Call
~~~~~~~~~~~~~~~            ---------------   --------------- --------- ---------
             DB Time(s):               8.9              64.4      0.00      3.86
              DB CPU(s):               5.4              39.0      0.00      2.34
  Logical read (blocks):       3,495,618.2      25,222,299.9



Top 10 Foreground Events by Total Wait Time
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                           Total Wait       Wait   % DB Wait
Event                                Waits Time (sec)    Avg(ms)   time Class
------------------------------ ----------- ---------- ---------- ------ --------
DB CPU                                         1638.6              60.6
resmgr:cpu quantum                  16,005     1065.4      66.57   39.4 Schedule

As you can see during those 5 minutes run time, SLOB generated about 3 500 000 logical reads per second. Those are also Full cached SLOB runs as the top events are “DB CPU” and “resmgr:cpu quantum” (that is linked to the Instance caging) for about 100% of the DB time.

TEST 3: cpu binding in place and LIO pressure

As I am using 9 SLOB readers, I’ll set 6 cpus in the “oracle” cgroup (to compare with the previous test and to produce LIO pressure) that way:

   cpuset {
     cpuset.mems="0";
     cpuset.cpus="1-6";
   }

remove caging and set processor_group_name to “oracle”:

SQL> alter system reset cpu_count scope=spfile sid='*';

System altered.

SQL> alter system reset resource_manager_plan scope=spfile sid='*';

System altered.

SQL> alter system set processor_group_name='oracle' scope=spfile sid='*';

System altered.

The average result is the following:

Load Profile                    Per Second   Per Transaction  Per Exec  Per Call
~~~~~~~~~~~~~~~            ---------------   --------------- --------- ---------
             DB Time(s):               8.9              64.4      0.00      3.70
              DB CPU(s):               5.9              42.3      0.00      2.43
  Logical read (blocks):       5,786,023.5      41,714,611.9



Top 10 Foreground Events by Total Wait Time
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                           Total Wait       Wait   % DB Wait
Event                                Waits Time (sec)    Avg(ms)   time Class
------------------------------ ----------- ---------- ---------- ------ --------
DB CPU                                         1775.1              65.6
db file sequential read                578        1.5       2.64     .1 User I/O
latch free                             213          1       4.80     .0 Other
.
.
.

As you can see during those 5 minutes run time:

  • SLOB generated about 5 800 000 logical reads per second.
  • The sum of the “% DB WAIT time” are less than 100% (“DB CPU” represents 65.6% of it).

This last observation is due to the fact that “CPU WAIT” is not included into “DB CPU“. OEM gives a clear picture of it:

new_cg6_for_blog

Remarks:

  1. You can find more details about “DB CPU” and “CPU WAIT” into this Doug Burns’s post.
  2. Kyle Hailey provided a sql to measure the “DB CPU” and “CPU WAIT” into this blog post.

That said, I can conclude that about 100% of the DB Time has been spent into the CPU or the run queue: Then those are also Full cached SLOB runs.

BUT WAIT: The previous test (cpu binding) has an advantage compare to the Instance caging one: The SGA and the cpus have been allocated from the same NUMA node (See the beginning of this post), while the SGA is spread across NUMA nodes in the Instance caging test. And it matters: See this post for more details.

Then I need to setup a new test: Create a cgroup with 9 cpus on the same NUMA node as the SGA, set the caging to 6 and start the instance on this cgroup.

That way, I’ll be able to compare Apple and Apple, means compare:

  • TEST 3: cpu binding in place and LIO pressure
  • TEST 4: cpu binding and Instance caging in place with LIO pressure due to the caging

Let’s do it:

TEST 4: cpu binding and Instance caging in place with LIO pressure due to the caging

As I am using 9 SLOB readers, I’ll set 9 cpus in the “oracle” cgroup, put the caging to 6 (to compare with the previous test and to produce LIO pressure) that way:

SQL> alter system set cpu_count=6 ;

System altered.

SQL> alter system set resource_manager_plan='default_plan';

System altered.

SQL> alter system set processor_group_name='oracle' scope=spfile sid='*';

System altered.

with the “oracle” cgroup setup that way (cpus and SGA allocated on the same NUMA node):

   cpuset {
     cpuset.mems="0";
     cpuset.cpus="1-9";
   }

The average result is the following:

Load Profile                    Per Second   Per Transaction  Per Exec  Per Call
~~~~~~~~~~~~~~~            ---------------   --------------- --------- ---------
             DB Time(s):               8.9              58.8      0.00      3.81
              DB CPU(s):               5.4              35.4      0.00      2.29
  Logical read (blocks):       5,182,622.4      34,126,892.6



Top 10 Foreground Events by Total Wait Time
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                           Total Wait       Wait   % DB Wait
Event                                Waits Time (sec)    Avg(ms)   time Class
------------------------------ ----------- ---------- ---------- ------ --------
DB CPU                                         1627.8              60.2
resmgr:cpu quantum                  16,190     1064.9      65.78   39.4 Schedule

As you can see during those 5 minutes run time, SLOB generated about 5 100 000 logical reads per second. Those are also Full cached SLOB runs as the top events are “DB CPU” and “resmgr:cpu quantum” (that is linked to the Instance caging) for about 100% of the DB time.

Last test, just out of curiosity to compare with TEST 1:

TEST 5: Just out of curiosity, launch the SLOB runs with cpu binding and without Instance caging (without LIO pressure)

As I am using 9 SLOB readers, I’ll set 9 cpus in the “oracle” cgroup without caging  (to compare with the test 1 and not to produce LIO pressure).

The average result is the following:

Load Profile                    Per Second   Per Transaction  Per Exec  Per Call
~~~~~~~~~~~~~~~            ---------------   --------------- --------- ---------
             DB Time(s):               9.0              64.5      0.00      2.79
              DB CPU(s):               8.8              63.8      0.00      2.75
  Logical read (blocks):       8,623,444.6      62,178,526.0



Top 10 Foreground Events by Total Wait Time
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                           Total Wait       Wait   % DB Wait
Event                                Waits Time (sec)    Avg(ms)   time Class
------------------------------ ----------- ---------- ---------- ------ --------
DB CPU                                         2677.3              98.8

As you can see during those 5 minutes run time, SLOB generated about 8 600 000 logical reads per second. Those are also Full cached SLOB runs as the top event is “DB CPU” with about 100% of the DB time.

Conclusion:

The ranking is the following (Logical IO per seconds in descending order):

  1. cpu binding (without LIO pressure):  8 600 000 logical reads per second.
  2. without anything (without LIO pressure): 6 300 000 logical reads per second.
  3. cpu binding only and LIO pressure: 5 800 000 logical reads per second.
  4. cpu binding and Instance caging (LIO pressure due to the caging): 5 100 000 logical reads per second.
  5. Instance caging only and LIO pressure: 3 500 000 logical reads per second.

Only ranks 3, 4 and 5 are of interest regarding the initial goal of this post. And only 3 and 4 are valuable comparison.

So, comparing 3 and 4 ranks, I can conclude that the Instance caging has been less performant (compare to the cpu binding) by about 10% during LIO pressure (by pressure I mean that the database needs more cpu resources than the ones it is limited to use). It makes sense to me as the caging is managed by the oracle database software and then needs some overhead.

Now that I have compared cpu binding and Instance caging from a performance point of view, I’ll discuss pros and cons of both approach into another blog post.

Useful link related to the oracle database and CPU interaction:

Measure the impact of remote versus local NUMA node access thanks to processor_group_name

Introduction

Starting with oracle database 12c you can use the processor_group_name parameter to instruct the database instance to run itself within the specified operating system processor group. You can find more detail about this feature into Nikolay Manchev’s blog post.

So basically, once the libcgroup package has been installed:

> yum install libcgroup

you can create an “oracle” (for example) group by editing the /etc/cgconfig.conf file so that it looks like:

> cat /etc/cgconfig.conf 
mount {
        cpuset  = /cgroup/cpuset;
        cpu     = /cgroup/cpu;
        cpuacct = /cgroup/cpuacct;
        memory  = /cgroup/memory;
        devices = /cgroup/devices;
        freezer = /cgroup/freezer;
        net_cls = /cgroup/net_cls;
        blkio   = /cgroup/blkio;
}

group oracle {
   perm {
     task {
       uid = oracle;
       gid = dba;
     }
     admin {
       uid = oracle;
       gid = dba;
     }
   }
   cpuset {
     cpuset.mems="0";
     cpuset.cpus="1-9";
   }
}

start the cgconfig service:

> service cgconfig start

And then set the processor_group_name instance parameter to the group (“oracle” in our example).

The interesting part is that to setup the group we have to use 2 mandatory parameters into the cpuset:

  • cpuset.mems: specifies the memory nodes that tasks in this cgroup are permitted to access.
  • cpuset.cpus: specifies the CPUs that tasks in this cgroup are permitted to access.

So, it means that with processor_group_name set:

  • The SGA will be allocated according to the cpuset.mems parameter.
  • The processes will be bind to the CPUs according to the cpuset.cpus parameter.

Then, we can measure the impact of remote versus local NUMA node access thanks to the processor_group_name parameter: Let’s do it.

Tests setup

My NUMA configuration is the following:

> numactl --hardware
available: 8 nodes (0-7)
node 0 cpus: 1 2 3 4 5 6 7 8 9 10 41 42 43 44 45 46 47 48 49 50
node 0 size: 132864 MB
node 0 free: 45326 MB
node 1 cpus: 11 12 13 14 15 16 17 18 19 20 51 52 53 54 55 56 57 58 59 60
node 1 size: 132864 MB
node 1 free: 46136 MB
node 2 cpus: 21 22 23 24 25 26 27 28 29 30 61 62 63 64 65 66 67 68 69 70
node 2 size: 132864 MB
node 2 free: 40572 MB
node 3 cpus: 31 32 33 34 35 36 37 38 39 40 71 72 73 74 75 76 77 78 79 80
node 3 size: 132864 MB
node 3 free: 47145 MB
node 4 cpus: 0 81 82 83 84 85 86 87 88 89 120 121 122 123 124 125 126 127 128 129
node 4 size: 132836 MB
node 4 free: 40573 MB
node 5 cpus: 90 91 92 93 94 95 96 97 98 99 130 131 132 133 134 135 136 137 138 139
node 5 size: 132864 MB
node 5 free: 45564 MB
node 6 cpus: 100 101 102 103 104 105 106 107 108 109 140 141 142 143 144 145 146 147 148 149
node 6 size: 132864 MB
node 6 free: 20592 MB
node 7 cpus: 110 111 112 113 114 115 116 117 118 119 150 151 152 153 154 155 156 157 158 159
node 7 size: 132864 MB
node 7 free: 44170 MB
node distances:
node   0   1   2   3   4   5   6   7 
  0:  10  13  12  12  22  22  22  22 
  1:  13  10  12  12  22  22  22  22 
  2:  12  12  10  13  22  22  22  22 
  3:  12  12  13  10  22  22  22  22 
  4:  22  22  22  22  10  13  12  12 
  5:  22  22  22  22  13  10  12  12 
  6:  22  22  22  22  12  12  10  13 
  7:  22  22  22  22  12  12  13  10

As you can see:

  • CPUs 1 to 9 are located into node 0.
  • node 0 is far from node 7 (distance equals 22).

Then I am able to test local NUMA node access by creating the group with:

  • cpuset.cpus=”1-9″
  • cpuset.mems=”0″

As CPUs 1-9 are located into node 0 and the SGA will be created into node 0.

and I am able to test remote NUMA node access by creating the group with:

  • cpuset.mems=”7″
  • cpuset.cpus=”1-9″

As CPUs 1-9 are located into node 0 while the SGA will be created into node 7.

To test the effect of remote versus local NUMA node access, I’ll use SLOB. From my point of view this is the perfect/mandatory tool to test the effect of NUMA from the oracle side.

For the tests I created 8 SLOB schemas (with SCALE=10000, so 80 MB), applied some parameters at the instance level for Logical IO testing (as we need FULL cached SLOB to test NUMA efficiently):

> cat lio_init.sql
alter system reset "_db_block_prefetch_limit" scope=spfile sid='*';
alter system reset "_db_block_prefetch_quota"  scope=spfile sid='*';
alter system reset "_db_file_noncontig_mblock_read_count" scope=spfile sid='*';
alter system reset "cpu_count" scope=spfile sid='*';
alter system set "db_cache_size"=20g scope=spfile sid='*';
alter system set "shared_pool_size"=10g scope=spfile sid='*';
alter system set "sga_target"=0 scope=spfile sid='*';

and set processor_group_name to oracle:

SQL> alter system set processor_group_name='oracle' scope=spfile sid='*';

For accurate comparison, each test will be based on the same SLOB configuration and workload: 8 readers and fix run time of 3 minutes. Then I’ll compare how many logical reads have been done in both cases.

Test 1: Measure local NUMA node access:

As explained previously, I do it by updating /etc/cgconfig.conf file that way:

  • cpuset.cpus=”1-9″
  • cpuset.mems=”0″

re-start the service:

> service cgconfig restart

and the database instance.

Check that the SGA is allocated into node 0:

> grep -i HugePages /sys/devices/system/node/node*/meminfo

/sys/devices/system/node/node0/meminfo:Node 0 HugePages_Total: 39303
/sys/devices/system/node/node0/meminfo:Node 0 HugePages_Free:  23589
/sys/devices/system/node/node0/meminfo:Node 0 HugePages_Surp:      0
/sys/devices/system/node/node1/meminfo:Node 1 HugePages_Total: 39302
/sys/devices/system/node/node1/meminfo:Node 1 HugePages_Free:  39046
/sys/devices/system/node/node1/meminfo:Node 1 HugePages_Surp:      0
/sys/devices/system/node/node2/meminfo:Node 2 HugePages_Total: 39302
/sys/devices/system/node/node2/meminfo:Node 2 HugePages_Free:  39047
/sys/devices/system/node/node2/meminfo:Node 2 HugePages_Surp:      0
/sys/devices/system/node/node3/meminfo:Node 3 HugePages_Total: 39302
/sys/devices/system/node/node3/meminfo:Node 3 HugePages_Free:  39047
/sys/devices/system/node/node3/meminfo:Node 3 HugePages_Surp:      0
/sys/devices/system/node/node4/meminfo:Node 4 HugePages_Total: 39302
/sys/devices/system/node/node4/meminfo:Node 4 HugePages_Free:  39047
/sys/devices/system/node/node4/meminfo:Node 4 HugePages_Surp:      0
/sys/devices/system/node/node5/meminfo:Node 5 HugePages_Total: 39302
/sys/devices/system/node/node5/meminfo:Node 5 HugePages_Free:  39047
/sys/devices/system/node/node5/meminfo:Node 5 HugePages_Surp:      0
/sys/devices/system/node/node6/meminfo:Node 6 HugePages_Total: 39302
/sys/devices/system/node/node6/meminfo:Node 6 HugePages_Free:  39047
/sys/devices/system/node/node6/meminfo:Node 6 HugePages_Surp:      0
/sys/devices/system/node/node7/meminfo:Node 7 HugePages_Total: 39302
/sys/devices/system/node/node7/meminfo:Node 7 HugePages_Free:  39047
/sys/devices/system/node/node7/meminfo:Node 7 HugePages_Surp:      0

As you can see, 39303-23589 large pages have been allocated from node 0. This is the amount of large pages needed for my SGA.

Then I launched SLOB 2 times with the following results:

Load Profile                    	Per Second   Per Transaction  Per Exec  Per Call
~~~~~~~~~~~~~~~            		---------------   --------------- --------- ---------
Run 1:  Logical read (blocks):       	4,551,737.9      52,878,393.1
Run 2: 	Logical read (blocks):   	4,685,712.3      18,984,632.1


                                           	Total Wait       Wait   % DB Wait
Event                                	Waits 	Time (sec)    Avg(ms)   time Class
------------------------------    ----------- 	---------- ---------- ------ --------
Run 1: DB CPU                                       1423.2              97.8
Run 2: DB CPU                                       1431.4              99.2

As you can see, during the 3 minutes run time, SLOB generated about 4 500 000 logical reads per second with local NUMA node access in place. Those are also FULL cached SLOB runs as the top event is “DB CPU” with about 100% of the DB time.

Test 2: Measure remote NUMA node access:

As explained previously, I do it by updating /etc/cgconfig.conf file that way:

  • cpuset.cpus=”1-9″
  • cpuset.mems=”7″

re-start the service:

> service cgconfig restart

and the database instance.

Check that the SGA is allocated into node 7:

> grep -i HugePages /sys/devices/system/node/node*/meminfo
/sys/devices/system/node/node0/meminfo:Node 0 HugePages_Total: 39303
/sys/devices/system/node/node0/meminfo:Node 0 HugePages_Free:  39047
/sys/devices/system/node/node0/meminfo:Node 0 HugePages_Surp:      0
/sys/devices/system/node/node1/meminfo:Node 1 HugePages_Total: 39302
/sys/devices/system/node/node1/meminfo:Node 1 HugePages_Free:  39046
/sys/devices/system/node/node1/meminfo:Node 1 HugePages_Surp:      0
/sys/devices/system/node/node2/meminfo:Node 2 HugePages_Total: 39302
/sys/devices/system/node/node2/meminfo:Node 2 HugePages_Free:  39047
/sys/devices/system/node/node2/meminfo:Node 2 HugePages_Surp:      0
/sys/devices/system/node/node3/meminfo:Node 3 HugePages_Total: 39302
/sys/devices/system/node/node3/meminfo:Node 3 HugePages_Free:  39047
/sys/devices/system/node/node3/meminfo:Node 3 HugePages_Surp:      0
/sys/devices/system/node/node4/meminfo:Node 4 HugePages_Total: 39302
/sys/devices/system/node/node4/meminfo:Node 4 HugePages_Free:  39047
/sys/devices/system/node/node4/meminfo:Node 4 HugePages_Surp:      0
/sys/devices/system/node/node5/meminfo:Node 5 HugePages_Total: 39302
/sys/devices/system/node/node5/meminfo:Node 5 HugePages_Free:  39047
/sys/devices/system/node/node5/meminfo:Node 5 HugePages_Surp:      0
/sys/devices/system/node/node6/meminfo:Node 6 HugePages_Total: 39302
/sys/devices/system/node/node6/meminfo:Node 6 HugePages_Free:  39047
/sys/devices/system/node/node6/meminfo:Node 6 HugePages_Surp:      0
/sys/devices/system/node/node7/meminfo:Node 7 HugePages_Total: 39302
/sys/devices/system/node/node7/meminfo:Node 7 HugePages_Free:  23557
/sys/devices/system/node/node7/meminfo:Node 7 HugePages_Surp:      0

As you can see, 39302-23557 large pages have been allocated from node 7. This is the amount of large pages needed for my SGA.

Then I launched SLOB 2 times with the following results:

Load Profile                    	Per Second   Per Transaction  Per Exec  Per Call
~~~~~~~~~~~~~~~            		---------------   --------------- --------- ---------
Run 1:  Logical read (blocks):       	2,133,879.6      24,991,464.4
Run 2: 	Logical read (blocks):   	2,203,307.5      10,879,098.7


                                           	Total Wait       Wait   % DB Wait
Event                                	Waits 	Time (sec)    Avg(ms)   time Class
------------------------------    ----------- 	---------- ---------- ------ --------
Run 1: DB CPU                                       1420.7              97.7
Run 2: DB CPU                                       1432.6              99.0

As you can see, during the 3 minutes run time, SLOB generated about 2 100 000 logical reads per second with remote NUMA node access in place. Those are also FULL cached SLOB runs as the top event is “DB CPU” with about 100% of the DB time.

This is about 2.15 times less than with the local NUMA node access. This is about the ratio of the distance between node 0 to node 7 and node 0 to node 0 (22/10).

Remarks:

  • The SGA is fully made of large pages as I set the use_large_pages parameter to “only”.
  • To check that only the CPUs 1 to 9 worked during the SLOB run you can read the mpstat.out file provided by SLOB at the end of the test.
  • To check that the oracle session running the SLOB run are bind to CPUs 1 to 9 you can use taskset:
> ps -ef | grep 641387
oracle    641387  641353 87 10:15 ?        00:00:55 oracleBDT12CG_1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
> taskset -c -p 641387
pid 641387's current affinity list: 1-9
  • To check that your SLOB Logical I/O test delivers the maximum, you can read this post.
  • With SLOB schema of 8 MB, I was not able to see this huge difference between remote and local NUMA node access. I guess this is due to the L3 cache of 24576K on the cpu used during my tests.

Conclusion:

  • Thanks to the 12c database parameter processor_group_name, we have been able to measure the impact of remote versus local NUMA node access from the database point of view.
  • If you plan the use the processor_group_name database parameter, then pay attention to configure the group correctly with cpuset.cpus and cpuset.mems to ensure local NUMA node access (My tests show 2.15 times less logical reads per second with my farest remote access).

Useful links related to NUMA: