processor_group_name and SGA distribution across NUMA nodes

Introduction

On a NUMA enabled system, a 12c database Instance allocates the SGA evenly across all the NUMA nodes by default (unless you change the “_enable_NUMA_interleave” hidden parameter to FALSE). You can find more details about this behaviour in Yves’s post.

Fine, but what if I am setting the processor_group_name parameter (to instruct the database instance to run itself within a specified operating system processor group) to a cgroup that contains more than one NUMA node?

I may need to link the processor_group_name parameter to more than one NUMA node because:

  • the database needs more memory that one NUMA node could offer.
  • the database needs more cpus that one NUMA node could offer.

In that case, is the SGA still evenly allocated across the NUMA nodes defined in the cgroup?

Time to test

My NUMA configuration is the following:

> lscpu | grep NUMA
NUMA node(s):          8
NUMA node0 CPU(s):     1-10,41-50
NUMA node1 CPU(s):     11-20,51-60
NUMA node2 CPU(s):     21-30,61-70
NUMA node3 CPU(s):     31-40,71-80
NUMA node4 CPU(s):     0,81-89,120-129
NUMA node5 CPU(s):     90-99,130-139
NUMA node6 CPU(s):     100-109,140-149
NUMA node7 CPU(s):     110-119,150-159

with this memory allocation (no Instances up):

 > sh ./numa_memory.sh
                MemTotal         MemFree         MemUsed           Shmem      HugePages_Total       HugePages_Free       HugePages_Surp
Node_0      136052736_kB     49000636_kB     87052100_kB        77728_kB                39300                39044                    0
Node_1      136052736_kB     50503228_kB     85549508_kB        77764_kB                39300                39044                    0
Node_2      136052736_kB     18163112_kB    117889624_kB        78236_kB                39300                39045                    0
Node_3      136052736_kB     47859560_kB     88193176_kB        77744_kB                39300                39045                    0
Node_4      136024568_kB     43286800_kB     92737768_kB        77780_kB                39300                39045                    0
Node_5      136052736_kB     50348004_kB     85704732_kB        77792_kB                39300                39045                    0
Node_6      136052736_kB     31591648_kB    104461088_kB        77976_kB                39299                39044                    0
Node_7      136052736_kB     48524064_kB     87528672_kB        77780_kB                39299                39044                    0

A cgroup (named oracle) has been created with those properties:

   group oracle {
   perm {
     task {
       uid = oracle;
       gid = dc_dba;
     }
     admin {
       uid = oracle;
       gid = dc_dba;
     }
   }
   cpuset {
     cpuset.mems=2,3;
     cpuset.cpus="21-30,61-70,31-40,71-80";
   }

The 12.1.0.2 database:

  • has been started using this cgroup thanks to the processor_group_name parameter (so that the database is linked to 2 NUMA nodes:  Nodes 2 and 3).
  • uses use_large_pages set to ONLY.
  • uses “_enable_NUMA_interleave” set to its default value: TRUE.
  • uses “_enable_NUMA_support” set to its default value: FALSE.

First test: The Instance has been started with a SGA that could fully fit into one NUMA node.

Let’s check the memory distribution:

> sh ./numa_memory.sh
                MemTotal         MemFree         MemUsed           Shmem      HugePages_Total       HugePages_Free       HugePages_Surp
Node_0      136052736_kB     48999132_kB     87053604_kB        77720_kB                39300                39044                    0
Node_1      136052736_kB     50504752_kB     85547984_kB        77748_kB                39300                39044                    0
Node_2      136052736_kB     17903752_kB    118148984_kB        78224_kB                39300                23427                    0
Node_3      136052736_kB     47511596_kB     88541140_kB        77736_kB                39300                39045                    0
Node_4      136024568_kB     43282316_kB     92742252_kB        77796_kB                39300                39045                    0
Node_5      136052736_kB     50345492_kB     85707244_kB        77804_kB                39300                39045                    0
Node_6      136052736_kB     31581004_kB    104471732_kB        77988_kB                39299                39044                    0
Node_7      136052736_kB     48516964_kB     87535772_kB        77784_kB                39299                39044                    0

As you can see, 39300-23427 large pages have been allocated from node 2 only. This is the amount of large pages needed for the SGA. So we can conclude that the memory is not evenly distributed across NUMA nodes 2 et 3.

Second test: The Instance has been started with a SGA that can not fully fit into one NUMA node.

Let’s check the memory distribution:

> sh ./numa_memory.sh
                MemTotal         MemFree         MemUsed           Shmem      HugePages_Total       HugePages_Free       HugePages_Surp
Node_0      136052736_kB     51332440_kB     84720296_kB        77832_kB                39300                39044                    0
Node_1      136052736_kB     52532564_kB     83520172_kB        77788_kB                39300                39044                    0
Node_2      136052736_kB     51669192_kB     84383544_kB        77892_kB                39300                    0                    0
Node_3      136052736_kB     52089448_kB     83963288_kB        77860_kB                39300                25864                    0
Node_4      136024568_kB     51992248_kB     84032320_kB        77876_kB                39300                39045                    0
Node_5      136052736_kB     52571468_kB     83481268_kB        77856_kB                39300                39045                    0
Node_6      136052736_kB     52131912_kB     83920824_kB        77844_kB                39299                39044                    0
Node_7      136052736_kB     52268200_kB     83784536_kB        77832_kB                39299                39044                    0

As you can see, the large pages have been allocated from nodes 2 and 3 but not evenly (as there is no more free large pages on node 2 while there is still about 25000 free on node 3).

Remarks

  • I observed the same with or without ASMM.
  • With no large pages (use_large_pages=FALSE) and no ASMM, I can observe this memory distribution:
> sh ./numa_memory.sh
                MemTotal         MemFree         MemUsed           Shmem      HugePages_Total       HugePages_Free       HugePages_Surp
Node_0      136052736_kB     48994024_kB     87058712_kB        77712_kB                39300                39044                    0
Node_1      136052736_kB     50497216_kB     85555520_kB        77752_kB                39300                39044                    0
Node_2      136052736_kB      9225964_kB    126826772_kB      8727192_kB                39300                39045                    0
Node_3      136052736_kB     38986380_kB     97066356_kB      8710796_kB                39300                39045                    0
Node_4      136024568_kB     43279124_kB     92745444_kB        77792_kB                39300                39045                    0
Node_5      136052736_kB     50341284_kB     85711452_kB        77796_kB                39300                39045                    0
Node_6      136052736_kB     31570200_kB    104482536_kB        78000_kB                39299                39044                    0
Node_7      136052736_kB     48505716_kB     87547020_kB        77776_kB                39299                39044                    0

As you can see the SGA has been evenly distributed across NUMA nodes 2 et 3 (see the Shmem column). I did not do more tests (means with ASMM or with AMM or…) with non large pages as not using large pages is not an option for me.

  • With large page and no processor_group_name set, the memory allocation looks like:
> sh ./numa_memory.sh
                MemTotal         MemFree         MemUsed           Shmem      HugePages_Total       HugePages_Free       HugePages_Surp
Node_0      136052736_kB     51234984_kB     84817752_kB        77824_kB                39300                37094                    0
Node_1      136052736_kB     52417252_kB     83635484_kB        77772_kB                39300                37095                    0
Node_2      136052736_kB     51178872_kB     84873864_kB        77904_kB                39300                37096                    0
Node_3      136052736_kB     52263300_kB     83789436_kB        77872_kB                39300                37096                    0
Node_4      136024568_kB     51870848_kB     84153720_kB        77864_kB                39300                37097                    0
Node_5      136052736_kB     52304932_kB     83747804_kB        77852_kB                39300                37097                    0
Node_6      136052736_kB     51837040_kB     84215696_kB        77840_kB                39299                37096                    0
Node_7      136052736_kB     52213888_kB     83838848_kB        77860_kB                39299                37096                    0

As you can see the large pages have been evenly allocated from all the NUMA nodes (this is what has been told in the introduction).

  • If you like it, you can get the simple num_memory.sh script from here.

Summary

  1. With large pages in place and processor_group_name linked to more than one NUMA node, then the SGA is not evenly distributed across the NUMA nodes defined in the cgroup.
  2. With large pages in place and processor_group_name not set,  then the SGA is evenly distributed across the NUMA nodes.
Advertisements

5 thoughts on “processor_group_name and SGA distribution across NUMA nodes

  1. It would be so insteresting to see a test of cached SLOB comparing:

    1. Test 2 (the SGA mostly on node2 with some on node3) with all Oracle processes pinned to those CPUs.
    2. boot the database with a bequeath connection and then run all SLOB processes under numactl –cpunodebind=2,3 –interleave=2,3

    In #2 leave all NUMA related params in Oracle to the default.

    What I specify here is a comparison between Oracle internal numa awareness via CGROUP and numactl imposed interleaved memory.

    1. Hey Kevin,

      It looks like “Oracle internal numa awareness via CGROUP” performs a little bit better than “numactl imposed interleaved memory”.

      See more details in this link.

      Thx
      Bertrand

    1. Kevin,

      The “under numactl –cpunodebind=2,3 –interleave=2,3” case was not good: The SGA was not interleaved due to “_enable_NUMA_interleave=TRUE (default)”, see:

      12.1.0.2 numactl interleave and _enable_NUMA_interleave

      As you can see on 12.1.0.2, “under numactl –cpunodebind=2,3 –interleave=2,3”:

      1) The SGA is not interleaved on nodes 2 and 3 with _enable_NUMA_interleave=TRUE (default)
      2) The SGA is interleaved on nodes 2 and 3 with _enable_NUMA_interleave=FALSE

      So, with _enable_NUMA_interleave=FALSE and under numactl –cpunodebind=2,3 –interleave=2,3: I got interleave (on nodes 2 and 3) and the SLOB run performs better (10000000 LIOPS) than “the SGA mostly on node2 with some on node3 with all Oracle processes pinned to those CPUs” (9700000 LIOPS).

      Remark: On 11.2.0.4 and “under numactl –cpunodebind=2,3 –interleave=2,3” the SGA is interleaved on nodes 2 and 3 whatever the value of _enable_NUMA_interleave is.

      Bertrand

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s