In-Memory Instance distribution with RAC databases: Impact of the parallel_instance_group parameter

INTRODUCTION

As you may know, with a RAC database, by default all objects populated into
In-memory will be distributed across all of the IM column stores in the cluster. It is also possible to have all the data appear in the IM column store on every node (Engineered Systems only). See this white paper for more details.

Into this blog post we have been told that:

RAC services and the parallel_instance_group parameter can be used to control where data is populated, and the subsequent access of that data.
In a RAC environment services enable the use of a subset of nodes and still insure that all In-Memory queries will be able to access all IM column stores.

My customer uses a RAC service defined as preferred/available on a 2 nodes RAC. In this configuration, I would like to see how the data is populated into the In-Memory column store (IMCS) on both Instances when the parallel_instance_group parameter is used .

TESTS

I’ll do several tests and stop/start the 12.1.0.2 database between each tests (to avoid confusion).

First let’s create a service BDT_PREF as preferred/available:

srvctl add service -db BDT12c1 -s BDT_PREF -preferred BDT12c1_1 -available BDT12c1_2

so that the Instance 1 (BDT12c1_1) is the preferred one.

Test 1:

parallel_instance_group not set.

Connect to the database through the BDT_PREF service and check the data distribution.

SQL> alter table bdt inmemory priority none;

Table altered.

SQL> select count(*) from bdt;

  COUNT(*)
----------
  38220000

SQL> SELECT inst_id, populate_status, inmemory_size, bytes, bytes_not_populated
FROM gv$im_user_segments
WHERE segment_name = 'BDT';

   INST_ID POPULATE_ INMEMORY_SIZE      BYTES BYTES_NOT_POPULATED
---------- --------- ------------- ---------- -------------------
         1 COMPLETED     308477952 1610612736           732684288
         2 COMPLETED     274530304 1610612736           826236928

As you can see no instance contains all data as the BYTES_NOT_POPULATED column is greater than 0. The data is distributed across all of the IM column stores in the cluster.

Test 2:

parallel_instance_group set on Instance 1.
The service is running on Instance 1.

Set the parallel_instance_group parameter on the Instance BDT12c1_1 (INST_ID=1) to the service:

SQL> alter system set parallel_instance_group=BDT_PREF scope=both sid='BDT12c1_1';

System altered.

Check that the BDT_PREF service is running on the Instance 1 (where parallel_instance_group has been set) :

SQL> host srvctl status service -d BDT12c1 -s BDT_PREF
Service BDT_PREF is running on instance(s) BDT12c1_1

Connect to the database through the BDT_PREF service and check the data distribution:

SQL> alter table bdt inmemory priority none;

Table altered.

SQL> select count(*) from bdt;

  COUNT(*)
----------
  38220000

SQL> SELECT inst_id, populate_status, inmemory_size, bytes, bytes_not_populated
FROM gv$im_user_segments
WHERE segment_name = 'BDT';

   INST_ID POPULATE_ INMEMORY_SIZE      BYTES BYTES_NOT_POPULATED
---------- --------- ------------- ---------- -------------------
         1 COMPLETED     575602688 1610612736                   0

As you can see the Instance 1 contains all the data in its IMCS (as BYTES_NOT_POPULATED=0).

Test 3:

parallel_instance_group set on Instance 1.
The service is running on Instance 2.

The fact that the service is not running on its preferred Instance could happen after:

A failover.
The database startup: When you use automatic services in an administrator-managed database, during planned database startup, services may start on the first instances that become available rather than their preferred instances (see the oracle documentation).
A manual relocation.

Set the parallel_instance_group parameter on the Instance BDT12c1_1 (INST_ID=1) to the service:

SQL> alter system set parallel_instance_group=BDT_PREF scope=both sid='BDT12c1_1';

System altered.

Relocate the BDT_PREF service on the Instance 2 (if needed):

SQL> host srvctl status service -d BDT12c1 -s BDT_PREF
Service BDT_PREF is running on instance(s) BDT12c1_1

SQL> host srvctl relocate service -d BDT12c1 -s BDT_PREF -newinst BDT12c1_2 -oldinst BDT12c1_1

SQL> host srvctl status service -d BDT12c1 -s BDT_PREF
Service BDT_PREF is running on instance(s) BDT12c1_2

Connect to the database through the BDT_PREF service and check the data distribution:

SQL> alter table bdt inmemory priority none;

Table altered.

SQL> select count(*) from bdt;

  COUNT(*)
----------
  38220000

SQL> SELECT inst_id, populate_status, inmemory_size, bytes, bytes_not_populated
FROM gv$im_user_segments
WHERE segment_name = 'BDT';

   INST_ID POPULATE_ INMEMORY_SIZE      BYTES BYTES_NOT_POPULATED
---------- --------- ------------- ---------- -------------------
         2 COMPLETED     277676032 1610612736           826236928
         1 COMPLETED     312672256 1610612736           732684288

As you can see no instance contains all data as the BYTES_NOT_POPULATED column is greater than 0. The data is distributed across all of the IM column stores in the cluster.

Remarks:

This distribution could lead to performance issue.
If a failover occurred and the Instance 1 is still down, then the data is fully populated on the Instance 2.

Test 4:

parallel_instance_group set on Instance 1 and on Instance 2.
The service is running on Instance 1.

Set the parallel_instance_group parameter on both Instances to the service:

SQL> alter system set parallel_instance_group=BDT_PREF scope=both sid='BDT12c1_1';

System altered.

SQL> alter system set parallel_instance_group=BDT_PREF scope=both sid='BDT12c1_2';

System altered.

Check that the BDT_PREF service is running on the Instance 1:

SQL> host srvctl status service -d BDT12c1 -s BDT_PREF
Service BDT_PREF is running on instance(s) BDT12c1_1

Connect to the database through the BDT_PREF service and check the data distribution:

SQL> alter table bdt inmemory priority none;

Table altered.

SQL> select count(*) from bdt;

  COUNT(*)
----------
  38220000

SQL> SELECT inst_id, populate_status, inmemory_size, bytes, bytes_not_populated
FROM gv$im_user_segments
WHERE segment_name = 'BDT';

   INST_ID POPULATE_ INMEMORY_SIZE      BYTES BYTES_NOT_POPULATED
---------- --------- ------------- ---------- -------------------
         1 COMPLETED     575602688 1610612736                   0

As you can see the Instance 1 contains all the data in its IMCS (as BYTES_NOT_POPULATED=0).

Test 5:

parallel_instance_group set on Instance 1 and on Instance 2.
The service is running on Instance 2.

Set the parallel_instance_group parameter on both Instances to the service:

SQL> alter system set parallel_instance_group=BDT_PREF scope=both sid='BDT12c1_1';

System altered.

SQL> alter system set parallel_instance_group=BDT_PREF scope=both sid='BDT12c1_2';

System altered.

Relocate the BDT_PREF service on the Instance 2 (if needed):

SQL> host srvctl status service -d BDT12c1 -s BDT_PREF
Service BDT_PREF is running on instance(s) BDT12c1_1

SQL> host srvctl relocate service -d BDT12c1 -s BDT_PREF -newinst BDT12c1_2 -oldinst BDT12c1_1

SQL> host srvctl status service -d BDT12c1 -s BDT_PREF
Service BDT_PREF is running on instance(s) BDT12c1_2

Connect to the database through the BDT_PREF service and check the data distribution:

SQL> alter table bdt inmemory priority none;

Table altered.

SQL> select count(*) from bdt;

  COUNT(*)
----------
  38220000

SQL> SELECT inst_id, populate_status, inmemory_size, bytes, bytes_not_populated
FROM gv$im_user_segments
WHERE segment_name = 'BDT';

   INST_ID POPULATE_ INMEMORY_SIZE      BYTES BYTES_NOT_POPULATED
---------- --------- ------------- ---------- -------------------
         2 COMPLETED     575602688 1610612736                   0

As you can see the Instance 2 contains all the data in its IMCS (as BYTES_NOT_POPULATED=0).

Out of curiosity, what about a service defined as preferred on both Instances and the parallel_instance_group set on both Instances?

Test 6:

parallel_instance_group set on Instance 1 and on Instance 2.
The service is running on Instance 1 and 2.

First let’s create a service BDT_PREF as preferred on both Instances:

srvctl add service -db BDT12c1 -s BDT_PREF -preferred BDT12c1_1,BDT12c1_2

Set the parallel_instance_group parameter on both Instances to the service:

SQL> alter system set parallel_instance_group=BDT_PREF scope=both sid='BDT12c1_1';

System altered.

SQL> alter system set parallel_instance_group=BDT_PREF scope=both sid='BDT12c1_2';

System altered.

Check that the BDT_PREF service is running on both Instances:

SQL> host srvctl status service -d BDT12c1 -s BDT_PREF
Service BDT_PREF is running on instance(s) BDT12c1_1,BDT12c1_2

Connect to the database through the BDT_PREF service and check the data distribution:

SQL> alter table bdt inmemory priority none;

Table altered.

SQL> select count(*) from bdt;

  COUNT(*)
----------
  38220000

SQL> SELECT inst_id, populate_status, inmemory_size, bytes, bytes_not_populated
FROM gv$im_user_segments
WHERE segment_name = 'BDT';

   INST_ID POPULATE_ INMEMORY_SIZE      BYTES BYTES_NOT_POPULATED
---------- --------- ------------- ---------- -------------------
         2 COMPLETED     277676032 1610612736           826236928
         1 COMPLETED     312672256 1610612736           732684288

As you can see no instance contains all data as the BYTES_NOT_POPULATED column is greater than 0. The data is distributed across all of the IM column stores in the cluster.

REMARKS

I am not discussing the impact of the distribution on the performance (with or without Auto DOP).
I just focus on how the data has been distributed.
In this previous blog post the “workaround” proposed to get rid of the default behaviour on non Engineered Systems (“Every row of the test table is stored in the IMCS of either one instance or the other”) works for any type of service.
In the current post I focus only on service defined as preferred/available on a 2 nodes RAC (The last test is out of curiosity).

CONCLUSION

With a 12.1.0.2 database and a service defined as preferred/available on a 2 nodes RAC:

With the parallel_instance_group parameter set to the service on both Instances:

The data is fully populated on the Instance the service is running on.

With the parallel_instance_group parameter set to the service on the preferred Instance only:

If the service is running on the preferred Instance, the data is fully populated on the Instance.

If the service is running on the “available” Instance, the data is distributed across all of the IM column stores in the cluster, .

With a 12.1.0.2 database, a service defined as preferred on both Instances on a 2 nodes RAC and parallel_instance_group parameter set to the service on both Instances:

The data is distributed across all of the IM column stores in the cluster.

binding (with processor_group_name) versus caging: Facts, Observations and Customer cases.

INTRODUCTION

This blog post came to my mind following an interesting conversation I have had on twitter with Philippe Fierens and Karl Arao.

Let’s go for it:

Nowadays it is very common to consolidate multiple databases on the same server.

One could want to limit the CPU usage of databases and/or ensure guarantee CPU for some databases. I would like to compare two methods to achieve this:

Instance Caging (available since 11.2).
processor_group_name (available since 12.1).

This blog post is composed of 4 mains parts:

Facts: Describing the features and a quick overview on how to implement them.
Observations: What I observed on my lab server. You should observe the same on your environment for most of them.
OEM and AWR views for Instance Caging and processor_group_name with CPU pressure.
Customer cases: I cover all the cases I faced so far.

FACTS

Instance caging: Purpose is to Limit or “cage” the amount of CPU that a database instance can use at any time. It can be enabled online (no need to restart the database) in 2 steps:

Set “cpu_count” parameter to the maximum number of CPUs the database should be able to use (Oracle advises to set cpu_count to at least 2)

SQL> alter system set cpu_count = 2;

Set “resource_manager_plan” parameter to enable CPU Resource Manager

SQL> alter system set resource_manager_plan = ‘default_plan’;

Graphical view with multiple databases limited as an example:

processor_group_name: Bind a database instance to specific CPUs / NUMA nodes. It is enabled in 4 steps on my RedHat 6.5 machine:

Specify the CPUs or NUMA nodes by creating a “processor group”:

Example: Snippet of /etc/cgconfig.conf:

group oracle {
    perm {
      task {
        uid = oracle;
        gid = dc_dba;
      }
      admin {
        uid = oracle;
        gid = dc_dba;
      }
    }
    cpuset {
      cpuset.mems=0;
      cpuset.cpus="1-9";
    }
 }

Start the cgconfig service:

service cgconfig start

Set the Oracle parameter “processor_group_name” to the name of this processor group:

SQL> alter system set processor_group_name='oracle' scope=spfile;

Restart the Database Instance.

Graphical view with multiple databases bound as an example:

This graphical view represents the ideal configuration, where a database is bound to specific CPUs and NUMA local memory (If you need more details, see the observation number 1 later on this post).

Remark regarding the Database Availability:

The Instance caging can be enabled online.
For cgroups/processor_group_name the database needs to be stopped and started.

OBSERVATIONS

For brevity, “Caging” is used for “Instance Caging” and “Binding” for “cgroups/processor_group_name” usages.

Also “CPU” is used when CPU threads are referred to, which is the actual granularity for caging and binding.

The observations have been made on a 12.1.0.2 database using large pages (use_large_pages=only). I will not cover the “non large pages” case as not using large pages is not an option for me.

1. When you define the cgroup for the binding, you should pay attention to the NUMA memory and CPUs locality as it has an impact on the LIO performance (See this blog post). As you can see (on my specific configuration), remote NUMA nodes access has been slower by about 2 times compare to local access.

2.Without binding, the SGA is interleaved across all the NUMA nodes (unless you set the oracle hidden parameter _enable_NUMA_interleave to FALSE):

3.With binding (on more than one NUMA nodes), the SGA is not interleaved across all the NUMA nodes (SGA bind on nodes 0 and 1):

4.Instance caging has been less performant (compare to the cpu binding) during LIO pressure (by pressure I mean that the database needs more cpu resources than the ones it is limited to use) (See this blog post for more details). Please note that the comparison has been done on the same NUMA node (to avoid binding advantage over caging due to observations number 1 and 2).

5.With cpu binding already in place (using processor_group_name) we are able to change the number of cpus a database is allowed to use on the fly (See this blog post).

Remarks:

You can find more details regarding observations 2 and 3 into this blog post.
From 2 and 3 we can conclude that with the caging, the SGA is interleaved across all the NUMA nodes (unless the caging has been setup on top of the binding (mix configuration)).
All above mentioned observations are related to performance. You may want to run the tests described in observations 1 and 4 in your own environment to measure the actual impact on your environment.

Caging and Binding with CPU pressure: OEM and AWR views

What do the OEM or AWR views of a database being under CPU pressure show (pressure meaning the database needs more CPU resources than it is configured to use) with both caging and binding? Let’s have a look.

CPU pressure with Caging:

The database has the caging enabled with the cpu_count parameter set to 6 and is running 9 SLOB users in parallel.

The OEM view:

As you can see: In average about 6 sessions are running on the CPU and about 3 are waiting in the “Scheduler” wait class.

The AWR view:

As you can see: Approximatively 60% of the DB time is spend on CPU and 40% is spend waiting because of the caging (Scheduler wait class).

CPU pressure with Binding:

The processor_group_name is set to a cgroup of 6 CPUs. The Instance is running 9 SLOB users in parallel.

The OEM view:

As you can see: In average 6 sessions are running on CPU and 3 are waiting for the CPU (runqueue).

The AWR view:

New “Top Events” section as of 12c:

To summarize:

Then, we can conclude that:

65.5% of the DB time has been spent on the CPU.
99.62–65.5 = 34.12% of the DB time has been spent into the runqueue.

CUSTOMER CASES

The cases below are based on studies for customers.

Case number 1: One database uses too much CPU and affects other instances’ performance.

Then we want to limit its CPU usage:

Caging: We are able to cage the database Instance without any restart.
Binding: We are able to bind the database Instance, but it needs a restart.

Example of such a case by implementing Caging:

What to choose?

The caging is the best choice in this case (as it is the easiest and we don’t need to restart the database).

For the following cases we need to define 2 categories of database:

The paying ones: Customer paid for guaranteed CPU resources available.
The non-paying ones: Customer did not pay for guaranteed CPU resources available.

Case number 2: Guarantee CPU resource for some databases. The CPU resource is defined by the customer and needs to be guaranteed at the database level.

Caging:

All the databases need to be caged (if not, there is no guaranteed resources availability as one non-caged could take all the CPU).
As a consequence, you can’t mix paying and non-paying customer without caging the non-paying ones.
You have to know how to cage each database individually to be sure that sum(cpu_count) <= Total number of threads (If not, CPU resources can’t be guaranteed).
Be sure that sum(cpu_count) of non-paid <= “CPU Total – number of paid CPU” (If not, CPU resources can’t be guaranteed).
There are a maximum number of databases that you can put on a machine: As cpu_count >=2 (see facts) then the maximum number of databases = Total number of threads /2.

Graphical view as an example:

If you need to add a new database (a paying one or a non-paying one) and you are already in a situation where “sum(cpu_count) = Total number of threads” then you have to review the caging of all non-paying databases (to not over allocate the machine).

Binding:

Each database is linked to a set of CPU (There is no overlap, means that a CPU can be part of only one cgroup). Then, create:

One cgroup per paying database.
One cgroup for all the non-paying databases.

That way we can easily mix paying and non-paying customer on the same machine.

Graphical View as an example:

If you need to add a new database then, if this is:

A non-paying one: Put it into the non-paying group.
A paying one: Create a new cgroup for it, assigning CPU taken away from non-paying ones if needed.

Binding + caging mix:

create a cgroup for all the paying databases and then put the caging on each paying database.
Put the non-paying into another cgroup (no overlap with the paying one) without caging.

Graphical View as an example:

If you need to add a new database then, if this is:

A non-paying one: Put it into the non-paying group.
A paying one: Extend the paying group if needed (taking CPU away from non-paying ones) and cage it accordingly.

What to choose?

I would say there is no “best choice”: It all depends of the number of database we are talking about (For a large number of databases I would use the mix approach). And don’t forget that with the caging option your can’t put more than number of threads/2 databases.

Case number 3: Guarantee CPU resource for exactly one group of databases. The CPU resource needs to be guaranteed at the group level (Imagine that customer paid for 24 CPUs whatever the number of database is).

Caging:

We have 2 options:

1. Option 1:

Cage all the non-paying databases.
Be sure that sum(cpu_count) of non-paid <= “CPU Total – number of paid CPU” (If not, CPU resources can’t be guaranteed).
Don’t cage the paying databases.

That way the paying databases have guaranteed at least the resources they paid for.

Graphical View as an example:

If you need to add a new database then, if this is:

A non-paying one: Cage it but you may need to re-cage all the non-paying ones to guarantee the paying ones still get its “minimum” resource.
A paying one: Nothing to do.

2. Option 2:

Cage all the databases (paying and non-paying).
You have to know how to cage each database individually to be sure that sum(cpu_count) <= Total number of threads (If not, CPU resources can’t be guaranteed).
Be sure that sum(cpu_count) of non-paid <= “CPU Total – number of paid CPU” (If not, CPU resources can’t be guaranteed).

That way the paying databases have guaranteed exactly the resources they paid for.

Graphical View as an example:

Binding:

Again 2 options:

1. Option 1:

Put all the non-paying databases into a “CPU Total – number of paid CPU ” cgroup.
Allow all the CPUs to be used by the paying databases (don’t create a cgroup for the paying databases).
That way the paying group has guaranteed at least the resources it paid for.

Graphical View as an example:

If you need to add a new database then, if this is:

A non-paying one: Put it in the non-paying cgroup.
A paying one: Create it outside the non-paying cgroup.

2. Option 2:

Put all the paying databases into a cgroup.
Put all the non-paying databases into another cgroup (no overlap with the paying cgroup).

That way the paying databases have guaranteed exactly the resources they paid for.

Graphical View:

If you need to add a new database, then if this is:

A non-paying one: Put it in the non-paying cgroup.
A paying one: Put it in the paying cgroup.

Binding + caging mix:

I don’t see any benefits here.

What to choose?

I like the option 1) in both cases as it allows customer to get more than what they paid for (with the guarantee of what they paid for). Then the choice between caging and binding really depends of the number of databases we are talking about (binding is preferable and easily manageable for large number of databases).

Case number 4: The need to guarantee CPU resources for more than one group of databases (Imagine that one group=one customer). The CPU resource needs to be assured at each group level.

Caging:

All the databases need to be caged (if not, there is no guaranteed resources availability as one non-caged could take all the CPU).
As a consequence, you can’t mix paying and non-paying customer without caging the non-paying ones.
You have to know how to cage each database individually to be sure that sum(cpu_count) <= Total number of threads (If not, CPU resources can’t be guaranteed).
Be sure that sum(cpu_count) of non-paid <= “CPU Total – number of paid CPU” (If not, CPU resources can’t be guaranteed).
There are a maximum number of databases that you can put on a machine: As cpu_count >=2 (see facts) then the maximum number of databases = Total number of threads /2.

Graphical View as an example:

Overlapping is not possible in that case (means you can’t guarantee resources with overlapping).

Binding:

Create a cgroup for each paying database group and also for the non-paying ones (no overlap between all the groups).
Overlapping is not possible in that case (means you can’t guarantee resources with overlapping).

Graphical view as an example:

If you need to add a database, then simply put it in the right group.

Binding + caging mix:

I don’t see any benefits here.

What to choose?

It really depends of the number of databases we are talking about (binding is preferable and easily manageable for large number of databases).

REMARKS

If you see (or faced) more cases, then feel free to share and to comment on this blog post.
I did not pay attention on potential impacts on LIO performance linked to any possible choice (caging vs binding with or without NUMA). I just mentioned them into the observations, but did not include them into the use cases (I just discussed the feasibility of the implementation).
I really like the options 1 into the case number 3. This option has been proposed by Karl during our conversation.