FAN callouts for In-Memory Instance distribution with RAC, PDB and preferred/available service

The context:

In my previous blog post: In-Memory Instance distribution with RAC and Multitenant Environment, I checked if we can influence (get rid of the default behaviour) how the In-Memory is distributed across all the Instances per PDB. I checked because:

  • By default all objects populated into memory will be distributed across all of the IM column stores in the cluster.
  • Christian Antognini show us into this blog post that having the data not fully populated on an Instance (default behaviour on non Engineered Systems) could lead to bad performance.

Let’s summarize the findings:

  • If the PDB is open on one Instance, then this Instance contains all the data.
  • If the PDB is open on all the Instances then we have to set “_inmemory_auto_distribute” to false to have at least one Instance that contains all the data.

The objective:

That said, let’s try to reach an objective: With a service defined as preferred/available on a PDB, I want the PDB to be open on only one Instance at any time (That way we don’t need to set the hidden parameter).

Into this blog post, Martin Bach already explained the PDB open/close behaviour with a service defined as preferred/available on it. I don’t want to repeat Martin’s work, I just want to draw your attention on the following (with our objective in mind):

The default PDB status is:

SYS@CDB$ROOT> select inst_id,open_mode from gv$pdbs where name='BDTPDB';

   INST_ID OPEN_MODE
---------- ----------
         1 MOUNTED
         2 MOUNTED

As you can see the PDB is mounted on both instances.

Now, let’s create and start a preferred/available service linked to our PDB that way:

srvctl add service -db SRCCONT -pdb BDTPDB -service BDTPDB_TAF -e select -m BASIC -P BASIC -r SRCCONT1 -a SRCCONT2 -w 5 -z 12
srvctl start service -d SRCCONT

Let’s check the PDB status and the service status:

SYS@CDB$ROOT> select inst_id,open_mode from gv$pdbs where name='BDTPDB';

   INST_ID OPEN_MODE
---------- ----------
         2 MOUNTED
         1 READ WRITE
srvctl status service -d SRCCONT
Service BDTPDB_TAF is running on instance(s) SRCCONT1

So, the PDB is open on the Instance that is hosting the service and not open on the other Instance:  The Objective is reached.

Now, let’s kill pmon for the SRCCONT1 Instance (that hosts the service), wait SRCCONT1 to be back and check the PDB and the service status:

kill -9 `ps -ef | grep -i pmon | grep -i SRCCONT1 | awk '{print $2}'`

--PDB Status
SYS@CDB$ROOT> select inst_id,open_mode from gv$pdbs where name='BDTPDB';

   INST_ID OPEN_MODE
---------- ----------
         2 READ WRITE
         1 MOUNTED

-- Service Status
srvctl status service -d SRCCONT
Service BDTPDB_TAF is running on instance(s) SRCCONT2

So, the service failed over, the PDB has been closed on the failing Instance and is open on the one that is now hosting the service: The Objective is reached.

Now, let’s (fail back) relocate the service manually and check the PDB and the service status:

srvctl relocate service -d SRCCONT -s BDTPDB_TAF -oldinst SRCCONT2 -newinst SRCCONT1 -force

--PDB Status
SYS@CDB$ROOT>  select inst_id,open_mode from gv$pdbs where name='BDTPDB';

   INST_ID OPEN_MODE
---------- ----------
         1 READ WRITE
         2 READ WRITE

-- Service Status
srvctl status service -d SRCCONT                                                                         
Service BDTPDB_TAF is running on instance(s) SRCCONT1

As you can see the service has been relocated and the PDB is now open on both Instances: The Objective is not reached.

To summarize:

  1. The objective has been reached when we created and started the service.
  2. The objective has been reached when the service failed over (Due to Instance crash).
  3. The objective has not been reached when the service has been relocated manually.

FAN callouts to the rescue:

FAN callouts are server-side scripts or executables that run whenever a FAN event is generated. I’ll use this feature to fix the third point where the objective is not reached.

To do so, I created the following script:

#!/bin/ksh
#
# Close the PDB on the old intance during manual service relocation
#
# Description:
# -  This script closes the PDB on the old intance during manual service relocation
#
# Requirement:
#  - Location of the script must be $ORA_CRS_HOME/racg/usrco/.
#
# Version:
#  - version 1.0
#  - Bertrand Drouvot
#
# Date: 2014/12/04
#
#
# Env settings

ME=`who | cut -d" " -f1`
PASSWDFILE="/etc/passwd"
HOMEDIR=`egrep "^${ME}" ${PASSWDFILE} | cut -d: -f 6`
DATE_LOG=$(date +"%y%m%d_%H%M")
LOGFILE=fan_cloe_pdb_${DATE_LOG}.log

VAR_EVENTTYPE=$1

for ARGS in $*;
 do
        PROPERTY=`echo $ARGS | cut -f1 -d"=" | tr '[:lower:]' '[:upper:]'`
        VALUE=`echo $ARGS | cut -f2 -d"=" | tr '[:lower:]' '[:upper:]'`
        case $PROPERTY in
                VERSION)      VAR_VERSION=$VALUE ;;
                SERVICE)      VAR_SERVICE=$VALUE ;;
                DATABASE)     VAR_DATABASE=$VALUE ;;
                INSTANCE)     VAR_INSTANCE=$VALUE ;;
                HOST)         VAR_HOST=$VALUE ;;
                STATUS)       VAR_STATUS=$VALUE ;;
                REASON)       VAR_REASON=$VALUE ;;
                CARD)         VAR_CARDINALITY=$VALUE ;;
                TIMESTAMP)    VAR_LOGDATE=$VALUE ;;
                ??:??:??)     VAR_LOGTIME=$VALUE ;;
        esac
 done 

# Close the PDB (PDB name extracted from the service name) if service relocated manually

if ( ([ $VAR_EVENTTYPE = "SERVICEMEMBER" ]) && ([ $VAR_STATUS = "DOWN" ]) && ([ $VAR_REASON = "USER" ]))
then

PATH=$PATH:/usr/local/bin
export pATH
ORAENV_ASK=NO
export ORAENV_ASK
ORACLE_SID=${VAR_INSTANCE}
export ORACLE_SID
. oraenv > /dev/null

# Extract PDB name based on our naming convention (You may need to change this)
PDB=`echo "${VAR_SERVICE}" | sed 's/_TAF.*//'`

# Close the PDB on the old instance
sqlplus /nolog <<EOF
connect / as sysdba;
alter pluggable database ${PDB} close instances=('${VAR_INSTANCE}');
EOF

# Log this in a logfile
echo ${*} >> ${HOMEDIR}/log/${LOGFILE}

fi

under $GRID_HOME/racg/usrco/ (on both nodes). Then the script will be executed for all FAN events, but the script will start processing only for manual service relocation (thanks to the “if condition” in it).

The goal of the script is to close the PDB on the old Instance during manual service relocation.

Now, let’s check if the objective is reached. Let’s relocate the service manually from SRCCONT1 to SRCCONT2:

-- Current PDB Status
SYS@CDB$ROOT> select inst_id,open_mode from gv$pdbs where name='BDTPDB';

   INST_ID OPEN_MODE
---------- ----------
         1 READ WRITE
         2 READ WRITE

-- Current service status
SYS@CDB$ROOT> !srvctl status service -d SRCCONT
Service BDTPDB_TAF is running on instance(s) SRCCONT1

-- Relocate the service
SYS@CDB$ROOT> !srvctl relocate service -d SRCCONT -s BDTPDB_TAF -oldinst SRCCONT1 -newinst SRCCONT2 -force

-- Check PDB status
SYS@CDB$ROOT> select inst_id,open_mode from gv$pdbs where name='BDTPDB';

   INST_ID OPEN_MODE
---------- ----------
         2 READ WRITE
         1 MOUNTED

-- Check service status
SYS@CDB$ROOT> !srvctl status service -d SRCCONT
Service BDTPDB_TAF is running on instance(s) SRCCONT2

So, As you can see the service has been manually relocated and the PDB has been closed to the old Instance. That way, the PDB is open on only one Instance: The Objective is reached.

Remarks:

  • I used the “force” option during the manual relocation which disconnects all the sessions during the relocate operation. If you are not using the “force” option, then I guess you don’t want to close the PDB on the old Instance ;-).
  • The PDB open/close behaviour with a service linked as preferred/available on it is the one described into this post, unless you have set the save state on it (default is discard state).

Conclusion:

The objective is reached in any case: The PDB is open on only one Instance at any time (with a preferred/available service in place). That way we don’t need to set the hidden parameter “_inmemory_auto_distribute” to false to get the In-Memory data fully populated on one Instance.

Update 2015/03/27: At the PDB level, If you set the parallel_instance_group parameter to the service on both Instances, then the In-Memory data is fully populated on the Instance the service is running on even if the PDB is open on both Instances (After the manual service relocation). You can find more details here.

Advertisement

In-Memory Instance distribution with RAC and Multitenant Environment

In a previous post (see In-Memory Instance distribution with RAC databases: I want at least one instance that contains all the data) I provided a way to get rid of the default In-Memory Instance distribution on a RAC (non CDB) database.

To summarize:

  • Thanks to the hidden parameter “_inmemory_auto_distribute” we are able to change the way the In-Memory Column Store are populated by default on an oracle RAC database.
  • By default all objects populated into memory will be distributed across all of the IM column stores in the cluster.
  • With “_inmemory_auto_distribute” set to false, then we are able to populate one instance (or all the instances) with all the data.

Now, let’s have a look at the In-Memory distribution on a Multitenant RAC database and let’s check if we can influence (get rid of the default behaviour) how the In-Memory is distributed across all the Instances per PDB.

By default the IMCS distribution for a PDB is the following:

First case: The PDB is open on one Instance only:

SYS@CDB$ROOT> alter pluggable database BDTPDB open instances=('BDT12c3_1');

Pluggable database altered.

-- Now connect to the BDTPDB PDB as bdt/bdt
SYS@CDB$ROOT> @connect_user_pdb.sql
Enter value for user: bdt
Enter value for pwd: bdt
Enter value for pdb: BDTPDB
Connected.

-- Let's enable the inmemory attribute on the BDT table
BDT@BDTPDB> alter table bdt inmemory ;

Table altered.

-- Trigger the IMCS population
BDT@BDTPDB> select count(*) from bdt;

  COUNT(*)
----------
  44591000

-- check the distribution
BDT@BDTPDB> SELECT s.inst_id, c.name,s.populate_status, s.inmemory_size, s.bytes, s.bytes_not_populated
FROM gv$im_user_segments s, v$pdbs c
WHERE s.segment_name = 'BDT'
and c.con_id=s.con_id; 

   INST_ID NAME                           POPULATE_ INMEMORY_SIZE      BYTES BYTES_NOT_POPULATED
---------- ------------------------------ --------- ------------- ---------- -------------------
         1 BDTPDB                         COMPLETED     695074816 1879048192                   0

As you can see the Instance 1 contains all the data for this PDB (as BYTES_NOT_POPULATED=0)

Second case: The PDB is open on all the Instances:

-- Open the BDTPDB PDB on all instances
SYS@CDB$ROOT> alter pluggable database BDTPDB open instances=all;

Pluggable database altered.

-- Now connect to the BDTPDB PDB as bdt/bdt
SYS@CDB$ROOT> @connect_user_pdb.sql
Enter value for user: bdt
Enter value for pwd: bdt
Enter value for pdb: BDTPDB
Connected.  

-- Let's enable the inmemory attribute on the BDT table
BDT@BDTPDB> alter table bdt inmemory ;

Table altered.

-- Trigger the IMCS population
BDT@BDTPDB> select count(*) from bdt;

  COUNT(*)  
----------  
  44591000

-- check the distribution
BDT@BDTPDB> SELECT s.inst_id, c.name,s.populate_status, s.inmemory_size, s.bytes, s.bytes_not_populated
FROM gv$im_user_segments s, v$pdbs c
WHERE s.segment_name = 'BDT'
and c.con_id=s.con_id;


   INST_ID NAME                           POPULATE_ INMEMORY_SIZE      BYTES BYTES_NOT_POPULATED
---------- ------------------------------ --------- ------------- ---------- -------------------
         1 BDTPDB                         COMPLETED     376307712 1879048192           802889728
         2 BDTPDB                         COMPLETED     298909696 1879048192          1006559232

We can see that (for this PDB), by default, no instance contains all data as the BYTES_NOT_POPULATED column is  greater than 0.

Now let’s set the hidden parameter _inmemory_auto_distribute” to false (at the PDB level):

BDT@BDTPDB> alter system set "_inmemory_auto_distribute"=false;
alter system set "_inmemory_auto_distribute"=false
*
ERROR at line 1:
ORA-65040: operation not allowed from within a pluggable database

So, we can’t set this hidden parameter at the PDB level.

Ok, let’s set it at the CDB level:

SYS@CDB$ROOT> alter system set "_inmemory_auto_distribute"=false;

System altered.

then re-trigger the population on instance 1 (for this PDB) and check the distribution:

SYS@CDB$ROOT> @connect_user_pdb.sql
Enter value for user: bdt
Enter value for pwd: bdt
Enter value for pdb: BDTPDB
Connected.  
BDT@BDTPDB> select count(*) from bdt;

  COUNT(*)  
----------  
  44591000

BDT@BDTPDB> SELECT s.inst_id, c.name,s.populate_status, s.inmemory_size, s.bytes, s.bytes_not_populated
FROM gv$im_user_segments s, v$pdbs c
WHERE s.segment_name = 'BDT'
and c.con_id=s.con_id;

   INST_ID NAME                           POPULATE_ INMEMORY_SIZE      BYTES BYTES_NOT_POPULATED
---------- ------------------------------ --------- ------------- ---------- -------------------
         2 BDTPDB                         COMPLETED     297861120 1879048192          1006559232
         1 BDTPDB                         COMPLETED     675151872 1879048192                   0

BINGO! Look at the Instance 1, it now contains all the data in its IMCS (as BYTES_NOT_POPULATED=0) for this PDB.

If I connect on the Instance 2 (for this PDB) and launch the query to trigger the population, I’ll get:

   INST_ID NAME                           POPULATE_ INMEMORY_SIZE      BYTES BYTES_NOT_POPULATED
---------- ------------------------------ --------- ------------- ---------- -------------------
         1 BDTPDB                         COMPLETED     675151872 1879048192                   0
         2 BDTPDB                         COMPLETED     672006144 1879048192                   0

So, both instances now contain all the data for this PDB.

Remarks:

  • As I said into the previous post: I used an hidden parameter, so you should get oracle support approval to use it.
  • I guess (because I can’t test) that in a Multitenant and Engineered Systems context it is also possible to have all the data appear in the IM column store on every node for a PDB (thanks to the duplicate attribute): As Christian Antognini show us into this blog post for non CDB.

Conclusion:

The conclusion is the same than in my previous post (with non CDB):

  • Thanks to the hidden parameter “_inmemory_auto_distribute” we are able to change the way the In-Memory Column Store are populated by default on an oracle RAC database.
  • By default all objects populated into memory will be distributed across all of the IM column stores in the cluster.
  • With “_inmemory_auto_distribute” set to false, then we are able to populate one instance (or all the instances) with all the data.

And we can add that:

  • If the PDB is open on one instance only then this instance contains all the data.
  • If the PDB is open on all the instances then we have to set “_inmemory_auto_distribute” to false to have at least one instance that contains all the data.
  • Each PDB population will behave the same (distributed or not) and inherit from the CDB setting (As we can not set the hidden parameter at the PDB level).

 

In-Memory Instance distribution with RAC databases: I want at least one instance that contains all the data

As you may know, with a RAC database, by default all objects populated into
In-memory will be distributed across all of the IM column stores in the cluster. It is also possible to have all the data appear in the IM column store on every node (Engineered Systems only). See this white paper for more details.

There is 2 very interesting blog post around this subject:

  1. Kerry Osborne show us how we can distribute the data (using the distribute INMEMORY attribute) across the nodes into this blog post.
  2. Christian Antognini show us that having the data not fully populated on each instance could lead to bad performance into this blog post.

But wait: If my RAC service is an active/passive one (I mean the service is started on only one instance) then I would like to have all the data fully populated into the In-Memory column store of the “active” instance  (and not distributed across the instances), right?

Let’s try to achieve this (all data fully populated into the In-Memory column store of the active instance):

So, by default, the IMCS distribution is the following:

SQL> alter table bdt inmemory ;

Table altered.

SQL> select count(*) from bdt;

  COUNT(*)
----------
  38220000

SQL> SELECT inst_id, populate_status, inmemory_size, bytes, bytes_not_populated
FROM gv$im_user_segments
WHERE segment_name = 'BDT';

   INST_ID POPULATE_ INMEMORY_SIZE      BYTES BYTES_NOT_POPULATED
---------- --------- ------------- ---------- -------------------
         1 COMPLETED     313720832 1610612736           732684288
         2 COMPLETED     274530304 1610612736           826236928

As you can see no instance contains all data as the BYTES_NOT_POPULATED column is  greater than 0.

Now let’s set the hidden parameter _inmemory_auto_distribute” to false and re-trigger the population on instance 1:

SQL> alter system set "_inmemory_auto_distribute"=false;

System altered.

SQL> select count(*) from bdt;

  COUNT(*)
----------
  38220000

SQL> SELECT inst_id, populate_status, inmemory_size, bytes, bytes_not_populated
FROM gv$im_user_segments
WHERE segment_name = 'BDT';

   INST_ID POPULATE_ INMEMORY_SIZE      BYTES BYTES_NOT_POPULATED
---------- --------- ------------- ---------- -------------------
         1 COMPLETED     587137024 1610612736                   0
         2 COMPLETED     274530304 1610612736           826236928

BINGO! Look at the Instance 1, it now contains all the data in its IMCS (as BYTES_NOT_POPULATED=0).

If I connect to the Instance 2 and launch the query to trigger the population I’ll get:

SQL> SELECT inst_id, populate_status, inmemory_size, bytes, bytes_not_populated
FROM gv$im_user_segments
WHERE segment_name = 'BDT';
  2    3  
   INST_ID POPULATE_ INMEMORY_SIZE      BYTES BYTES_NOT_POPULATED
---------- --------- ------------- ---------- -------------------
         1 COMPLETED     587137024 1610612736                   0
         2 COMPLETED     589234176 1610612736                   0

So, both instances now contain all the data (like the duplicate attribute would have done).

Remarks:

  1. In case of active/passive service (means preferred/available service) then after a service failover the second instance will also contain all the data (once populated).
  2. In case of active/active service then each database instance will contain all the data but not necessary at the same time (so the behaviour is a little bit different of the duplicate attribute).
  3. I used an hidden parameter, so you should get oracle support approval to use it.
  4. I’ll check the behaviour with PDB into another post.

Conclusion:

  • Thanks to the hidden parameter “_inmemory_auto_distribute” we are able to change the way the In-Memory Column Store are populated by default on an oracle RAC database.
  • By default all objects populated into memory will be distributed across all of the IM column stores in the cluster.
  • With “_inmemory_auto_distribute” set to false, then we are able to populate one instance (or all the instances) with all the data.

Update 2015/03/27:

  • This “workaround” proposed to get rid of the default behaviour on non Engineered Systems (“Every row of the test table is stored in the IMCS of either one instance or the other”) works for any type of service.
  • In case of preferred/available service, there is no need to set the hidden parameter to get the IM fully populated on one node. The secret sauce is linked to the parallel_instance_group parameter. See this blog post for more details.

Watch out for optimizer_adaptive_features as It may have a huge negative impact

Let me describe how I discovered that the optimizer_adaptive_features may have a huge negative impact (specially on RAC databases).

I upgraded a 11gR2 database to 12.1.0.2 and then I converted this database to a PDB. To do so I launched “@$ORACLE_HOME/rdbms/admin/noncdb_to_pdb.sql” (See MOS Doc ID 1564657.1 for more details).

This sql script took about 75 minutes to complete on my 2 nodes RAC database. This is quite long and then I decided to try to reduce this duration.

To diagnose: I dropped the PDB, plugged it back, re-launched noncdb_to_pdb.sql and enabled a 10046 trace on the session.

The top SQL (sort by elapsed time) for this session is the following:

SQL ID: frjd8zfy2jfdq Plan Hash: 510421217

SELECT executions, end_of_fetch_count,              elapsed_time/px_servers
  elapsed_time,        cpu_time/px_servers     cpu_time,
  buffer_gets/executions  buffer_gets
FROM
 (SELECT sum(executions)   as executions,                            sum(case
  when px_servers_executions > 0                              then
  px_servers_executions                                  else executions end)
  as px_servers,                sum(end_of_fetch_count) as end_of_fetch_count,
                sum(elapsed_time) as elapsed_time,
  sum(cpu_time)     as cpu_time,                     sum(buffer_gets)  as
  buffer_gets            FROM   gv$sql
  WHERE executions > 0                                 AND sql_id = :1
                              AND parsing_schema_name = :2)


call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse    70502      4.02       4.25          0          0          0           0
Execute  70502    264.90     759.33          0          0          0           0
Fetch    70502    215.77    1848.46          0          0          0       70502
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total   211506    484.70    2612.05          0          0          0       70502

Misses in library cache during parse: 17
Misses in library cache during execute: 17
Optimizer mode: CHOOSE
Parsing user id: SYS   (recursive depth: 2)
Number of plan statistics captured: 70502

Rows (1st) Rows (avg) Rows (max)  Row Source Operation
---------- ---------- ----------  ---------------------------------------------------
         1          1          1  VIEW  (cr=0 pr=0 pw=0 time=76 us)
         1          1          1   SORT AGGREGATE (cr=0 pr=0 pw=0 time=76 us)
         0          0          1    PX COORDINATOR  (cr=0 pr=0 pw=0 time=76 us)
         0          0          0     PX SEND QC (RANDOM) :TQ10000 (cr=0 pr=0 pw=0 time=0 us)
         0          0          0      VIEW  GV$SQL (cr=0 pr=0 pw=0 time=0 us)
         0          0          0       FIXED TABLE FIXED INDEX X$KGLCURSOR_CHILD (ind:2) (cr=0 pr=0 pw=0 time=0 us)

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  PX Deq: Join ACK                           281741        0.03        368.85
  PX Deq: reap credit                       1649736        0.00         25.22
  IPC send completion sync                   141000        0.02        135.07
  PX Deq: Parse Reply                        141004        0.04        170.64
  PX Deq: Execute Reply                      141004        0.08       1366.42
  reliable message                            70502        0.00        125.51
  PX Deq: Signal ACK EXT                     140996        0.03         13.99
  PX Deq: Slave Session Stats                140996        0.02         25.54
  enq: PS - contention                        70605        0.11        145.80
  KJC: Wait for msg sends to complete          2584        0.00          0.04
  latch free                                     16        0.00          0.00
  PX qref latch                                  14        0.00          0.00
  latch: shared pool                              4        0.00          0.00
  latch: active service list                      1        0.00          0.00
  row cache lock                                127        0.00          0.08
  library cache lock                             36        0.00          0.04
  library cache pin                              36        0.00          0.04
  gc cr grant 2-way                               1        0.00          0.00
  db file sequential read                         1        0.00          0.00
  oracle thread bootstrap                         1        0.03          0.03

As you can see this SQL elapsed time is about 2600 seconds in total (for about 70 000 executions), most of the wait time comes from “PX %” events and this SQL queries the GV$SQL view.

But wait:

Why the hell this SQL (which looks like an SQL that collects metrics for a particular sql_id) is somehow part of the session that launched the noncdb_to_pdb.sql script? Does it make sense?

I really don’t think so, this sql (sql_id “frjd8zfy2jfdq”) should have been triggered by something else (parsing or whatever) than noncdb_to_pdb.sql.

Let’s prove it with a simple test case (I am alone on the database):

alter system flush shared_pool;

select count(*) from v$sql where sql_id='frjd8zfy2jfdq';

connect / as sysdba

begin
execute immediate 'select object_name from dba_objects';
end;
/

select count(*) from v$sql where sql_id='frjd8zfy2jfdq';

With the following result:

SQL> @test_case

System altered.


  COUNT(*)
----------
         0

Connected.

PL/SQL procedure successfully completed.


  COUNT(*)
----------
         2

Did you see that the sql_id “frjd8zfy2jfdq” has been produced by this simple test case? (If you trace the session you would see this query into the trace file).

I reproduced this behavior on:

  • 12.1.0.2 database with CDB/PDB.
  • 12.1.0.2 database (Non CDB).

and was not able to reproduce it on a 11gR2 database.

So, it looks like that this SQL is introduced by a 12.1.0.2 (12cR1) new feature. As it is somehow linked to “SQL metrics collection”, I tried to disable some 12cR1 features (linked to this area) until I found the “culprit”.

It did not produce any change until I set the optimizer_adaptive_features parameter to false (true is the default).

Here is the result:

SQL> !cat test_case.sql
alter system flush shared_pool;

select count(*) from v$sql where sql_id='frjd8zfy2jfdq';

connect / as sysdba
alter session set optimizer_adaptive_features=false;

begin
execute immediate 'select object_name from dba_objects';
end;
/

select count(*) from v$sql where sql_id='frjd8zfy2jfdq';

SQL> @test_case

System altered.

  COUNT(*)
----------
         0

Connected.

Session altered.

PL/SQL procedure successfully completed.

  COUNT(*)
----------
         0

BINGO!!! The sql_id “frjd8zfy2jfdq” has not been executed!

Well, now what is the impact on noncdb_to_pdb.sql on my 2 nodes RAC database?

I edited the script and added:

alter session set optimizer_adaptive_features=false;

just above this line:

exec dbms_pdb.noncdb_to_pdb(1);

Guess what?

  1. noncdb_to_pdb.sql took about 20 minutes to execute (compare to about 75 minutes without setting optimizer_adaptive_features to false).
  2. The sql_id “frjd8zfy2jfdq” is not part of the trace file anymore.

Remarks:

  • The impact of this SQL is much more “visible” with a RAC database, as it queries a GLOBAL V$ view (GV$SQL) and then needs parallel query to run (If more than one instance is up). With only one instance up, the 10046 trace file produced:
SQL ID: frjd8zfy2jfdq Plan Hash: 510421217

SELECT executions, end_of_fetch_count,              elapsed_time/px_servers
  elapsed_time,        cpu_time/px_servers     cpu_time,
  buffer_gets/executions  buffer_gets
FROM
 (SELECT sum(executions)   as executions,                            sum(case
  when px_servers_executions > 0                              then
  px_servers_executions                                  else executions end)
  as px_servers,                sum(end_of_fetch_count) as end_of_fetch_count,
                sum(elapsed_time) as elapsed_time,
  sum(cpu_time)     as cpu_time,                     sum(buffer_gets)  as
  buffer_gets            FROM   gv$sql
  WHERE executions > 0                                 AND sql_id = :1
                              AND parsing_schema_name = :2)


call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse    69204      2.58       2.78          0          0          0           0
Execute  69204     23.93      25.45          0          0          0           0
Fetch    69204      2.68       2.64          0          0          0       69204
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total   207612     29.20      30.88          0          0          0       69204

Misses in library cache during parse: 18
Misses in library cache during execute: 18
Optimizer mode: CHOOSE
Parsing user id: SYS   (recursive depth: 2)
Number of plan statistics captured: 62

Rows (1st) Rows (avg) Rows (max)  Row Source Operation
---------- ---------- ----------  ---------------------------------------------------
         1          1          1  VIEW  (cr=0 pr=0 pw=0 time=120 us)
         1          1          1   SORT AGGREGATE (cr=0 pr=0 pw=0 time=113 us)
         0          0          0    PX COORDINATOR  (cr=0 pr=0 pw=0 time=100 us)
         0          0          0     PX SEND QC (RANDOM) :TQ10000 (cr=0 pr=0 pw=0 time=50 us)
         0          0          0      VIEW  GV$SQL (cr=0 pr=0 pw=0 time=43 us)
         0          0          0       FIXED TABLE FIXED INDEX X$KGLCURSOR_CHILD (ind:2) (cr=0 pr=0 pw=0 time=39 us)


Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  row cache lock                                  1        0.00          0.00

As you can see: elapsed time of about 30 seconds (for about 70 000 executions) and no “PX%” wait events.

  • I checked which parameters (hidden or not) changed when setting optimizer_adaptive_features to false. Then I tried one by one those parameters with my test case and discovered that setting “_optimizer_dsdir_usage_control” to 0 is enough to get rid of the sql_id “frjd8zfy2jfdq”.
  • During the run of noncdb_to_pdb.sql, this is the large number of executions (about 70 000) of the sql_id “frjd8zfy2jfdq” that produced this long “overall” duration and highlighted the fact that this query has to be somehow removed.
  • You could meet this SQL on 12cR1 databases (CDB/PDB or non CDB).

Conclusion:

  • The optimizer_adaptive_features (set to true) may produce a huge negative impact on the performance (specially with RAC database). I provided an example of such an impact when running the noncdb_to_pdb.sql script.
  • Should you meet the sql_id “frjd8zfy2jfdq” in your database and would like to get rid of it (because you observe a negative impact): Then simply set optimizer_adaptive_features to false (or “_optimizer_dsdir_usage_control” to 0) at the session or system level.

Issue installing the 11gR2 database software after the 12.1.0.2 clusterware setup

If you don’t plan to install the 11gR2 database software after a 12.1.0.2 clusterware installation, I guess there is no need for you to read this post.

I just want to share the issue I got and the way you could workaround it. The purpose of this post is just to save your time, in case of.

So, after a 12.1.0.2 clusterware installation (on a 2 nodes RAC cluster), I decided to install the 11.2.0.4 database software. I launched the runInstaller, followed the install process until the Step 4:

install11g_db

As you can see the nodes that are part of the cluster have not been proposed and clicking next would produce “CRS is not installed on any of the nodes”. 

The first question you may ask is: Is the 11gR2 database compatible with CRS 12.1? Yes it is, as you can see here:

crs_supported_version

As I am curious, I canceled the 11gR2 database software installation and gave a try with the 12.1.0.2 database software. The Step 4 produced:

install12c_db

As you can see the nodes that are part of the cluster have been proposed.

I canceled the 12.1.0.2 database software installation and I had a look to the Inventory. Then I discovered that the CRS=”true” flag was not set for the 12.1.0.2 GI HOME:

$ cat /etc/oraInst.loc | grep inventory_loc
inventory_loc=/ec/poc/server/oracle/olrpoc1/u000/oraInventory

$ cat /ec/poc/server/oracle/olrpoc1/u000/oraInventory/ContentsXML/inventory.xml
..
HOME NAME="OraGI12Home1" LOC="/ec/poc/server/oracle/olrpoc1/u000/product/GRID.12.1.0.2" TYPE="O" IDX="1"
..

Then I added the the CRS=”true” flag that way:

$GI_HOME/oui/bin/runInstaller -updateNodeList ORACLE_HOME="/ec/poc/server/oracle/olrpoc1/u000/product/GRID.12.1.0.2" CRS=true

So that:

$ cat /ec/poc/server/oracle/olrpoc1/u000/oraInventory/ContentsXML/inventory.xml
..
HOME NAME="OraGI12Home1" LOC="/ec/poc/server/oracle/olrpoc1/u000/product/GRID.12.1.0.2" TYPE="O" IDX="1" CRS="true"
..

Then I relaunched the 11.2.0.4 database software installation, and the Step 4 produced:

install11g_db_fixed

As you can see the nodes have been proposed so that I have been able to complete the installation successfully.

Remarks:

  1. MOS 1053393.1 provides more details about the CRS=”true” flag.
  2. I guess the “issue” will be the same for database software >=10.1 and < 12.1 (But I did not test it).
  3. On another RAC, I upgraded the GI from 12.1.0.1 to 12.1.0.2 and then the CRS=”true” flag has been set automatically for the 12.1.0.2 GI.

Conclusion: After a 12.1.0.2 CRS installation,

  1. You may need to put the CRS=”true” flag to avoid this issue during a 11gR2 database software installation.
  2. You don’t need to put the CRS=”true” flag during a 12.1.0.2 database software installation as the issue does not appear.
  3. It looks like you won’t hit the issue if the GI has been upgraded to 12.1.0.2 (See third remark).

Modify the Private Network Information in Oracle Clusterware with HAIP in place

The MOS note “How to Modify Private Network Information in Oracle Clusterware (Doc ID 283684.1)” explains how to change the private Network information with the interconnect made of a single interface (then producing downtime).

So what’s about changing the interconnect configuration with Highly Available IP (HAIP) in place ?

Let’s remember what HAIP is (from Oracle® Database High Availability Best Practices):

haip

As HAIP provides redundant interconnect, we should be able to change the interconnect configuration of one private interface  without any downtime, right ?

First let’s check the current interconnect configuration:

oifcfg getif       
eth4  172.16.0.128  global  cluster_interconnect
eth6  172.16.1.0  global  cluster_interconnect

and the associated Virtual IP:

oifcfg iflist -p -n
eth4  172.16.0.128  PRIVATE  255.255.255.128
eth4  169.254.0.0  UNKNOWN  255.255.128.0
eth6  172.16.1.0  PRIVATE  255.255.255.128
eth6  169.254.128.0  UNKNOWN  255.255.128.0

I removed the public network from the output. As you can see each private interface is hosting a Virtual IP (169.xxx.x.x).

Now your sysadmin do the change (for example he will change the subnet and the VLAN) on one of the private interface (Let’s say eth4 for example), so that the ohasd.log log file reports something like:

2014-01-27 11:16:17.154: [GIPCHGEN][1837012736]gipchaInterfaceFail: marking interface failing 0x7f4c0c1b5f00 { host '', haName 'CLSFRAME_olrdev1', local (nil), ip '172.16.0.129:28029', subnet '172.16.0.128', mask '255.255.255.128', mac 'e8-39-35-12-77-7e', ifname 'eth4', numRef 0, numFail 0, idxBoot 0, flags 0x184d }
2014-01-27 11:16:17.334: [GIPCHGEN][1856595712]gipchaInterfaceDisable: disabling interface 0x7f4c0c1b5f00 { host '', haName 'CLSFRAME_olrdev1', local (nil), ip '172.16.0.129:28029', subnet '172.16.0.128', mask '255.255.255.128', mac 'e8-39-35-12-77-7e', ifname 'eth4', numRef 0, numFail 0, idxBoot 0, flags 0x19cd }
2014-01-27 11:16:17.339: [GIPCHDEM][1856595712]gipchaWorkerCleanInterface: performing cleanup of disabled interface 0x7f4c0c1b5f00 { host '', haName 'CLSFRAME_olrdev1', local (nil), ip '172.16.0.129:28029', subnet '172.16.0.128', mask '255.255.255.128', mac 'e8-39-35-12-77-7e', ifname 'eth4', numRef 0, numFail 0, idxBoot 0, flags 0x19ed }

So now let’s check the virtual IP and the available interfaces and subnet again:

oifcfg iflist -p -n
eth4  172.17.3.0  PRIVATE  255.255.255.128
eth6  172.16.1.0  PRIVATE  255.255.255.128
eth6  169.254.128.0  UNKNOWN  255.255.128.0
eth6  169.254.0.0  UNKNOWN  255.255.128.0
bond0  158.168.4.0  UNKNOWN  255.255.252.0

Well, we can see 2 things:

  • The first one, is that eth6 is now “hosting” both Virtual IPs (169.xxxx).
  • The second one, is the new “available” subnet for eth4 (172.17.3.0).

So that, we just have to remove the previous eth4 configuration

oifcfg delif -global eth4/172.16.0.128

and put the new one that way:

oifcfg setif -global eth4/172.17.3.0:cluster_interconnect

Now, check again the Virtual IPs:

oifcfg iflist -p -n
eth4  172.17.3.0  PRIVATE  255.255.255.128
eth4  169.254.128.0  UNKNOWN  255.255.128.0
eth6  172.16.1.0  PRIVATE  255.255.255.128
eth6  169.254.0.0  UNKNOWN  255.255.128.0

Perfect, now each private Interface hosts a Virtual IP (We are back to a “normal” configuration).

Remarks:

No downtime will occur as long as:

  • You don’t change both interfaces configuration at the same time 😉
  • You don’t remove by mistake the configuration of the interface that hosts both Virtual IPs.
  • The interconnect hosting both VIPs doesn’t fail before you put back the updated interface.

There is nothing new with this post. I just had to do the exercise so that I share it 😉

Conclusion:

Thanks to HAIP, we have been able to change the Interconnect Network configuration of one interface without any downtime.

RAC Attack at Oracle OpenWorld 2013: Don’t be afraid !

You’ll attend Oracle OpenWorld 2013 and:

  • You want to install a 12c RAC using oracle Virtual Box on your laptop.
  • Or you just want to have a talk or share your experience with some people around this technology.

So, you should attend to RAC Attack at Oracle OpenWorld 2013:

RAC-attack

Don’t be afraid ! we won’t configure an extended RAC and then no need to break your laptop into 2 separate pieces to simulate a stretched cluster 🙂

You can find more information on the event here.

Join Rac Attack on the social media:

  • Rac Attack is on Google + here.
  • Rac Attack is on twitter: Use the #RacAttack hashtag.
  • Rac Attack is on facebook here.

Hope to see you there !

TAF is not working on Rac One Node in case of node failure ? Well it depends

Martin Bach did a good study of Rac One Node capability and especially of what happens with session failover during a database relocation or during a node failure into this blog post.

He showed us that:

  1. during a database relocation (initiated manually with srvctl relocate database) the TAF works as expected.
  2. during a node failure the TAF did not work and you will be disconnected (for a running select or a new execution)

It’s important: I said “during a node failure“, it means no instances are available during the select.

But what’s about a session already connected to a TAF service before the node failure that :

  • will be inactive during the node failure
  • will launch a select after the node failure (means an instance is back)

For this one an node failure is transparent, let’s see that in action:

First check my database is a Rac One Node:

srvctl config database -d BDTO | grep -i type
Type: RACOneNode

Now let’s check that the bdto_taf service I’ll connect on is a TAF one:

srvctl config service -s BDTO_TAF -d BDTO | egrep -i "taf|failover"
Service name: BDTO_TAF
Failover type: SELECT
Failover method: BASIC
TAF failover retries: 0
TAF failover delay: 0
TAF policy specification: BASIC

Well everything is in place to test.

So connect to my Rac One Node database using the TAF service:

sqlplus bdt/"test"@bdto_taf

SQL*Plus: Release 11.2.0.3.0 Production on Thu May 16 14:39:36 2013

Copyright (c) 1982, 2011, Oracle.  All rights reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options

SQL> alter session set nls_date_format='YYYY/MM/DD HH24:MI:SS';

Session altered.

SQL> select sysdate||' '||host_name from v$instance;

SYSDATE||''||HOST_NAME
--------------------------------------------------------------------------------
2013/05/16 14:39:47 BDTSRV1

Now let’s kill smon:

ps -ef | grep -i smon | grep -i BDTO_1 
oracle    8035     1  0 14:30 ?        00:00:00 ora_smon_BDTO_1

kill -9 `ps -ef | grep -i smon | grep -i BDTO_1 | awk '{print $2}'`

Let’s wait an instance is back on a node (Important step)

In this case the instance has been relocated on the other node, let’s see it is up:

ps -ef | grep -i smon | grep -i BDTO_1
oracle   29266     1  0 14:54 ?        00:00:00 ora_smon_BDTO_1

and come back to our “idle” sqlplus session to re-launch the sql:

SQL> /

SYSDATE||''||HOST_NAME
--------------------------------------------------------------------------------
16-MAY-13 BDTSRV2

Oh ! As you can see the “idle” sqlplus session has not been disconnected and we have been able to re-launch the query.

As you can see we lost our nls_date_format setting. This is expected as explained into Martin’s Book “Pro Oracle Database 11g RAC on Linux“:

Also, note that while TAF can reauthorize a session, it does not restore the session to its previous
state. Therefore, following an instance failure, you will need to restore any session variables, PL/SQL
package variables, and instances of user-defined types. TAF will not restore these values automatically,
so you will need to extend your applications to restore them manually.

Conclusion:

If you are using a TAF service to connect to a Rac One Node database (of course through the OCI), then a node failure is transparent if the session is “idle” during the time there is no instance available.

When you think of it, it is quite normal as the TAF is more or less nothing more than a OCI / SQL*net feature, so once the TAF service is back, then TAF can work as expected.

Remarks:

  • If the “idle” connection has an “opened” transaction, you’ll receive the “ORA-25402: transaction must roll back” oracle error (This is not so transparent anymore 🙂 ).
  • Do not test TAF connected as the oracle sys user as SYSDBA Sessions do not failover with SRVCTL TAF configured, see MOS [ID 1342992.1]
  • If you are using srvctl to test TAF”s reaction on node failure then you should read the MOS note [ID 1324574.1] first (as the TAF may not work depending of the version and srvctl options being used)
  • For a good explanation of Rac One Node, you should have a look to this Aman Sharma’s blog post.

Why I wrote this post ?

Because into Martin’s post, I put a comment on “December 13, 2012 at 19:08” trying to explain this behavior. I came back to this comment yesterday and I realized that my comment was not so clear ! I hope this post is clear 😉

Rac One Node : Create preferred node

In my previous Rac One Node post I gave a way to “stick” one Rac One Node database to a particular node of your cluster so that :

  1. It should re-start automatically on the “stick” node in case of database or node crash.
  2. It does not relocate automatically on other nodes.

In this post I’ll give a way to create a “preferred” node for a Rac One Node database so that:

  1. It should re-start and relocate automatically on the “preferred” node as soon as it is available.
  2. It relocates automatically on other nodes in case the “preferred” node crash.

Let’s go, first modify the SERVER_NAMES attribute of the associated server pool to keep only the “preferred” node:

To which pool belongs the db ?

crsctl status resource ora.bdto.db -p | grep -i pool
SERVER_POOLS=ora.BDTO

Which servers are part of this pool ?

crsctl status serverpool ora.BDTO -p | grep -i server
SERVER_NAMES=bdtnode1 bdtnode2

Modify the server pool to define the “preferred” node for the database (bdtnode2 for example):

crsctl modify serverpool ora.BDTO -attr SERVER_NAMES="bdtnode2"
crsctl status serverpool ora.BDTO -p | grep -i server
SERVER_NAMES=bdtnode2

Second step is to change the PLACEMENT attribute of the database.

Change the PLACEMENT attribute from restricted to favored:

crsctl status resource ora.bdto.db -p | grep -i place
ACTIVE_PLACEMENT=1
PLACEMENT=restricted

crsctl modify resource ora.bdto.db -attr PLACEMENT="favored"
crsctl status resource ora.bdto.db -p | grep -i place
ACTIVE_PLACEMENT=1
PLACEMENT=favored

Now the BDTO RAC One Node database:

  • will be re-started automatically on node bdtnode1 in case the “preferred” node bdtnode2 crash.
  • will be re-started and relocated automatically on the “preferred” node bdtnode2 as soon as it will be again available.

Remarks:

Be careful: The database will be relocated automatically on the “preferred” node as soon as it is up (Even if the database is currently running on other node).

If you try to relocate manually the database on the “non preferred” node:

srvctl relocate database -d BDTO -n bdtnode1

It will be done and the node bdtnode1 will be again member of the server pool (as a consequence the node “bdtnode2” is not anymore the “preferred” one).

Conclusion: You have the choice,

  • Stick the database to one node so that it does not relocate automatically on surviving nodes (This post).
  • Defined a “preferred” node and allow the database to relocate automatically on surviving nodes and on the “preferred” node as soon as it is available (The current post).

Rac One Node : Avoid automatic database relocation

Imagine you need to ensure that one of your many RAC One Node database:

  1. Should re-start automatically on the current node in case of database or node crash.
  2. Does not relocate automatically on other nodes.

So, you need somehow to “stick” this particular RAC One Node database to one node of your cluster.

For a good understanding of what RAC One Node is, you can have a look to Martin’s post.

Disabling the database will not help as it will not satisfy the second point mentioned above.

To answer this need, I’ll modify one property of the database’s server pool (database is administrator managed):

To which pool belongs the db ?

crsctl status resource ora.BDTO.db -p | grep -i pool 
SERVER_POOLS=ora.BDTO

Which servers are part of this pool ?

crsctl status serverpool ora.BDTO -p | grep -i server
SERVER_NAMES=bdtnode1 bdtnode2

Modify the server pool to “stick” the database to one node (bdtnode2 for example):

crsctl modify serverpool ora.BDTO -attr SERVER_NAMES="bdtnode2"
crsctl status serverpool ora.BDTO -p | grep -i server
SERVER_NAMES=bdtnode2

Now the BDTO RAC One Node database:

  • will not be re-started automatically on node bdtnode1 in case bdtnode2 crash.
  • will be re-started automatically on node bdtnode2 as soon as it will be again available.

Remarks:

  1. If the server pool “host” many databases, all of them will be affected by the change.
  2. If you try to relocate manually the database on the “excluded” node:
srvctl relocate database -d BDTO -n bdtnode1

It will be done and the “excluded” bdtnode1 will be again member of the server pool (as a consequence the database is not sticked to bdtnode2 anymore).

If you need another behaviour (create a preferred node), then you have to change the PLACEMENT attribute of the database (see this post).

Conclusion: The purpose of this post is not to explain why you could choose to stick one RAC One Node database to a particular node of your cluster, it just provide a way to do so 🙂