Exadata cell metrics: Join the metrics and their descriptions on the fly

Introduction

Cells metrics are very useful but their name are not so friendly. The name is a concatenation of abbreviations for the type of component, delimited by the underscore character. Then you have to understand the naming convention to understand the meaning of the metric name.

For example, knowing that:

  • CD stands for “Cell Disks metrics”
  • IO_BY stands for “Number of megabytes”
  • R stands for Read
  • LG stands for Large

You can conclude that the CD_IO_BY_R_LG metric is linked to the “Number of megabytes read in large blocks from a cell disk”.

Hopefully the metrics are explained into the Oracle Documentation and you can also retrieve their description from cellcli:

$ cellcli -e "list metricdefinition attributes name,description where name='CD_IO_BY_R_LG'"

CD_IO_BY_R_LG	 "Number of megabytes read in large blocks from a cell disk"

Lack of description

That said, as an example, let’s query the current metric values for a particular database that way:

$ cellcli -e "list metriccurrent attributes metricObjectName,name,metricValue where name like 'DB.*' and metricObjectName='BDT'"

	 BDT	 DB_FC_IO_BY_SEC   	 0 MB/sec
	 BDT	 DB_FC_IO_RQ       	 47,638 IO requests
	 BDT	 DB_FC_IO_RQ_SEC   	 2.1 IO/sec
	 BDT	 DB_FD_IO_BY_SEC   	 0 MB/sec
	 BDT	 DB_FD_IO_LOAD     	 19,885
	 BDT	 DB_FD_IO_RQ_LG    	 36 IO requests
	 BDT	 DB_FD_IO_RQ_LG_SEC	 0.0 IO/sec
	 BDT	 DB_FD_IO_RQ_SM    	 47,602 IO requests
	 BDT	 DB_FD_IO_RQ_SM_SEC	 2.1 IO/sec
	 BDT	 DB_FL_IO_BY       	 0.000 MB
	 BDT	 DB_FL_IO_BY_SEC   	 0.000 MB/sec
	 BDT	 DB_FL_IO_RQ       	 0 IO requests
	 BDT	 DB_FL_IO_RQ_SEC   	 0.0 IO/sec
	 BDT	 DB_IO_BY_SEC      	 0 MB/sec
	 BDT	 DB_IO_LOAD        	 0.0
	 BDT	 DB_IO_RQ_LG       	 0 IO requests
	 BDT	 DB_IO_RQ_LG_SEC   	 0.0 IO/sec
	 BDT	 DB_IO_RQ_SM       	 19 IO requests
	 BDT	 DB_IO_RQ_SM_SEC   	 0.0 IO/sec
	 BDT	 DB_IO_UTIL_LG     	 0 %
	 BDT	 DB_IO_UTIL_SM     	 0 %
	 BDT	 DB_IO_WT_LG       	 0 ms
	 BDT	 DB_IO_WT_LG_RQ    	 0.0 ms/request
	 BDT	 DB_IO_WT_SM       	 0 ms
	 BDT	 DB_IO_WT_SM_RQ    	 0.0 ms/request

As you can see the metric description is not there and there is no way to retrieve it from metriccurrent (or metrichistory) because this is not an attribute:

$ cellcli -e "describe metriccurrent"
	name
	alertState
	collectionTime
	metricObjectName
	metricType
	metricValue
	objectType

$ cellcli -e "describe metrichistory"
	name
	alertState
	collectionTime
	metricObjectName
	metricType
	metricValue
	metricValueAvg
	metricValueMax
	metricValueMin
	objectType

But if you send the result of our example to someone that don’t know (or don’t remember) the naming convention (or if you are not 100% sure of the definition of a particular metric) then he/you’ll have to:

  • go back to the oracle documentation
  • query the metricdefinition with cellcli

New script: exadata_metrics_desc.pl

Thanks to the exadata_metrics_desc.pl script, you can add (to the cellcli output) the description of the metric on the fly.

Let’s launch the same query (as the one used in the previous example) and add a call to exadata_metrics_desc.pl that way:

$ cellcli -e "list metriccurrent attributes metricObjectName,name,metricValue where name like 'DB.*' and metricObjectName='BDT'" | ./exadata_metrics_desc.pl

  BDT   DB_FC_IO_BY_SEC (Number of megabytes of I/O per second for this database to flash cache)          0 MB/sec
  BDT   DB_FC_IO_RQ (Number of IO requests issued by a database to flash cache)                           48,123 IO requests
  BDT   DB_FC_IO_RQ_SEC (Number of IO requests issued by a database to flash cache per second)            2.1 IO/sec
  BDT   DB_FD_IO_BY_SEC (Number of megabytes of I/O per second for this database to flash disks)          0 MB/sec
  BDT   DB_FD_IO_LOAD (Average I/O load from this database for flash disks)                               4,419
  BDT   DB_FD_IO_RQ_LG (Number of large IO requests issued by a database to flash disks)                  36 IO requests
  BDT   DB_FD_IO_RQ_LG_SEC (Number of large IO requests issued by a database to flash disks per second)   0.0 IO/sec
  BDT   DB_FD_IO_RQ_SM (Number of small IO requests issued by a database to flash disks)                  48,087 IO requests
  BDT   DB_FD_IO_RQ_SM_SEC (Number of small IO requests issued by a database to flash disks per second)   2.1 IO/sec
  BDT   DB_FL_IO_BY (The number of MB written to the Flash Log)                                           0.000 MB
  BDT   DB_FL_IO_BY_SEC (The number of MB written per second to the Flash Log)                            0.000 MB/sec
  BDT   DB_FL_IO_RQ (The number of I/O requests issued to the Flash Log)                                  0 IO requests
  BDT   DB_FL_IO_RQ_SEC (The number of I/O requests per second issued to the Flash Log)                   0.0 IO/sec
  BDT   DB_IO_BY_SEC (Number of megabytes of I/O per second for this database to hard disks)              0 MB/sec
  BDT   DB_IO_LOAD (Average I/O load from this database for hard disks)                                   0.0
  BDT   DB_IO_RQ_LG (Number of large IO requests issued by a database to hard disks)                      0 IO requests
  BDT   DB_IO_RQ_LG_SEC (Number of large IO requests issued by a database to hard disks per second)       0.0 IO/sec
  BDT   DB_IO_RQ_SM (Number of small IO requests issued by a database to hard disks)                      19 IO requests
  BDT   DB_IO_RQ_SM_SEC (Number of small IO requests issued by a database to hard disks per second)       0.0 IO/sec
  BDT   DB_IO_UTIL_LG (Percentage of disk resources utilized by large requests from this database)        0 %
  BDT   DB_IO_UTIL_SM (Percentage of disk resources utilized by small requests from this database)        0 %
  BDT   DB_IO_WT_LG (IORM wait time for large IO requests issued by a database)                           0 ms
  BDT   DB_IO_WT_LG_RQ (Average IORM wait time per request for large IO requests issued by a database)    0.0 ms/request
  BDT   DB_IO_WT_SM (IORM wait time for small IO requests issued by a database)                           0 ms
  BDT   DB_IO_WT_SM_RQ (Average IORM wait time per request for small IO requests issued by a database)    0.0 ms/request

As you can see the description of each metric being part of the initial output has been added.

Remarks

  • You can download the script from this repository or from GitHub.
  • Feel free to build the query you want on the metrics. You just need to add a call to exadata_metrics_desc.pl to see the metric description being added on the fly (as long as the metric name appears in the output of your initial query).
  • The idea of this script is all to credit to Martin Bach.

Conclusion

The exadata_metrics_desc.pl can be used to join on the fly the metric name, its value (and whatever attribute you would love to see) with its associated description.

 

Posted in Exadata | Leave a comment

Extract Exadata cells historical metrics in CSV format

Exadata provides a lot of useful metrics to monitor the Cells and you may want to retrieve historical values for some metrics. To do so, you can use the “LIST METRICHISTORY” command through CellCLI on the cell.

But as usual, visualising the metrics is even more better. For this purpose, you can use a perl script (see the download link in the remarks section) that extracts the historical metrics in CSV format so that you can graph them with the visualisation tool of your choice.

Let’s see the help:

Usage: ./csv_exadata_metrics_history.pl [cell=|groupfile=] [serial] [type=] [name=] [objectname=] [name!=] [objectname!=] [ago_unit=] [ago_value=]

 Parameter                 Comment                                                      Default
 ---------                 -------                                                      -------
 CELL=                     comma-separated list of cells
 GROUPFILE=                file containing list of cells
 SERIAL                    serialize execution over the cells (default is no)
 TYPE=                     Metrics type to extract: Cumulative|Rate|Instantaneous       ALL
 NAME=                     Metrics to extract (wildcard allowed)                        ALL
 OBJECTNAME=               Objects to extract (wildcard allowed)                        ALL
 NAME!=                    Exclude metrics (wildcard allowed)                           EMPTY
 OBJECTNAME!=              Exclude objects (wildcard allowed)                           EMPTY
 AGO_UNIT=                 Unit to retrieve historical metrics back: day|hour|minute    HOUR
 AGO_VALUE=                Value associated to Unit to retrieve historical metrics back 1

utility assumes passwordless SSH from this cell node to the other cell nodes
utility assumes ORACLE_HOME has been set (with celladmin user for example)

Example : ./csv_exadata_metrics_history.pl cell=cell
Example : ./csv_exadata_metrics_history.pl groupfile=./cell_group
Example : ./csv_exadata_metrics_history.pl cell=cell objectname='CD_disk03_cell'
Example : ./csv_exadata_metrics_history.pl cell=cell name='.*BY.*' objectname='.*disk.*'
Example : ./csv_exadata_metrics_history.pl cell=enkcel02 name='.*DB_IO.*' objectname!='ASM' name!='.*RQ.*' ago_unit=minute ago_value=4
Example : ./csv_exadata_metrics_history.pl cell=enkcel02 type='Instantaneous' name='.*DB_IO.*' objectname!='ASM' name!='.*RQ.*' ago_unit=hour ago_value=4
Example : ./csv_exadata_metrics_history.pl cell=enkcel01,enkcel02 type='Instantaneous' name='.*DB_IO.*' objectname!='ASM' name!='.*RQ.*' ago_unit=minute ago_value=4 serial

You have to setup passwordless SSH from one cell to the other cells (Then you can launch the script from this cell and retrieve data from the other cells).

The main options/features are:

  1. You can specify the cells on which you want to collect the metrics thanks to the cell or groupfile parameter.
  2. You can choose to serialize the execution over the cells thanks to the serial parameter.
  3. You can choose the type of metrics you want to retrieve (Cumulative, rate or instantaneous) thanks to the type parameter.
  4. You can focus on some metrics thanks to the name parameter (wildcard allowed).
  5. You can exclude some metrics thanks to the name! parameter (wildcard allowed).
  6. You can focus on some metricobjectname thanks to the objectname parameter (wildcard allowed).
  7. You can exclude some metricobjectname thanks to the objectname! parameter (wildcard allowed).
  8. You can choose the unit to retrieve metrics back (day, hour, minute) thanks to the ago_unit parameter.
  9. You can choose the value associated to the unit to retrieve metrics back thanks to the ago_value parameter.

Let’s see an example:

I want to retrieve in csv format the metrics from 2 cells related to databases for the last 20 minutes:

$> ./csv_exadata_metrics_history.pl cell=enkx3cel01,enkx3cel02 name='DB_.*' ago_unit=minute ago_value=20
Cell;metricType;DateTime;name;objectname;value;unit
enkx3cel01;Instantaneous;2015-07-01T08:57:59-05:00;DB_FC_BY_ALLOCATED;ACSTBY;0.000;MB
enkx3cel01;Instantaneous;2015-07-01T08:57:59-05:00;DB_FC_BY_ALLOCATED;ASM;15,779;MB
enkx3cel01;Instantaneous;2015-07-01T08:57:59-05:00;DB_FC_BY_ALLOCATED;BDT;0.000;MB
enkx3cel01;Instantaneous;2015-07-01T08:57:59-05:00;DB_FC_BY_ALLOCATED;BIGDATA;0.000;MB
enkx3cel01;Instantaneous;2015-07-01T08:57:59-05:00;DB_FC_BY_ALLOCATED;DBFS;0.000;MB
enkx3cel01;Instantaneous;2015-07-01T08:57:59-05:00;DB_FC_BY_ALLOCATED;DBM;15,779;MB
enkx3cel01;Instantaneous;2015-07-01T08:57:59-05:00;DB_FC_BY_ALLOCATED;DEMO;794,329;MB
enkx3cel01;Instantaneous;2015-07-01T08:57:59-05:00;DB_FC_BY_ALLOCATED;DEMOX3;0.000;MB
enkx3cel01;Instantaneous;2015-07-01T08:57:59-05:00;DB_FC_BY_ALLOCATED;EXDB;0.000;MB
enkx3cel01;Instantaneous;2015-07-01T08:57:59-05:00;DB_FC_BY_ALLOCATED;WZSDB;0.000;MB
enkx3cel01;Instantaneous;2015-07-01T08:57:59-05:00;DB_FC_BY_ALLOCATED;_OTHER_DATABASE_;48,764;MB
enkx3cel01;Instantaneous;2015-07-01T08:57:59-05:00;DB_FC_IO_BY_SEC;ACSTBY;0;MB/sec
enkx3cel01;Instantaneous;2015-07-01T08:57:59-05:00;DB_FC_IO_BY_SEC;ASM;0;MB/sec
enkx3cel01;Instantaneous;2015-07-01T08:57:59-05:00;DB_FC_IO_BY_SEC;BDT;0;MB/sec
enkx3cel01;Instantaneous;2015-07-01T08:57:59-05:00;DB_FC_IO_BY_SEC;BIGDATA;0;MB/sec
enkx3cel01;Instantaneous;2015-07-01T08:57:59-05:00;DB_FC_IO_BY_SEC;DBFS;0;MB/sec
enkx3cel01;Instantaneous;2015-07-01T08:57:59-05:00;DB_FC_IO_BY_SEC;DBM;15;MB/sec
enkx3cel01;Instantaneous;2015-07-01T08:57:59-05:00;DB_FC_IO_BY_SEC;DEMO;0;MB/sec
enkx3cel01;Instantaneous;2015-07-01T08:57:59-05:00;DB_FC_IO_BY_SEC;DEMOX3;0;MB/sec
enkx3cel01;Instantaneous;2015-07-01T08:57:59-05:00;DB_FC_IO_BY_SEC;EXDB;0;MB/sec
enkx3cel01;Instantaneous;2015-07-01T08:57:59-05:00;DB_FC_IO_BY_SEC;WZSDB;0;MB/sec
enkx3cel01;Instantaneous;2015-07-01T08:57:59-05:00;DB_FC_IO_BY_SEC;_OTHER_DATABASE_;0;MB/sec
enkx3cel01;Cumulative;2015-07-01T08:57:59-05:00;DB_FC_IO_RQ;ACSTBY;2,318;IO requests
enkx3cel01;Cumulative;2015-07-01T08:57:59-05:00;DB_FC_IO_RQ;ASM;0;IO requests
enkx3cel01;Cumulative;2015-07-01T08:57:59-05:00;DB_FC_IO_RQ;BDT;2,966;IO requests
enkx3cel01;Cumulative;2015-07-01T08:57:59-05:00;DB_FC_IO_RQ;BIGDATA;25,415;IO requests
enkx3cel01;Cumulative;2015-07-01T08:57:59-05:00;DB_FC_IO_RQ;DBFS;3,489;IO requests
enkx3cel01;Cumulative;2015-07-01T08:57:59-05:00;DB_FC_IO_RQ;DBM;1,627,066;IO requests
enkx3cel01;Cumulative;2015-07-01T08:57:59-05:00;DB_FC_IO_RQ;DEMO;4,506;IO requests
enkx3cel01;Cumulative;2015-07-01T08:57:59-05:00;DB_FC_IO_RQ;DEMOX3;4,172;IO requests
enkx3cel01;Cumulative;2015-07-01T08:57:59-05:00;DB_FC_IO_RQ;EXDB;0;IO requests
enkx3cel01;Cumulative;2015-07-01T08:57:59-05:00;DB_FC_IO_RQ;WZSDB;4,378;IO requests
enkx3cel01;Cumulative;2015-07-01T08:57:59-05:00;DB_FC_IO_RQ;_OTHER_DATABASE_;6,227;IO requests
enkx3cel01;Cumulative;2015-07-01T08:57:59-05:00;DB_FC_IO_RQ_LG;ACSTBY;0;IO requests
.
.
.

That way, you could visualize your data the way you feel comfortable with. For example, I used tableau to create this “Database metrics dashboard” on DB_* rate metrics:

cells_metrics

Remarks:

  • If you retrieve too much data, you could receive something like:
Error: enkx3cel02 is returning over 100000 lines; output is truncated !!!
 Command could be retried with the serialize option: --serial
 Killing child pid 15720 to enkx3cel02...

Then, you can launch the script with the serial option (see the help).

Conclusion:

You probably already have a way to build your own graph of the historical metrics. But if you don’t, feel free to use this script and the visualisation tool of your choice.

Posted in Exadata | Leave a comment

12c Application Continuity with Data Guard in action

Introduction

Application Continuity enables replay, in a non-disruptive and rapid manner, of a database request when a recoverable error makes the database session unavailable. Application Continuity uses Transaction Guard. You can find more details on those features here.

With RAC, Application Continuity can be used to launch the replay on another Instance. It can also be used with Data Guard under some circumstances and conditions (Extract from here):

Screen Shot 2015-05-07 at 08.52.52

Well, I just want to see this feature in action during a Data Guard switchover. Let’s give it a try.

Test preparation

The test environment is made of:

  • NBDT database (12.1.0.2 and initially primary).
  • SNBDT database (12.1.0.2 and initially standby).

On the primary database, connected as user bdt, let’s create the following tables:

SQL> create table BDT_CONT as select rownum num,rpad('U',20) val from dual connect by level <= 500000;

Table created.

SQL> create table BDT_SPY (d date, instance_name varchar2(16), num number);

Table created.

And the following trigger:

create or replace trigger BDT_CONT_SPY_T
before update
on BDT_CONT
for each row
begin
insert into BDT_SPY values(sysdate,sys_context('USERENV', 'INSTANCE_NAME'), :old.num);
end;
/

Trigger created.

The test application will update the 500000 rows of the BDT_CONT table in one transaction. The aim of the trigger is to track on which Instance the update has been successfully executed .

Now let’s create a service on the primary and standby databases that way:

# Primary
$ srvctl add service -d NBDT -s appco -preferred NBDT1,NBDT2 -l primary -policy automatic -j SHORT -B SERVICE_TIME -z 30 -w 10 -commit_outcome TRUE -e TRANSACTION -replay_init_time 1800 -retention 86400 -notification TRUE

# Standby
$ srvctl add service -d SNBDT -s appco -preferred SNBDT1,SNBDT2  -l primary -policy automatic -j SHORT -B SERVICE_TIME -z 30 -w 10 -commit_outcome TRUE -e TRANSACTION -replay_init_time 1800 -retention 86400 -notification TRUE

So that the appco service:

  • Is automatically started on the standby if it becomes primary.
  • Is defined for Application Continuity (see more details here on how to configure a service for Application Continuity).

Now let’s create a JAVA application that will be used to test the Application Continuity:

import java.sql.*;
import oracle.jdbc.pool.*;
import oracle.jdbc.*;
import oracle.jdbc.replay.*;

public class AppCont {
    public String getInstanceName(Connection conn) throws SQLException {
        PreparedStatement preStatement = conn.prepareStatement("select instance_name from v$instance");
        String r=new String();
        ResultSet result = preStatement.executeQuery();

        while (result.next()){
            r=result.getString("instance_name");
        }
        return r;
    }

    public static void main(String args[]) throws SQLException {
         Connection conn = null;
         Statement stmt = null;
         try {
              // OracleDataSourceImpl instead of OracleDataSource.
              OracleDataSourceImpl ocpds = new OracleDataSourceImpl();
              // Create the database URL
              String dbURL =
              "jdbc:oracle:thin:@(DESCRIPTION_LIST="+
              "(LOAD_BALANCE=off)"+
              "(FAILOVER=on)"+
              "(DESCRIPTION=(CONNECT_TIMEOUT=90) (RETRY_COUNT=10)(RETRY_DELAY=3)"+
              "(ADDRESS_LIST="+
              "(LOAD_BALANCE=on)"+
              "(ADDRESS=(PROTOCOL=TCP) (HOST=rac-cluster-scan)(PORT=1521)))"+
              "(CONNECT_DATA=(SERVICE_NAME=appco)))"+
              "(DESCRIPTION=(CONNECT_TIMEOUT=90) (RETRY_COUNT=10)(RETRY_DELAY=10)"+
              "(ADDRESS_LIST="+
              "(LOAD_BALANCE=on)"+
	      "(ADDRESS=(PROTOCOL=TCP)(HOST=srac-cluster-scan)(PORT=1521)))"+
              "(CONNECT_DATA=(SERVICE_NAME=appco))))";

              ocpds.setURL(dbURL);
              ocpds.setUser("bdt");
              ocpds.setPassword("bdt");

              // Get a connection
              conn = ocpds.getConnection();

              // Instantiation
              AppCont ex = new AppCont();

              // On which Instance are we connected?
              System.out.println("Instance Name = "+ex.getInstanceName(conn));

              System.out.println("Performing transaction");
              /* Don't forget to disable autocommit
	       * And launch the transaction
	       */
              ((oracle.jdbc.replay.ReplayableConnection)conn).beginRequest();
              conn.setAutoCommit(false);
              stmt = conn.createStatement();
              String updsql = "UPDATE BDT_CONT " +
                              "SET val=UPPER(val)";
              stmt.executeUpdate(updsql);
              conn.commit();
              // End
              ((oracle.jdbc.replay.ReplayableConnection)conn).endRequest();

	      stmt.close();

              System.out.println("Instance Name = "+ex.getInstanceName(conn));

              // On which Instance are we connected?
              conn.close();
         }
         catch (Exception e) {
              e.printStackTrace();
        }
    }
}

The important parts are:

  • The use of OracleDataSourceImpl instead of OracleDataSource.
  • The database URL definition that includes the access to the appco service for both the primary and standby databases (NBDT is hosted on “rac-cluster” and SNBDT is hosted on “srac-cluster”).
  • Autocommit is disabled (Read this if you need more details).

We are ready to launch the application and test the Application Continuity in a Data Guard context.

Run it

First let’s check the status of the data guard configuration and that it is ready for the switchover:

DGMGRL for Linux: Version 12.1.0.2.0 - 64bit Production

Copyright (c) 2000, 2013, Oracle. All rights reserved.

Welcome to DGMGRL, type "help" for information.
DGMGRL> connect sys/toto@NBDT
Connected as SYSDBA.
DGMGRL> show configuration;

Configuration - nbdt_dr

  Protection Mode: MaxPerformance
  Members:
  nbdt  - Primary database
    snbdt - Physical standby database

Fast-Start Failover: DISABLED

Configuration Status:
SUCCESS   (status updated 24 seconds ago)

DGMGRL> validate database SNBDT;

  Database Role:     Physical standby database
  Primary Database:  nbdt

  Ready for Switchover:  Yes
  Ready for Failover:    Yes (Primary Running)

  Flashback Database Status:
    nbdt:   Off
    snbdt:  Off

Launch the application

$ cat launch.ksh
export CLASSPATH=/u01/app/oracle/product/12.1.0.2/jdbc/lib/ojdbc7.jar:.
time java -cp $CLASSPATH:. AppCont
$ ksh ./launch.ksh
Instance Name = NBDT1
Performing transaction

At this stage, launch the switchover (from another terminal):

DGMGRL> switchover to snbdt;
Performing switchover NOW, please wait...
Operation requires a connection to instance "SNBDT1" on database "snbdt"
Connecting to instance "SNBDT1"...
Connected as SYSDBA.
New primary database "snbdt" is opening...
Oracle Clusterware is restarting database "nbdt" ...
Switchover succeeded, new primary is "snbdt"

And then wait until the application ends:

$ ksh ./launch.ksh
Instance Name = NBDT1
Performing transaction
.
.   <= Switchover occurred here
.
Instance Name = SNBDT1

real	1m46.19s
user	0m1.33s
sys	0m0.13s

As we can see:

  • No errors have been reported.
  • The application terminated on the SNBDT1 instance (Then on the new primary database).

Let’s check (thanks to the spy table) on which Instance did the transaction commit:

$ sqlplus bdt/bdt@SNBDT

SQL*Plus: Release 12.1.0.2.0 Production on Thu May 7 10:04:54 2015

Copyright (c) 1982, 2014, Oracle.  All rights reserved.

Last Successful login time: Thu May 07 2015 09:59:05 +02:00

Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Advanced Analytics and Real Application Testing options

SQL> select distinct(instance_name) from bdt_spy;

INSTANCE_NAME
----------------
SNBDT1

As we can see the transaction has been replayed and did commit on the SNBDT database (The new primary database).

Remarks

  1. Application Continuity is supported for Oracle Data Guard switchovers to physical standby databases. It is also supported for fast-start failover to physical standbys in maximum availability data protection mode.
  2. Don’t forget that the primary and standby databases must be licensed for Oracle RAC or Oracle Active Data Guard in order to use Application Continuity.
  3. The bdt oracle user that has been used during the test, has grant execute on DBMS_APP_CONT.
  4. This Java Application has been created and shared by Laurent Leturgez during the Paris Oracle Meetup. User groups are always good, as you can share your work with others. Thanks again Laurent!

Conclusion

We have seen the Application Continuity feature in Action in a Data Guard context.

Posted in Availability | 2 Comments

Direct Path Read and “enq: KO – fast object checkpoint”

Introduction

When using direct path read, oracle reads the blocks from the disk. The buffer cache is not being used.

Then, when oracle begins to read a segment using direct path read, it flush its dirty blocks to disk (A dirty block is a block that does not have the same image in memory and on disk). During that time, the session that triggered the direct path read is waiting for the “enq: KO – fast object checkpoint” wait event.

In this post, I just want to see with my own eyes that the dirty blocks being flushed to disk belong only to the segment being read.

Tests

For the test, I used two tables on a 11.2.0.4 database. One table is made of 2 partitions and one is not partitioned (then 3 segments):

BDT:BDT1> select SEGMENT_NAME,PARTITION_NAME from dba_segments where segment_name in ('BDT1','BDT2');

SEGMENT_NAME PARTITION_NAME
------------ --------------
BDT2
BDT1         P2
BDT1         P1

The rows distribution across segments, blocks and file relative numbers is the following:

BDT:BDT1> l
  1  select u.object_name,u.subobject_name,dbms_rowid.rowid_block_number(BDT1.rowid) BLOCK_NUMBER, dbms_rowid.rowid_relative_fno(BDT1.rowid) FNO
  2	 from BDT1, user_objects u
  3	 where dbms_rowid.rowid_object(BDT1.rowid) = u.object_id
  4	 union all
  5	 select u.object_name,u.subobject_name,dbms_rowid.rowid_block_number(BDT2.rowid) BLOCK_NUMBER, dbms_rowid.rowid_relative_fno(BDT2.rowid) FNO
  6	 from BDT2, user_objects u
  7	 where dbms_rowid.rowid_object(BDT2.rowid) = u.object_id
  8	 order by 1,2
  9*
BDT:BDT1> /

OBJECT_NAME  SUBOBJECT_NAME  BLOCK_NUMBER FNO
------------ --------------- ------------ -----------
BDT1	     P1              2509842	  5
BDT1	     P1              2509842      5
BDT1	     P1              2509842      5
BDT1	     P1              2509842      5
BDT1	     P2              2510866      5
BDT1	     P2              2510866      5
BDT1	     P2              2510866      5
BDT1	     P2              2510866      5
BDT1	     P2              2510866      5
BDT2                         1359091      5
BDT2                         1359091      5
BDT2                         1359091      5
BDT2                         1359091      5
BDT2                         1359091      5
BDT2                         1359091      5
BDT2                         1359091      5
BDT2                         1359091      5
BDT2                         1359091      5

18 rows selected.

so that all the rows of the segment:

  • BDT1:P1 are in datafile 5, block 2509842
  • BDT1:P2 are in datafile 5, block 2510866
  • BDT2 are in datafile 5, block 1359091

Remark:

BDT1:P<N> stands for table BDT1 partition <N>

Now, let’s update both tables (without committing):

BDT:BDT1> update BDT1 set ID2=100;

9 rows updated.

BDT:BDT1> update BDT2 set ID2=100;

9 rows updated.

As I do not commit (nor rollback), then those 3 blocks in memory contain an active transaction.

In another session, let’s force serial direct path read and read partition P1 from BDT1 only:

BDT:BDT1> alter session set "_serial_direct_read"=always;

Session altered.

BDT:BDT1> select count(*) from BDT1 partition (P1);

COUNT(*)
-----------
4

Now, check its wait events and related statistics with snapper. A cool thing with snapper is that I can get wait events and statistics from one tool.

SYS:BDT1> @snapper all,end 5 1 20
Sampling SID 20 with interval 5 seconds, taking 1 snapshots...

-- Session Snapper v4.22 - by Tanel Poder ( http://blog.tanelpoder.com/snapper ) - Enjoy the Most Advanced Oracle Troubleshooting Script on the Planet! :)

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 SID @INST, USERNAME  , TYPE, STATISTIC                                                 ,         DELTA, HDELTA/SEC,    %TIME, GRAPH       , NUM_WAITS,  WAITS/SEC,   AVERAGES
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    20  @1, BDT       , STAT, physical reads direct                                     ,             1,        .02,         ,             ,          ,           ,         .5 per execution
    20  @1, BDT       , WAIT, enq: KO - fast object checkpoint                          ,          1117,    18.12us,      .0%, [          ],         1,        .02,     1.12ms average wait

--  End of Stats snap 1, end=2015-04-29 08:25:20, seconds=61.7

--  End of ASH snap 1, end=, seconds=, samples_taken=0, AAS=(No ASH sampling in begin/end snapshot mode)

PL/SQL procedure successfully completed.

For brevity and clarity, I keep only the wait events and statistics of interest.

As we can see, the direct path read has been used and the “fast object checkpoint” occurred (as the session waited for it).

Now let’s check which block(s) has/have been flushed to disk: to do so, I’ll dump those 3 blocks from disk and check if they contain an active transaction (as the ones in memory).

BDT:BDT1> !cat dump_blocks.sql
-- BDT2
alter system dump datafile 5 block 1359091;
-- BDT1:P2
alter system dump datafile 5 block 2510866;
-- BDT1:P1
alter system dump datafile 5 block 2509842;
select value from v$diag_info where name like 'Default%';

BDT:BDT1> @dump_blocks.sql

System altered.

System altered.

System altered.

VALUE
------------------------------
/u01/app/oracle/diag/rdbms/bdt/BDT1/trace/BDT1_ora_91815.trc

BDT:BDT1> !view /u01/app/oracle/diag/rdbms/bdt/BDT1/trace/BDT1_ora_91815.trc

The Interested Transaction List (ITL) for the first block that has been dumped from disk (BDT2 Table) looks like:

 Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0xffff.000.00000000  0x00000000.0000.00  C---    0  scn 0x0000.0016f9c7
0x02   0x0000.000.00000000  0x00000000.0000.00  ----    0  fsc 0x0000.00000000
0x03   0x0000.000.00000000  0x00000000.0000.00  ----    0  fsc 0x0000.00000000

We can see that the block on disk does not contain an active transaction in it (There is no rows locked as Lck=0). We can conclude that the block that belongs to the BDT2 table has not been flushed to disk (The block is still dirty).

The Interested Transaction List (ITL) for the second block that has been dumped from disk (BDT1:P2) looks like:

 Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0xffff.000.00000000  0x00000000.0000.00  C---    0  scn 0x0000.0016f99f
0x02   0x0000.000.00000000  0x00000000.0000.00  ----    0  fsc 0x0000.00000000
0x03   0x0000.000.00000000  0x00000000.0000.00  ----    0  fsc 0x0000.00000000

We can see that the block on disk does not contain an active transaction in it (There is no rows locked as Lck=0). We can conclude that the block that belongs to the partition P2 of the BDT1 table has not been flushed to disk (The block is still dirty).

The Interested Transaction List (ITL) for the third block that has been dumped from disk (BDT1:P1) looks like:

 Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0xffff.000.00000000  0x00000000.0000.00  C---    0  scn 0x0000.0016f998
0x02   0x000a.005.00001025  0x00c00211.038e.23  ----    4  fsc 0x0000.00000000
0x03   0x0000.000.00000000  0x00000000.0000.00  ----    0  fsc 0x0000.00000000

We can see that the block on disk does contain an active transaction in it (4 rows are locked as Lck=4). This is the block that has been read in direct path read during the select on BDT1:P1.

We can conclude that the block that belongs to BDT1:P1 has been flushed to disk (The block is not dirty anymore).

Conclusion

When oracle begins to read a segment using direct path read:

  • It does not flush to disk the dirty blocks from other segments.
  • Only the dirty blocks of the segment being read are flushed to disk.
Posted in Performance | 7 Comments

ASM Bytes Read and Smart Scan

Introduction

I like to query the v$asm_disk_iostat cumulative view (at the ASM instance level) as this is a centralized location where you can find metrics for all the databases the ASM instance is servicing. One of its metric is:

BYTES_READ: Total number of bytes read from the disk (see the oracle documentation).

I would like to see which value is recorded into this metric in case of smart scan. To do so, I’ll launch some tests and check the output of the asm_metrics utility: It basically takes a snapshot each second (default interval) from the gv$asm_disk_iostat cumulative view and computes the delta with the previous snapshot.

Environment

ASM and database versions are 11.2.0.4. A segment of about 19.5 gb has been created without any indexes, so that a full table scan is triggered during the tests.

Tests

During the tests:

  • This simple query is launched:
select id from bdt where id=1;
  • Then, the asm metrics will be recorded that way:
./asm_metrics.pl -show=dbinst,fg,inst -dbinst=BDT2 -inst=+ASM2 -display=avg
  • We’ll look at the output field “Kby Read/s” which is based on the BYTES_READ column coming from the v$asm_disk_iostat cumulative view.
  • The sql elapsed time and the percentage of IO saved by the offload (if any) will be checked with a query that looks like fsx.sql. By “percentage of IO saved” I mean the ratio of data received from the storage cells to the actual amount of data that would have had to be received on non-Exadata storage.

Test 1: Launch the query without offload

BDT:BDT2> alter session set cell_offload_processing=false;

Session altered.

BDT:BDT2> select /* NO_SMART_SCAN_NO_STORAGEIDX */ id from bdt where id=1;

The elapsed time and the percentage of IO saved are:

SQL_ID	       CHILD   PLAN_HASH  EXECS  AVG_ETIME AVG_PX OFFLOAD ELIGIBLE_MB  INTERCO_MB OFF_RETU_MB IO_SAVED_% SQL_TEXT
------------- ------ ----------- ------ ---------- ------ ------- ----------- ----------- ----------- ---------- ----------------------------------------------------------------------
7y7xa34jjab2q	   0   627556429      1      23.11	0 No		    0 19413.42969	    0	     .00 select /* NO_SMART_SCAN_NO_STORAGEIDX */ id from bdt where id=1

The elapsed time of the sql is 23.11 seconds and obviously no IO have been saved by offload. As you can see about 19.5 gb has been exchanged between the Oracle Database and the storage system (INTERCO_MB is based on IO_INTERCONNECT_BYTES column from v$sql).

The asm_metrics ouput is the following:

06:39:47                                                                            Kby       Avg       AvgBy/               Kby       Avg        AvgBy/
06:39:47   INST     DBINST        DG            FG           DSK          Reads/s   Read/s    ms/Read   Read      Writes/s   Write/s   ms/Write   Write
06:39:47   ------   -----------   -----------   ----------   ----------   -------   -------   -------   ------    ------     -------   --------   ------
06:39:47   +ASM2                                                          869       885423    3.7       1042916   2          24        0.9        15477
06:39:47   +ASM2    BDT2                                                  869       885423    3.7       1042916   2          24        0.9        15477
06:39:47   +ASM2    BDT2                        ENKCEL01                  300       305865    3.7       1044812   1          8         1.2        15061
06:39:47   +ASM2    BDT2                        ENKCEL02                  330       335901    3.7       1041451   1          8         0.8        15061
06:39:47   +ASM2    BDT2                        ENKCEL03                  239       243657    3.6       1042564   0          8         0.6        16384

So the average Kby Read per second is 885423.

Then we can conclude that the BYTES_READ column records that about 885423 * 23.11 = 19.5 gb has been read from disk.

Does it make sense? Yes.

Test 2: Offload and no storage indexes

BDT:BDT2> alter session set cell_offload_processing=true;

Session altered.

BDT:BDT2> alter session set "_kcfis_storageidx_disabled"=true;

Session altered.

BDT:BDT2> select /* WITH_SMART_SCAN_NO_STORAGEIDX */ id from bdt where id=1;

The elapsed time and the percentage of IO saved are:

SQL_ID	       CHILD   PLAN_HASH  EXECS  AVG_ETIME AVG_PX OFFLOAD ELIGIBLE_MB  INTERCO_MB OFF_RETU_MB IO_SAVED_% SQL_TEXT
------------- ------ ----------- ------ ---------- ------ ------- ----------- ----------- ----------- ---------- ----------------------------------------------------------------------
5zzvpyn94b05g	   0   627556429      1       4.11	0 Yes	  19413.42969 2.860450745 2.860450745	   99.99 select /* WITH_SMART_SCAN_NO_STORAGEIDX */ id from bdt where id=1

The elapsed time of the sql is 4.11 seconds and 99.99 % of IO has been saved by offload. About 2.8 mb have been exchanged between the Oracle Database and the storage system.

The asm_metrics ouput is the following:

06:41:54                                                                            Kby       Avg       AvgBy/               Kby       Avg        AvgBy/
06:41:54   INST     DBINST        DG            FG           DSK          Reads/s   Read/s    ms/Read   Read      Writes/s   Write/s   ms/Write   Write
06:41:54   ------   -----------   -----------   ----------   ----------   -------   -------   -------   ------    ------     -------   --------   ------
06:41:54   +ASM2                                                          4862      4969898   0.0       1046671   1          12        6.4        16384
06:41:54   +ASM2    BDT2                                                  4862      4969898   0.0       1046671   1          12        6.4        16384
06:41:54   +ASM2    BDT2                        ENKCEL01                  1678      1715380   0.0       1047123   0          4         17.2       16384
06:41:54   +ASM2    BDT2                        ENKCEL02                  1844      1883738   0.0       1046067   0          4         1.0        16384
06:41:54   +ASM2    BDT2                        ENKCEL03                  1341      1370780   0.0       1046935   0          4         1.0        16384

So the average Kby Read per second is 4969898.

Then we can conclude that the BYTES_READ column records that about 4969898 *4.11 = 19.5 gb has been read from disk.

Does it make sense? I would say yes, because the storage indexes haven’t been used. Then, during the smart scan all the datas blocks have been opened in the cells in order to extract and send back to the database layer the requested column (column projection) on the selected rows (row filtering).

Test 3: Offload and storage indexes

BDT:BDT2> alter session set "_kcfis_storageidx_disabled"=false;

Session altered.

BDT:BDT2> select /* WITH_SMART_SCAN_AND_STORAGEIDX */ id from bdt where id=1;

The elapsed time and the percentage of IO saved are:

SQL_ID	       CHILD   PLAN_HASH  EXECS  AVG_ETIME AVG_PX OFFLOAD ELIGIBLE_MB  INTERCO_MB OFF_RETU_MB IO_SAVED_% SQL_TEXT
------------- ------ ----------- ------ ---------- ------ ------- ----------- ----------- ----------- ---------- ----------------------------------------------------------------------
3jdpqa2s4bb0v	   0   627556429      1        .09	0 Yes	  19413.42969  .062171936  .062171936	  100.00 select /* WITH_SMART_SCAN_AND_STORAGEIDX */ id from bdt where id=1

The elapsed time of the sql is 0.09 seconds and about 100 % of IO has been saved by offload. About 0.06 mb have been exchanged between the Oracle Database and the storage system.

The storage indexes saved a lot of reads (almost the whole table):

BDT:BDT2> l
  1* select n.name, s.value from v$statname n, v$mystat s where n.name='cell physical IO bytes saved by storage index' and n.STATISTIC#=s.STATISTIC#
BDT:BDT2> /
NAME							     VALUE
-------------------------------------------------- ---------------
cell physical IO bytes saved by storage index	       20215947264

The asm_metrics ouput is the following:

06:43:58                                                                            Kby        Avg       AvgBy/               Kby       Avg        AvgBy/
06:43:58   INST     DBINST        DG            FG           DSK          Reads/s   Read/s     ms/Read   Read      Writes/s   Write/s   ms/Write   Write
06:43:58   ------   -----------   -----------   ----------   ----------   -------   -------    -------   ------    ------     -------   --------   ------
06:43:58   +ASM2                                                          19436     19879384   0.0       1047360   0          0         0.0        0
06:43:58   +ASM2    BDT2                                                  19436     19879384   0.0       1047360   0          0         0.0        0
06:43:58   +ASM2    BDT2                        ENKCEL01                  6708      6861488    0.0       1047430   0          0         0.0        0
06:43:58   +ASM2    BDT2                        ENKCEL02                  7369      7534840    0.0       1047045   0          0         0.0        0
06:43:58   +ASM2    BDT2                        ENKCEL03                  5359      5483056    0.0       1047705   0          0         0.0        0

So the average Kby Read per second is 19879384.

As the asm_metrics utility’s granularity to collect the data is one second and as the elapsed time is < 1s then we can conclude that the BYTES_READ column records that about 19.5 gb has been read from disk.

Does it make sense? I would say no, because during the smart scan, thanks to the Storage indexes, not all the datas blocks have been opened in the cells in order to extract the requested column (column projection) on the selected rows (row filtering).

Remark

You could also query the v$asm_disk_iostat view and measure the delta for the BYTES_READ column by your own (means without the asm_metrics utility). The results would be the same.

Conclusion

The BYTES_READ column displays:

  • The Total number of bytes read from the disk (and also transferred to the database) without smart scan.
  • The Total number of bytes read from the disk (but not the bytes transferred to the database) with smart scan and no storage indexes being used.
  • Neither the Total number of bytes read from the disk nor the bytes transferred to the database with smart scan and storage indexes being used.

 

Posted in Exadata | 2 Comments

In-Memory Instance distribution with RAC databases: Impact of the parallel_instance_group parameter

INTRODUCTION

As you may know, with a RAC database, by default all objects populated into
In-memory will be distributed across all of the IM column stores in the cluster. It is also possible to have all the data appear in the IM column store on every node (Engineered Systems only). See this white paper for more details.

Into this blog post we have been told that:

  • RAC services and the parallel_instance_group parameter can be used to control where data is populated, and the subsequent access of that data.
  • In a RAC environment services enable the use of a subset of nodes and still insure that all In-Memory queries will be able to access all IM column stores.

My customer uses a RAC service defined as preferred/available on a 2 nodes RAC. In this configuration, I would like to see how the data is populated into the In-Memory column store (IMCS) on both Instances when the parallel_instance_group parameter is used .

TESTS

I’ll do several tests and stop/start the 12.1.0.2 database between each tests (to avoid confusion).

First let’s create a service BDT_PREF as preferred/available:

srvctl add service -db BDT12c1 -s BDT_PREF -preferred BDT12c1_1 -available BDT12c1_2

so that the Instance 1 (BDT12c1_1) is the preferred one.

Test 1:

  • parallel_instance_group not set.

Connect to the database through the BDT_PREF service and check the data distribution.

SQL> alter table bdt inmemory priority none;

Table altered.

SQL> select count(*) from bdt;

  COUNT(*)
----------
  38220000

SQL> SELECT inst_id, populate_status, inmemory_size, bytes, bytes_not_populated
FROM gv$im_user_segments
WHERE segment_name = 'BDT';

   INST_ID POPULATE_ INMEMORY_SIZE      BYTES BYTES_NOT_POPULATED
---------- --------- ------------- ---------- -------------------
         1 COMPLETED     308477952 1610612736           732684288
         2 COMPLETED     274530304 1610612736           826236928

As you can see no instance contains all data as the BYTES_NOT_POPULATED column is  greater than 0. The data is distributed across all of the IM column stores in the cluster.

Test 2:

  • parallel_instance_group set on Instance 1.
  • The service is running on Instance 1.

Set the parallel_instance_group parameter on the Instance BDT12c1_1 (INST_ID=1) to the service:

SQL> alter system set parallel_instance_group=BDT_PREF scope=both sid='BDT12c1_1';

System altered.

Check that the BDT_PREF service is running on the Instance 1 (where parallel_instance_group has been set) :

SQL> host srvctl status service -d BDT12c1 -s BDT_PREF
Service BDT_PREF is running on instance(s) BDT12c1_1

Connect to the database through the BDT_PREF service and check the data distribution:

SQL> alter table bdt inmemory priority none;

Table altered.

SQL> select count(*) from bdt;

  COUNT(*)
----------
  38220000

SQL> SELECT inst_id, populate_status, inmemory_size, bytes, bytes_not_populated
FROM gv$im_user_segments
WHERE segment_name = 'BDT';

   INST_ID POPULATE_ INMEMORY_SIZE      BYTES BYTES_NOT_POPULATED
---------- --------- ------------- ---------- -------------------
         1 COMPLETED     575602688 1610612736                   0

As you can see the Instance 1 contains all the data in its IMCS (as BYTES_NOT_POPULATED=0).

Test 3:

  • parallel_instance_group set on Instance 1.
  • The service is running on Instance 2.

The fact that the service is not running on its preferred Instance could happen after:

  • A failover.
  • The database startup: When you use automatic services in an administrator-managed database, during planned database startup, services may start on the first instances that become available rather than their preferred instances (see the oracle documentation).
  • A manual relocation.

Set the parallel_instance_group parameter on the Instance BDT12c1_1 (INST_ID=1) to the service:

SQL> alter system set parallel_instance_group=BDT_PREF scope=both sid='BDT12c1_1';

System altered.

Relocate the BDT_PREF service on the Instance 2 (if needed):

SQL> host srvctl status service -d BDT12c1 -s BDT_PREF
Service BDT_PREF is running on instance(s) BDT12c1_1

SQL> host srvctl relocate service -d BDT12c1 -s BDT_PREF -newinst BDT12c1_2 -oldinst BDT12c1_1

SQL> host srvctl status service -d BDT12c1 -s BDT_PREF
Service BDT_PREF is running on instance(s) BDT12c1_2

Connect to the database through the BDT_PREF service and check the data distribution:

SQL> alter table bdt inmemory priority none;

Table altered.

SQL> select count(*) from bdt;

  COUNT(*)
----------
  38220000

SQL> SELECT inst_id, populate_status, inmemory_size, bytes, bytes_not_populated
FROM gv$im_user_segments
WHERE segment_name = 'BDT';

   INST_ID POPULATE_ INMEMORY_SIZE      BYTES BYTES_NOT_POPULATED
---------- --------- ------------- ---------- -------------------
         2 COMPLETED     277676032 1610612736           826236928
         1 COMPLETED     312672256 1610612736           732684288

As you can see no instance contains all data as the BYTES_NOT_POPULATED column is  greater than 0. The data is distributed across all of the IM column stores in the cluster.

Remarks:

  1. This distribution could lead to performance issue.
  2. If a failover occurred and the Instance 1 is still down, then the data is fully populated on the Instance 2.

Test 4:

  • parallel_instance_group set on Instance 1 and on Instance 2.
  • The service is running on Instance 1.

Set the parallel_instance_group parameter on both Instances to the service:

SQL> alter system set parallel_instance_group=BDT_PREF scope=both sid='BDT12c1_1';

System altered.

SQL> alter system set parallel_instance_group=BDT_PREF scope=both sid='BDT12c1_2';

System altered.

Check that the BDT_PREF service is running on the Instance 1:

SQL> host srvctl status service -d BDT12c1 -s BDT_PREF
Service BDT_PREF is running on instance(s) BDT12c1_1

Connect to the database through the BDT_PREF service and check the data distribution:

SQL> alter table bdt inmemory priority none;

Table altered.

SQL> select count(*) from bdt;

  COUNT(*)
----------
  38220000

SQL> SELECT inst_id, populate_status, inmemory_size, bytes, bytes_not_populated
FROM gv$im_user_segments
WHERE segment_name = 'BDT';

   INST_ID POPULATE_ INMEMORY_SIZE      BYTES BYTES_NOT_POPULATED
---------- --------- ------------- ---------- -------------------
         1 COMPLETED     575602688 1610612736                   0

As you can see the Instance 1 contains all the data in its IMCS (as BYTES_NOT_POPULATED=0).

Test 5:

  • parallel_instance_group set on Instance 1 and on Instance 2.
  • The service is running on Instance 2.

Set the parallel_instance_group parameter on both Instances to the service:

SQL> alter system set parallel_instance_group=BDT_PREF scope=both sid='BDT12c1_1';

System altered.

SQL> alter system set parallel_instance_group=BDT_PREF scope=both sid='BDT12c1_2';

System altered.

Relocate the BDT_PREF service on the Instance 2 (if needed):

SQL> host srvctl status service -d BDT12c1 -s BDT_PREF
Service BDT_PREF is running on instance(s) BDT12c1_1

SQL> host srvctl relocate service -d BDT12c1 -s BDT_PREF -newinst BDT12c1_2 -oldinst BDT12c1_1

SQL> host srvctl status service -d BDT12c1 -s BDT_PREF
Service BDT_PREF is running on instance(s) BDT12c1_2

Connect to the database through the BDT_PREF service and check the data distribution:

SQL> alter table bdt inmemory priority none;

Table altered.

SQL> select count(*) from bdt;

  COUNT(*)
----------
  38220000

SQL> SELECT inst_id, populate_status, inmemory_size, bytes, bytes_not_populated
FROM gv$im_user_segments
WHERE segment_name = 'BDT';

   INST_ID POPULATE_ INMEMORY_SIZE      BYTES BYTES_NOT_POPULATED
---------- --------- ------------- ---------- -------------------
         2 COMPLETED     575602688 1610612736                   0

As you can see the Instance 2 contains all the data in its IMCS (as BYTES_NOT_POPULATED=0).

Out of curiosity, what about a service defined as preferred on both Instances and the parallel_instance_group set on both Instances?

Test 6:

  • parallel_instance_group set on Instance 1 and on Instance 2.
  • The service is running on Instance 1 and 2.

First let’s create a service BDT_PREF as preferred on both Instances:

srvctl add service -db BDT12c1 -s BDT_PREF -preferred BDT12c1_1,BDT12c1_2

Set the parallel_instance_group parameter on both Instances to the service:

SQL> alter system set parallel_instance_group=BDT_PREF scope=both sid='BDT12c1_1';

System altered.

SQL> alter system set parallel_instance_group=BDT_PREF scope=both sid='BDT12c1_2';

System altered.

Check that the BDT_PREF service is running on both Instances:

SQL> host srvctl status service -d BDT12c1 -s BDT_PREF
Service BDT_PREF is running on instance(s) BDT12c1_1,BDT12c1_2

Connect to the database through the BDT_PREF service and check the data distribution:

SQL> alter table bdt inmemory priority none;

Table altered.

SQL> select count(*) from bdt;

  COUNT(*)
----------
  38220000

SQL> SELECT inst_id, populate_status, inmemory_size, bytes, bytes_not_populated
FROM gv$im_user_segments
WHERE segment_name = 'BDT';

   INST_ID POPULATE_ INMEMORY_SIZE      BYTES BYTES_NOT_POPULATED
---------- --------- ------------- ---------- -------------------
         2 COMPLETED     277676032 1610612736           826236928
         1 COMPLETED     312672256 1610612736           732684288

As you can see no instance contains all data as the BYTES_NOT_POPULATED column is  greater than 0. The data is distributed across all of the IM column stores in the cluster.

REMARKS

  • I am not discussing the impact of the distribution on the performance (with or without Auto DOP).
  • I just focus on how the data has been distributed.
  • In this previous blog post the “workaround” proposed to get rid of the default behaviour on non Engineered Systems (“Every row of the test table is stored in the IMCS of either one instance or the other”) works for any type of service.
  • In the current post I focus only on service defined as preferred/available on a 2 nodes RAC (The last test is out of curiosity).

CONCLUSION

With a 12.1.0.2 database and a service defined as preferred/available on a 2 nodes RAC:

  • With the parallel_instance_group parameter set to the service on both Instances:

The data is fully populated on the Instance the service is running on.

  • With the parallel_instance_group parameter set to the service on the preferred Instance only:

If the service is running on the preferred Instance, the data is fully populated on the Instance.

If the service is running on the “available” Instance, the data is distributed across all of the IM column stores in the cluster, .

With a 12.1.0.2 database, a service defined as preferred on both Instances on a 2 nodes RAC and parallel_instance_group parameter set to the service on both Instances:

The data is distributed across all of the IM column stores in the cluster.

Posted in In-Memory | 1 Comment

binding (with processor_group_name) versus caging: Facts, Observations and Customer cases.

INTRODUCTION

This blog post came to my mind following an interesting conversation I have had on twitter with Philippe Fierens and Karl Arao.

Let’s go for it:

Nowadays it is very common to consolidate multiple databases on the same server.

One could want to limit the CPU usage of databases and/or ensure guarantee CPU for some databases. I would like to compare two methods to achieve this:

  • Instance Caging (available since 11.2).
  • processor_group_name (available since 12.1).

This blog post is composed of 4 mains parts:

  1. Facts: Describing the features and a quick overview on how to implement them.
  2. Observations: What I observed on my lab server. You should observe the same on your environment for most of them.
  3. OEM and AWR views for Instance Caging and processor_group_name with CPU pressure.
  4. Customer cases: I cover all the cases I faced so far.

FACTS

Instance caging: Purpose is to Limit or “cage” the amount of CPU that a database instance can use at any time. It can be enabled online (no need to restart the database) in 2 steps:

  •  Set “cpu_count” parameter to the maximum number of CPUs the database should be able to use (Oracle advises to set cpu_count to at least 2)
SQL> alter system set cpu_count = 2;
  • Set “resource_manager_plan” parameter to enable CPU Resource Manager
SQL> alter system set resource_manager_plan = ‘default_plan’;

Graphical view with multiple databases limited as an example:

instance_caging_graphical_viewprocessor_group_name: Bind a database instance to specific CPUs / NUMA nodes. It is enabled in 4 steps on my RedHat 6.5 machine:

  •  Specify the CPUs or NUMA nodes by creating a “processor group”:

Example: Snippet of /etc/cgconfig.conf:

group oracle {
    perm {
      task {
        uid = oracle;
        gid = dc_dba;
      }
      admin {
        uid = oracle;
        gid = dc_dba;
      }
    }
    cpuset {
      cpuset.mems=0;
      cpuset.cpus="1-9";
    }
 }
  •  Start the cgconfig service:
service cgconfig start
  •  Set the Oracle parameter “processor_group_name” to the name of this processor group:
SQL> alter system set processor_group_name='oracle' scope=spfile;
  •  Restart the Database Instance.

Graphical view with multiple databases bound as an example:

cpu_binding_graphical_viewThis graphical view represents the ideal configuration, where a database is bound to specific CPUs and NUMA local memory (If you need more details, see the observation number 1 later on this post).

Remark regarding the Database Availability:

  • The Instance caging can be enabled online.
  • For cgroups/processor_group_name the database needs to be stopped and started.

OBSERVATIONS

For brevity, “Caging” is used for “Instance Caging” and “Binding” for “cgroups/processor_group_name” usages.

Also “CPU” is used when CPU threads are referred to, which is the actual granularity for caging and binding.

The observations have been made on a 12.1.0.2 database using large pages (use_large_pages=only). I will not cover the “non large pages” case as not using large pages is not an option for me.

1. When you define the cgroup for the binding, you should pay attention to the NUMA memory and CPUs locality as it has an impact on the LIO performance (See this blog post). As you can see (on my specific configuration), remote NUMA nodes access has been slower by about 2 times compare to local access.

   2.Without binding, the SGA is interleaved across all the NUMA nodes (unless you set the oracle hidden parameter _enable_NUMA_interleave to FALSE):

numa_mem_alloc_db_no_bind

   3.With binding (on more than one NUMA nodes), the SGA is not interleaved across all the NUMA nodes (SGA bind on nodes 0 and 1):

numa_mem_alloc_db_bind_nodes_0_1

   4.Instance caging has been less performant (compare to the cpu binding) during LIO pressure (by pressure I mean that the database needs more cpu resources than the ones it is limited to use) (See this blog post for more details). Please note that the comparison has been done on the same NUMA node (to avoid binding advantage over caging due to observations number 1 and 2).

   5.With cpu binding already in place (using processor_group_name) we are able to change the number of cpus a database is allowed to use on the fly (See this blog post).

Remarks:

  • You can find more details regarding observations 2 and 3 into this blog post.
  • From 2 and 3 we can conclude that with the caging, the SGA is interleaved across all the NUMA nodes (unless the caging has been setup on top of the binding (mix configuration)).
  • All above mentioned observations are related to performance. You may want to run the tests described in observations 1 and 4 in your own environment to measure the actual impact on your environment.

Caging and Binding with CPU pressure: OEM and AWR views

What do the OEM or AWR views of a database being under CPU pressure show (pressure meaning the database needs more CPU resources than it is configured to use) with both caging and binding? Let’s have a look.

CPU pressure with Caging:

The database has the caging enabled with the cpu_count parameter set to 6 and is running 9 SLOB users in parallel.

The OEM view:

new_caging_6_nocg _for_blogAs you can see: In average about 6 sessions are running on the CPU and about 3 are waiting in the “Scheduler” wait class.

The AWR view:

awr_caging_6_nocgAs you can see: Approximatively 60% of the DB time is spend on CPU and 40% is spend waiting because of the caging (Scheduler wait class).

CPU pressure with Binding:

The processor_group_name is set to a cgroup of 6 CPUs. The Instance is running 9 SLOB users in parallel.

The OEM view:

new_cg6_for_blogAs you can see: In average 6 sessions are running on CPU and 3 are waiting for the CPU (runqueue).

The AWR view:

awr_cg_6_top10New “Top Events” section as of 12c:

awr_cg_6_topevents

 

 

 

 

To summarize:

awr_12c_top_event_cg6Then, we can conclude that:

  • 65.5% of the DB time has been spent on the CPU.
  • 99.62–65.5 = 34.12% of the DB time has been spent into the runqueue.

CUSTOMER CASES

The cases below are based on studies for customers.

Case number 1: One database uses too much CPU and affects other instances’ performance.

Then we want to limit its CPU usage:

  • Caging:   We are able to cage the database Instance without any restart.
  • Binding:  We are able to bind the database Instance, but it needs a restart.

Example of such a case by implementing Caging:

caging_on_the_flyWhat to choose?

The caging is the best choice in this case (as it is the easiest and we don’t need to restart the database).

For the following cases we need to define 2 categories of database:

  1. The paying ones: Customer paid for guaranteed CPU resources available.
  2. The non-paying ones: Customer did not pay for guaranteed CPU resources available.

Case number 2: Guarantee CPU resource for some databases. The CPU resource is defined by the customer and needs to be guaranteed at the database level.

Caging:

  • All the databases need to be caged (if not, there is no guaranteed resources availability as one non-caged could take all the CPU).
  • As a consequence, you can’t mix paying and non-paying customer without caging the non-paying ones.
  • You have to know how to cage each database individually to be sure that sum(cpu_count) <= Total number of threads (If not, CPU resources can’t be guaranteed).
  • Be sure that sum(cpu_count) of non-paid <= “CPU Total – number of paid CPU” (If not, CPU resources can’t be guaranteed).
  • There are a maximum number of databases that you can put on a machine: As cpu_count >=2 (see facts) then the maximum number of databases = Total number of threads /2.

Graphical view as an example:

all_caged_databases_levelIf you need to add a new database (a paying one or a non-paying one) and you are already in a situation where “sum(cpu_count) = Total number of threads” then you have to review the caging of all non-paying databases (to not over allocate the machine).

Binding:

Each database is linked to a set of CPU (There is no overlap, means that a CPU can be part of only one cgroup). Then, create:

  • One cgroup per paying database.
  • One cgroup for all the non-paying databases.

That way we can easily mix paying and non-paying customer on the same machine.

Graphical View as an example:

bind_all_databases_levelIf you need to add a new database then, if this is:

  • A non-paying one: Put it into the non-paying group.
  • A paying one: Create a new cgroup for it, assigning CPU taken away from non-paying ones if needed.

Binding + caging mix:

  • create a cgroup for all the paying databases and then put the caging on each paying database.
  • Put the non-paying into another cgroup (no overlap with the paying one) without caging.

Graphical View as an example:

bind_caging_mixIf you need to add a new database then, if this is:

  • A non-paying one: Put it into the non-paying group.
  • A paying one: Extend the paying group if needed (taking CPU away from non-paying ones) and cage it accordingly.

What to choose?

I would say there is no “best choice”: It all depends of the number of database we are talking about (For a large number of databases I would use the mix approach). And don’t forget that with the caging option your can’t put more than number of threads/2 databases.

Case number 3: Guarantee CPU resource for exactly one group of databases. The CPU resource needs to be guaranteed at the group level (Imagine that customer paid for 24 CPUs whatever the number of database is).

Caging:

We have 2 options:

1. Option 1:

  • Cage all the non-paying databases.
  • Be sure that sum(cpu_count) of non-paid <= “CPU Total – number of paid CPU” (If not, CPU resources can’t be guaranteed).
  • Don’t cage the paying databases.

That way the paying databases have guaranteed at least the resources they paid for.

Graphical View as an example:

caging_1group_option1If you need to add a new database then, if this is:

  • A non-paying one: Cage it but you may need to re-cage all the non-paying ones to guarantee the paying ones still get its “minimum” resource.
  • A paying one: Nothing to do.

2. Option 2:

  • Cage all the databases (paying and non-paying).
  • You have to know how to cage each database individually to be sure that sum(cpu_count) <= Total number of threads (If not, CPU resources can’t be guaranteed).
  • Be sure that sum(cpu_count) of non-paid <= “CPU Total – number of paid CPU” (If not, CPU resources can’t be guaranteed).

That way the paying databases have guaranteed exactly the resources they paid for.

Graphical View as an example:

caging_1group_option2If you need to add a new database (a paying one or a non-paying one) and you are already in a situation where “sum(cpu_count) = Total number of threads” then you have to review the caging of all non-paying databases (to not over allocate the machine).

Binding:

Again 2 options:

1. Option 1:

  • Put all the non-paying databases into a “CPU Total – number of paid CPU ” cgroup.
  • Allow all the CPUs to be used by the paying databases (don’t create a cgroup for the paying databases).
  • That way the paying group has guaranteed at least the resources it paid for.

Graphical View as an example:

bind_1group_option1If you need to add a new database then, if this is:

  • A non-paying one: Put it in the non-paying cgroup.
  • A paying one: Create it outside the non-paying cgroup.

2. Option 2:

  • Put all the paying databases into a cgroup.
  • Put all the non-paying databases into another cgroup (no overlap with the paying cgroup).

That way the paying databases have guaranteed exactly the resources they paid for.

Graphical View:

bind_1group_option2If you need to add a new database, then if this is:

  • A non-paying one: Put it in the non-paying cgroup.
  • A paying one: Put it in the paying cgroup.

Binding + caging mix:

I don’t see any benefits here.

What to choose?

I like the option 1) in both cases as it allows customer to get more than what they paid for (with the guarantee of what they paid for). Then the choice between caging and binding really depends of the number of databases we are talking about (binding is preferable and easily manageable for large number of databases).

Case number 4:  The need to guarantee CPU resources for more than one group of databases (Imagine that one group=one customer). The CPU resource needs to be assured at each group level.

Caging:

  • All the databases need to be caged (if not, there is no guaranteed resources availability as one non-caged could take all the CPU).
  • As a consequence, you can’t mix paying and non-paying customer without caging the non-paying ones.
  • You have to know how to cage each database individually to be sure that sum(cpu_count) <= Total number of threads (If not, CPU resources can’t be guaranteed).
  • Be sure that sum(cpu_count) of non-paid <= “CPU Total – number of paid CPU” (If not, CPU resources can’t be guaranteed).
  • There are a maximum number of databases that you can put on a machine: As cpu_count >=2 (see facts) then the maximum number of databases = Total number of threads /2.

Graphical View as an example:

caging_2groupsIf you need to add a new database (a paying one or a non-paying one) and you are already in a situation where “sum(cpu_count) = Total number of threads” then you have to review the caging of all non-paying databases (to not over allocate the machine).

Overlapping is not possible in that case (means you can’t guarantee resources with overlapping).

Binding:

  • Create a cgroup for each paying database group and also for the non-paying ones (no overlap between all the groups).
  • Overlapping is not possible in that case (means you can’t guarantee resources with overlapping).

Graphical view as an example:

bind_2groups

If you need to add a database, then simply put it in the right group.

Binding + caging mix:

I don’t see any benefits here.

What to choose? 

It really depends of the number of databases we are talking about (binding is preferable and easily manageable for large number of databases).

REMARKS

  •  If you see (or faced) more cases, then feel free to share and to comment on this blog post.
  • I did not pay attention on potential impacts on LIO performance linked to any possible choice (caging vs binding with or without NUMA). I just mentioned them into the observations, but did not include them into the use cases (I just discussed the feasibility of the implementation).
  • I really like the options 1 into the case number 3. This option has been proposed by Karl during our conversation.

CONCLUSION

The choice between caging and binding depends of:

  • Your needs.
  • The number of databases you are consolidating on the server.
  • Your performance expectations.

I would like to thank you Karl Arao, Frits Hoogland, Yves Colin and David Coelho for their efficient reviews of this blog post.

Posted in Consolidation | 6 Comments

Joining the Accenture Enkitec Group

On April 1 2015, I’ll start working for the Accenture Enkitec Group. No, this is not a joke for the Fool’s day and I can’t still believe it. Why?, because:

  • Lot of people from whom I learned (and still learn) and for whom I have admiration for are working for this group.
  • I had the chance to meet and/or communicate through social media with: Andy Colvin, Kerry Osborne, Karl Arao, Frits Hoogland, Carlos Sierra, Mauro Pagano, Tanel Poder, Martin Bach and Veronica Stigers: They are all good, humble person and you can feel the passion when they talk about their jobs.
  • I see this group as an oracle community: Always happy to share their knowledge, to help and to share free tools (recent examples are: edb360 and SQLd360).
  • If you know me and you read this, then you should understand why I feel so happy.

I spent the last 3 years working at the European Commission (Data Center). I really enjoyed this job. But as you understood, working for the Accenture Enkitec Group is something that I can’t refuse!

This post is not about self-promoting (I am not a big fan of it) me or the company, it is just a dump of my mind and feelings.

Looking forward to start this new challenge!

Posted in Other | 6 Comments

processor_group_name and SGA distribution across NUMA nodes

Introduction

On a NUMA enabled system, a 12c database Instance allocates the SGA evenly across all the NUMA nodes by default (unless you change the “_enable_NUMA_interleave” hidden parameter to FALSE). You can find more details about this behaviour in Yves’s post.

Fine, but what if I am setting the processor_group_name parameter (to instruct the database instance to run itself within a specified operating system processor group) to a cgroup that contains more than one NUMA node?

I may need to link the processor_group_name parameter to more than one NUMA node because:

  • the database needs more memory that one NUMA node could offer.
  • the database needs more cpus that one NUMA node could offer.

In that case, is the SGA still evenly allocated across the NUMA nodes defined in the cgroup?

Time to test

My NUMA configuration is the following:

> lscpu | grep NUMA
NUMA node(s):          8
NUMA node0 CPU(s):     1-10,41-50
NUMA node1 CPU(s):     11-20,51-60
NUMA node2 CPU(s):     21-30,61-70
NUMA node3 CPU(s):     31-40,71-80
NUMA node4 CPU(s):     0,81-89,120-129
NUMA node5 CPU(s):     90-99,130-139
NUMA node6 CPU(s):     100-109,140-149
NUMA node7 CPU(s):     110-119,150-159

with this memory allocation (no Instances up):

 > sh ./numa_memory.sh
                MemTotal         MemFree         MemUsed           Shmem      HugePages_Total       HugePages_Free       HugePages_Surp
Node_0      136052736_kB     49000636_kB     87052100_kB        77728_kB                39300                39044                    0
Node_1      136052736_kB     50503228_kB     85549508_kB        77764_kB                39300                39044                    0
Node_2      136052736_kB     18163112_kB    117889624_kB        78236_kB                39300                39045                    0
Node_3      136052736_kB     47859560_kB     88193176_kB        77744_kB                39300                39045                    0
Node_4      136024568_kB     43286800_kB     92737768_kB        77780_kB                39300                39045                    0
Node_5      136052736_kB     50348004_kB     85704732_kB        77792_kB                39300                39045                    0
Node_6      136052736_kB     31591648_kB    104461088_kB        77976_kB                39299                39044                    0
Node_7      136052736_kB     48524064_kB     87528672_kB        77780_kB                39299                39044                    0

A cgroup (named oracle) has been created with those properties:

   group oracle {
   perm {
     task {
       uid = oracle;
       gid = dc_dba;
     }
     admin {
       uid = oracle;
       gid = dc_dba;
     }
   }
   cpuset {
     cpuset.mems=2,3;
     cpuset.cpus="21-30,61-70,31-40,71-80";
   }

The 12.1.0.2 database:

  • has been started using this cgroup thanks to the processor_group_name parameter (so that the database is linked to 2 NUMA nodes:  Nodes 2 and 3).
  • uses use_large_pages set to ONLY.
  • uses “_enable_NUMA_interleave” set to its default value: TRUE.
  • uses “_enable_NUMA_support” set to its default value: FALSE.

First test: The Instance has been started with a SGA that could fully fit into one NUMA node.

Let’s check the memory distribution:

> sh ./numa_memory.sh
                MemTotal         MemFree         MemUsed           Shmem      HugePages_Total       HugePages_Free       HugePages_Surp
Node_0      136052736_kB     48999132_kB     87053604_kB        77720_kB                39300                39044                    0
Node_1      136052736_kB     50504752_kB     85547984_kB        77748_kB                39300                39044                    0
Node_2      136052736_kB     17903752_kB    118148984_kB        78224_kB                39300                23427                    0
Node_3      136052736_kB     47511596_kB     88541140_kB        77736_kB                39300                39045                    0
Node_4      136024568_kB     43282316_kB     92742252_kB        77796_kB                39300                39045                    0
Node_5      136052736_kB     50345492_kB     85707244_kB        77804_kB                39300                39045                    0
Node_6      136052736_kB     31581004_kB    104471732_kB        77988_kB                39299                39044                    0
Node_7      136052736_kB     48516964_kB     87535772_kB        77784_kB                39299                39044                    0

As you can see, 39300-23427 large pages have been allocated from node 2 only. This is the amount of large pages needed for the SGA. So we can conclude that the memory is not evenly distributed across NUMA nodes 2 et 3.

Second test: The Instance has been started with a SGA that can not fully fit into one NUMA node.

Let’s check the memory distribution:

> sh ./numa_memory.sh
                MemTotal         MemFree         MemUsed           Shmem      HugePages_Total       HugePages_Free       HugePages_Surp
Node_0      136052736_kB     51332440_kB     84720296_kB        77832_kB                39300                39044                    0
Node_1      136052736_kB     52532564_kB     83520172_kB        77788_kB                39300                39044                    0
Node_2      136052736_kB     51669192_kB     84383544_kB        77892_kB                39300                    0                    0
Node_3      136052736_kB     52089448_kB     83963288_kB        77860_kB                39300                25864                    0
Node_4      136024568_kB     51992248_kB     84032320_kB        77876_kB                39300                39045                    0
Node_5      136052736_kB     52571468_kB     83481268_kB        77856_kB                39300                39045                    0
Node_6      136052736_kB     52131912_kB     83920824_kB        77844_kB                39299                39044                    0
Node_7      136052736_kB     52268200_kB     83784536_kB        77832_kB                39299                39044                    0

As you can see, the large pages have been allocated from nodes 2 and 3 but not evenly (as there is no more free large pages on node 2 while there is still about 25000 free on node 3).

Remarks

  • I observed the same with or without ASMM.
  • With no large pages (use_large_pages=FALSE) and no ASMM, I can observe this memory distribution:
> sh ./numa_memory.sh
                MemTotal         MemFree         MemUsed           Shmem      HugePages_Total       HugePages_Free       HugePages_Surp
Node_0      136052736_kB     48994024_kB     87058712_kB        77712_kB                39300                39044                    0
Node_1      136052736_kB     50497216_kB     85555520_kB        77752_kB                39300                39044                    0
Node_2      136052736_kB      9225964_kB    126826772_kB      8727192_kB                39300                39045                    0
Node_3      136052736_kB     38986380_kB     97066356_kB      8710796_kB                39300                39045                    0
Node_4      136024568_kB     43279124_kB     92745444_kB        77792_kB                39300                39045                    0
Node_5      136052736_kB     50341284_kB     85711452_kB        77796_kB                39300                39045                    0
Node_6      136052736_kB     31570200_kB    104482536_kB        78000_kB                39299                39044                    0
Node_7      136052736_kB     48505716_kB     87547020_kB        77776_kB                39299                39044                    0

As you can see the SGA has been evenly distributed across NUMA nodes 2 et 3 (see the Shmem column). I did not do more tests (means with ASMM or with AMM or…) with non large pages as not using large pages is not an option for me.

  • With large page and no processor_group_name set, the memory allocation looks like:
> sh ./numa_memory.sh
                MemTotal         MemFree         MemUsed           Shmem      HugePages_Total       HugePages_Free       HugePages_Surp
Node_0      136052736_kB     51234984_kB     84817752_kB        77824_kB                39300                37094                    0
Node_1      136052736_kB     52417252_kB     83635484_kB        77772_kB                39300                37095                    0
Node_2      136052736_kB     51178872_kB     84873864_kB        77904_kB                39300                37096                    0
Node_3      136052736_kB     52263300_kB     83789436_kB        77872_kB                39300                37096                    0
Node_4      136024568_kB     51870848_kB     84153720_kB        77864_kB                39300                37097                    0
Node_5      136052736_kB     52304932_kB     83747804_kB        77852_kB                39300                37097                    0
Node_6      136052736_kB     51837040_kB     84215696_kB        77840_kB                39299                37096                    0
Node_7      136052736_kB     52213888_kB     83838848_kB        77860_kB                39299                37096                    0

As you can see the large pages have been evenly allocated from all the NUMA nodes (this is what has been told in the introduction).

  • If you like it, you can get the simple num_memory.sh script from here.

Summary

  1. With large pages in place and processor_group_name linked to more than one NUMA node, then the SGA is not evenly distributed across the NUMA nodes defined in the cgroup.
  2. With large pages in place and processor_group_name not set,  then the SGA is evenly distributed across the NUMA nodes.
Posted in Performance | 5 Comments

Modify on the fly the cgroup properties linked to the processor_group_name parameter

Introduction

I am preparing a blog post in which I’ll try to describe pros and cons of cpu binding (using the processor_group_name parameter) versus Instance caging.

Today’s comparison:

As you know you can change the cpu_count and resource_manager_plan parameters while the database Instance is up and running. Then, while the database Instance is up and running you can:

  • enable / disable Instance caging.
  • change the cpu_count value and as a consequence the cpu limitation (with the Instance caging already in place).

Fine, can we do the same with cpu binding? What is the effect of changing the cgroup attributes that are linked to the processor_group_name parameter while the database Instance is up and running?

Let’s test it.

Initial setup

The database version is 12.1.0.2 and the  _enable_NUMA_support parameter has been kept to its default value: FALSE.

With this NUMA setup on a single machine:

NUMA node0 CPU(s):     1-10,41-50
NUMA node1 CPU(s):     11-20,51-60
NUMA node2 CPU(s):     21-30,61-70
NUMA node3 CPU(s):     31-40,71-80
NUMA node4 CPU(s):     0,81-89,120-129
NUMA node5 CPU(s):     90-99,130-139
NUMA node6 CPU(s):     100-109,140-149
NUMA node7 CPU(s):     110-119,150-159

A cgroup (named oracle) has been created with those properties:

group oracle {
   perm {
     task {
       uid = oracle;
       gid = dc_dba;
     }
     admin {
       uid = oracle;
       gid = dc_dba;
     }
   }
   cpuset {
     cpuset.mems="0";
     cpuset.cpus="1-6";
   }
}

Means that I want the SGA to be allocated on NUMA node 0 and to use the cpus 1 to 6.

The database has been started using this cgroup thanks to the processor_group_name parameter (The database has to be restarted):

SQL> alter system set processor_group_name='oracle' scope=spfile sid='*';

System altered.

During the database startup you can see the following into the alert.log file:

Instance started in processor group oracle (NUMA Nodes: 0 CPUs: 1-6)

You can also check that the SGA has been allocated from NUMA node 0:

> grep -i HugePages /sys/devices/system/node/node*/meminfo
/sys/devices/system/node/node0/meminfo:Node 0 HugePages_Total: 39303
/sys/devices/system/node/node0/meminfo:Node 0 HugePages_Free:  23589
/sys/devices/system/node/node0/meminfo:Node 0 HugePages_Surp:      0
/sys/devices/system/node/node1/meminfo:Node 1 HugePages_Total: 39302
/sys/devices/system/node/node1/meminfo:Node 1 HugePages_Free:  39046
/sys/devices/system/node/node1/meminfo:Node 1 HugePages_Surp:      0
/sys/devices/system/node/node2/meminfo:Node 2 HugePages_Total: 39302
/sys/devices/system/node/node2/meminfo:Node 2 HugePages_Free:  39047
/sys/devices/system/node/node2/meminfo:Node 2 HugePages_Surp:      0
/sys/devices/system/node/node3/meminfo:Node 3 HugePages_Total: 39302
/sys/devices/system/node/node3/meminfo:Node 3 HugePages_Free:  39047
/sys/devices/system/node/node3/meminfo:Node 3 HugePages_Surp:      0
/sys/devices/system/node/node4/meminfo:Node 4 HugePages_Total: 39302
/sys/devices/system/node/node4/meminfo:Node 4 HugePages_Free:  39047
/sys/devices/system/node/node4/meminfo:Node 4 HugePages_Surp:      0
/sys/devices/system/node/node5/meminfo:Node 5 HugePages_Total: 39302
/sys/devices/system/node/node5/meminfo:Node 5 HugePages_Free:  39047
/sys/devices/system/node/node5/meminfo:Node 5 HugePages_Surp:      0
/sys/devices/system/node/node6/meminfo:Node 6 HugePages_Total: 39302
/sys/devices/system/node/node6/meminfo:Node 6 HugePages_Free:  39047
/sys/devices/system/node/node6/meminfo:Node 6 HugePages_Surp:      0
/sys/devices/system/node/node7/meminfo:Node 7 HugePages_Total: 39302
/sys/devices/system/node/node7/meminfo:Node 7 HugePages_Free:  39047
/sys/devices/system/node/node7/meminfo:Node 7 HugePages_Surp:      0

As you can see, 39303-23589 large pages have been allocated from node 0. This is the amount of large pages needed for the SGA.

We can check that the cpu_count parameter has been automatically set to 6:

SQL> select value from v$parameter where name='cpu_count';

VALUE
--------------------------------------------------------------------------------
6

We can also check the cpu affinity of a background process (smon for example):

> taskset -c -p `ps -ef | grep -i smon | grep BDT12CG | awk '{print $2}'`
pid 1019968's current affinity list: 1-6

Note that the same affinity would have been observed with a foreground process.

Now let’s change the cpuset.cpus attribute while the database instance is up and running:

I want the cpus 11 to 20 to be used (instead of 1 to 6):

> /bin/echo 11-20 > /cgroup/cpuset/oracle/cpuset.cpus

After a few seconds, those lines appear into the alert.log file:

Detected change in CPU count to 10

Great, it looks like the database is aware of the new cpuset.cpus value. Let’s check the smon process cpu affinity :

> taskset -c -p `ps -ef | grep -i smon | grep BDT12CG | awk '{print $2}'`
pid 1019968's current affinity list: 11-20

BINGO: The database is now using the new value (means the new cpus) and we don’t need to bounce it!

Remarks:

  • Depending of the new cpuset.cpus value the SGA allocation may not be “optimal” (If the cpus are not on the same NUMA node where the SGA is currently allocated).
  • Don’t forget to update the /etc/cgconfig.conf file accordingly (If not, the restart of the cgconfig service will overwrite your changes).
  • I did the same change during a full cached SLOB run of 9 readers, and the mpstat -P ALL 3 output moved from:
02:21:54 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
02:21:57 PM  all    3.79    0.01    0.21    0.01    0.00    0.00    0.00    0.00   95.97
02:21:57 PM    0    0.34    0.00    0.00    0.00    0.00    0.00    0.00    0.00   99.66
02:21:57 PM    1   98.66    0.00    1.34    0.00    0.00    0.00    0.00    0.00    0.00
02:21:57 PM    2   98.67    0.00    1.33    0.00    0.00    0.00    0.00    0.00    0.00
02:21:57 PM    3   92.31    0.00    7.69    0.00    0.00    0.00    0.00    0.00    0.00
02:21:57 PM    4   97.00    0.00    3.00    0.00    0.00    0.00    0.00    0.00    0.00
02:21:57 PM    5   98.34    0.00    1.66    0.00    0.00    0.00    0.00    0.00    0.00
02:21:57 PM    6   99.33    0.00    0.67    0.00    0.00    0.00    0.00    0.00    0.00
.
.
.
02:21:57 PM   11    1.34    1.68    0.34    0.00    0.00    0.00    0.00    0.00   96.64
02:21:57 PM   12    0.67    0.00    0.00    0.34    0.00    0.00    0.00    0.00   98.99
02:21:57 PM   13    0.33    0.00    0.00    0.00    0.00    0.00    0.00    0.00   99.67
02:21:57 PM   14    0.33    0.00    0.33    0.00    0.00    0.00    0.00    0.00   99.33
02:21:57 PM   15    0.34    0.00    0.34    0.00    0.00    0.00    0.00    0.00   99.33
02:21:57 PM   16    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
02:21:57 PM   17    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
02:21:57 PM   18    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
02:21:57 PM   19    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
02:21:57 PM   20    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00

to:

02:22:00 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
02:22:03 PM  all    5.71    0.00    0.18    0.02    0.00    0.00    0.00    0.00   94.09
02:22:03 PM    0    0.33    0.00    0.00    0.00    0.00    0.33    0.00    0.00   99.33
02:22:03 PM    1    0.34    0.00    0.00    0.00    0.00    0.00    0.00    0.00   99.66
02:22:03 PM    2    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
02:22:03 PM    3    2.70    0.00    6.08    0.00    0.00    0.00    0.00    0.00   91.22
02:22:03 PM    4    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
02:22:03 PM    5    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
02:22:03 PM    6    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
.
.
.
02:22:03 PM   11    2.05    0.00    1.71    0.34    0.00    0.00    0.00    0.00   95.89
02:22:03 PM   12   99.00    0.00    1.00    0.00    0.00    0.00    0.00    0.00    0.00
02:22:03 PM   13   99.33    0.00    0.67    0.00    0.00    0.00    0.00    0.00    0.00
02:22:03 PM   14   99.00    0.00    1.00    0.00    0.00    0.00    0.00    0.00    0.00
02:22:03 PM   15   97.32    0.00    2.68    0.00    0.00    0.00    0.00    0.00    0.00
02:22:03 PM   16   98.67    0.00    1.33    0.00    0.00    0.00    0.00    0.00    0.00
02:22:03 PM   17   99.33    0.00    0.67    0.00    0.00    0.00    0.00    0.00    0.00
02:22:03 PM   18   99.00    0.00    1.00    0.00    0.00    0.00    0.00    0.00    0.00
02:22:03 PM   19   99.00    0.00    1.00    0.00    0.00    0.00    0.00    0.00    0.00
02:22:03 PM   20   99.67    0.00    0.33    0.00    0.00    0.00    0.00    0.00    0.00

As you can see the process activity moved from cpus 1-6 to cpus 11-20: This is exactly what we want.

Now, out of curiosity: what about the memory? let’s change the cpuset.mems property while the database is up and running:

Let’s say that we want the SGA to migrate from NUMA node 0 to node 1.

To achieve this we also need to set the cpuset memory_migrate attribute to 1 (default is 0: means memory allocation is not following cpuset.mems update).

So, let’s change the memory_migrate attribute to 1 and cpuset.mems from node 0 to node 1:

/bin/echo 1 > /cgroup/cpuset/oracle/cpuset.memory_migrate
/bin/echo 1 > /cgroup/cpuset/oracle/cpuset.mems

What is the effect on the SGA allocation?

grep -i HugePages /sys/devices/system/node/node*/meminfo               
/sys/devices/system/node/node0/meminfo:Node 0 HugePages_Total: 39303
/sys/devices/system/node/node0/meminfo:Node 0 HugePages_Free:  23589
/sys/devices/system/node/node0/meminfo:Node 0 HugePages_Surp:      0
/sys/devices/system/node/node1/meminfo:Node 1 HugePages_Total: 39302
/sys/devices/system/node/node1/meminfo:Node 1 HugePages_Free:  39046
/sys/devices/system/node/node1/meminfo:Node 1 HugePages_Surp:      0
/sys/devices/system/node/node2/meminfo:Node 2 HugePages_Total: 39302
/sys/devices/system/node/node2/meminfo:Node 2 HugePages_Free:  39047
/sys/devices/system/node/node2/meminfo:Node 2 HugePages_Surp:      0
/sys/devices/system/node/node3/meminfo:Node 3 HugePages_Total: 39302
/sys/devices/system/node/node3/meminfo:Node 3 HugePages_Free:  39047
/sys/devices/system/node/node3/meminfo:Node 3 HugePages_Surp:      0
/sys/devices/system/node/node4/meminfo:Node 4 HugePages_Total: 39302
/sys/devices/system/node/node4/meminfo:Node 4 HugePages_Free:  39047
/sys/devices/system/node/node4/meminfo:Node 4 HugePages_Surp:      0
/sys/devices/system/node/node5/meminfo:Node 5 HugePages_Total: 39302
/sys/devices/system/node/node5/meminfo:Node 5 HugePages_Free:  39047
/sys/devices/system/node/node5/meminfo:Node 5 HugePages_Surp:      0
/sys/devices/system/node/node6/meminfo:Node 6 HugePages_Total: 39302
/sys/devices/system/node/node6/meminfo:Node 6 HugePages_Free:  39047
/sys/devices/system/node/node6/meminfo:Node 6 HugePages_Surp:      0
/sys/devices/system/node/node7/meminfo:Node 7 HugePages_Total: 39302
/sys/devices/system/node/node7/meminfo:Node 7 HugePages_Free:  39047
/sys/devices/system/node/node7/meminfo:Node 7 HugePages_Surp:      0

None: as you can see the SGA is still allocated on node 0 (I observed the same during the full cached SLOB run).

Then, let’s try to stop the database, add memory_migrate attribute into the /etc/cgconfig.conf file that way:

   cpuset {
     cpuset.mems="0";
     cpuset.cpus="1-6";
     cpuset.memory_migrate=1;
   }

restart the cgconfig service:

> service cgconfig restart

and check the value:

> cat /cgroup/cpuset/oracle/cpuset.memory_migrate
1

Well it is set to 1, let’s start the database and check again:

> cat /cgroup/cpuset/oracle/cpuset.memory_migrate
0

Hey! It has been set back to 0 after the database startup. So, let’s conclude that, by default, the memory can not be dynamically migrated from one NUMA node to another by changing on the fly the cpuset.mems value.

Remark:

Obviously, changing the cpuset.mems attribute from node 0 to node 1:

> /bin/echo 1 > /cgroup/cpuset/oracle/cpuset.mems

and bounce the Instance will allocate the memory on the new node. Again, don’t forget to update the /etc/cgconfig.conf file accordingly (If not, the restart of the cgconfig service will overwrite your changes).

Conclusion:

By changing the cpuset.cpus value:

  • We have been able to change the cpus affinity dynamically (without database Instance restart).

By changing the cpuset.mems value:

  • We have not been able to change the memory node dynamically (an Instance restart is needed).

The main goal is reached: With cpu binding already in place (using processor_group_name) we are able to change on the fly the number of cpus a database is limited to use.

Posted in Performance | 1 Comment