Exadata : Direct path read and smart scan

To write this post I took my inspiration mainly from 2 sources:

  1. The Expert Oracle Exadata Book
  2. Frits Hoogland’s blog post

The book (as well as the oracle university course) states that Smart Scan prerequisites are :

  • There must be a full scan of an object.
  • The scan must use Oracle’s Direct Path Read mechanism.
  • The object must be stored on Exadata storage.

and the explanation is:

There is a simple explanation as to why these requirements exist. Oracle is a C program. The function that performs Smart Scans (kcfis_read) is called by the direct path read function (kcbldrget), which is called by one of the full scan functions. It’s that simple. You can’t get to the kcfis_read function without traversing the code path from full scan to direct read. And of course, the storage will have to be running Oracle’s software in order to process Smart Scans.

Here we are:  As oracle is a C program, I can try to figure out by my own what’s going on during Smart Scan thanks to the gdb debugger (Of course I have no doubt about the accuracy of the explanation mentioned above, the exercise is just for fun 🙂 )

To do this, I will use:

  •  One session with the trace 10046 enabled. This session is running a query that will produce the “cell smart table scan” wait event.
  •  One gdb session attached on the foreground process mentioned above (gdb -p <pid>)

First let’s create a break point on the write() function to trap the stack when the session is writing the “cell smart table scan” into the 10046 trace file.

(gdb) break write

Let’s run the query : The backtrace of all stack frames is the following when the “cell smart table scan” wait event has been written for the first time into the trace file.

#13 0x0ddb94a1 in kcfis_reap_ioctl ()
#14 0x0ddb8204 in kcfis_push ()
#15 0x0ddcc01a in kcfis_read ()
#16 0x0c800afc in kcbl_predpush_get ()
#17 0x0898c40c in kcbldrget ()
#18 0x0fadc5d2 in kcbgtcr ()

So, it looks like we are in the right direction, as we can see the kcbldrget and the kcfis_read functions into the frame.

We now just have to check which function is responsible of the “cell smart table scan” wait event.

To do so let’s put break points on the functions we discovered above that is to say:

(gdb) break kcbldrget
Breakpoint 1 at 0x898b103
(gdb) break kcbl_predpush_get
Breakpoint 2 at 0xc800a78
(gdb) break kcfis_read
Breakpoint 3 at 0xddcb371
(gdb) break kcfis_push
Breakpoint 4 at 0xddb78c0

and let’s continue:

(gdb) c
Continuing.

Now re-launch the sql.

First it breaks on:

Breakpoint 1, 0x0898b103 in kcbldrget ()

As no wait event “cell smart table scan” has been see so far into the trace file we can continue.

(gdb) c
Continuing.

Now it breaks on:

Breakpoint 2, 0x0c800a78 in kcbl_predpush_get ()

As no wait event “cell smart table scan” has been see so far into the trace file we can continue.

(gdb) c
Continuing.

Now it breaks on:

Breakpoint 3, 0x0ddcb371 in kcfis_read ()

As no wait event “cell smart table scan” has been see so far into the trace file we can continue.

(gdb) c
Continuing.

Now it breaks on :

Breakpoint 4, 0x0ddb78c0 in kcfis_push ()

Here we are: The wait event “cell smart table scan” appears into the trace file.

So the event comes from the function that has been called before the kcfis_push one. Let’s display the calling tree to figure out  which one it is:

Breakpoint 4, 0x0ddb78c0 in kcfis_push ()
(gdb) up
#1  0x0ddcc01a in kcfis_read ()
(gdb) up
#2  0x0c800afc in kcbl_predpush_get ()
(gdb) up
#3  0x0898c40c in kcbldrget ()

So it’s the kcfis_read function that produced the wait event “cell smart table scan”. Furthermore as we can see into the calling tree the kcfis_read function has been called by the kcbldrget one.

Conclusion:

The Smart Scan has been launched from the direct read code path.

But a question still remains: How can I check by myself (again just for fun) that the kcfis_read function (which is responsible of the smart scan) can not be called outside the direct read code path ? For the moment I have no clue on this 🙂

Update 2015/09/08:

As of 12.1.0.1, a more convenient way to display the stack linked to a wait event is:

 exec DBMS_MONITOR.SESSION_TRACE_ENABLE(waits => true, binds => false, plan_stat => 'NEVER');

alter session set events 'wait_event["cell smart table scan"] trace("from %s\n",shortstack())'; 

Enabling the session trace is not mandatory but is useful as it displays the wait event prior to the stack.
The wait event “cell smart table scan” has been used as an example (as it is the one of interest for this post), but you could use the one of your choice.

Real-Time Wait Events and Statistics related to Exadata : Part II

Into my first blog entry on this topic I used 3 scripts to get real-time statistics at the database and session/sql_id level and wait events at the database level:

system_event.pl
sysstat.pl
sqlidstat.pl

This new post entry add a new one :

sqlidevent.pl (click on the link and then on the view source button to copy/paste the source code) that is usefull to diagnose more in depth real-time relationship between sql_id, sessions and wait events.

It basically takes a snapshot each second (default interval) of the v$session_event cumulative view (values are gathered since the session started up) and computes the differences with the previous snapshot.

For example, let’s check sql_id in relation with smart table scan:

./sqlidevent.pl event='%cell%'
02:00:02 INST_NAME SID	SQL_ID	        EVENT	                NB_WAITS	TIME_WAITED	ms/Wait
02:00:02 BDT1	   ALL	100dajbkzu295	cell smart table scan	9	        408816	        45.424
--------------------------------------> NEW
02:00:03 INST_NAME SID	SQL_ID	        EVENT	                NB_WAITS	TIME_WAITED	ms/Wait
02:00:03 BDT1	   ALL	100dajbkzu295	cell smart table scan	4	        273434          68.359

As you can see during the last second 4 “cell smart table scan” wait events occurred to sql_id “100dajbkzu295”.

By default all the SID have been aggregated but you could also filter on a particular sid and/or on a particular sql_id (see the help).

Remarks:

  • Those 4 scripts are not exclusively related to Exadata as you can use them on all the wait events or statistics, I simply used them with filters to focus on ‘%cell%’ .
  • You can choose the number of snapshots to display and the time to wait between snapshots.
  • The scripts are oracle RAC aware : you can work on all the instances, a subset or the local one.
  • You have to set oraenv on one instance of the database you want to diagnose first.

Conclusion:

We now have 4 scripts at our disposal to diagnose real-time wait events and statistics at the database level and at the session/sql_id level.

Real-Time segments statistics

The v$segment_statistics and v$segstat views are a goldmine to extract statistics that are associated with the Oracle segments.

You can see how useful it could be in those posts :

Kevin Closson’s post

Jonathan Lewis’s post or this one

Arup Nanda’s post

But those views are cumulatives, so not so helpful to report real-time information on the segments (Imagine your database is generating a lot of I/O right now and you would like to know wich segments are generating those I/O).

To report real-time statistics on the segments I wrote the segments stats.pl script (click on the link and then on the view source button to copy/paste the source code) that basically takes a snapshot based on the v$segstat view each second (default interval) and computes the differences with the previous snapshot.

Let’s see an example:

./segments_stats.pl

Connecting to the Instance...

07:10:45   INST_NAME	OWNER	OBJECT_NAME	STAT_NAME                VALUE     
07:10:45   BDT1		BDT	BDTTAB          physical read requests   6         
07:10:45   BDT1		BDT	BDTTAB          segment scans            6         
07:10:45   BDT1         BDT     BDTTAB          logical reads            48        
07:10:45   BDT1         BDT     BDTTAB          physical reads           85        
07:10:45   BDT1         BDT     BDTTAB          physical reads direct    85   
--------------------------------------> NEW
07:10:46   INST_NAME	OWNER	OBJECT_NAME	STAT_NAME                VALUE     
07:10:46   BDT1		BDT	BDTTAB          segment scans            19
07:10:46   BDT1		BDT	BDTTAB          physical read requests   28         
07:10:46   BDT1         BDT     BDTTAB          logical reads            48        
07:10:46   BDT1         BDT     BDTTAB          physical reads           285        
07:10:46   BDT1         BDT     BDTTAB          physical reads direct    285

So, as you can see the BDTTAB generated 285 physical reads during the last second.

Let’s see the help:

./segments_stats.pl help

Usage: ./segments_stats.pl [Interval [Count]] [inst] [top=] [owner=] [statname=] [segment=] [includesys=]

        Default Interval : 1 second.
        Default Count    : Unlimited

  Parameter		Comment						     Default    
  ---------		-------					             -------    
  INST=          	ALL - Show all Instance                              ALL        
       CURRENT - Show Current Instance                                         
       INSTANCE_NAME,... - choose Instance(s) to display                       

                       << Instances are only displayed in a RAC DB >>                       

  TOP=			Number of rows to display                            10         
  OWNER=		ALL - Show all OWNER                                 ALL        
  STATNAME=		ALL - Show all Stats                                 ALL        
  SEGMENT=		ALL - Show all SEGMENTS                              ALL        
  INCLUDESYS=           Show SYS OBJECTS                                     N          

Example : ./segments_stats.pl segment='AQ%' statname='physical%'
Example : ./segments_stats.pl segment='AQ%' statname='physical writes direct'

As usual:

  • You can choose the number of snapshots to display and the time to wait between snapshots.
  • You can choose to filter on statname, segment and owner (by default no filter is applied).
  • This script is oracle RAC aware : you can work on all the instances, a subset or the local one.
  • You have to set oraenv on one instance of the database you want to diagnose first.
  • The script has been tested on Linux, Unix and Windows.

Remark for Exadata: 

As you can filter on the statname, you could choose to filter on the particular ‘optimized physical reads‘ statistic that way:

./segments_stats.pl statname='%optimized%'

Real-Time Wait Events and Statistics related to Exadata

Into my first post related to Exadata I provided a perl script to extract real-time metrics from the cells based on their cumulative metrics (see here).

In this post I focus on the information available from the database: Cumulative “cell” Statistics and cumulative “cell” Wait Events related to Exadata (Uwe described the most important ones into this post).

As usual the philosophy behind the 3 scripts used into this post is: extract real-time information from the cumulative views by taking a snapshot each second (default interval) and computes the differences with the previous snapshot.

So to collect Real-Time Wait Events related to Exadata, I use a perl script “system_event.pl” (click on the link and then on the view source button  to copy/paste the source code) already used into one of my previous post “Measure oracle real-time I/O performance“.

But this time I focus only on “cell” Wait Events related to Exadata that way :

./system_event.pl event_like='%cell%'

02:30:41 INST_NAME	EVENT				  NB_WAITS	TIME_WAITED		ms/Wait
02:30:41 BDT1		cell smart file creation 	  7 		950197 			135.742
--------------------------------------> NEW
02:30:42 INST_NAME	EVENT 				  NB_WAITS 	TIME_WAITED 		ms/Wait
02:30:42 BDT1 		cell smart file creation 	  3 		543520 			181.173
02:30:42 BDT1 		cell single block physical read   3 		4420 			1.473

As you can see, during the last second the database waited 3 times on “Cell smart file creation”

Now to collect Real-Time Statistics related to Exadata, I use the “sysstat.pl” script (click on the link and then on the view source button to copy/paste the source code) that way:

./sysstat.pl statname_like='%cell%'

03:15:29   INST_NAME	NAME                                                               VALUE
03:15:29   BDT1		cell smart IO session cache lookups                                11
03:15:29   BDT1		cell scans                                                         17
03:15:29   BDT1		cell blocks processed by cache layer                               3551
03:15:29   BDT1		cell blocks processed by txn layer                                 3551
03:15:29   BDT1		cell blocks processed by data layer                                3551
03:15:29   BDT1		cell blocks helped by minscn optimization                          3551
03:15:29   BDT1		cell physical IO interconnect bytes returned by smart scan         1421352
03:15:29   BDT1		cell physical IO interconnect bytes                                2756648
03:15:29   BDT1		cell IO uncompressed bytes                                         29089792
03:15:29   BDT1		cell physical IO bytes eligible for predicate offload              29089792
--------------------------------------> NEW
03:15:30   INST_NAME	NAME                                                               VALUE
03:15:30   BDT1		cell smart IO session cache lookups                                33
03:15:30   BDT1		cell scans                                                         33
03:15:30   BDT1		cell blocks processed by cache layer                               9896
03:15:30   BDT1		cell blocks processed by txn layer                                 9896
03:15:30   BDT1		cell blocks processed by data layer                                9896
03:15:30   BDT1		cell blocks helped by minscn optimization                          9896
03:15:30   BDT1		cell physical IO interconnect bytes returned by smart scan         6602936
03:15:30   BDT1		cell physical IO interconnect bytes                                6701240
03:15:30   BDT1		cell IO uncompressed bytes                                         81068032
03:15:30   BDT1		cell physical IO bytes eligible for predicate offload              81068032

As you can see during the last second the statistic cell physical IO bytes has increased by 81068032 .
If we need to diagnose more in depth and link in real-time those “cell” statistics with one or more sql_id, we can use the “sqlidstat.pl” (click on the link and then on the view source button  to copy/paste the source code) that way:

./sqlidstat.pl statname_like='%cell%'

--------------------------------------> NEW
03:15:29   SID   SQL_ID         NAME	        		   			    VALUE
03:15:29   ALL   0ab8xuf6kuud5  cell smart IO session cache hits                            4
03:15:29   ALL   0ab8xuf6kuud5  cell scans                                                  10
03:15:29   ALL   0ab8xuf6kuud5  cell blocks processed by cache layer                        1193
03:15:29   ALL   0ab8xuf6kuud5  cell blocks processed by data layer                         1193
03:15:29   ALL   0ab8xuf6kuud5  cell blocks processed by txn layer                          1193
03:15:29   ALL   0ab8xuf6kuud5  cell blocks helped by minscn optimization                   1193
03:15:29   ALL   0ab8xuf6kuud5  cell physical IO interconnect bytes returned by smart scan  356096
03:15:29   ALL   0ab8xuf6kuud5  cell physical IO interconnect bytes                         2232064
03:15:29   ALL   0ab8xuf6kuud5  cell IO uncompressed bytes                                  9773056
03:15:29   ALL   0ab8xuf6kuud5  cell physical IO bytes eligible for predicate offload       9773056
--------------------------------------> NEW
03:15:30   SID   SQL_ID         NAME                                                        VALUE
03:15:30   ALL   0ab8xuf6kuud5  cell smart IO session cache hits                            34
03:15:30   ALL   0ab8xuf6kuud5  cell scans                                                  34
03:15:30   ALL   0ab8xuf6kuud5  cell blocks processed by cache layer                        10522
03:15:30   ALL   0ab8xuf6kuud5  cell blocks processed by data layer                         10522
03:15:30   ALL   0ab8xuf6kuud5  cell blocks processed by txn layer                          10522
03:15:30   ALL   0ab8xuf6kuud5  cell blocks helped by minscn optimization                   10522
03:15:30   ALL   0ab8xuf6kuud5  cell physical IO interconnect bytes returned by smart scan  6039016
03:15:30   ALL   0ab8xuf6kuud5  cell physical IO interconnect bytes                         6039016
03:15:30   ALL   0ab8xuf6kuud5  cell IO uncompressed bytes                                  86196224
03:15:30   ALL   0ab8xuf6kuud5  cell physical IO bytes eligible for predicate offload       86196224

I removed the “INST_NAME” column for lisibilty.

By default all the SID have been aggregated but you could also filter on a particular sid (see the help).
So we can see that during the last second, most of the offload processing that we observed at the database level is related to the sql_id 0ab8xuf6kuud5.

Remarks :

  • Those 3 scripts are not exclusively related to Exadata as you can use them on all the wait events or statistics, I simply used them with the event_like or statname_like arguments to focus on ‘%cell%’ .
  • You can choose the number of snapshots to display and the time to wait between snapshots.
  • The scripts are oracle RAC aware : you can work on all the instances, a subset or the local one.
  • You have to set oraenv on one instance of the database you want to diagnose first.

Part II of this subject available here.

Exadata real-time metrics extracted from cumulative metrics

Exadata provides a lot of useful metrics to monitor the Cells.

The Metrics can be of various types:

  • Cumulative: Cumulative statistics since the metric was created.
  • Instantaneous: Value at the time that the metric is collected.
  • Rate: Rates computed by averaging statistics over observation periods.
  • Transition: Are collected at the time when the value of the metrics has changed, and typically captures important transitions in hardware status.

You can found some information on how to exploit those metrics in those posts:

UWE HESSE’s post 

TANEL PODER’s post

But I think those types of metrics are  not enough to answer all the basic questions.

Let me explain why with 2 examples:

Let’s have a look to the metrics GD_IO_RQ_W_SM and GD_IO_RQ_W_SM_SEC (restricted to one Grid Disk for lisibility):

dcli -c cell1 cellcli -e "list metriccurrent attributes name,metricType,metricObjectName,metricValue where name like \'.*GD_IO_RQ_W_SM.*\' and metricObjectName ='data_CD_disk01_cell'"

cell1: GD_IO_RQ_W_SM 		Cumulative 	data_CD_disk01_cell 	2,930 IO requests
cell1: GD_IO_RQ_W_SM_SEC 	Rate 		data_CD_disk01_cell 	0.3 IO/sec

So we can observe that this “cumulative” metric shows the number of small write I/O requests while its associated “rate” metric shows the number of small write I/O requests per seconds.

or

Let’s have a look to the metrics CD_IO_TM_W_SM and CD_IO_TM_W_SM_RQ (restricted to one Cell Disk for lisibility):

dcli -c cell1 cellcli -e "list metriccurrent attributes name,metricType,metricObjectName,metricValue where name like \'.*CD_IO_TM_W.*SM.*\' and metricobjectname='CD_disk07_cell'"

cell1: CD_IO_TM_W_SM	 	Cumulative 	CD_disk07_cell 		1,512,939 us
cell1: CD_IO_TM_W_SM_RQ 	Rate 		CD_disk07_cell 		168 us/request

So we can observe that this “cumulative” metric shows the small write I/O latency in us while its associated “rate” metric shows the small write I/O latency in us per request.

But how can I answer those questions:

  1. How many small write I/O requests have been done during the last 80 seconds? (Unfortunately 0.3 * 80 will not necessary provide the right answer as it depends of the “observation period” of the rate metrics)
  2. What was the small write I/O latency during the last 80 second ?

You could ask for the same kind of questions on all cumulative metrics.

To answer all those questions I created a perl script exadata_metrics.pl (click on the link and then on the view source button  to copy/paste the source code) that extracts exadata real-time information metrics based on cumulative metrics.

That is to say the script works with all the cumulative metrics (the following command list all of them) :

cellcli -e "list metriccurrent attributes name,metricType where metricType='Cumulative'"

To extract real-time information the script takes a snapshot of cumulative metrics each second (default interval) and computes the differences with the previous snapshot.

So, to get the answer to our first question :

./exadata_metrics.pl 80 cell=cell1 name='GD_IO_RQ_W_SM' metricobjectname='data_CD_disk01_cell'

04:30:38 CELL 	NAME 			OBJECTNAME 		VALUE
04:30:38 cell1 	GD_IO_RQ_W_SM 		data_CD_disk01_cell 	0.00 IO requests
--------------------------------------> NEW
04:31:58 CELL 	NAME 			OBJECTNAME 		VALUE
04:31:58 cell1 	GD_IO_RQ_W_SM 		data_CD_disk01_cell 	20.00 IO requests

As you can see 20 small write I/O requests have been generated during the last 80 seconds (which is different from 0.3*80).

To get the answer to our second question :

./exadata_metrics.pl 80 cell=cell1 name_like='.*CD_IO_TM_W.*SM.*' metricobjectname='CD_disk07_cell'

06:48:33 CELL 	NAME 			OBJECTNAME 		VALUE
06:48:33 cell1 	CD_IO_TM_W_SM 		CD_disk07_cell 		0.00 us
--------------------------------------> NEW
06:49:53 CELL 	NAME 			OBJECTNAME 		VALUE
06:49:53 cell1 	CD_IO_TM_W_SM 		CD_disk07_cell 		3613.00 us

As you can see we the small write I/O latency has been  3613 us during the last 80 seconds.

Let’s see the help of the script:

./exadata_metrics.pl help

Usage: ./exadata_metrics.pl [Interval [Count]] [cell=] [top=] [name=] [metricobjectname=] [name_like=] [metricobjectname_like=]

Default Interval : 1 second.
Default Count : Unlimited

Parameter 		Comment 					Default
--------- 		------- 					-------
CELL= 			comma separated list of cell to display
TOP= 			Number of rows to display 			10
NAME= 			ALL - Show all cumulative metrics 		ALL
NAME_LIKE= 		ALL - Show all cumulative metrics 		ALL
METRICOBJECTNAME= 	ALL - Show all objects 				ALL
METRICOBJECTNAME_LIKE= 	ALL - Show all objects 				ALL

Example: ./exadata_metrics.pl cell=cell1,cell2 name_like='.*FC.*'
Example: ./exadata_metrics.pl cell=cell1,cell2 name='CD_IO_BY_W_LG'
Example: ./exadata_metrics.pl cell=cell1,cell2 name='CD_IO_BY_W_LG' metricobjectname_like='.*disk.*'

The script is based on the dcli and the cellcli commands and their regular expressions (wich are described into Kerry Osborne’s  post).

  • You can choose the number of snapshots to display and the time to wait between snapshots.
  • You can choose to filter on name and metricobjectname based on like or equal predicates.
  • You can work on all the cells or a subset thanks to the mandatory CELL parameter.
  • A cell os user allowed to run dcli without password (celladmin for example) can launch the script (ORACLE_HOME must be set).

Please don’t hesitate to tell me if this is useful for you and if you find any issues with this script.

Updates: