Exadata real-time metrics extracted from cumulative metrics

Exadata provides a lot of useful metrics to monitor the Cells.

The Metrics can be of various types:

  • Cumulative: Cumulative statistics since the metric was created.
  • Instantaneous: Value at the time that the metric is collected.
  • Rate: Rates computed by averaging statistics over observation periods.
  • Transition: Are collected at the time when the value of the metrics has changed, and typically captures important transitions in hardware status.

You can found some information on how to exploit those metrics in those posts:

UWE HESSE’s post 

TANEL PODER’s post

But I think those types of metrics are  not enough to answer all the basic questions.

Let me explain why with 2 examples:

Let’s have a look to the metrics GD_IO_RQ_W_SM and GD_IO_RQ_W_SM_SEC (restricted to one Grid Disk for lisibility):

dcli -c cell1 cellcli -e "list metriccurrent attributes name,metricType,metricObjectName,metricValue where name like \'.*GD_IO_RQ_W_SM.*\' and metricObjectName ='data_CD_disk01_cell'"

cell1: GD_IO_RQ_W_SM 		Cumulative 	data_CD_disk01_cell 	2,930 IO requests
cell1: GD_IO_RQ_W_SM_SEC 	Rate 		data_CD_disk01_cell 	0.3 IO/sec

So we can observe that this “cumulative” metric shows the number of small write I/O requests while its associated “rate” metric shows the number of small write I/O requests per seconds.

or

Let’s have a look to the metrics CD_IO_TM_W_SM and CD_IO_TM_W_SM_RQ (restricted to one Cell Disk for lisibility):

dcli -c cell1 cellcli -e "list metriccurrent attributes name,metricType,metricObjectName,metricValue where name like \'.*CD_IO_TM_W.*SM.*\' and metricobjectname='CD_disk07_cell'"

cell1: CD_IO_TM_W_SM	 	Cumulative 	CD_disk07_cell 		1,512,939 us
cell1: CD_IO_TM_W_SM_RQ 	Rate 		CD_disk07_cell 		168 us/request

So we can observe that this “cumulative” metric shows the small write I/O latency in us while its associated “rate” metric shows the small write I/O latency in us per request.

But how can I answer those questions:

  1. How many small write I/O requests have been done during the last 80 seconds? (Unfortunately 0.3 * 80 will not necessary provide the right answer as it depends of the “observation period” of the rate metrics)
  2. What was the small write I/O latency during the last 80 second ?

You could ask for the same kind of questions on all cumulative metrics.

To answer all those questions I created a perl script exadata_metrics.pl (click on the link and then on the view source button  to copy/paste the source code) that extracts exadata real-time information metrics based on cumulative metrics.

That is to say the script works with all the cumulative metrics (the following command list all of them) :

cellcli -e "list metriccurrent attributes name,metricType where metricType='Cumulative'"

To extract real-time information the script takes a snapshot of cumulative metrics each second (default interval) and computes the differences with the previous snapshot.

So, to get the answer to our first question :

./exadata_metrics.pl 80 cell=cell1 name='GD_IO_RQ_W_SM' metricobjectname='data_CD_disk01_cell'

04:30:38 CELL 	NAME 			OBJECTNAME 		VALUE
04:30:38 cell1 	GD_IO_RQ_W_SM 		data_CD_disk01_cell 	0.00 IO requests
--------------------------------------> NEW
04:31:58 CELL 	NAME 			OBJECTNAME 		VALUE
04:31:58 cell1 	GD_IO_RQ_W_SM 		data_CD_disk01_cell 	20.00 IO requests

As you can see 20 small write I/O requests have been generated during the last 80 seconds (which is different from 0.3*80).

To get the answer to our second question :

./exadata_metrics.pl 80 cell=cell1 name_like='.*CD_IO_TM_W.*SM.*' metricobjectname='CD_disk07_cell'

06:48:33 CELL 	NAME 			OBJECTNAME 		VALUE
06:48:33 cell1 	CD_IO_TM_W_SM 		CD_disk07_cell 		0.00 us
--------------------------------------> NEW
06:49:53 CELL 	NAME 			OBJECTNAME 		VALUE
06:49:53 cell1 	CD_IO_TM_W_SM 		CD_disk07_cell 		3613.00 us

As you can see we the small write I/O latency has been  3613 us during the last 80 seconds.

Let’s see the help of the script:

./exadata_metrics.pl help

Usage: ./exadata_metrics.pl [Interval [Count]] [cell=] [top=] [name=] [metricobjectname=] [name_like=] [metricobjectname_like=]

Default Interval : 1 second.
Default Count : Unlimited

Parameter 		Comment 					Default
--------- 		------- 					-------
CELL= 			comma separated list of cell to display
TOP= 			Number of rows to display 			10
NAME= 			ALL - Show all cumulative metrics 		ALL
NAME_LIKE= 		ALL - Show all cumulative metrics 		ALL
METRICOBJECTNAME= 	ALL - Show all objects 				ALL
METRICOBJECTNAME_LIKE= 	ALL - Show all objects 				ALL

Example: ./exadata_metrics.pl cell=cell1,cell2 name_like='.*FC.*'
Example: ./exadata_metrics.pl cell=cell1,cell2 name='CD_IO_BY_W_LG'
Example: ./exadata_metrics.pl cell=cell1,cell2 name='CD_IO_BY_W_LG' metricobjectname_like='.*disk.*'

The script is based on the dcli and the cellcli commands and their regular expressions (wich are described into Kerry Osborne’s  post).

  • You can choose the number of snapshots to display and the time to wait between snapshots.
  • You can choose to filter on name and metricobjectname based on like or equal predicates.
  • You can work on all the cells or a subset thanks to the mandatory CELL parameter.
  • A cell os user allowed to run dcli without password (celladmin for example) can launch the script (ORACLE_HOME must be set).

Please don’t hesitate to tell me if this is useful for you and if you find any issues with this script.

Updates:

Advertisements

Perl Scripts Library

A small post to let you know that this page has been added to the blog.

This page reference all the perl scripts that will be used into this Blog and each of them will be linked to  a practical case.

As the main purpose of the blog is to share experience, you can get the source code of the scripts (explanation into the page).

I hope this will be useful for you.

Measure oracle real-time I/O performance

Did you already need to measure real-time I/O performance of an oracle database ?

AWR could provide an answer for historical period of time but what’s about:  what is going on on my database right now ?

To measure real-time I/O performance I use one of my perl script (system_event.pl) based on the gv$system_event cumulative view.

As this view is a cumulative one (values are gathered since the instance started up), to extract real-time information the script takes a snapshot each second (default interval) and computes the differences with the previous snapshot.

Let’s see an example: as this post deals with IO performance, we can focus on this particular wait class that way:

./system_event.pl waitclass="User I/O"

 18:27:36 INST_NAME 	EVENT				NB_WAITS	TIME_WAITED 	ms/Wait
 18:27:36 BDTO_2 	Disk file operations I/O	1 		77 		0.077
 18:27:36 BDTO_2 	db file scattered read 		241 		1423749 	5.908
 18:27:36 BDTO_2 	db file sequential read 	260 		1168489 	4.494
 18:27:36 BDTO_2 	db file parallel read 		626 		10447664 	16.690
 --------------------------------------> NEW
 18:27:37 INST_NAME 	EVENT 				NB_WAITS 	TIME_WAITED 	ms/Wait
 18:27:37 BDTO_2 	Disk file operations I/O 	1 		84 		0.084
 18:27:37 BDTO_2 	db file scattered read 		205 		1128732		5.506
 18:27:37 BDTO_2 	db file sequential read 	261 		1196085 	4.583
 18:27:37 BDTO_2 	db file parallel read 		592 		9224011 	15.581

So, during the last second we got 261 “db file sequential read” events for an average of 4.583 ms per wait.

You could get more detailed information thanks to the gv$event_histogram view (benefit of this view is described in this post).

As this view is a cumulative one, to extract real-time information, I wrote the event_histogram.pl perl script.

./event_histogram.pl event='db file sequential read'

18:32:52 INST_NAME 	EVENT 				WAIT_TIME_MS            COUNT
18:32:52 BDTO_1 	db file sequential read 	1 			0
18:32:52 BDTO_2 	db file sequential read 	2 			0
18:32:52 BDTO_2 	db file sequential read 	256 			0
18:32:52 BDTO_1 	db file sequential read 	4 			0
18:32:52 BDTO_2 	db file sequential read 	8 			0
18:32:52 BDTO_1 	db file sequential read 	8 			0
18:32:52 BDTO_1 	db file sequential read 	512 			0
18:32:52 BDTO_1 	db file sequential read 	16 			0
18:32:52 BDTO_2 	db file sequential read 	4 			2
18:32:52 BDTO_2 	db file sequential read 	1 			5

In this example we can see that 5 “df file sequential read” took less than 1ms to complete during the last second.

Of course those two perl scripts deal with all oracle wait events and not only the ones related to the User I/O wait class:  then you can extend their usage area following your needs on system events.

See the help for more details:

./system_event.pl help

Usage: ./system_event.pl [Interval [Count]] [inst] [top=] [waitclass=Concurrency|User I/O|System I/O|Administrative|Configuration|Other|Cluster|Application|Idle|Commit|Network] [event=]
Default Interval : 1 second.

Default Count : Unlimited
Parameter 				Comment 					Default

--------- 				------- 					-------

INST= 					ALL - Show all Instance(s)                      ALL
	
 CURRENT	 			- Show Current Instance
	
 INSTANCE_NAME,... 	                - choose Instance(s) to display
<< Instances are only displayed in a RAC DB >>
TOP= 					Number of rows to display                       10

WAITCLASS=Concurrency|User I/O|System I/O|Administrative|Configuration|Other|Cluster|Application|Idle|Commit|Network ALL

EVENT= 					ALL - Show all EVENTS 		                ALL

As usual :

  • You can choose the number of snapshots to display and the time to wait between snapshots.
  • You can choose to filter on wait_class and on a particular event (by default no filter is applied).
  • This script is oracle RAC aware : you can work on all the instances, a subset or the local one.
  • You have to set oraenv on one instance of the database you want to diagnose first.
  • The script has been tested on Linux, Unix and Windows.

Link sql_id and oracle stats

How many times working on a performance issue, finding the Top SQL and Top Waits event has not been enough to understand what’s going on ?

Sometimes you have to diagnose more in depth thanks to the Oracle stats reported into the V$SESSTAT view.

But how to quickly answer :

– Which sql is linked to this stat ?

– Which stats is linked to this sql ?

To answer those questions I wrote a perl script (sqlidstat.pl) :

  • This perl script takes snapshots from the GV$SESSION and  GV$SESSTAT views.
  • As the GV$SESSTAT view is a cumulative one (values are gathered since the session started up), to extract real-time information the script takes a snapshot each second (default interval) and computes the differences with the previous snapshot.
  • You can choose the number of snapshots to display and the time to wait between snapshots.
  • You can choose to filter on sql_id, sid and on a particular stat (by default no filter is applied).
  • This script is oracle RAC aware : you can work on all the instances, a subset or the local one.
  • You have to set oraenv on one instance of the database you want to diagnose first.
  • The script has been tested on Linux, Solaris and Windows.

Usage Examples:

The help :

./sqlidstat.pl help
Usage:./sqlidstat.pl [Interval [Count]] [inst] [top=] [sql_id=] [sid=] [name="statname"]
Default Interval : 1 second.
Default Count : Unlimited

Parameter                  Comment                      Default
---------                  -------                      -------
INST=                      ALL - Show all Instance(s)   ALL
CURRENT                    Show Current Instance
INSTANCE_NAME,...          choose Instance(s) to display
              << Instances are only displayed in a RAC DB >>
TOP=                        Number of rows to display   10
SQL_ID=                     ALL - Show all sql_id       ALL
SID=                        ALL - Show all SID          ALL
NAME=                       ALL - Show all STATNAME     ALL

Launch it on all instances :

./sqlidstat.pl
12:17:11 INST_NAME SQL_ID        NAME                                VALUE
12:17:11 BDTO_1    ff9853fmqsbdu table scan rows gotten              52562
12:17:11 BDTO_2    756yagxj5mmzn session uga memory max              65512
12:17:11 BDTO_2    756yagxj5mmzn session pga memory max              65536
12:17:11 BDTO_1    ff9853fmqsbdu session pga memory                  524288
12:17:11 BDTO_1    ff9853fmqsbdu file io wait time                   634055
12:17:11 BDTO_1    ff9853fmqsbdu session uga memory                  982264
12:17:11 BDTO_1    ff9853fmqsbdu physical read bytes                 8404992
12:17:11 BDTO_1    ff9853fmqsbdu cell physical IO interconnect bytes 8404992
12:17:11 BDTO_1    ff9853fmqsbdu physical read total bytes           8404992
12:17:11 BDTO_1    ff9853fmqsbdu logical read bytes from cache       23945216

Launch it on one instance and a particular stat :

./sqlidstat.pl inst=BDTO_1 name='logical read bytes from cache'
12:27:16 INST_NAME SQL_ID        NAME                          VALUE
12:27:16 BDTO_1    ff9853fmqsbdu logical read bytes from cache 227860480

You could also use the well known snapper script as Gwen Shapira who used it on a real practical case in this post.

I will follow that post with others posts and perl scripts I use to collect real time information based on wait events, latchs, sga stats…