Retrieve and visualize system statistics metrics from AWR with R

Into my last post I gave a way to Retrieve and visualize wait events metrics from AWR with R, now it’s time for the system statistics.

So, for a particular system statistic, I’ll retrieve from the dba_hist_sysstat view:

  • Its VALUE between 2 snaps
  • Its VALUE per second between 2 snaps

As the VALUE is cumulative, I need to compute the difference between 2 snaps that way:

SQL> !cat check_awr_stats.sql
set linesi 200
col BEGIN_INTERVAL_TIME format a28
col stat_name format a40

alter session set nls_timestamp_format='YYYY/MM/DD HH24:MI:SS';
alter session set nls_date_format='YYYY/MM/DD HH24:MI:SS';

select s.begin_interval_time,sta.stat_name,sta.VALUE,
--round(((sta.VALUE)/(to_date(s.end_interval_time)-to_date(s.begin_interval_time)))/86400,2) VALUE_PER_SEC_NOT_ACCURATE,
round(((sta.VALUE)/
(
(extract(day from s.END_INTERVAL_TIME)-extract(day from s.BEGIN_INTERVAL_TIME))*86400 +
(extract(hour from s.END_INTERVAL_TIME)-extract(hour from s.BEGIN_INTERVAL_TIME))*3600 +
(extract(minute from s.END_INTERVAL_TIME)-extract(minute from s.BEGIN_INTERVAL_TIME))*60 +
(extract(second from s.END_INTERVAL_TIME)-extract(second from s.BEGIN_INTERVAL_TIME))
)
),2) VALUE_PER_SEC
from
(
select instance_number,snap_id,stat_name,
value - first_value(value) over (partition by stat_name order by snap_id rows 1 preceding) "VALUE"
from
dba_hist_sysstat
where stat_name like nvl('&stat_name',stat_name)
and instance_number = (select instance_number from v$instance)
) sta, dba_hist_snapshot s
where sta.instance_number=s.instance_number
and sta.snap_id=s.snap_id
and s.BEGIN_INTERVAL_TIME >= trunc(sysdate-&sysdate_nb_day_begin_interval+1)
and s.BEGIN_INTERVAL_TIME <= trunc(sysdate-&sysdate_nb_day_end_interval+1)
order by s.begin_interval_time asc;

I use the “partition by stat_name order by snap_id rows 1 preceding” to compute the difference between snaps par stat_name.

I also use Extract to get an accurate value per second, you should read  this blog post to understand why.

The output is like:

SQL> @check_awr_stats.sql

Session altered.

Session altered.

Enter value for stat_name: physical reads
old  17: where stat_name like nvl('&stat_name',stat_name)
new  17: where stat_name like nvl('physical reads',stat_name)
Enter value for sysdate_nb_day_begin_interval: 7
old  22: and s.BEGIN_INTERVAL_TIME >= trunc(sysdate-&sysdate_nb_day_begin_interval+1)
new  22: and s.BEGIN_INTERVAL_TIME >= trunc(sysdate-7+1)
Enter value for sysdate_nb_day_end_interval: 0
old  23: and s.BEGIN_INTERVAL_TIME <= trunc(sysdate-&sysdate_nb_day_end_interval+1)
new  23: and s.BEGIN_INTERVAL_TIME <= trunc(sysdate-0+1)

BEGIN_INTERVAL_TIME          STAT_NAME                                     VALUE VALUE_PER_SEC
---------------------------- ---------------------------------------- ---------- -------------
2013/03/21 00:00:12          physical reads                              1363483       1132.11
2013/03/21 00:20:17          physical reads                               260228        216.04
2013/03/21 00:40:21          physical reads                                29573         24.56
2013/03/21 01:00:25          physical reads                               231492        192.18
2013/03/21 01:20:30          physical reads                               494749         410.7
2013/03/21 01:40:35          physical reads                               232803        193.02
2013/03/21 02:00:41          physical reads                               318803        264.66
2013/03/21 02:20:45          physical reads                              1253398       1039.57
2013/03/21 02:40:51          physical reads                              2064294       1711.98
2013/03/21 03:00:57          physical reads                               503404        439.13
2013/03/21 03:20:03          physical reads                               138052        114.59

So if you use this sql, you’ll be able to see for a particular system statistic its historical behaviour. That’s fine and I used it a lot of times.

But I like also to have a graphical view of what’s going on, and that is exactly where R comes into play.

I created a R script named: graph_awr_sysstat.r (You can download it from this repository) that provides:

  1. A graph for the VALUE metric over the period of time
  2. A graph for the VALUE per second metric over the period of time
  3. A graph for the Histogram of VALUE over the period of time
  4. graph for the Histogram of VALUE per second over the period of time
  5. A pdf file that contains those graphs
  6. A text file that contains the metrics used to build the graphs

The graphs will come from both outputs (X11 and the pdf file). In case the X11 environment does not work, the pdf file is generated anyway.

As a graphical view is better to understand, let’s have a look how it works and what the display is:

For example, let’s focus on the “physical reads” system statistic over the last 7 days that way:

./graph_awr_sysstat.r
Building the thin jdbc connection string....

host ?: bdt_host
port ?: 1521
service_name ?: bdt
system password ?: XXXXXXXX
Display which sysstat (no quotation marks) ?: physical reads
Please enter nb_day_begin_interval: 7
Please enter nb_day_end_interval: 0
Loading required package: methods
Loading required package: DBI
Loading required package: rJava
[1] 11
Please enter any key to exit:

The output will be like:

physical_reads
and the physical_reads pdf file will be generated as well.

As you can see, you are prompted for:

  • jdbc thin “like” details to connect to the database (You can launch the R script outside the host hosting the database)
  • oracle system user password
  • The statistic we want to focus on
  • Number of days to go back as the time frame starting point
  • Number of days to go back  as the time frame ending point

Remarks:

– You need to purchase the Diagnostic Pack in order to be allowed to query the AWR repository.

– If the script has been launched with X11 not working properly, you’ll get:

Loading required package: methods
Loading required package: DBI
Loading required package: rJava
[1] "Not able to display, so only pdf generation..."
Warning message:
In x11(width = 15, height = 10) :
  unable to open connection to X11 display ''
[1] 11
Please enter any key to exit:

But the script takes care of it and the pdf file will be generated anyway.

Conclusion:

We are able to display graphically AWR historical metrics for a particular statistic over a period of time with the help of a single script  named:  graph_awr_sysstat.r (You can download it from this repository).

If you don’t have R installed:

  • you can use the sql provided at the beginning of this post to get at least a “text” view of the historical metrics.
  • Install it 🙂 (See “Getting Starting” from this link)

Update: If you want to see the same metrics in real time then you could have a look to this post.

Advertisement

Retrieve and visualize wait events metrics from AWR with R

In this post I will provide a way to retrieve wait events metrics from AWR and to display graphically those metrics thanks to R over a period of time.

Why R ?:  Because R is a powerful tool for statistical analysis with graphing and plotting packages built in. Furthermore, R can connect to Oracle via a JDBC package which makes importing data very easy.

So, for a particular wait event, I’ll retrieve from the dba_hist_system_event view:

  • TIME_WAITED_MS: Time waited in ms between 2 snaps
  • TOTAL_WAITS: Number of waits between 2 snaps
  • MS_PER_WAIT: Avg wait time in ms between 2 snaps

As those metrics are cumulative ones, I need to compute the difference between 2 snaps that way:

SQL> !cat check_awr_event.sql
set linesi 220;

alter session set nls_date_format='DD-MM-YYYY HH24:MI:SS';

col BEGIN_INTERVAL_TIME format a28
col event_name format a40
col WAIT_CLASS format a20
set pagesi 999

select distinct(WAIT_CLASS) from v$system_event;

select e.WAIT_CLASS,e.event_name,s.begin_interval_time,e.TOTAL_WAITS,e.TIME_WAITED_MS,e.TIME_WAITED_MS / TOTAL_WAITS "MS_PER_WAIT"
from
(
select instance_number,snap_id,WAIT_CLASS,event_name,
total_waits - first_value(total_waits) over (partition by event_name order by snap_id rows 1 preceding) "TOTAL_WAITS",
(time_waited_micro - first_value(time_waited_micro) over (partition by event_name order by snap_id rows 1 preceding))/1000 "TIME_WAITED_MS"
from
dba_hist_system_event
where
WAIT_CLASS like nvl('&WAIT_CLASS',WAIT_CLASS)
and event_name like nvl('&event_name',event_name)
and instance_number = (select instance_number from v$instance)
) e, dba_hist_snapshot s
where e.TIME_WAITED_MS > 0
and e.instance_number=s.instance_number
and e.snap_id=s.snap_id
and s.BEGIN_INTERVAL_TIME >= trunc(sysdate-&sysdate_nb_day_begin_interval+1)
and s.BEGIN_INTERVAL_TIME <= trunc(sysdate-&sysdate_nb_day_end_interval+1) order by 1,2,3;

I use the “partition by event_name order by snap_id rows 1 preceding” to compute the difference between snaps per event.

The output is like:

SQL> @check_awr_event.sql

Session altered.

WAIT_CLASS
--------------------
Administrative
Application
Commit
Concurrency
Configuration
Idle
Network
Other
System I/O
User I/O

10 rows selected.

Enter value for wait_class:
old  10: WAIT_CLASS like nvl('&WAIT_CLASS',WAIT_CLASS)
new  10: WAIT_CLASS like nvl('',WAIT_CLASS)
Enter value for event_name: db file sequential read
old  11: and event_name like nvl('&event_name',event_name)
new  11: and event_name like nvl('db file sequential read',event_name)
Enter value for sysdate_nb_day_begin_interval: 7
old  17: and s.BEGIN_INTERVAL_TIME >= trunc(sysdate-&sysdate_nb_day_begin_interval+1)
new  17: and s.BEGIN_INTERVAL_TIME >= trunc(sysdate-7+1)
Enter value for sysdate_nb_day_end_interval: 0
old  18: and s.BEGIN_INTERVAL_TIME <= trunc(sysdate-&sysdate_nb_day_end_interval+1) order by 1,2,3
new  18: and s.BEGIN_INTERVAL_TIME <= trunc(sysdate-0+1) order by 1,2,3

WAIT_CLASS           EVENT_NAME                               BEGIN_INTERVAL_TIME          TOTAL_WAITS TIME_WAITED_MS MS_PER_WAIT
-------------------- ---------------------------------------- ---------------------------- ----------- -------------- -----------
User I/O             db file sequential read                  20-MAR-13 12.00.45.270 AM         286608     271639.345  .947773073
User I/O             db file sequential read                  20-MAR-13 12.20.49.821 AM          32759     125296.732  3.82480332
User I/O             db file sequential read                  20-MAR-13 12.40.54.540 AM           4404       8946.577  2.03146617
User I/O             db file sequential read                  20-MAR-13 01.00.58.981 AM           3617       4737.182   1.3096992
User I/O             db file sequential read                  20-MAR-13 01.20.03.039 AM          22624      94671.254  4.18454977
User I/O             db file sequential read                  20-MAR-13 01.40.07.163 AM          94323     181118.282  1.92019213
User I/O             db file sequential read                  20-MAR-13 02.00.11.636 AM         119458     317204.205  2.65536176
User I/O             db file sequential read                  20-MAR-13 02.20.16.040 AM          81720     212678.865  2.60253139
User I/O             db file sequential read                  20-MAR-13 02.40.20.827 AM          61664     120446.947   1.9532782
User I/O             db file sequential read                  20-MAR-13 03.00.25.531 AM          92493     110715.902  1.19701926
User I/O             db file sequential read                  20-MAR-13 03.20.29.923 AM           2692       6102.149  2.26677155

So if you use this sql, you’ll be able to see for a particular event its historical behaviour. That’s fine and I used it a lot of times.

But I like also to have a graphical view of what’s going on, and that is exactly where R comes into play.

I created a R script named:  graph_awr_event.r (You can download it from this repository) that provides:

        1. A graph for the TIME_WAITED_MS metric over the period of time
        2. A graph for the NB_WAITS metric over the period of time
        3. A graph for the MS_PER_WAIT metric over the period of time
        4. A graph for the Histogram of MS_PER_WAIT over the period of time
        5. A pdf file that contains those graphs
        6. A text file that contains the metrics used to build the graphs

The graphs will come from both outputs (X11 and the pdf file). In case the X11 environment does not work, the pdf file is generated anyway.

As a graphical view is better to understand, let’s have a look how it works and what the display is:

For example, let’s focus on the “db file sequential read” wait event over the last 7 days that way:

./graph_awr_event.r   
Building the thin jdbc connection string....

host ?: bdt_host
port ?: 1521
service_name ?: bdt
system password ?: XXXXXXXX
Display which event (no quotation marks) ?: db file sequential read
Please enter nb_day_begin_interval: 7
Please enter nb_day_end_interval: 0
Loading required package: methods
Loading required package: DBI
Loading required package: rJava
[1] 11
Please enter any key to exit:

The output will be like:

db_file_sequential_read

and the db_file_sequential_read pdf  file will be generated as well.

As you can see, you are prompted for:

  • jdbc thin “like” details to connect to the database (You can launch the R script outside the host hosting the database)
  • oracle system user password
  • The wait event we want to focus on
  • Number of days to go back as the time frame starting point
  • Number of days to go back  as the time frame ending point

Remarks:

– You need to purchase the Diagnostic Pack in order to be allowed to query the AWR repository.

– If the script has been launched with X11 not working properly, you’ll get:

Loading required package: methods
Loading required package: DBI
Loading required package: rJava
[1] "Not able to display, so only pdf generation..."
Warning message:
In x11(width = 15, height = 10) :
  unable to open connection to X11 display ''
[1] 11
Please enter any key to exit:

But the script takes care of it and the pdf file will be generated anyway.

Conclusion:

We are able to display graphically AWR historical metrics for a particular wait event over a period of time with the help of a single script  named:  graph_awr_event.r (You can download it from this repository).

If you don’t have R installed:

  • you can use the sql provided at the beginning of this post to get at least a “text” view of the historical metrics.
  • Install it 😉 (See “Getting Starting” from this link)

Updates:

Link huge PGA or TEMP consumption to sql_id over a period of time

Imagine you discovered that during a particular period of time a huge amount of PGA or TEMP space has been consumed by your database.

Then you want to know, if you could link this behavior to one or more sql_id.

Ok, I am a little bit late 🙂 but you should have noticed that since 11.2.0.1  two useful columns have been added to the v$active_session_history and dba_hist_active_sess_history views:

  • PGA_ALLOCATED:  Amount of PGA memory (in bytes) consumed by this session at the time this sample was taken
  • TEMP_SPACE_ALLOCATED: Amount of TEMP memory (in bytes) consumed by this session at the time this sample was taken

Interesting, but is it helpful to answer:

  • What are the top sql_id linked to pga consumption during a particular period of time ?
  • What are the top sql_id linked to temp space consumption during a particular period of time ?

Coskan Gundogar gave a nice example related to a temp space issue, for a particular session into this blog post.

With this blog post, I just want to provide a way to generalize the computation at the instance level instead of the session one.

So, to find the top sql_id(s) responsible of  PGA or TEMP space consumption during a particular period of time, I use:

For the PGA consumption:

SQL> !cat ash_sql_id_pga.sql
col percent head '%' for 99990.99
col star for A10 head ''

accept seconds prompt "Last Seconds [60] : " default 60;
accept top prompt "Top  Rows    [10] : " default 10;

select SQL_ID,round(PGA_MB,1) PGA_MB,percent,rpad('*',percent*10/100,'*') star
from
(
select SQL_ID,sum(DELTA_PGA_MB) PGA_MB ,(ratio_to_report(sum(DELTA_PGA_MB)) over ())*100 percent,rank() over(order by sum(DELTA_PGA_MB) desc) rank
from
(
select SESSION_ID,SESSION_SERIAL#,sample_id,SQL_ID,SAMPLE_TIME,IS_SQLID_CURRENT,SQL_CHILD_NUMBER,PGA_ALLOCATED,
greatest(PGA_ALLOCATED - first_value(PGA_ALLOCATED) over (partition by SESSION_ID,SESSION_SERIAL# order by sample_time rows 1 preceding),0)/power(1024,2) "DELTA_PGA_MB"
from
v$active_session_history
where
IS_SQLID_CURRENT='Y'
and sample_time > sysdate-&seconds/86400
order by 1,2,3,4
)
group by sql_id
having sum(DELTA_PGA_MB) > 0
)
where rank < (&top+1)
order by rank
/

The output is like:

SQL> @ash_sql_id_pga.sql
Last Seconds [60] : 3600
Top  Rows    [10] : 10
old  13: and sample_time > sysdate-&seconds/86400
new  13: and sample_time > sysdate-3600/86400
old  20: where rank < (&top+1)
new  20: where rank < (10+1)

SQL_ID            PGA_MB         %
------------- ---------- --------- ----------
4nyd6q26dzvb2     2211.6     55.75 *****
3s5d6gj84kban      309.8      7.81
8pfzqzrsvjj38        171      4.31
2bjfxk6vqc0ft      134.1      3.38
fxr8wdgq9bmsv      115.4      2.91
4r23u15d4c9rh      108.4      2.73
                    94.3      2.38
g02u5ztkuv2sz       43.5      1.10
ddr8uck5s5kp3       23.8      0.60
10pty85f37hrb       20.5      0.52

10 rows selected.

For the TEMP consumption:

SQL> !cat ash_sql_id_temp.sql
col percent head '%' for 99990.99
col star for A10 head ''

accept seconds prompt "Last Seconds [60] : " default 60;
accept top prompt "Top  Rows    [10] : " default 10;

select SQL_ID,TEMP_MB,percent,rpad('*',percent*10/100,'*') star
from
(
select SQL_ID,sum(DELTA_TEMP_MB) TEMP_MB ,(ratio_to_report(sum(DELTA_TEMP_MB)) over ())*100 percent,rank() over(order by sum(DELTA_TEMP_MB) desc) rank
from
(
select SESSION_ID,SESSION_SERIAL#,sample_id,SQL_ID,SAMPLE_TIME,IS_SQLID_CURRENT,SQL_CHILD_NUMBER,temp_space_allocated,
greatest(temp_space_allocated - first_value(temp_space_allocated) over (partition by SESSION_ID,SESSION_SERIAL# order by sample_time rows 1 preceding),0)/power(1024,2) "DELTA_TEMP_MB"
from
v$active_session_history
where
IS_SQLID_CURRENT='Y'
and sample_time > sysdate-&seconds/86400
order by 1,2,3,4
)
group by sql_id
having sum(DELTA_TEMP_MB) > 0
)
where rank < (&top+1)
order by rank
/

The output is like:

SQL> @ash_sql_id_temp.sql
Last Seconds [60] : 3600
Top  Rows    [10] : 
old  13: and sample_time > sysdate-&seconds/86400
new  13: and sample_time > sysdate-3600/86400
old  19: where rank < (&top+1)
new  19: where rank < (10+1)

SQL_ID           TEMP_MB         %
------------- ---------- --------- ----------
by714720ajxwk          2     50.00 *****
c2jdkwzndq685          1     25.00 **
5h7w8ykwtb2xt          1     25.00 **

Basically:

  • The SQL computes, for each session, the PGA or TEMP space allocated between two active session history samples  thanks to “over (partition by SESSION_ID,SESSION_SERIAL# order by sample_time rows 1 preceding)“.
  • Those computed values are linked to the “active” sql_id observed during the sampling.
  • Then, it sums per sql_id those computed values and display the top sql_id(s).

So, we are now able to know which sql_id are responsible of huge PGA or TEMP consumption during a certain period of time.

Important remarks:

  1. You can also query the dba_hist_active_sess_history view but bear in mind that it is no so accurate as only a subset of the rows coming from v$active_session_history are flushed into the dba_hist_active_sess_history view.
  2. Those SQL work as of 11.2.0.1.
  3. You need to purchase the Diagnostic Pack in order to be allowed to query the “v$active_session_history” view.
  4. Those queries are useful to diagnose “huge” PGA or TEMP consumption, they are not so helpful to find out which sql_id used exactly how much PGA or TEMP  (As it may used already pre-allocated PGA or TEMP space and did not need over allocation: See the columns definition in the beginning of the post)

UPDATE: You can drill down in details per sql_id execution into this blog post

ASM I/O Statistics Utility: Update for Exadata

In this previous post (You should read it to understand what will follow) I explained how my asmiostat utility could be useful for the Exadata community. For this, I made one assumption:

  • Each storage cell constitutes a separate failure group (in most common Exadata configuration) (see Expert Oracle Exadata Book for more details)

And I concluded with:

  • In case your Exadata configuration does not follow this rule:  One  failure group per storage cell, just be aware that I will update my asmiostat utility so that it will be able to group by storage cells in any case (thanks to the IP located into the disks path). I’ll keep you posted once ready.

Here we are: I updated my asmiostat utility so that you can choose to focus on IP (Exadata Cells) instead of Failgroup.

So that now, you can measure the performance and the IO load across the DB servers and the Cells that way:

./real_time.pl -type=asmiostat -show=ip,inst

with the following ouput:

Collecting 1 sec....
............................
03:32:18                                                               Kby      Avg       AvgBy/    Read                Kby       Avg        AvgBy/    Write
03:32:18   INST     DG          IP (Cells)        DSK        Reads/s   Read/s   ms/Read   Read      Errors   Writes/s   Write/s   ms/Write   Write     Errors
03:32:18   ------   ---------   ---------------   --------   -------   ------   -------   ------    ------   --------   -------   --------   ------    ------
03:32:18   +ASM                                              48        1600     10.6      34133     0        7          144       18.0       21065     0
03:32:18   +ASM                 192.168.56.111               12        424      13.4      36181     0        3          48        32.4       16384     0
03:32:18   +ASM                 192.168.56.101               36        1176     9.7       33451     0        4          96        7.2        24576     0

You can also choose to filter on some IP adresses (see the help):

./real_time.pl -type=asmiostat -help

Usage: ./real_time.pl -type=asmiostat [-interval] [-count] [-inst] [-dg] [-fg] [-ip] [-show] [-help]
 Default Interval : 1 second.
 Default Count    : Unlimited

  Parameter    Comment                                                      Default
  ---------    -------                                                      -------
  -INST=       ALL - Show all Instance(s)                                   ALL
               CURRENT - Show Current Instance
               INSTANCE_NAME,... - choose Instance(s) to display

  -DG=         Diskgroup to collect (comma separated list)                  ALL
  -FG=         Failgroup to collect (comma separated list)                  ALL
  -IP=         IP (Exadata Cells) to collect (Wildcard allowed)             ALL
  -SHOW=       What to show: inst,fg|ip,dg,dsk (comma separated list)       DG

Example: ./real_time.pl -type=asmiostat
Example: ./real_time.pl -type=asmiostat -inst=+ASM1
Example: ./real_time.pl -type=asmiostat -dg=DATA -show=dg
Example: ./real_time.pl -type=asmiostat -dg=data -show=inst,dg,fg
Example: ./real_time.pl -type=asmiostat -show=dg,dsk
Example: ./real_time.pl -type=asmiostat -show=inst,dg,fg,dsk
Example: ./real_time.pl -type=asmiostat -show=ip -ip='%10%'

Remarks:

  • To get the asmiostat utility included into the real_time.pl script:  Click on the link, and then on the view source button and then copy/paste the source code. You can also download the script from this repository to avoid copy/paste (click on the link)
  • For a full description of my asmiostat utility see this post.

UPDATE: The asmiostat utility is not part of the real_time.pl script anymore. A new utility called asm_metrics.pl has been created. See “ASM metrics are a gold mine. Welcome to asm_metrics.pl, a new utility to extract and to manipulate them in real time” for more information.

Exadata real-time metrics extracted from cumulative metrics: Part II

Into this post I introduced my exadata_metrics.pl script that I use to collect real-time cell’s metrics from the cumulative ones.

I added new features on it that you may find useful/helpful:

  1. First  I added the possibility to aggregate the results based on the cell, on the metricobjectname or both: That is to say you can customize the way the metrics are computed.
  2. Secondly I added the possibility to use a “groupfile” (a file containing a list of cells: same format as the dcli utility) as input (could be useful if your exadata has a lot of cells 😉 )

Lets see examples to make it clear: For this I will focus on the CD_IO_TM_R_SM metric and on 2 cells only (for output simplicity) during the last 100 seconds.

To collect the metrics without aggregation:

./exadata_metrics.pl 100 groupfile=./cell_group name='CD_IO_TM_R_SM'

The output will be like:

07:59:46   CELL                    NAME                         OBJECTNAME                                                  VALUE
07:59:46   ----                    ----                         ----------                                                  -----
07:59:46   cell2                   CD_IO_TM_R_SM                CD_disk05_cell                                              0.00 us
07:59:46   cell2                   CD_IO_TM_R_SM                FD_03_cell                                                  0.00 us
07:59:46   cell2                   CD_IO_TM_R_SM                CD_disk03_cell                                              36479.00 us
07:59:46   cell                    CD_IO_TM_R_SM                CD_disk06_cell                                              41572.00 us
07:59:46   cell2                   CD_IO_TM_R_SM                CD_disk02_cell                                              167822.00 us
07:59:46   cell                    CD_IO_TM_R_SM                CD_disk02_cell                                              522659.00 us
07:59:46   cell                    CD_IO_TM_R_SM                CD_disk04_cell                                              523456.00 us
07:59:46   cell                    CD_IO_TM_R_SM                CD_disk01_cell                                              553921.00 us
07:59:46   cell                    CD_IO_TM_R_SM                FD_02_cell                                                  580027.00 us
07:59:46   cell                    CD_IO_TM_R_SM                FD_03_cell                                                  801521.00 us
07:59:46   cell                    CD_IO_TM_R_SM                FD_01_cell                                                  845028.00 us
07:59:46   cell                    CD_IO_TM_R_SM                FD_00_cell                                                  1228914.00 us
07:59:46   cell2                   CD_IO_TM_R_SM                CD_disk01_cell                                              1229929.00 us
07:59:46   cell                    CD_IO_TM_R_SM                CD_disk05_cell                                              1314321.00 us
07:59:46   cell                    CD_IO_TM_R_SM                CD_disk03_cell                                              4055302.00 us

With an aggregation on the objectname (I use the “show” option has I want to display cell and name):

./exadata_metrics.pl 100 groupfile=./cell_group name='CD_IO_TM_R_SM' show=cell,name

The output would have been like:

07:59:48   CELL                    NAME                         OBJECTNAME                                                  VALUE
07:59:48   ----                    ----                         ----------                                                  -----
07:59:48   cell2                   CD_IO_TM_R_SM                                                                            1434230.00 us
07:59:48   cell                    CD_IO_TM_R_SM                                                                            10466721.00 us

As you can see, the objectname has disappeared as the metrics have been aggregated on it.

Let’s collect one more time with no aggregation:

./exadata_metrics.pl 100 groupfile=./cell_group name='CD_IO_TM_R_SM'

With no aggregation the output will be like:

09:37:01   CELL                    NAME                         OBJECTNAME                                                  VALUE
09:37:01   ----                    ----                         ----------                                                  -----
09:37:01   cell2                   CD_IO_TM_R_SM                CD_disk09_cell                                              0.00 us
09:37:01   cell2                   CD_IO_TM_R_SM                CD_disk05_cell                                              0.00 us
09:37:01   cell2                   CD_IO_TM_R_SM                FD_03_cell                                                  0.00 us
09:37:01   cell                    CD_IO_TM_R_SM                CD_disk12_cell                                              0.00 us
09:37:01   cell                    CD_IO_TM_R_SM                CD_disk01_cell                                              879.00 us
09:37:01   cell                    CD_IO_TM_R_SM                CD_disk06_cell                                              41629.00 us
09:37:01   cell                    CD_IO_TM_R_SM                FD_03_cell                                                  111676.00 us
09:37:01   cell                    CD_IO_TM_R_SM                FD_02_cell                                                  233388.00 us
09:37:01   cell                    CD_IO_TM_R_SM                FD_01_cell                                                  253784.00 us
09:37:01   cell                    CD_IO_TM_R_SM                CD_disk04_cell                                              519624.00 us
09:37:01   cell                    CD_IO_TM_R_SM                CD_disk05_cell                                              587848.00 us
09:37:01   cell2                   CD_IO_TM_R_SM                CD_disk03_cell                                              1949331.00 us
09:37:01   cell2                   CD_IO_TM_R_SM                CD_disk02_cell                                              2016198.00 us
09:37:01   cell                    CD_IO_TM_R_SM                CD_disk03_cell                                              3220328.00 us
09:37:01   cell2                   CD_IO_TM_R_SM                CD_disk01_cell                                              3631941.00 us

With an aggregation on the cell and the objectname (I use the “show” option has I want to display name):

./exadata_metrics.pl 100 groupfile=./cell_group name='CD_IO_TM_R_SM' show=name

The output would have been like:

09:36:59   CELL                    NAME                         OBJECTNAME                                                  VALUE
09:36:59   ----                    ----                         ----------                                                  -----
09:36:59                           CD_IO_TM_R_SM                                                                            12566626.00 us

As you can see, the objectname and the cell have disappeared as the metrics have been aggregated.

Let’s see the help:

 ./exadata_metrics.pl help

Usage: ./exadata_metrics.pl [Interval [Count]] [cell=|groupfile=] [show=] [top=] [name=] [name!=] [objectname=] [objectname!=]

 Default Interval : 1 second.
 Default Count : Unlimited

 Parameter                 Comment                                                      Default
 ---------                 -------                                                      -------
 CELL=                     comma-separated list of cells
 GROUPFILE=                file containing list of cells
 SHOW=                     What to show (name included): cell,objectname                ALL
 TOP=                      Number of rows to display                                    10
 NAME=                     ALL - Show all cumulative metrics (wildcard allowed)         ALL
 NAME!=                    Exclude cumulative metrics (wildcard allowed)                EMPTY
 OBJECTNAME=               ALL - Show all objects (wildcard allowed)                    ALL
 OBJECTNAME!=              Exclude objects (wildcard allowed)                           EMPTY

utility assumes passwordless SSH from this cell node to the other cell nodes
utility assumes ORACLE_HOME has been set (with celladmin user for example)

Example : ./exadata_metrics.pl cell=cell
Example : ./exadata_metrics.pl groupfile=./cell_group
Example : ./exadata_metrics.pl groupfile=./cell_group show=name
Example : ./exadata_metrics.pl cell=cell objectname='CD_disk03_cell' name!='.*RQ_W.*'
Example : ./exadata_metrics.pl cell=cell name='.*BY.*' objectname='.*disk.*' name!='GD.*' objectname!='.*disk1.*'

Conclusion:

  • You are able to collect real-time metrics based on cumulative metrics.
  • You can choose the number of snapshots to display and the time to wait between snapshots.
  • You can choose to filter on name and objectname based on predicates (see the help).
  • You can work on all the cells or a subset thanks to the CELL or the GROUPFILE parameter.
  • You can decide the way to compute the metrics with no aggregation, aggregation on cell, objectname or both.

To get the exadata_metrics.pl script:  Click on the link, and then on the view source button and then copy/paste the source code. You can also download the script from this repository to avoid copy/paste (click on the link).

Any remarks, suggestions, questions are welcome.

Update:

You should read this post for a better interpretation of the utility: Exadata Cell metrics: collectionTime attribute, something that matters