Graphing Exadata cells metrics with Telegraf, InfluxDB and Grafana

Introduction

As a picture is worth a thousand words, we may want to visualise the Exadata cells metrics. What if you could do it the way you want? build your own graph? Let’s try to achieve this with 3 layers:

  • telegraf: to collect the Exadata metrics
  • InfluxDB: to store the time-series Exadata metrics
  • grafana: to visualise the Exadata metrics

You should first read this post: Graphing Oracle performance metrics with Telegraf, InfluxDB and Grafana prior to this one. The reason is that the current post relies on it (as the current post gives less explanation about Installation, setup and so on).

Installation

  • The Installation of those 3 layers is the same as described into this blog post.
  • Telegraf has been installed on one database node.
  • The dcli utility has also been copied on this database node.

Setup

The setup is the same as described into this blog post except for the script being used to collect the metrics:

  • The script collects the cells metrics from one database node and prints the output as InfluxDB line-protocol.
  • The script assumes passwordless ssh setup to connect from the database node to the cells.
  • It collects all metriccurrent metrics that are not cumulative: the dcli command being used is:
$> dcli -g ./cells_group cellcli -e "list metriccurrent attributes name,metricObjectName,metricValue,metricType where metricType !='cumulative'"

The perl script code is:

$> cat influxdb_exadata_metrics.pl
#!/usr/bin/env perl
#
# Author: Bertrand Drouvot
# influxdb_exadata_metrics.pl : V1.0 (2016/05)
#
# Utility used to extract exadata metrics in InfluxDB line protocol
#
#----------------------------------------------------------------#

BEGIN {
die "ORACLE_HOME not set\n" unless $ENV{ORACLE_HOME};
unless ($ENV{OrAcLePeRl}) {
$ENV{OrAcLePeRl} = "$ENV{ORACLE_HOME}/perl";
$ENV{PERL5LIB} = "$ENV{PERL5LIB}:$ENV{OrAcLePeRl}/lib:$ENV{OrAcLePeRl}/lib/site_perl";
$ENV{LD_LIBRARY_PATH} = "$ENV{LD_LIBRARY_PATH}:$ENV{ORACLE_HOME}/lib32:$ENV{ORACLE_HOME}/lib";
exec "$ENV{OrAcLePeRl}/bin/perl", $0, @ARGV;
}
}

use strict;
use Time::Local;

#
# Variables
#
my $nbmatch=-1;
my $help=0;
my $goodparam=0;

my $dclicomm='';
my $groupcell_pattern='';
my $metrictype_pattern='ALL';
my $result;

# Parameter parsing

foreach my $para (@ARGV) {

if ( $para =~ m/^help.*/i ) {
$nbmatch++;
$help=1;
}

if ( $para =~ m/^groupfile=(.*)$/i ) {
$nbmatch++;
$groupcell_pattern=$1;
$goodparam++;
}
}

# Check if groupfile is empty

if ((!$goodparam) | $goodparam > 1) {
print "\n Error while processing parameters : GROUPFILE parameter is mandatory! \n\n" unless ($help);
$help=1;
}

# Print usage if a difference exists between parameters checked
#
if ($nbmatch != $#ARGV | $help) {
print "\n Error while processing parameters \n\n" unless ($help);
print " \nUsage: $0 [groupfile=] \n\n";

printf (" %-25s %-60s %-10s \n",'Parameter','Comment','Default');
printf (" %-25s %-60s %-10s \n",'---------','-------','-------');
printf (" %-25s %-60s %-10s \n",'GROUPFILE=','file containing list of cells','');
print ("\n");
print ("utility assumes passwordless SSH from this node to the cell nodes\n");
print ("utility assumes ORACLE_HOME has been set \n");
print ("\n");
print ("Example : $0 groupfile=./cell_group\n");
print "\n\n";
exit 0;
}

# dcli command

$dclicomm="dcli -g ".$groupcell_pattern. " cellcli -e \"list metriccurrent attributes name,metricObjectName,metricValue,metricType where metricType !='cumulative'\"";

# Launch the dcli command
my $result=`$dclicomm` ;

if ( $? != 0 )
{
print "\n";
print "\n";
die "Something went wrong executing [$dclicomm]\n";
}

# Split the string into array
my @array_result = split(/\n/, $result);

foreach my $line ( @array_result ) {
# drop tab
$line =~ s/\t+/ /g;
# drop blanks
$line =~ s/\s+/ /g;

#Split each line on 6 pieces based on blanks
my @tab1 = split (/ +/,$line,6);

# Supress : from the cell name
$tab1[0] =~ s/://;

# Supress , from the value
$tab1[3] =~ s/,//g;

# Add "N/A" if no Unit
if ($tab1[5] eq "") {$tab1[5]="N/A"};

# Print
print "exadata_cell_metrics,cell=$tab1[0],metric_name=$tab1[1],metricObjectName=$tab1[2],metric_type=$tab1[5],metric_unit=$tab1[4] metric_value=$tab1[3]\n";
}

The output looks like (the output format is the InfluxDB line-protocol):

$> perl ./influxdb_exadata_metrics.pl groupfile=./cells_group

exadata_cell_metrics,cell=cell1,metric_name=CD_BY_FC_DIRTY,metricObjectName=CD_disk01_cell12,metric_type=Instantaneous,metric_unit=MB metric_value=0.000
exadata_cell_metrics,cell=cell1,metric_name=CD_BY_FC_DIRTY,metricObjectName=CD_disk02_cell12,metric_type=Instantaneous,metric_unit=MB metric_value=0.000
exadata_cell_metrics,cell=cell1,metric_name=CD_BY_FC_DIRTY,metricObjectName=CD_disk03_cell12,metric_type=Instantaneous,metric_unit=MB metric_value=0.000
exadata_cell_metrics,cell=cell1,metric_name=CD_BY_FC_DIRTY,metricObjectName=CD_disk04_cell12,metric_type=Instantaneous,metric_unit=MB metric_value=0.000
exadata_cell_metrics,cell=cell1,metric_name=CD_BY_FC_DIRTY,metricObjectName=CD_disk05_cell12,metric_type=Instantaneous,metric_unit=MB metric_value=0.000
exadata_cell_metrics,cell=cell1,metric_name=CD_BY_FC_DIRTY,metricObjectName=CD_disk06_cell12,metric_type=Instantaneous,metric_unit=MB metric_value=0.000
exadata_cell_metrics,cell=cell1,metric_name=CD_BY_FC_DIRTY,metricObjectName=CD_disk07_cell12,metric_type=Instantaneous,metric_unit=MB metric_value=0.000
exadata_cell_metrics,cell=cell1,metric_name=CD_BY_FC_DIRTY,metricObjectName=CD_disk08_cell12,metric_type=Instantaneous,metric_unit=MB metric_value=0.000
exadata_cell_metrics,cell=cell1,metric_name=CD_BY_FC_DIRTY,metricObjectName=CD_disk09_cell12,metric_type=Instantaneous,metric_unit=MB metric_value=0.000
exadata_cell_metrics,cell=cell1,metric_name=CD_BY_FC_DIRTY,metricObjectName=CD_disk10_cell12,metric_type=Instantaneous,metric_unit=MB metric_value=0.000
exadata_cell_metrics,cell=cell1,metric_name=CD_BY_FC_DIRTY,metricObjectName=FD_00_cell12,metric_type=Instantaneous,metric_unit=MB metric_value=0.000
exadata_cell_metrics,cell=cell1,metric_name=CD_BY_FC_DIRTY,metricObjectName=FD_01_cell12,metric_type=Instantaneous,metric_unit=MB metric_value=0.000
exadata_cell_metrics,cell=cell1,metric_name=CD_IO_BY_R_LG_SEC,metricObjectName=CD_disk01_cell12,metric_type=Rate,metric_unit=MB/sec metric_value=0.000
exadata_cell_metrics,cell=cell1,metric_name=CD_IO_BY_R_LG_SEC,metricObjectName=CD_disk02_cell12,metric_type=Rate,metric_unit=MB/sec metric_value=0.000
exadata_cell_metrics,cell=cell1,metric_name=CD_IO_BY_R_LG_SEC,metricObjectName=CD_disk03_cell12,metric_type=Rate,metric_unit=MB/sec metric_value=0.000
exadata_cell_metrics,cell=cell1,metric_name=CD_IO_BY_R_LG_SEC,metricObjectName=CD_disk04_cell12,metric_type=Rate,metric_unit=MB/sec metric_value=0.000
exadata_cell_metrics,cell=cell1,metric_name=CD_IO_BY_R_LG_SEC,metricObjectName=CD_disk05_cell12,metric_type=Rate,metric_unit=MB/sec metric_value=0.000
exadata_cell_metrics,cell=cell1,metric_name=CD_IO_BY_R_LG_SEC,metricObjectName=CD_disk06_cell12,metric_type=Rate,metric_unit=MB/sec metric_value=0.000
exadata_cell_metrics,cell=cell1,metric_name=CD_IO_BY_R_LG_SEC,metricObjectName=CD_disk07_cell12,metric_type=Rate,metric_unit=MB/sec metric_value=0.000
exadata_cell_metrics,cell=cell1,metric_name=CD_IO_BY_R_LG_SEC,metricObjectName=CD_disk08_cell12,metric_type=Rate,metric_unit=MB/sec metric_value=0.000
exadata_cell_metrics,cell=cell1,metric_name=CD_IO_BY_R_LG_SEC,metricObjectName=CD_disk09_cell12,metric_type=Rate,metric_unit=MB/sec metric_value=0.000
exadata_cell_metrics,cell=cell1,metric_name=CD_IO_BY_R_LG_SEC,metricObjectName=CD_disk10_cell12,metric_type=Rate,metric_unit=MB/sec metric_value=0.000
exadata_cell_metrics,cell=cell1,metric_name=CD_IO_BY_R_LG_SEC,metricObjectName=FD_00_cell12,metric_type=Rate,metric_unit=MB/sec metric_value=0.000
exadata_cell_metrics,cell=cell1,metric_name=CD_IO_BY_R_LG_SEC,metricObjectName=FD_01_cell12,metric_type=Rate,metric_unit=MB/sec metric_value=0.000
.
.
.
  • The output has been cut for readability.
  • The groupfile parameter is mandatory: This file contains the list of the cells you want the metrics to be collected on.

On the database node, configure telegraf to execute the perl script with 60 seconds interval and send the output to InfluxDB. Edit the /etc/telegraf/telegraf.conf file so that it contains:

###############################################################################
#                            OUTPUT PLUGINS                                   #
###############################################################################

# Configuration for influxdb server to send metrics to
[[outputs.influxdb]]
urls = ["http://influxgraf:8086"] # required
  database = "telegraf" # required
  precision = "s"
  timeout = "5s"

###############################################################################
#                                  INPUTS                                     #
###############################################################################

# Exadata metrics
[[inputs.exec]]
  # Shell/commands array
  commands = ["/home/oracle/scripts/exadata_metrics.sh"]
  # Data format to consume. This can be "json", "influx" or "graphite" (line-protocol)
  # NOTE json only reads numerical measurements, strings and booleans are ignored.
  data_format = "influx"
  interval = "60s"

The exadata_metrics.sh script contains the call to the perl script:

$> cat exadata_metrics.sh
#!/bin/env bash

export ORACLE_HOME=/u01/app/oracle/product/12.1.0/dbhome_1
export PATH=$PATH:/home/oracle/scripts

perl /home/oracle/scripts/influxdb_exadata_metrics.pl groupfile=/home/oracle/scripts/cells_group

Now you can connect to grafana and create a new dashboard with the Exadata cells metrics the way you want to.

Example

Screen Shot 2016-05-05 at 16.37.40

Remark

Nothing has been installed on the cells to collect those metrics with telegraf.

Conclusion

Thanks to:

  • telegraf and InfluxDB, we are able to collect and store the Exadata cells metrics we want to.
  • grafana, we are able to visualise the Exadata cells metrics the way we want to.
Advertisement

Exadata cell metrics: Join the metrics and their descriptions on the fly

Introduction

Cells metrics are very useful but their name are not so friendly. The name is a concatenation of abbreviations for the type of component, delimited by the underscore character. Then you have to understand the naming convention to understand the meaning of the metric name.

For example, knowing that:

  • CD stands for “Cell Disks metrics”
  • IO_BY stands for “Number of megabytes”
  • R stands for Read
  • LG stands for Large

You can conclude that the CD_IO_BY_R_LG metric is linked to the “Number of megabytes read in large blocks from a cell disk”.

Hopefully the metrics are explained into the Oracle Documentation and you can also retrieve their description from cellcli:

$ cellcli -e "list metricdefinition attributes name,description where name='CD_IO_BY_R_LG'"

CD_IO_BY_R_LG	 "Number of megabytes read in large blocks from a cell disk"

Lack of description

That said, as an example, let’s query the current metric values for a particular database that way:

$ cellcli -e "list metriccurrent attributes metricObjectName,name,metricValue where name like 'DB.*' and metricObjectName='BDT'"

	 BDT	 DB_FC_IO_BY_SEC   	 0 MB/sec
	 BDT	 DB_FC_IO_RQ       	 47,638 IO requests
	 BDT	 DB_FC_IO_RQ_SEC   	 2.1 IO/sec
	 BDT	 DB_FD_IO_BY_SEC   	 0 MB/sec
	 BDT	 DB_FD_IO_LOAD     	 19,885
	 BDT	 DB_FD_IO_RQ_LG    	 36 IO requests
	 BDT	 DB_FD_IO_RQ_LG_SEC	 0.0 IO/sec
	 BDT	 DB_FD_IO_RQ_SM    	 47,602 IO requests
	 BDT	 DB_FD_IO_RQ_SM_SEC	 2.1 IO/sec
	 BDT	 DB_FL_IO_BY       	 0.000 MB
	 BDT	 DB_FL_IO_BY_SEC   	 0.000 MB/sec
	 BDT	 DB_FL_IO_RQ       	 0 IO requests
	 BDT	 DB_FL_IO_RQ_SEC   	 0.0 IO/sec
	 BDT	 DB_IO_BY_SEC      	 0 MB/sec
	 BDT	 DB_IO_LOAD        	 0.0
	 BDT	 DB_IO_RQ_LG       	 0 IO requests
	 BDT	 DB_IO_RQ_LG_SEC   	 0.0 IO/sec
	 BDT	 DB_IO_RQ_SM       	 19 IO requests
	 BDT	 DB_IO_RQ_SM_SEC   	 0.0 IO/sec
	 BDT	 DB_IO_UTIL_LG     	 0 %
	 BDT	 DB_IO_UTIL_SM     	 0 %
	 BDT	 DB_IO_WT_LG       	 0 ms
	 BDT	 DB_IO_WT_LG_RQ    	 0.0 ms/request
	 BDT	 DB_IO_WT_SM       	 0 ms
	 BDT	 DB_IO_WT_SM_RQ    	 0.0 ms/request

As you can see the metric description is not there and there is no way to retrieve it from metriccurrent (or metrichistory) because this is not an attribute:

$ cellcli -e "describe metriccurrent"
	name
	alertState
	collectionTime
	metricObjectName
	metricType
	metricValue
	objectType

$ cellcli -e "describe metrichistory"
	name
	alertState
	collectionTime
	metricObjectName
	metricType
	metricValue
	metricValueAvg
	metricValueMax
	metricValueMin
	objectType

But if you send the result of our example to someone that don’t know (or don’t remember) the naming convention (or if you are not 100% sure of the definition of a particular metric) then he/you’ll have to:

  • go back to the oracle documentation
  • query the metricdefinition with cellcli

New script: exadata_metrics_desc.pl

Thanks to the exadata_metrics_desc.pl script, you can add (to the cellcli output) the description of the metric on the fly.

Let’s launch the same query (as the one used in the previous example) and add a call to exadata_metrics_desc.pl that way:

$ cellcli -e "list metriccurrent attributes metricObjectName,name,metricValue where name like 'DB.*' and metricObjectName='BDT'" | ./exadata_metrics_desc.pl

  BDT   DB_FC_IO_BY_SEC (Number of megabytes of I/O per second for this database to flash cache)          0 MB/sec
  BDT   DB_FC_IO_RQ (Number of IO requests issued by a database to flash cache)                           48,123 IO requests
  BDT   DB_FC_IO_RQ_SEC (Number of IO requests issued by a database to flash cache per second)            2.1 IO/sec
  BDT   DB_FD_IO_BY_SEC (Number of megabytes of I/O per second for this database to flash disks)          0 MB/sec
  BDT   DB_FD_IO_LOAD (Average I/O load from this database for flash disks)                               4,419
  BDT   DB_FD_IO_RQ_LG (Number of large IO requests issued by a database to flash disks)                  36 IO requests
  BDT   DB_FD_IO_RQ_LG_SEC (Number of large IO requests issued by a database to flash disks per second)   0.0 IO/sec
  BDT   DB_FD_IO_RQ_SM (Number of small IO requests issued by a database to flash disks)                  48,087 IO requests
  BDT   DB_FD_IO_RQ_SM_SEC (Number of small IO requests issued by a database to flash disks per second)   2.1 IO/sec
  BDT   DB_FL_IO_BY (The number of MB written to the Flash Log)                                           0.000 MB
  BDT   DB_FL_IO_BY_SEC (The number of MB written per second to the Flash Log)                            0.000 MB/sec
  BDT   DB_FL_IO_RQ (The number of I/O requests issued to the Flash Log)                                  0 IO requests
  BDT   DB_FL_IO_RQ_SEC (The number of I/O requests per second issued to the Flash Log)                   0.0 IO/sec
  BDT   DB_IO_BY_SEC (Number of megabytes of I/O per second for this database to hard disks)              0 MB/sec
  BDT   DB_IO_LOAD (Average I/O load from this database for hard disks)                                   0.0
  BDT   DB_IO_RQ_LG (Number of large IO requests issued by a database to hard disks)                      0 IO requests
  BDT   DB_IO_RQ_LG_SEC (Number of large IO requests issued by a database to hard disks per second)       0.0 IO/sec
  BDT   DB_IO_RQ_SM (Number of small IO requests issued by a database to hard disks)                      19 IO requests
  BDT   DB_IO_RQ_SM_SEC (Number of small IO requests issued by a database to hard disks per second)       0.0 IO/sec
  BDT   DB_IO_UTIL_LG (Percentage of disk resources utilized by large requests from this database)        0 %
  BDT   DB_IO_UTIL_SM (Percentage of disk resources utilized by small requests from this database)        0 %
  BDT   DB_IO_WT_LG (IORM wait time for large IO requests issued by a database)                           0 ms
  BDT   DB_IO_WT_LG_RQ (Average IORM wait time per request for large IO requests issued by a database)    0.0 ms/request
  BDT   DB_IO_WT_SM (IORM wait time for small IO requests issued by a database)                           0 ms
  BDT   DB_IO_WT_SM_RQ (Average IORM wait time per request for small IO requests issued by a database)    0.0 ms/request

As you can see the description of each metric being part of the initial output has been added.

Remarks

  • You can download the script from this repository or from GitHub.
  • Feel free to build the query you want on the metrics. You just need to add a call to exadata_metrics_desc.pl to see the metric description being added on the fly (as long as the metric name appears in the output of your initial query).
  • The idea of this script is all to credit to Martin Bach.
  • This script works with any input (could be a text file):

Screen Shot 2015-09-14 at 15.34.36

Conclusion

The exadata_metrics_desc.pl can be used to join on the fly the metric name, its value (and whatever attribute you would love to see) with its associated description.

Extract Exadata cells historical metrics in CSV format

Exadata provides a lot of useful metrics to monitor the Cells and you may want to retrieve historical values for some metrics. To do so, you can use the “LIST METRICHISTORY” command through CellCLI on the cell.

But as usual, visualising the metrics is even more better. For this purpose, you can use a perl script (see the download link in the remarks section) that extracts the historical metrics in CSV format so that you can graph them with the visualisation tool of your choice.

Let’s see the help:

Usage: ./csv_exadata_metrics_history.pl [cell=|groupfile=] [serial] [type=] [name=] [objectname=] [name!=] [objectname!=] [ago_unit=] [ago_value=]

 Parameter                 Comment                                                      Default
 ---------                 -------                                                      -------
 CELL=                     comma-separated list of cells
 GROUPFILE=                file containing list of cells
 SERIAL                    serialize execution over the cells (default is no)
 TYPE=                     Metrics type to extract: Cumulative|Rate|Instantaneous       ALL
 NAME=                     Metrics to extract (wildcard allowed)                        ALL
 OBJECTNAME=               Objects to extract (wildcard allowed)                        ALL
 NAME!=                    Exclude metrics (wildcard allowed)                           EMPTY
 OBJECTNAME!=              Exclude objects (wildcard allowed)                           EMPTY
 AGO_UNIT=                 Unit to retrieve historical metrics back: day|hour|minute    HOUR
 AGO_VALUE=                Value associated to Unit to retrieve historical metrics back 1

utility assumes passwordless SSH from this cell node to the other cell nodes
utility assumes ORACLE_HOME has been set (with celladmin user for example)

Example : ./csv_exadata_metrics_history.pl cell=cell
Example : ./csv_exadata_metrics_history.pl groupfile=./cell_group
Example : ./csv_exadata_metrics_history.pl cell=cell objectname='CD_disk03_cell'
Example : ./csv_exadata_metrics_history.pl cell=cell name='.*BY.*' objectname='.*disk.*'
Example : ./csv_exadata_metrics_history.pl cell=enkcel02 name='.*DB_IO.*' objectname!='ASM' name!='.*RQ.*' ago_unit=minute ago_value=4
Example : ./csv_exadata_metrics_history.pl cell=enkcel02 type='Instantaneous' name='.*DB_IO.*' objectname!='ASM' name!='.*RQ.*' ago_unit=hour ago_value=4
Example : ./csv_exadata_metrics_history.pl cell=enkcel01,enkcel02 type='Instantaneous' name='.*DB_IO.*' objectname!='ASM' name!='.*RQ.*' ago_unit=minute ago_value=4 serial

You have to setup passwordless SSH from one cell to the other cells (Then you can launch the script from this cell and retrieve data from the other cells).

The main options/features are:

  1. You can specify the cells on which you want to collect the metrics thanks to the cell or groupfile parameter.
  2. You can choose to serialize the execution over the cells thanks to the serial parameter.
  3. You can choose the type of metrics you want to retrieve (Cumulative, rate or instantaneous) thanks to the type parameter.
  4. You can focus on some metrics thanks to the name parameter (wildcard allowed).
  5. You can exclude some metrics thanks to the name! parameter (wildcard allowed).
  6. You can focus on some metricobjectname thanks to the objectname parameter (wildcard allowed).
  7. You can exclude some metricobjectname thanks to the objectname! parameter (wildcard allowed).
  8. You can choose the unit to retrieve metrics back (day, hour, minute) thanks to the ago_unit parameter.
  9. You can choose the value associated to the unit to retrieve metrics back thanks to the ago_value parameter.

Let’s see an example:

I want to retrieve in csv format the metrics from 2 cells related to databases for the last 20 minutes:

$> ./csv_exadata_metrics_history.pl cell=enkx3cel01,enkx3cel02 name='DB_.*' ago_unit=minute ago_value=20
Cell;metricType;DateTime;name;objectname;value;unit
enkx3cel01;Instantaneous;2015-07-01T08:57:59-05:00;DB_FC_BY_ALLOCATED;ACSTBY;0.000;MB
enkx3cel01;Instantaneous;2015-07-01T08:57:59-05:00;DB_FC_BY_ALLOCATED;ASM;15,779;MB
enkx3cel01;Instantaneous;2015-07-01T08:57:59-05:00;DB_FC_BY_ALLOCATED;BDT;0.000;MB
enkx3cel01;Instantaneous;2015-07-01T08:57:59-05:00;DB_FC_BY_ALLOCATED;BIGDATA;0.000;MB
enkx3cel01;Instantaneous;2015-07-01T08:57:59-05:00;DB_FC_BY_ALLOCATED;DBFS;0.000;MB
enkx3cel01;Instantaneous;2015-07-01T08:57:59-05:00;DB_FC_BY_ALLOCATED;DBM;15,779;MB
enkx3cel01;Instantaneous;2015-07-01T08:57:59-05:00;DB_FC_BY_ALLOCATED;DEMO;794,329;MB
enkx3cel01;Instantaneous;2015-07-01T08:57:59-05:00;DB_FC_BY_ALLOCATED;DEMOX3;0.000;MB
enkx3cel01;Instantaneous;2015-07-01T08:57:59-05:00;DB_FC_BY_ALLOCATED;EXDB;0.000;MB
enkx3cel01;Instantaneous;2015-07-01T08:57:59-05:00;DB_FC_BY_ALLOCATED;WZSDB;0.000;MB
enkx3cel01;Instantaneous;2015-07-01T08:57:59-05:00;DB_FC_BY_ALLOCATED;_OTHER_DATABASE_;48,764;MB
enkx3cel01;Instantaneous;2015-07-01T08:57:59-05:00;DB_FC_IO_BY_SEC;ACSTBY;0;MB/sec
enkx3cel01;Instantaneous;2015-07-01T08:57:59-05:00;DB_FC_IO_BY_SEC;ASM;0;MB/sec
enkx3cel01;Instantaneous;2015-07-01T08:57:59-05:00;DB_FC_IO_BY_SEC;BDT;0;MB/sec
enkx3cel01;Instantaneous;2015-07-01T08:57:59-05:00;DB_FC_IO_BY_SEC;BIGDATA;0;MB/sec
enkx3cel01;Instantaneous;2015-07-01T08:57:59-05:00;DB_FC_IO_BY_SEC;DBFS;0;MB/sec
enkx3cel01;Instantaneous;2015-07-01T08:57:59-05:00;DB_FC_IO_BY_SEC;DBM;15;MB/sec
enkx3cel01;Instantaneous;2015-07-01T08:57:59-05:00;DB_FC_IO_BY_SEC;DEMO;0;MB/sec
enkx3cel01;Instantaneous;2015-07-01T08:57:59-05:00;DB_FC_IO_BY_SEC;DEMOX3;0;MB/sec
enkx3cel01;Instantaneous;2015-07-01T08:57:59-05:00;DB_FC_IO_BY_SEC;EXDB;0;MB/sec
enkx3cel01;Instantaneous;2015-07-01T08:57:59-05:00;DB_FC_IO_BY_SEC;WZSDB;0;MB/sec
enkx3cel01;Instantaneous;2015-07-01T08:57:59-05:00;DB_FC_IO_BY_SEC;_OTHER_DATABASE_;0;MB/sec
enkx3cel01;Cumulative;2015-07-01T08:57:59-05:00;DB_FC_IO_RQ;ACSTBY;2,318;IO requests
enkx3cel01;Cumulative;2015-07-01T08:57:59-05:00;DB_FC_IO_RQ;ASM;0;IO requests
enkx3cel01;Cumulative;2015-07-01T08:57:59-05:00;DB_FC_IO_RQ;BDT;2,966;IO requests
enkx3cel01;Cumulative;2015-07-01T08:57:59-05:00;DB_FC_IO_RQ;BIGDATA;25,415;IO requests
enkx3cel01;Cumulative;2015-07-01T08:57:59-05:00;DB_FC_IO_RQ;DBFS;3,489;IO requests
enkx3cel01;Cumulative;2015-07-01T08:57:59-05:00;DB_FC_IO_RQ;DBM;1,627,066;IO requests
enkx3cel01;Cumulative;2015-07-01T08:57:59-05:00;DB_FC_IO_RQ;DEMO;4,506;IO requests
enkx3cel01;Cumulative;2015-07-01T08:57:59-05:00;DB_FC_IO_RQ;DEMOX3;4,172;IO requests
enkx3cel01;Cumulative;2015-07-01T08:57:59-05:00;DB_FC_IO_RQ;EXDB;0;IO requests
enkx3cel01;Cumulative;2015-07-01T08:57:59-05:00;DB_FC_IO_RQ;WZSDB;4,378;IO requests
enkx3cel01;Cumulative;2015-07-01T08:57:59-05:00;DB_FC_IO_RQ;_OTHER_DATABASE_;6,227;IO requests
enkx3cel01;Cumulative;2015-07-01T08:57:59-05:00;DB_FC_IO_RQ_LG;ACSTBY;0;IO requests
.
.
.

That way, you could visualize your data the way you feel comfortable with. For example, I used tableau to create this “Database metrics dashboard” on DB_* rate metrics:

cells_metrics

Remarks:

  • If you retrieve too much data, you could receive something like:
Error: enkx3cel02 is returning over 100000 lines; output is truncated !!!
 Command could be retried with the serialize option: --serial
 Killing child pid 15720 to enkx3cel02...

Then, you can launch the script with the serial option (see the help).

Conclusion:

You probably already have a way to build your own graph of the historical metrics. But if you don’t, feel free to use this script and the visualisation tool of your choice.

ASM Bytes Read and Smart Scan

Introduction

I like to query the v$asm_disk_iostat cumulative view (at the ASM instance level) as this is a centralized location where you can find metrics for all the databases the ASM instance is servicing. One of its metric is:

BYTES_READ: Total number of bytes read from the disk (see the oracle documentation).

I would like to see which value is recorded into this metric in case of smart scan. To do so, I’ll launch some tests and check the output of the asm_metrics utility: It basically takes a snapshot each second (default interval) from the gv$asm_disk_iostat cumulative view and computes the delta with the previous snapshot.

Environment

ASM and database versions are 11.2.0.4. A segment of about 19.5 gb has been created without any indexes, so that a full table scan is triggered during the tests.

Tests

During the tests:

  • This simple query is launched:
select id from bdt where id=1;
  • Then, the asm metrics will be recorded that way:
./asm_metrics.pl -show=dbinst,fg,inst -dbinst=BDT2 -inst=+ASM2 -display=avg
  • We’ll look at the output field “Kby Read/s” which is based on the BYTES_READ column coming from the v$asm_disk_iostat cumulative view.
  • The sql elapsed time and the percentage of IO saved by the offload (if any) will be checked with a query that looks like fsx.sql. By “percentage of IO saved” I mean the ratio of data received from the storage cells to the actual amount of data that would have had to be received on non-Exadata storage.

Test 1: Launch the query without offload

BDT:BDT2> alter session set cell_offload_processing=false;

Session altered.

BDT:BDT2> select /* NO_SMART_SCAN_NO_STORAGEIDX */ id from bdt where id=1;

The elapsed time and the percentage of IO saved are:

SQL_ID	       CHILD   PLAN_HASH  EXECS  AVG_ETIME AVG_PX OFFLOAD ELIGIBLE_MB  INTERCO_MB OFF_RETU_MB IO_SAVED_% SQL_TEXT
------------- ------ ----------- ------ ---------- ------ ------- ----------- ----------- ----------- ---------- ----------------------------------------------------------------------
7y7xa34jjab2q	   0   627556429      1      23.11	0 No		    0 19413.42969	    0	     .00 select /* NO_SMART_SCAN_NO_STORAGEIDX */ id from bdt where id=1

The elapsed time of the sql is 23.11 seconds and obviously no IO have been saved by offload. As you can see about 19.5 gb has been exchanged between the Oracle Database and the storage system (INTERCO_MB is based on IO_INTERCONNECT_BYTES column from v$sql).

The asm_metrics ouput is the following:

06:39:47                                                                            Kby       Avg       AvgBy/               Kby       Avg        AvgBy/
06:39:47   INST     DBINST        DG            FG           DSK          Reads/s   Read/s    ms/Read   Read      Writes/s   Write/s   ms/Write   Write
06:39:47   ------   -----------   -----------   ----------   ----------   -------   -------   -------   ------    ------     -------   --------   ------
06:39:47   +ASM2                                                          869       885423    3.7       1042916   2          24        0.9        15477
06:39:47   +ASM2    BDT2                                                  869       885423    3.7       1042916   2          24        0.9        15477
06:39:47   +ASM2    BDT2                        ENKCEL01                  300       305865    3.7       1044812   1          8         1.2        15061
06:39:47   +ASM2    BDT2                        ENKCEL02                  330       335901    3.7       1041451   1          8         0.8        15061
06:39:47   +ASM2    BDT2                        ENKCEL03                  239       243657    3.6       1042564   0          8         0.6        16384

So the average Kby Read per second is 885423.

Then we can conclude that the BYTES_READ column records that about 885423 * 23.11 = 19.5 gb has been read from disk.

Does it make sense? Yes.

Test 2: Offload and no storage indexes

BDT:BDT2> alter session set cell_offload_processing=true;

Session altered.

BDT:BDT2> alter session set "_kcfis_storageidx_disabled"=true;

Session altered.

BDT:BDT2> select /* WITH_SMART_SCAN_NO_STORAGEIDX */ id from bdt where id=1;

The elapsed time and the percentage of IO saved are:

SQL_ID	       CHILD   PLAN_HASH  EXECS  AVG_ETIME AVG_PX OFFLOAD ELIGIBLE_MB  INTERCO_MB OFF_RETU_MB IO_SAVED_% SQL_TEXT
------------- ------ ----------- ------ ---------- ------ ------- ----------- ----------- ----------- ---------- ----------------------------------------------------------------------
5zzvpyn94b05g	   0   627556429      1       4.11	0 Yes	  19413.42969 2.860450745 2.860450745	   99.99 select /* WITH_SMART_SCAN_NO_STORAGEIDX */ id from bdt where id=1

The elapsed time of the sql is 4.11 seconds and 99.99 % of IO has been saved by offload. About 2.8 mb have been exchanged between the Oracle Database and the storage system.

The asm_metrics ouput is the following:

06:41:54                                                                            Kby       Avg       AvgBy/               Kby       Avg        AvgBy/
06:41:54   INST     DBINST        DG            FG           DSK          Reads/s   Read/s    ms/Read   Read      Writes/s   Write/s   ms/Write   Write
06:41:54   ------   -----------   -----------   ----------   ----------   -------   -------   -------   ------    ------     -------   --------   ------
06:41:54   +ASM2                                                          4862      4969898   0.0       1046671   1          12        6.4        16384
06:41:54   +ASM2    BDT2                                                  4862      4969898   0.0       1046671   1          12        6.4        16384
06:41:54   +ASM2    BDT2                        ENKCEL01                  1678      1715380   0.0       1047123   0          4         17.2       16384
06:41:54   +ASM2    BDT2                        ENKCEL02                  1844      1883738   0.0       1046067   0          4         1.0        16384
06:41:54   +ASM2    BDT2                        ENKCEL03                  1341      1370780   0.0       1046935   0          4         1.0        16384

So the average Kby Read per second is 4969898.

Then we can conclude that the BYTES_READ column records that about 4969898 *4.11 = 19.5 gb has been read from disk.

Does it make sense? I would say yes, because the storage indexes haven’t been used. Then, during the smart scan all the datas blocks have been opened in the cells in order to extract and send back to the database layer the requested column (column projection) on the selected rows (row filtering).

Test 3: Offload and storage indexes

BDT:BDT2> alter session set "_kcfis_storageidx_disabled"=false;

Session altered.

BDT:BDT2> select /* WITH_SMART_SCAN_AND_STORAGEIDX */ id from bdt where id=1;

The elapsed time and the percentage of IO saved are:

SQL_ID	       CHILD   PLAN_HASH  EXECS  AVG_ETIME AVG_PX OFFLOAD ELIGIBLE_MB  INTERCO_MB OFF_RETU_MB IO_SAVED_% SQL_TEXT
------------- ------ ----------- ------ ---------- ------ ------- ----------- ----------- ----------- ---------- ----------------------------------------------------------------------
3jdpqa2s4bb0v	   0   627556429      1        .09	0 Yes	  19413.42969  .062171936  .062171936	  100.00 select /* WITH_SMART_SCAN_AND_STORAGEIDX */ id from bdt where id=1

The elapsed time of the sql is 0.09 seconds and about 100 % of IO has been saved by offload. About 0.06 mb have been exchanged between the Oracle Database and the storage system.

The storage indexes saved a lot of reads (almost the whole table):

BDT:BDT2> l
  1* select n.name, s.value from v$statname n, v$mystat s where n.name='cell physical IO bytes saved by storage index' and n.STATISTIC#=s.STATISTIC#
BDT:BDT2> /
NAME							     VALUE
-------------------------------------------------- ---------------
cell physical IO bytes saved by storage index	       20215947264

The asm_metrics ouput is the following:

06:43:58                                                                            Kby        Avg       AvgBy/               Kby       Avg        AvgBy/
06:43:58   INST     DBINST        DG            FG           DSK          Reads/s   Read/s     ms/Read   Read      Writes/s   Write/s   ms/Write   Write
06:43:58   ------   -----------   -----------   ----------   ----------   -------   -------    -------   ------    ------     -------   --------   ------
06:43:58   +ASM2                                                          19436     19879384   0.0       1047360   0          0         0.0        0
06:43:58   +ASM2    BDT2                                                  19436     19879384   0.0       1047360   0          0         0.0        0
06:43:58   +ASM2    BDT2                        ENKCEL01                  6708      6861488    0.0       1047430   0          0         0.0        0
06:43:58   +ASM2    BDT2                        ENKCEL02                  7369      7534840    0.0       1047045   0          0         0.0        0
06:43:58   +ASM2    BDT2                        ENKCEL03                  5359      5483056    0.0       1047705   0          0         0.0        0

So the average Kby Read per second is 19879384.

As the asm_metrics utility’s granularity to collect the data is one second and as the elapsed time is < 1s then we can conclude that the BYTES_READ column records that about 19.5 gb has been read from disk.

Does it make sense? I would say no, because during the smart scan, thanks to the Storage indexes, not all the datas blocks have been opened in the cells in order to extract the requested column (column projection) on the selected rows (row filtering).

Remark

You could also query the v$asm_disk_iostat view and measure the delta for the BYTES_READ column by your own (means without the asm_metrics utility). The results would be the same.

Conclusion

The BYTES_READ column displays:

  • The Total number of bytes read from the disk (and also transferred to the database) without smart scan.
  • The Total number of bytes read from the disk (but not the bytes transferred to the database) with smart scan and no storage indexes being used.
  • Neither the Total number of bytes read from the disk nor the bytes transferred to the database with smart scan and storage indexes being used.

 

exadata_metrics.pl: “Month ’12’ out of range 0..11” dealing with the collectionTime attribute

Yesterday was the first of December and I detected an issue with the exadata_metrics.pl script when dealing with the collectionTime attribute.

It produced:

Month '12' out of range 0..11 at ./exadata_metrics.pl line 255

This is due to the fact that when using perl localtime/timegm the valid range for a month is 0-11 with 0 indicating January and 11 indicating December (while the collectiontime attribute is using 1-12)

The exadata_metrics.pl script has been updated to take care of this rule (It can be downloaded from this repository).

Remarks:

  1. Without this update the script did not produce wrong values for the DELTA(s) field (It was simply not possible to launch the script during December).
  2. Guess when I discovered the issue? When I was speaking about my exadata_metrics.pl script during the UKOUG TECH13 conference. Live demo never works as expected :-), but with the help of some attendees (Big thanks to Martin Nash) we managed to bypass the issue during the live demo.

exadata_metrics.pl: New feature to collect real-time metrics extracted from cumulative metrics

I am a big fan of real-time metrics extraction based on the cumulative metrics: That is to say take a snapshot each second (default interval) from the cumulative metrics and computes the delta with the previous snapshot.

I build such tools for ASM metrics and for Exadata Cell metrics.

Recently, I added a new feature to the asm_metrics.pl script: The ability to display the average delta values since the collection began (Means since we launched the script).

Well, I just added this new feature to the exadata_metrics.pl script.

Of course the old features remain:

  • You can choose the number of snapshots to display and the time to wait between the snapshots.
  • You can choose to filter on name and objectname based on predicates (see the help).
  • You can work on all the cells or a subset thanks to the CELL or the GROUPFILE parameter.
  • You can decide the way to compute the metrics with no aggregation, aggregation on cell, objectname or both.

Let’s see the help:

 ./exadata_metrics.pl help

Usage: ./exadata_metrics.pl [Interval [Count]] [cell=|groupfile=] [display=] [show=] [top=] [name=] [name!=] [objectname=] [objectname!=]

 Default Interval : 1 second.
 Default Count : Unlimited

 Parameter                 Comment                                                      Default
 ---------                 -------                                                      -------
 CELL=                     comma-separated list of cells
 GROUPFILE=                file containing list of cells
 SHOW=                     What to show (name included): cell,objectname                ALL
 DISPLAY=                  What to display: snap,avg (comma separated list)             SNAP
 TOP=                      Number of rows to display                                    10
 NAME=                     ALL - Show all cumulative metrics (wildcard allowed)         ALL
 NAME!=                    Exclude cumulative metrics (wildcard allowed)                EMPTY
 OBJECTNAME=               ALL - Show all objects (wildcard allowed)                    ALL
 OBJECTNAME!=              Exclude objects (wildcard allowed)                           EMPTY

utility assumes passwordless SSH from this cell node to the other cell nodes
utility assumes ORACLE_HOME has been set (with celladmin user for example)

Example : ./exadata_metrics.pl cell=cell
Example : ./exadata_metrics.pl groupfile=./cell_group
Example : ./exadata_metrics.pl groupfile=./cell_group show=name
Example : ./exadata_metrics.pl cell=cell objectname='CD_disk03_cell' name!='.*RQ_W.*'
Example : ./exadata_metrics.pl cell=cell name='.*BY.*' objectname='.*disk.*' name!='GD.*' objectname!='.*disk1.*'

The “display” option is the new one: It allows you to display the delta snap values (as previously), the average delta values since the collection began (that is to say since the script has been launched) or both.

Let’s see one example:

I want to extract real-time metrics from the %IO_TM_R% cumulative ones, for 2 cells and aggregate the results for all the cells and all the objectname (means I just want to show the metrics). I want to see each snapshots and the average since we launched the script.

So, I launch the script that way:

./exadata_metrics.pl cell=exacell1,exacell2 display=avg,snap name='.*IO_TM_R.*' show=name

It will produce this kind of output:

--------------------------------------
----------COLLECTING DATA-------------
--------------------------------------

......... SNAP FOR LAST COLLECTION TIME ...................

DELTA(s)   CELL                    NAME                         OBJECTNAME                                                  VALUE
--------   ----                    ----                         ----------                                                  -----
60                                 CD_IO_TM_R_LG                                                                            0.00 us
60                                 CD_IO_TM_R_SM                                                                            807269.00 us

......... AVG SINCE FIRST COLLECTION TIME...................

DELTA(s)   CELL                    NAME                         OBJECTNAME                                                  VALUE
--------   ----                    ----                         ----------                                                  -----
60                                 CD_IO_TM_R_LG                                                                            0.00 us
60                                 CD_IO_TM_R_SM                                                                            807269.00 us

So, no differences between the delta snap and the average after the first snap (which is obvious 🙂 ).

Into the “SNAP” section, the delta(s) field computes the delta in seconds of the collectionTime recorded into the snapshots (see this post for more details). Into the “AVG” section, the delta(s) field is the sum of the delta(s) field from all the previous “SNAP” sections.

Let’s have a look to the output after 2 minutes of metrics extraction to make things clear:

--------------------------------------
----------COLLECTING DATA-------------
--------------------------------------

......... SNAP FOR LAST COLLECTION TIME ...................

DELTA(s)   CELL                    NAME                         OBJECTNAME                                                  VALUE
--------   ----                    ----                         ----------                                                  -----
61                                 CD_IO_TM_R_LG                                                                            0.00 us
61                                 CD_IO_TM_R_SM                                                                            348479.00 us

......... AVG SINCE FIRST COLLECTION TIME...................

DELTA(s)   CELL                    NAME                         OBJECTNAME                                                  VALUE
--------   ----                    ----                         ----------                                                  -----
121                                CD_IO_TM_R_LG                                                                            0.00 us
121                                CD_IO_TM_R_SM                                                                            577874.00 us

So, during the last 61 seconds of collection the CD_IO_TM_R_SM metric delta value is 348479.00 us.

It leads to an average of 577874.00 us since the collection began (That is to say since 121 seconds of metrics collection).

Remarks:

  • The exadata_metrics.pl script can be downloaded from this repository.
  • The DELTA (s) field is an important one to interpret the result correctly, I strongly recommend to read this post for a better understanding of it.

Conclusion:

The exadata_metrics.pl now offers the ability to display the average delta values since the collection began (that is to say since the script has been launched).

Exadata Cell metrics: collectionTime attribute, something that matters

Exadata provides a lot of useful metrics to monitor the Cells.

The Metrics can be of various types:

  • Cumulative: Cumulative statistics since the metric was created.
  • Instantaneous: Value at the time that the metric is collected.
  • Rate: Rates computed by averaging statistics over observation periods.
  • Transition: Are collected at the time when the value of the metrics has changed, and typically captures important transitions in hardware status.

One attribute of the cumulative metric is the collectionTime.

For example, let’s have a look to one of them:

CellCLI> list METRICCURRENT DB_IO_WT_SM detail
         name:                   DB_IO_WT_SM
         alertState:             normal
         collectionTime:         2013-09-12T23:46:14+02:00
         metricObjectName:       EXABDT
         metricType:             Cumulative
         metricValue:            120 ms
         objectType:             IORM_DATABASE

The collectionTime attribute is the time at which the metric was collected.

Why does it matter ?

Based on it, we can compute the delta in second between 2 collections.

Let’s see two use cases.

First use case: Suppose, you decided to extract real-time metrics from the cumulative ones. To do so, you created a script that takes a snapshot of the cumulative metrics each second (default interval) and computes the delta with the previous snapshot (yes, I am describing my exadata_metrics.pl script introduced into this post 🙂 ).

Then, if the delta value of the metric is 0, you need to know why (two explanations are possible as we’ll see).

Let’s see an example: I’ll take a snapshot with a 40 seconds interval of 2 IORM cumulative metrics:

./exadata_metrics.pl 40 cell=exacell1  name='DB_IO_WT_.*' objectname='EXABDT'
--------------------------------------
----------COLLECTING DATA-------------
--------------------------------------

00:19:21   CELL                    NAME                         OBJECTNAME                                                  VALUE
00:19:21   ----                    ----                         ----------                                                  -----
00:19:21   exacell1                DB_IO_WT_LG                  EXABDT                                                      0.00 ms
00:19:21   exacell1                DB_IO_WT_SM                  EXABDT                                                      0.00 ms

Well, as you can see the computed (delta) value is 0.00 ms but:

  • does it mean that no IO has been queued by the IORM ?
  • or does it mean that the 2 snaps are based on the same collectionTime? (could be the case if the collection interval is greater than the interval you are using with my script).

To answer those questions, I modified the script so that it takes care of the collectionTime: It computes the delta in seconds of the collectionTime recorded into the snapshots.

Let’s see it in action:

Enable the IORM plan:

CellCLI> alter iormplan objective=auto;
IORMPLAN successfully altered

and launch the script with a 40 seconds interval:

./exadata_metrics.pl 40 cell=exacell1  name='DB_IO_WT_.*' objectname='EXABDT'

--------------------------------------
----------COLLECTING DATA-------------
--------------------------------------

DELTA(s)   CELL                    NAME                         OBJECTNAME                                                  VALUE
--------   ----                    ----                         ----------                                                  -----
61         exacell1                DB_IO_WT_SM                  EXABDT                                                      0.00 ms
61         exacell1                DB_IO_WT_LG                  EXABDT                                                      1444922.00 ms

--------------------------------------
----------COLLECTING DATA-------------
--------------------------------------

DELTA(s)   CELL                    NAME                         OBJECTNAME                                                  VALUE
--------   ----                    ----                         ----------                                                  -----
60         exacell1                DB_IO_WT_SM                  EXABDT                                                      1.00 ms
60         exacell1                DB_IO_WT_LG                  EXABDT                                                      2573515.00 ms

--------------------------------------
----------COLLECTING DATA-------------
--------------------------------------

DELTA(s)   CELL                    NAME                         OBJECTNAME                                                  VALUE
--------   ----                    ----                         ----------                                                  -----
0          exacell1                DB_IO_WT_LG                  EXABDT                                                      0.00 ms
0          exacell1                DB_IO_WT_SM                  EXABDT                                                      0.00 ms

Look at the DELTA(s) column: It indicates the delta in seconds for the collectionTime attribute.

So that:

  • DELTA(s) > 0: Means you can check the metric value as the snaps are from 2 distinct collectionTime.
  • DELTA(s) = 0: Means the snaps come from the same collectionTime and then a metric value of 0 is obvious.

Second use case:

As we now have the DELTA(s) value we can compute by our own the associated (_SEC) rate metrics.

For example, from:

./exadata_metrics_orig_new.pl 10 cell=exacell1 name='DB_IO_.*' objectname='EXABDT'
--------------------------------------
----------COLLECTING DATA-------------
--------------------------------------

DELTA(s)   CELL                    NAME                         OBJECTNAME                                                  VALUE                
--------   ----                    ----                         ----------                                                  -----                
60         exacell1                DB_IO_WT_SM                  EXABDT                                                      0.00 ms        
60         exacell1                DB_IO_RQ_SM                  EXABDT                                                      153.00 IO requests
60         exacell1                DB_IO_RQ_LG                  EXABDT                                                      292.00 IO requests
60         exacell1                DB_IO_WT_LG                  EXABDT                                                      830399.00 ms

We can conclude, that:

  • the number of large IO request per second is 292/60=4.87.
  • the number of small IO request per second is 153/60=2.55.

Let’s verify those numbers with their associated rate metrics (DB_IO_RQ_LG_SEC and DB_IO_RQ_SM_SEC):

cellcli -e "list metriccurrent attributes name,metrictype,metricobjectname,metricvalue,collectionTime where name like 'DB_IO_.*' and metricobjectname='EXABDT' and metrictype='Rate'"
         DB_IO_RQ_LG_SEC         Rate    EXABDT  4.9 IO/sec              2013-09-13T16:13:40+02:00
         DB_IO_RQ_SM_SEC         Rate    EXABDT  2.6 IO/sec              2013-09-13T16:13:40+02:00
         DB_IO_WT_LG_RQ          Rate    EXABDT  2,844 ms/request        2013-09-13T16:13:40+02:00
         DB_IO_WT_SM_RQ          Rate    EXABDT  0.0 ms/request          2013-09-13T16:13:40+02:00

Great, that’s the same numbers.

Conclusion:

The collectionTime metric attribute can be very useful when you extract real-time metrics from the cumulative ones as:

  • It provides a way to interpret the results.
  • it provides a way to extract the rate metrics (_SEC) from their cumulatives ones.

Regarding the script:

  • You are able to collect real-time metrics based on cumulative metrics.
  • You can choose the number of snapshots to display and the time to wait between snapshots.
  • You can choose to filter on name and objectname based on predicates (see the help).
  • You can work on all the cells or a subset thanks to the CELL or the GROUPFILE parameter.
  • You can decide the way to compute the metrics with no aggregation, aggregation on cell, objectname or both.

You can download the exadata_metrics.pl script from this repository.

ASM I/O Statistics Utility: Update for Exadata

In this previous post (You should read it to understand what will follow) I explained how my asmiostat utility could be useful for the Exadata community. For this, I made one assumption:

  • Each storage cell constitutes a separate failure group (in most common Exadata configuration) (see Expert Oracle Exadata Book for more details)

And I concluded with:

  • In case your Exadata configuration does not follow this rule:  One  failure group per storage cell, just be aware that I will update my asmiostat utility so that it will be able to group by storage cells in any case (thanks to the IP located into the disks path). I’ll keep you posted once ready.

Here we are: I updated my asmiostat utility so that you can choose to focus on IP (Exadata Cells) instead of Failgroup.

So that now, you can measure the performance and the IO load across the DB servers and the Cells that way:

./real_time.pl -type=asmiostat -show=ip,inst

with the following ouput:

Collecting 1 sec....
............................
03:32:18                                                               Kby      Avg       AvgBy/    Read                Kby       Avg        AvgBy/    Write
03:32:18   INST     DG          IP (Cells)        DSK        Reads/s   Read/s   ms/Read   Read      Errors   Writes/s   Write/s   ms/Write   Write     Errors
03:32:18   ------   ---------   ---------------   --------   -------   ------   -------   ------    ------   --------   -------   --------   ------    ------
03:32:18   +ASM                                              48        1600     10.6      34133     0        7          144       18.0       21065     0
03:32:18   +ASM                 192.168.56.111               12        424      13.4      36181     0        3          48        32.4       16384     0
03:32:18   +ASM                 192.168.56.101               36        1176     9.7       33451     0        4          96        7.2        24576     0

You can also choose to filter on some IP adresses (see the help):

./real_time.pl -type=asmiostat -help

Usage: ./real_time.pl -type=asmiostat [-interval] [-count] [-inst] [-dg] [-fg] [-ip] [-show] [-help]
 Default Interval : 1 second.
 Default Count    : Unlimited

  Parameter    Comment                                                      Default
  ---------    -------                                                      -------
  -INST=       ALL - Show all Instance(s)                                   ALL
               CURRENT - Show Current Instance
               INSTANCE_NAME,... - choose Instance(s) to display

  -DG=         Diskgroup to collect (comma separated list)                  ALL
  -FG=         Failgroup to collect (comma separated list)                  ALL
  -IP=         IP (Exadata Cells) to collect (Wildcard allowed)             ALL
  -SHOW=       What to show: inst,fg|ip,dg,dsk (comma separated list)       DG

Example: ./real_time.pl -type=asmiostat
Example: ./real_time.pl -type=asmiostat -inst=+ASM1
Example: ./real_time.pl -type=asmiostat -dg=DATA -show=dg
Example: ./real_time.pl -type=asmiostat -dg=data -show=inst,dg,fg
Example: ./real_time.pl -type=asmiostat -show=dg,dsk
Example: ./real_time.pl -type=asmiostat -show=inst,dg,fg,dsk
Example: ./real_time.pl -type=asmiostat -show=ip -ip='%10%'

Remarks:

  • To get the asmiostat utility included into the real_time.pl script:  Click on the link, and then on the view source button and then copy/paste the source code. You can also download the script from this repository to avoid copy/paste (click on the link)
  • For a full description of my asmiostat utility see this post.

UPDATE: The asmiostat utility is not part of the real_time.pl script anymore. A new utility called asm_metrics.pl has been created. See “ASM metrics are a gold mine. Welcome to asm_metrics.pl, a new utility to extract and to manipulate them in real time” for more information.

Exadata real-time metrics extracted from cumulative metrics: Part II

Into this post I introduced my exadata_metrics.pl script that I use to collect real-time cell’s metrics from the cumulative ones.

I added new features on it that you may find useful/helpful:

  1. First  I added the possibility to aggregate the results based on the cell, on the metricobjectname or both: That is to say you can customize the way the metrics are computed.
  2. Secondly I added the possibility to use a “groupfile” (a file containing a list of cells: same format as the dcli utility) as input (could be useful if your exadata has a lot of cells 😉 )

Lets see examples to make it clear: For this I will focus on the CD_IO_TM_R_SM metric and on 2 cells only (for output simplicity) during the last 100 seconds.

To collect the metrics without aggregation:

./exadata_metrics.pl 100 groupfile=./cell_group name='CD_IO_TM_R_SM'

The output will be like:

07:59:46   CELL                    NAME                         OBJECTNAME                                                  VALUE
07:59:46   ----                    ----                         ----------                                                  -----
07:59:46   cell2                   CD_IO_TM_R_SM                CD_disk05_cell                                              0.00 us
07:59:46   cell2                   CD_IO_TM_R_SM                FD_03_cell                                                  0.00 us
07:59:46   cell2                   CD_IO_TM_R_SM                CD_disk03_cell                                              36479.00 us
07:59:46   cell                    CD_IO_TM_R_SM                CD_disk06_cell                                              41572.00 us
07:59:46   cell2                   CD_IO_TM_R_SM                CD_disk02_cell                                              167822.00 us
07:59:46   cell                    CD_IO_TM_R_SM                CD_disk02_cell                                              522659.00 us
07:59:46   cell                    CD_IO_TM_R_SM                CD_disk04_cell                                              523456.00 us
07:59:46   cell                    CD_IO_TM_R_SM                CD_disk01_cell                                              553921.00 us
07:59:46   cell                    CD_IO_TM_R_SM                FD_02_cell                                                  580027.00 us
07:59:46   cell                    CD_IO_TM_R_SM                FD_03_cell                                                  801521.00 us
07:59:46   cell                    CD_IO_TM_R_SM                FD_01_cell                                                  845028.00 us
07:59:46   cell                    CD_IO_TM_R_SM                FD_00_cell                                                  1228914.00 us
07:59:46   cell2                   CD_IO_TM_R_SM                CD_disk01_cell                                              1229929.00 us
07:59:46   cell                    CD_IO_TM_R_SM                CD_disk05_cell                                              1314321.00 us
07:59:46   cell                    CD_IO_TM_R_SM                CD_disk03_cell                                              4055302.00 us

With an aggregation on the objectname (I use the “show” option has I want to display cell and name):

./exadata_metrics.pl 100 groupfile=./cell_group name='CD_IO_TM_R_SM' show=cell,name

The output would have been like:

07:59:48   CELL                    NAME                         OBJECTNAME                                                  VALUE
07:59:48   ----                    ----                         ----------                                                  -----
07:59:48   cell2                   CD_IO_TM_R_SM                                                                            1434230.00 us
07:59:48   cell                    CD_IO_TM_R_SM                                                                            10466721.00 us

As you can see, the objectname has disappeared as the metrics have been aggregated on it.

Let’s collect one more time with no aggregation:

./exadata_metrics.pl 100 groupfile=./cell_group name='CD_IO_TM_R_SM'

With no aggregation the output will be like:

09:37:01   CELL                    NAME                         OBJECTNAME                                                  VALUE
09:37:01   ----                    ----                         ----------                                                  -----
09:37:01   cell2                   CD_IO_TM_R_SM                CD_disk09_cell                                              0.00 us
09:37:01   cell2                   CD_IO_TM_R_SM                CD_disk05_cell                                              0.00 us
09:37:01   cell2                   CD_IO_TM_R_SM                FD_03_cell                                                  0.00 us
09:37:01   cell                    CD_IO_TM_R_SM                CD_disk12_cell                                              0.00 us
09:37:01   cell                    CD_IO_TM_R_SM                CD_disk01_cell                                              879.00 us
09:37:01   cell                    CD_IO_TM_R_SM                CD_disk06_cell                                              41629.00 us
09:37:01   cell                    CD_IO_TM_R_SM                FD_03_cell                                                  111676.00 us
09:37:01   cell                    CD_IO_TM_R_SM                FD_02_cell                                                  233388.00 us
09:37:01   cell                    CD_IO_TM_R_SM                FD_01_cell                                                  253784.00 us
09:37:01   cell                    CD_IO_TM_R_SM                CD_disk04_cell                                              519624.00 us
09:37:01   cell                    CD_IO_TM_R_SM                CD_disk05_cell                                              587848.00 us
09:37:01   cell2                   CD_IO_TM_R_SM                CD_disk03_cell                                              1949331.00 us
09:37:01   cell2                   CD_IO_TM_R_SM                CD_disk02_cell                                              2016198.00 us
09:37:01   cell                    CD_IO_TM_R_SM                CD_disk03_cell                                              3220328.00 us
09:37:01   cell2                   CD_IO_TM_R_SM                CD_disk01_cell                                              3631941.00 us

With an aggregation on the cell and the objectname (I use the “show” option has I want to display name):

./exadata_metrics.pl 100 groupfile=./cell_group name='CD_IO_TM_R_SM' show=name

The output would have been like:

09:36:59   CELL                    NAME                         OBJECTNAME                                                  VALUE
09:36:59   ----                    ----                         ----------                                                  -----
09:36:59                           CD_IO_TM_R_SM                                                                            12566626.00 us

As you can see, the objectname and the cell have disappeared as the metrics have been aggregated.

Let’s see the help:

 ./exadata_metrics.pl help

Usage: ./exadata_metrics.pl [Interval [Count]] [cell=|groupfile=] [show=] [top=] [name=] [name!=] [objectname=] [objectname!=]

 Default Interval : 1 second.
 Default Count : Unlimited

 Parameter                 Comment                                                      Default
 ---------                 -------                                                      -------
 CELL=                     comma-separated list of cells
 GROUPFILE=                file containing list of cells
 SHOW=                     What to show (name included): cell,objectname                ALL
 TOP=                      Number of rows to display                                    10
 NAME=                     ALL - Show all cumulative metrics (wildcard allowed)         ALL
 NAME!=                    Exclude cumulative metrics (wildcard allowed)                EMPTY
 OBJECTNAME=               ALL - Show all objects (wildcard allowed)                    ALL
 OBJECTNAME!=              Exclude objects (wildcard allowed)                           EMPTY

utility assumes passwordless SSH from this cell node to the other cell nodes
utility assumes ORACLE_HOME has been set (with celladmin user for example)

Example : ./exadata_metrics.pl cell=cell
Example : ./exadata_metrics.pl groupfile=./cell_group
Example : ./exadata_metrics.pl groupfile=./cell_group show=name
Example : ./exadata_metrics.pl cell=cell objectname='CD_disk03_cell' name!='.*RQ_W.*'
Example : ./exadata_metrics.pl cell=cell name='.*BY.*' objectname='.*disk.*' name!='GD.*' objectname!='.*disk1.*'

Conclusion:

  • You are able to collect real-time metrics based on cumulative metrics.
  • You can choose the number of snapshots to display and the time to wait between snapshots.
  • You can choose to filter on name and objectname based on predicates (see the help).
  • You can work on all the cells or a subset thanks to the CELL or the GROUPFILE parameter.
  • You can decide the way to compute the metrics with no aggregation, aggregation on cell, objectname or both.

To get the exadata_metrics.pl script:  Click on the link, and then on the view source button and then copy/paste the source code. You can also download the script from this repository to avoid copy/paste (click on the link).

Any remarks, suggestions, questions are welcome.

Update:

You should read this post for a better interpretation of the utility: Exadata Cell metrics: collectionTime attribute, something that matters

Exadata: Storage Cells IO performance metrics and IO distribution with DB servers

Yesterday day I was reading Uwe Hesse ‘s blog post  “Appliance? How #Exadata will impact your IT Organization” (you should read it too by the way 🙂 ) and then (I don’t know why) my mind switch back to my asmiostat utility and how it could be useful for the Exadata community.

Let me explain:

If you read my previous blog post “ASM Preferred Read: Collect performance metrics” you see that thanks to my asmiostat utility we are able to measure the IO distribution between ASM instances and failure groups. (ASM is not doing any IOs by the way that’s just a simple way to say:  The IOs generated by the databases linked to the ASM instance)

You see, that we are also able to measure the failure groups IO performance metrics (and their associated disks if needed) (see first post related to my asmiostat utility).

That said, now think about Exadata for which:

  • One ASM instance is running per DB server.
  • Each storage cell constitutes a separate failure group (in most common Exadata configuration) (see Expert Oracle Exadata Book for more details)

So, now come back to the first sentence of the explanation and simply change a few words for Exadata:

You see that thanks to my asmiostat utility we are able to measure the IO distribution between DB servers (ASM instances) and Storage cells (failure groups).

You see, that we are also able to measure the Storage cells (failure groups) IO performance metrics (and their associated Grid Disks (disks) if needed)

Remarks:  

  • In case your Exadata configuration does not follow this rule:  One  failure group per storage cell, just be aware that I will update my asmiostat utility so that it will be able to group by storage cells in any case (thanks to the IP located into the disks path). I’ll keep you posted once ready.
  • To get the asmiostat utility included into the real_time.pl script:  Click on the link, and then on the view source button and then copy/paste the source code. You can also download the script from this repository to avoid copy/paste (click on the link)
  • For a full description of my asmiostat utility see this post.

Update: The asmiostat utility is now able to deal with Exadata Cell’s IPs (see this post)