Graphing Oracle performance metrics with Telegraf, InfluxDB and Grafana

Introduction

As a picture is worth a thousand words, we may want to visualise the oracle performance metrics. There is already several tools to do so, but what if you could do it the way you want? build your own graph? Let’s try to achieve this with 3 layers:

  • telegraf: to collect the oracle metrics
  • InfluxDB: to store the time-series oracle metrics
  • grafana: to visualise the oracle time-series metrics

Installation

Let’s install InfluxDB and grafana that way (for example on the same host, namely influxgraf):

  • Install InfluxDB (detail here):
[root@influxgraf ~]# wget https://s3.amazonaws.com/influxdb/influxdb-0.10.2-1.x86_64.rpm

[root@influxgraf ~]# yum localinstall influxdb-0.10.2-1.x86_64.rpm
  • Install grafana (detail here):
[root@influxgraf ~]# yum install https://grafanarel.s3.amazonaws.com/builds/grafana-2.6.0-1.x86_64.rpm

Let’s install telegraf on the oracle host (namely Dprima) that way (detail here):

[root@Dprima ~]# wget http://get.influxdb.org/telegraf/telegraf-0.10.4.1-1.x86_64.rpm

[root@Dprima ~]# yum localinstall telegraf-0.10.4.1-1.x86_64.rpm

Setup

  • Create a telegraf database into InfluxDB (using the web interface):

Screen Shot 2016-03-05 at 09.06.10

  • Create a root user into InfluxDB (using the web interface):

Screen Shot 2016-03-05 at 09.08.38

  • Write a script to collect the oracle metrics. I am using python but this is not mandatory at all. Only the output of the script does matter, it has to be InfluxDB line-protocol. The script query the v$sysmetric and v$eventmetric views to get the wait class and the wait event metrics during the last minute. I am not reinventing the wheel, I am using Kyle Hailey‘s queries. The python code is:
[oracle@Dprima scripts]$ cat oracle_metrics.py
import os
import sys
import cx_Oracle
import argparse
import re

class OraMetrics():
    def __init__(self, user, passwd, sid):
        import cx_Oracle
        self.user = user
        self.passwd = passwd
        self.sid = sid
        self.connection = cx_Oracle.connect( self.user , self.passwd , self.sid )
        cursor = self.connection.cursor()
        cursor.execute("select HOST_NAME from v$instance")
        for hostname in cursor:
            self.hostname = hostname[0]

    def waitclassstats(self, user, passwd, sid):
        cursor = self.connection.cursor()
        cursor.execute("""
        select n.wait_class, round(m.time_waited/m.INTSIZE_CSEC,3) AAS
        from   v$waitclassmetric  m, v$system_wait_class n
        where m.wait_class_id=n.wait_class_id and n.wait_class != 'Idle'
        union
        select  'CPU', round(value/100,3) AAS
        from v$sysmetric where metric_name='CPU Usage Per Sec' and group_id=2
        union select 'CPU_OS', round((prcnt.busy*parameter.cpu_count)/100,3) - aas.cpu
        from
            ( select value busy
                from v$sysmetric
                where metric_name='Host CPU Utilization (%)'
                and group_id=2 ) prcnt,
                ( select value cpu_count from v$parameter where name='cpu_count' )  parameter,
                ( select  'CPU', round(value/100,3) cpu from v$sysmetric where metric_name='CPU Usage Per Sec' and group_id=2) aas
        """)
        for wait in cursor:
            wait_name = wait[0]
            wait_value = wait[1]
            print "oracle_wait_class,host=%s,db=%s,wait_class=%s wait_value=%s" % (self.hostname, sid,re.sub(' ', '_', wait_name), wait_value )

    def waitstats(self, user, passwd, sid):
        cursor = self.connection.cursor()
        cursor.execute("""
	select
	n.wait_class wait_class,
       	n.name wait_name,
       	m.wait_count cnt,
       	round(10*m.time_waited/nullif(m.wait_count,0),3) avgms
	from v$eventmetric m,
     	v$event_name n
	where m.event_id=n.event_id
  	and n.wait_class <> 'Idle' and m.wait_count > 0 order by 1 """)
        for wait in cursor:
         wait_class = wait[0]
         wait_name = wait[1]
         wait_cnt = wait[2]
         wait_avgms = wait[3]
         print "oracle_wait_event,host=%s,db=%s,wait_class=%s,wait_event=%s count=%s,latency=%s" % (self.hostname, sid,re.sub(' ', '_', wait_class),re.sub(' ', '_', wait_name)
, wait_cnt,wait_avgms)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('-u', '--user', help="Username", required=True)
    parser.add_argument('-p', '--passwd', required=True)
    parser.add_argument('-s', '--sid', help="SID", required=True)

    args = parser.parse_args()

    stats = OraMetrics(args.user, args.passwd, args.sid)
    stats.waitclassstats(args.user, args.passwd, args.sid)
    stats.waitstats(args.user, args.passwd, args.sid)

The output looks like (the output format is the InfluxDB line-protocol):

[oracle@Dprima scripts]$ python "/home/oracle/scripts/oracle_metrics.py" "-u" "system" "-p" "bdtbdt" "-s" PBDT
oracle_wait_class,host=Dprima,db=PBDT,wait_class=Administrative wait_value=0
oracle_wait_class,host=Dprima,db=PBDT,wait_class=CPU wait_value=0
oracle_wait_class,host=Dprima,db=PBDT,wait_class=CPU_OS wait_value=0.035
oracle_wait_class,host=Dprima,db=PBDT,wait_class=Commit wait_value=0
oracle_wait_class,host=Dprima,db=PBDT,wait_class=Concurrency wait_value=0
oracle_wait_class,host=Dprima,db=PBDT,wait_class=Configuration wait_value=0
oracle_wait_class,host=Dprima,db=PBDT,wait_class=Network wait_value=0
oracle_wait_class,host=Dprima,db=PBDT,wait_class=Other wait_value=0
oracle_wait_class,host=Dprima,db=PBDT,wait_class=Scheduler wait_value=0
oracle_wait_class,host=Dprima,db=PBDT,wait_class=System_I/O wait_value=0.005
oracle_wait_class,host=Dprima,db=PBDT,wait_class=User_I/O wait_value=0
oracle_wait_event,host=Dprima,db=PBDT,wait_class=System_I/O,wait_event=control_file_sequential_read count=163,latency=0.009
oracle_wait_event,host=Dprima,db=PBDT,wait_class=System_I/O,wait_event=control_file_parallel_write count=60,latency=3.933
oracle_wait_event,host=Dprima,db=PBDT,wait_class=System_I/O,wait_event=log_file_parallel_write count=60,latency=1.35
oracle_wait_event,host=Dprima,db=PBDT,wait_class=User_I/O,wait_event=Disk_file_operations_I/O count=16,latency=0.037
oracle_wait_event,host=Dprima,db=PBDT,wait_class=User_I/O,wait_event=Parameter_File_I/O count=16,latency=0.004
  • On the oracle host, configure telegraf to execute the python script with 60 seconds interval and send the output to InfluxDB. Edit the /etc/telegraf/telegraf.conf file so that it contains:
###############################################################################
#                                  OUTPUTS                                    #
###############################################################################

# Configuration for influxdb server to send metrics to
[[outputs.influxdb]]
  urls = ["http://influxgraf:8086"] # required
  database = "telegraf" # required
  precision = "s"
  timeout = "5s"


###############################################################################
#                                  INPUTS                                     #
###############################################################################

# Oracle metrics
[[inputs.exec]]
  # Shell/commands array
  commands = ["/home/oracle/scripts/oracle_metrics.sh"]
  # Data format to consume. This can be "json", "influx" or "graphite" (line-protocol)
  # NOTE json only reads numerical measurements, strings and booleans are ignored.
  data_format = "influx"
  interval = "60s"

The oracle_metrics.sh script contains the call to the python script:

#!/bin/env bash

export LD_LIBRARY_PATH=/u01/app/oracle/product/11.2.0/dbhome_1//lib
export ORACLE_HOME=/u01/app/oracle/product/11.2.0/dbhome_1/

python "/home/oracle/scripts/oracle_metrics.py" "-u" "system" "-p" "bdtbdt" "-s" PBDT
  • Launch telegraf:
[root@Dprima ~]# telegraf -config /etc/telegraf/telegraf.conf

Graph the metrics

  • First check that the metrics are stored the way we want into InfluxDB:

Screen Shot 2016-03-05 at 09.49.09

 

  • Configure InfluxDB as datasource in grafana:

Screen Shot 2016-03-05 at 10.03.55

  • In grafana, create a dashboard and create some variables (hosts, db and wait_class):

Screen Shot 2016-03-05 at 10.59.00

Screen Shot 2016-03-05 at 10.59.57

  • Now let’s create a graph:

Screen Shot 2016-03-05 at 11.04.09

  • Get the metrics:

Screen Shot 2016-03-05 at 11.06.18

  • So that the graph looks like:

Screen Shot 2016-03-05 at 11.20.18

Remarks

  • Thanks to the variables defined in grafana, we can filter on the host, database and wait class (instead of visualising all of them).
  • We can also graph the wait events we collected, or whatever you want to collect.

Conclusion

Thanks to:

  • telegraf and InfluxDB, we are able to collect and store the metrics we want to.
  • grafana, we are able to visualise the metrics the way we want to.

Update 04/21/2016:

As another example, with more informations collected:

dash1

dash2

dash3

dash4

dash5

 

Advertisements

21 thoughts on “Graphing Oracle performance metrics with Telegraf, InfluxDB and Grafana

  1. Thank you for posting this.

    I tried to follow your setup but i couldn’t display the wait event data in grafana. Can you include your Grafana dashboard json file in this post?

  2. Traceback (most recent call last):
    File “oracle_metrics.py”, line 3, in
    import cx_Oracle
    ImportError: No module named cx_Oracle

    I am getting this error.

  3. I have configured your example successfully. Thanks!!
    Can you share the scripts to collect the data for the examples of the 04/21/2016 update?

  4. Wow great post !
    I also tried to add the physical reads and writes but I can not find requests that gave graphs as good as yours. How did you get these stats ?

  5. Hi! can you share the python code for the Sessions count part? I can’t find the output values corresponding to your lasts screenshots.

  6. Great Work! thx for sharing.
    I followed your step, but when after set telegraf.conf and started collect info,
    log continue reply ” E! Error in plugin [inputs.exec]: exec: fork/exec /home/oracle/scripts/oracle_metrics.sh: permission denied for command ‘/home/oracle/scripts/oracle_metrics.sh’ ”
    but Im already used ‘chmod +x’ to make sure script can run, and it’s work fine in manually.

    1. It looks like the owner of the telegraf process (most probably telegraf) does not have the privs to launch the oracle scripts. Extend the privs or change the telegraf owner to root (can be done through the telegraf init config file). telegraf process owner was root at the time of the writing of this blog post but has changed to telegraf since that time.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s