SystemTap for PostgreSQL Toolkit

Introduction

The purpose of this post is to share some SystemTap tools that have been initially written for oracle and have been adapted for PostgreSQL.

The tools are:

  • pg_schedtimes.stp: To track time spend in various states (run, sleep, iowait, queued)
  • pg_page_faults.stp: To report the total number of page faults and splits them into Major or Minor faults as well as Read or Write access
  • pg_traffic.stp: To track the I/O (vfs, block) and Network (tcp, udp, nfs) traffic

Those tools are able to group the SystemTap probes per client connections (per database or user) and server processes.

Grouping the probes

As described into the documentation, on most platforms, PostgreSQL modifies its command title as reported by ps, so that individual server processes can readily be identified.

For example on my lab, the processes are:

# ps -ef | grep postgres:
postgres   1460   1447  0 09:35 ?        00:00:00 postgres: logger process
postgres   1462   1447  0 09:35 ?        00:00:00 postgres: checkpointer process
postgres   1463   1447  0 09:35 ?        00:00:00 postgres: writer process
postgres   1464   1447  0 09:35 ?        00:00:00 postgres: wal writer process
postgres   1465   1447  0 09:35 ?        00:00:00 postgres: autovacuum launcher process
postgres   1466   1447  0 09:35 ?        00:00:00 postgres: stats collector process
postgres   7981   1447  0 12:56 ?        00:00:00 postgres: postgres postgres [local] idle
postgres   7983   1447  0 12:56 ?        00:00:00 postgres: bdt postgres 172.16.170.1(56175) idle
postgres   7984   1447  0 12:56 ?        00:00:00 postgres: bdt bdtdb 172.16.170.1(56203) idle
  • The firsts six processes are background worker processes
  • Each of the remaining processes is a server process handling one client connection. Each such process sets its command line display in the form “postgres: user database host activity

That said, we can fetch the command line of the processes that trigger the probe event thanks to the cmdline_str() function and:

  • filter the processes
  • extract the piece of information to be used to group the probes

So let’s write two embedded C functions to extract the relevant piece of information from each of the command line output described above.

Functions

get_pg_dbname:

function get_pg_dbname:string (mystr:string) %{
char *ptr;
char *ptr2;

int  ch = ' ';
char substr_res[500];
char *strargs = STAP_ARG_mystr;
ptr = strchr( strchr( strargs , ch) + 1 , ch);
ptr2 = strchr( ptr + 1 , ch);
strncpy (substr_res,ptr, ptr2 - ptr);
substr_res[ptr2 - ptr]='\0';
snprintf(STAP_RETVALUE, MAXSTRINGLEN, "%s",substr_res+1);
%}

This function extracts the database from any “postgres: user database host activity” string

get_pg_user_proc:

function get_pg_user_proc:string (mystr:string) %{
char *ptr;
char *ptr2;

int ch = ' ';
char substr_res[500];
char *strargs = STAP_ARG_mystr;
ptr = strchr( strargs , ch);
ptr2 = strchr( ptr + 1 , ch);
strncpy (substr_res,ptr, ptr2 - ptr);
substr_res[ptr2 - ptr]='\0';
snprintf(STAP_RETVALUE, MAXSTRINGLEN, "%s",substr_res+1);
%}

This function extracts:

  • the user from any “postgres: user database host activity” string
  • the procname from any “postgres: procname <any string> process” string

Having in mind that the SystemTap aggregation operator is “<<<” (as explained here) we can use those 2 functions to aggregate within the probes by passing as parameter the cmdline_str().

Let’s see the usage and output of those tools.

pg_schedtimes.stp

Usage

 $> stap -g ./pg_schedtimes.stp  <pg uid> <refresh time ms> <db|user> <details|nodetails>

 <db|user>: group by db or user
 <details|nodetails>: process server details or grouped all together

Output example

The postgres userid is 26, and we want to see the time spend in various states by client connections (grouped by database) and all the worker process.

$> stap -g ./pg_schedtimes.stp 26 10000 db details

-----------------------------------------------------------------------
                   run(us)  sleep(us) iowait(us) queued(us)  total(us)
-----------------------------------------------------------------------
NOT_PG         :      34849  430368652       1433      35876  430440810
logger         :         81    9986664          0        446    9987191
checkpointer   :         24    9986622          0        543    9987189
writer         :        331    9986227          0        629    9987187
wal            :        248    9986279          0        657    9987184
autovacuum     :        862    9983132          0       3188    9987182
stats          :       2210    9981339          0       3631    9987180
postgres       :      11058    9975156          0        948    9987162
bdtdb          :         13    9986338          0        809    9987160

I can see the client connections grouped by the bdtdb and postgres databases, the worker processes and all that is not related to PostgreSQL (NOT_PG).

pg_page_faults.stp

Usage

 $> stap -g ./pg_page_faults.stp  <pg uid> <refresh time ms> <db|user> <details|nodetails>

 <db|user>: group by db or user
 <details|nodetails>: process server details or grouped all together

Output example

The postgres userid is 26, and we want to see the page faults by client connections grouped by user and no worker process details.

$> stap -g ./pg_page_faults.stp 26 10000 user nodetails

------------------------------------------------------------------------------------------
                READS_PFLT   WRITES_PFLT  TOTAL_PFLT   MAJOR_PFLT   MINOR_PFLT
------------------------------------------------------------------------------------------
bdt            : 0            294          294          0            294
NOT_PG         : 0            71           71           0            71
PG SERVER      : 0            3            3            0            3

I can see the client connections grouped by the bdt user, the worker processes grouped all together as “PG SERVER” and all that is not related to PostgreSQL (NOT_PG).

pg_traffic.stp

Usage

 $> stap -g ./pg_traffic.stp <pg uid> <refresh time ms> <io|network|both> <db|user> <details|nodetails>

 <io|network|both>: see io, network or both activity
 <db|user>: group by db or user
 <details|nodetails>: process server details or grouped all together

Output example

The postgres userid is 26, and we want to see the I/O activity by client connections grouped by database and all the worker process.

$> stap -g ./pg_traffic.stp 26 10000 io db details

-------------------------------------------------------------------------------------------------------------
|                                                          I/O                                              |
-------------------------------------------------------------------------------------------------------------
                                  READS                     |                     WRITES                    |
                                                            |                                               |
                       VFS                    BLOCK         |          VFS                    BLOCK         |
            | NB                 KB | NB                 KB | NB                 KB | NB                 KB |
            | --                 -- | --                 -- | --                 -- | --                 -- |
postgres    | 189               380 | 0                   0 | 0                   0 | 0                   0 |
bdtdb       | 38                127 | 0                   0 | 0                   0 | 0                   0 |
NOT_PG      | 79                  2 | 0                   0 | 10                  2 | 0                   0 |
autovacuum  | 49                141 | 0                   0 | 2                   0 | 0                   0 |
stats       | 0                   0 | 0                   0 | 42                 96 | 0                   0 |
-------------------------------------------------------------------------------------------------------------

I can see the client connections grouped by the bdtdb and postgres databases, the worker processes and all that is not related to PostgreSQL (NOT_PG).

For information, the network output produces something like:

-------------------------------------------------------------------------------------------------------------------------------------------------------------
|                                                                                Network                                                                    |
-------------------------------------------------------------------------------------------------------------------------------------------------------------
                                              RECV                                  |                                 SENT                                  |
                                                                                    |                                                                       |
                       TCP                     UDP                     NFS          |          TCP                     UDP                     NFS          |
            | NB                 KB | NB                 KB | NB                 KB | NB                 KB | NB                 KB | NB                 KB |
            | --                 -- | --                 -- | --                 -- | --                 -- | --                 -- | --                 -- |
postgres    | 95                760 | 0                   0 | 0                   0 | 52                  8 | 0                   0 | 0                   0 |
NOT_PG      | 6                  48 | 0                   0 | 0                   0 | 3                   3 | 0                   0 | 0                   0 |
bdtdb       | 10                 80 | 0                   0 | 0                   0 | 5                   0 | 0                   0 | 0                   0 |
stats       | 0                   0 | 0                   0 | 0                   0 | 0                   0 | 0                   0 | 0                   0 |
-------------------------------------------------------------------------------------------------------------------------------------------------------------

Source code

The source code is available in this github repository

Conclusion

Thanks to the embedded C functions we have been able to aggregate the probes and display the information by worker processes or client connections (grouped by database or user).

Advertisements

SystemTap and Oracle RDBMS: I/O and Network Traffic

Now that I am able to aggregate SytemTap probes by Oracle database, let’s focus on I/O and Network Traffic.

For this purpose a new SystemTap script (traffic_per_db.stp) has been created and has been added into this github repository.

traffic_per_db

This script tracks the I/O and Network traffic per database and also groups the non database(s) traffic.

Usage:

$> stap -g ./traffic_per_db.stp <oracle uid> <refresh time ms> <io|network|both>

Output example (I/O only):

$> stap -g ./traffic_per_db.stp 54321 5000 io

-------------------------------------------------------------------------------------------------------------
|                                                          I/O                                              |
-------------------------------------------------------------------------------------------------------------
                                  READS                     |                     WRITES                    |
                                                            |                                               |
                       VFS                    BLOCK         |          VFS                    BLOCK         |
            | NB                 KB | NB                 KB | NB                 KB | NB                 KB |
            | --                 -- | --                 -- | --                 -- | --                 -- |
BDTS        | 6203            49280 | 0                   0 | 11                 64 | 12                128 |
NOT_A_DB    | 45                  3 | 0                   0 | 15                  2 | 0                   0 |
-------------------------------------------------------------------------------------------------------------

For this example the database files are located on a file system. In this output, we can see than the reads I/O are served by the file system cache: Reads VFS I/O and no Reads BLOCK I/O.

Output example (Network only):

Example 1: The database files are located on Kernel NFS (kNFS)

$> stap -g ./traffic_per_db.stp 54321 5000 network
-------------------------------------------------------------------------------------------------------------------------------------------------------------
|                                                                                Network                                                                    |
-------------------------------------------------------------------------------------------------------------------------------------------------------------
                                              RECV                                  |                                 SENT                                  |
                                                                                    |                                                                       |
                       TCP                     UDP                     NFS          |          TCP                     UDP                     NFS          |
            | NB                 KB | NB                 KB | NB                 KB | NB                 KB | NB                 KB | NB                 KB |
            | --                 -- | --                 -- | --                 -- | --                 -- | --                 -- | --                 -- |
NOT_A_DB    | 4                  32 | 0                   0 | 61                706 | 1943              252 | 0                   0 | 5                  80 |
BVDB        | 0                   0 | 0                   0 | 1623            16825 | 113                13 | 0                   0 | 170              1437 |
-------------------------------------------------------------------------------------------------------------------------------------------------------------

As expected we can observed NFS traffic at the database level.

Example 2: The database files are located on Direct NFS (dNFS)

$> stap -g ./traffic_per_db.stp 54321 5000 network

-------------------------------------------------------------------------------------------------------------------------------------------------------------
|                                                                                Network                                                                    |
-------------------------------------------------------------------------------------------------------------------------------------------------------------
                                              RECV                                  |                                 SENT                                  |
                                                                                    |                                                                       |
                       TCP                     UDP                     NFS          |          TCP                     UDP                     NFS          |
            | NB                 KB | NB                 KB | NB                 KB | NB                 KB | NB                 KB | NB                 KB |
            | --                 -- | --                 -- | --                 -- | --                 -- | --                 -- | --                 -- |
BVDB        | 3810            18934 | 0                   0 | 0                   0 | 2059             1787 | 0                   0 | 0                   0 |
NOT_A_DB    | 3                  24 | 0                   0 | 0                   0 | 4                   2 | 0                   0 | 0                   0 |
-------------------------------------------------------------------------------------------------------------------------------------------------------------

As you can see the NFS traffic has been replaced by a TCP traffic at the database level.

Remarks

  • In this post the word database stands for “all the foreground and background processes linked to an oracle database”.
  • In a consolidated environment, having a view per database can be very useful.
  • This script helps to display the traffic per database and also reports the one that is not related to databases (“NOT_A_DB”).
  • The probes documentation can be found here.

Conclusion

We are able to track the I/O and the Network traffic at the database level.

SystemTap and Oracle RDBMS: Page Faults

Introduction

Now that I am able to aggregate SytemTap probes by Oracle database, let’s focus on page faults.

For this purpose a new SystemTap script (page_faults_per_db.stp) has been created and has been added into this github repository.

page_faults_per_db

This script tracks the page faults per database. It reports the total number of page faults and splits them into Major or Minor faults as well as Read or Write access.

Usage:

$> stap -g ./page_faults_per_db.stp <oracle uid> <refresh time ms>

Output example:

$> stap -g ./page_faults_per_db.stp 54321 10000

---------------------------------------------------------------------------------------
DBNAME      : READS_PFLT   WRITES_PFLT  TOTAL_PFLT   MAJOR_PFLT   MINOR_PFLT
---------------------------------------------------------------------------------------
BDT         : 30418        22526        52944        22           52922
NOT_A_DB    : 773          1088         1861         4            1858

Remarks

  • In this post the word database stands for “all the foreground and background processes linked to an oracle database”.
  • In a consolidated environment, having a view per database can be very useful.
  • Should you be interested by this subject, then do read Hatem Mahmoud post.

Conclusion

We are able to track and group the page faults at the database level.

SystemTap and Oracle RDBMS: Time in Various States, VFS I/O and Block I/O

Introduction

Now that I am able to aggregate SytemTap probes by Oracle database, it’s time to create several scripts in a toolkit. The toolkit is available in this github repository.

Let’s describe 3 new members of the toolkit:

  • schedtimes_per_db.stp: To track time databases spend in various states
  • vfsio_per_db.stp: To track I/O by database through the Virtual File System (vfs) layer
  • blkio_per_db.stp: To track I/O by database through the block IO layer

schedtimes_per_db

This script tracks the time databases spend in various states. It also reports the time spend by non oracle database.

Usage:

$> stap -g ./schedtimes_per_db.stp <oracle uid> <refresh time ms>

Output example:

$> stap -g ./schedtimes_per_db.stp 54321 10000

------------------------------------------------------------------
DBNAME    :    run(us)  sleep(us) iowait(us) queued(us)  total(us)
------------------------------------------------------------------
NOT_A_DB  :     447327  200561911       1328     517522  201528088
BDT       :      42277  316189082          0      69355  316300714
VBDT      :      74426  326694570          0      77489  326846485


vfsio_per_db

This script tracks the database I/O through the VFS layer. It also reports the I/O in this layer for non oracle database.

Usage:

$> stap -g ./vfsio_per_db.stp <oracle uid> <refresh time ms>

Output example:

$> stap -g ./vfsio_per_db.stp 54321 10000

------------------------------------------------------------------------
DBNAME      : NB_READ   READ_KB   NB_WRITE  WRITE_KB  NB_TOTAL  TOTAL_KB
------------------------------------------------------------------------
BDTS        : 110       347       6         96        116       443
NOT_A_DB    : 89        11        2         0         91        11

blkio_per_db

This script tracks the database I/O through the block IO layer. It also reports the I/O in this layer for non oracle database.

Usage:

$> stap -g ./blkio_per_db.stp <oracle uid> <refresh time ms>

Output example:

$> stap -g ./blkio_per_db.stp 54321 10000

------------------------------------------------------------------------
DBNAME      : NB_READ   READ_KB   NB_WRITE  WRITE_KB  NB_TOTAL  TOTAL_KB
------------------------------------------------------------------------
BDTS        : 9690      110768    18        192       9708      110960
NOT_A_DB    : 0         0         6         560       6         560

Remarks

  • The schedtimes_per_db script is mainly inspired by this one (full credit goes to the authors).
  • Why is it  interesting to look at the vfs layer? Answers are in this awesome File System Latency series (see parts 1, 2, 3, 4 and 5) from Brendan Gregg.
  • In this post the word database stands for “all the foreground and background processes linked to an oracle database”.
  • In a consolidated environment, having a view per database can be very useful.

Conclusion

The toolkit has been created and 3 new members are part of it. Expect from it to grow a lot.

SystemTap and Oracle RDBMS: Aggregate by database

Introduction

The purpose of this post is to share a way to aggregate by Oracle database within SystemTap probes. Let’s describe a simple use case to make things clear.

Use Case

Let’s say that I want to get the number and the total size of TCP messages that have been sent and received by an Oracle database. To do so, let’s use 2 probes:

and fetch the command line of the processes that trigger the event thanks to the cmdline_str() function. In case of a process related to an oracle database, the cmdline_str() output would look like one of those 2:

  • ora_<something>_<instance_name>
  • oracle<instance_name> (LOCAL=<something>

So let’s write two embedded C functions to extract the Instance name from each of the 2 strings described above.

Functions

get_oracle_name_b:

For example, this function would extract BDT from “ora_dbw0_BDT” or any “ora_<something>_BDT” string.

The code is the following:

function get_oracle_name_b:string (mystr:string) %{
char *ptr;
int  ch = '_';
char *strargs = STAP_ARG_mystr;
ptr = strchr( strchr( strargs , ch) + 1 , ch);
snprintf(STAP_RETVALUE, MAXSTRINGLEN, "%s",ptr + 1);
%}

get_oracle_name_f:

For example, this function would extract BDT from “oracleBDT (LOCAL=NO)”, “oracleBDT (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))” or any “oracleBDT (LOCAL=<something>” string.

The code is the following:

function get_oracle_name_f:string (mystr:string) %{
char *ptr;
int    ch = ' ';
char substr_res[30];
char *strargs = STAP_ARG_mystr;
ptr = strchr( strargs, ch );
strncpy (substr_res,strargs+6, ptr - strargs - 6);
substr_res[ptr - strargs - 6]='\0';
snprintf(STAP_RETVALUE, MAXSTRINGLEN, "%s",substr_res);
%}

Having in mind that the SystemTap aggregation operator is “<<<” (as explained here) we can use those 2 functions to aggregate within the probes by Instance Name (passing as parameter the cmdline_str()) that way:

probe tcp.recvmsg
{

if ( isinstr(cmdline_str() , "ora_" ) == 1 && uid() == orauid) {
tcprecv[get_oracle_name_b(cmdline_str())] <<< size
} else if ( isinstr(cmdline_str() , "LOCAL=" ) == 1 && uid() == orauid) {
tcprecv[get_oracle_name_f(cmdline_str())] <<< size
} else {
tcprecv["NOT_A_DB"] <<< size
}
}

probe tcp.sendmsg
{

if ( isinstr(cmdline_str() , "ora_" ) == 1 && uid() == orauid) {
tcpsend[get_oracle_name_b(cmdline_str())] <<< size
} else if ( isinstr(cmdline_str() , "LOCAL=" ) == 1 && uid() == orauid) {
tcpsend[get_oracle_name_f(cmdline_str())] <<< size
} else {
tcpsend["NOT_A_DB"] <<< size
}
}

As you can see, non oracle database would be recorded and displayed as “NOT_A_DB”.

Based on this, the tcp_by_db.stp script has been created.

tcp_by_db.stp: script usage and output example

Usage:

$> stap -g ./tcp_by_db.stp <oracle uid> <refresh time ms>

Output Example:

$> stap -g ./tcp_by_db.stp 54321 10000
---------------------------------------------------------
NAME              NB_SENT   SENT_KB   NB_RECV   RECV_KB
---------------------------------------------------------
VBDT              5439      8231      10939     64154
NOT_A_DB          19        4         41        128
BDT               19        50        35        259

---------------------------------------------------------
NAME              NB_SENT   SENT_KB   NB_RECV   RECV_KB
---------------------------------------------------------
VBDT              267       109       391       2854
NOT_A_DB          102       19        116       680
BDT               26        55        47        326

---------------------------------------------------------
NAME              NB_SENT   SENT_KB   NB_RECV   RECV_KB
---------------------------------------------------------
VBDT              340       176       510       2940
NOT_A_DB          150       8         151       1165
BDT               42        77        61        423

Remarks:

  • The oracle uid on my machine is 54321
  • The refresh time has been set to 10 seconds
  • You can see the aggregation for 2 databases on my machine and also for all that is not an oracle database

Whole Script source code

The whole code is available at:

Conclusion

Thanks to the embedded C functions we have been able to aggregate by Oracle database within SystemTap probes.

Running 12.1.0.2 Oracle Database on Docker

Introduction

As you may have noticed, Oracle has released a few weeks ago Docker build files for the Oracle Database on Github and an associated blog post. Should you read the blog post, you would notice that you need to download manually the oracle binaries.

The purpose of this post is to provide a modified version of the original build files (focusing on the 12.1.0.2 database EE only), so that the oracle binaries are downloaded during the build of the image (thanks to Maris Elsins’s getMOSPatch script).

Create the 12.1.0.2 database docker image

The manual steps are:

Install the docker engine (You can find detailed instructions here), basically you have to add the yum repository and launch:

root@docker# yum install docker-engine

Change the maximum container size to 20 GB:

root@docker# docker daemon --storage-opt dm.basesize=20G

Clone those files from github:

root@docker# git clone https://github.com/bdrouvot/oracledb-docker.git

Update the oracledb-docker/Dockerfile file (ENV section only) with the appropriate values:

  • ORACLE_PWD=”<put_the_password_you_want>”
  • MOSU=”<your_oracle_support_username>”
  • MOSP=”<your_oracle_support_password>”

Build the Image:

root@docker# docker build --force-rm=true --no-cache=true -t oracle/database:12.1.0.2 .

That’s it, now we can:

Use the Image

Simply start a docker container that way:

root@docker# docker run -p 1521:1521 --name bdt12c oracle/database:12.1.0.2

The host that is running the docker engine is “docker”.  You can connect to the database as you would do normally, for example, using Oracle SQL Developper:

Screen Shot 2016-08-18 at 18.46.30

Note that the Hostname is “docker”, that is to say the one that is hosting the docker engine.

Remarks

  • At the time of this writing Oracle Database on Docker is NOT supported by Oracle. Use these files at your own discretion.
  • If you are interested in this kind of stuff, then you should also read Frits Hoogland’s blog post.
  • The Dockerfile used is very closed to the one provided by Oracle (Gerald Venzl). Only a few things have been modified to download the oracle binaries during the image creation.
  • Thanks to Maris Elsins for getMOSPatch.