Measure the impact of DRBD on your PostgreSQL database thanks to pgio (the SLOB method for PostgreSQL)

You might want to replicate a PostgreSQL database thanks to DRBD. In this case you should measure the impact of your DRBD setup, especially if you plan to use DRBD in sync mode. As I am a lucky beta tester of pgio (the SLOB method for PostgreSQL), let’s use it to measure the impact.

DRBD configuration

The purpose of this post is not to explain how to set up DRBD. The DRBD configuration that will be used in this post is the following:

  • Primary host: ubdrbd1
  • Secondary host: ubdrbd2

Configuration:

root@ubdrbd1:/etc# cat drbd.conf
global {
usage-count no;
}
common {
protocol C;
}
resource postgresql {
on ubdrbd1 {
device /dev/drbd1;
disk /dev/sdb1;
address 172.16.170.153:7789;
meta-disk internal;
}
on ubdrbd2 {
device /dev/drbd1;
disk /dev/sdb1;
address 172.16.170.155:7789;
meta-disk internal;
}
}

The C protocol is the synchronous one, so that local write operations on the primary node are considered completed only after both the local and the remote disk write (on /dev/sdb1) have been confirmed.

To ensure that PostgreSQL writes in this replicated device, let’s create a filesystem on it:

root@ubdrbd1:/etc# mkfs.ext4 /dev/drbd1
mke2fs 1.44.1 (24-Mar-2018)
Creating filesystem with 5242455 4k blocks and 1310720 inodes
Filesystem UUID: a7cc203b-39fa-46b3-a707-f330f72ca5b1
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000

Allocating group tables: done
Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

And mount it, in sync mode, on the postgresql default data location:

root@ubdrbd1:~# mkdir /var/lib/postgresql
root@ubdrbd1:~# mount -o sync /dev/drbd1 /var/lib/postgresql

pgio setup

Once PostgreSQL is up and running, we can setup pgio, this is as easy as that:

postgres@ubdrbd1:~/$ tar -xvf pgio-0.9.tar
postgres@ubdrbd1:~/$ cd pgio

then, edit the configuration file to suit your needs. In our current case, the configuration file being used is the following:

postgres@ubdrbd1:~/pgio$ cat pgio.conf
UPDATE_PCT=10
RUN_TIME=120
NUM_SCHEMAS=1
NUM_THREADS=1
WORK_UNIT=255
UPDATE_WORK_UNIT=8
SCALE=200M
DBNAME=pgio
CONNECT_STRING="pgio"
CREATE_BASE_TABLE=TRUE

As you can see:

  • I want 1 schema and 1 thread by schema
  • I have set UPDATE_PCT so that 10% of calls will do an UPDATE (to test writes, and so DRBD replication impact)
  • I kept the default work unit to read 255 blocks and, for those 10% updates, update 8 blocks only

Now, create the pgio tablespace and pgio database:

postgres=# create tablespace pgio location '/var/lib/postgresql/pgdata';
CREATE TABLESPACE
postgres=# create database pgio tablespace pgio;
CREATE DATABASE

and run setup.sh to load the data:

postgres@ubdrbd1:~/pgio$ ./setup.sh

Job info:      Loading 200M scale into 1 schemas as per pgio.conf->NUM_SCHEMAS.
Batching info: Loading 1 schemas per batch as per pgio.conf->NUM_THREADS.
Base table loading time: 3 seconds.
Waiting for batch. Global schema count: 1. Elapsed: 0 seconds.
Waiting for batch. Global schema count: 1. Elapsed: 13 seconds.

Group data loading phase complete.         Elapsed: 13 seconds.

check the content:

postgres@ubdrbd1:~/pgio$ echo '\d+' | psql pgio
                      List of relations
 Schema |   Name    | Type  |  Owner   |  Size  | Description
--------+-----------+-------+----------+--------+-------------
 public | pgio1     | table | postgres | 200 MB |
 public | pgio_base | table | postgres | 29 MB  |
(2 rows)

Now, let’s run 2 times pgio:

  • One time with DRBD not active
  • One time with DRBD active

For each test, the pgio database will be recreated and the setup will be launched. Also the same pgio run duration will be applied (to compare the exact same things).

First run: DRBD not active

The result is the following:

postgres@ubdrbd1:~/pgio$ ./runit.sh
Date: Sat Jun 2 05:09:46 UTC 2018
Database connect string: "pgio".
Shared buffers: 128MB.
Testing 1 schemas with 1 thread(s) accessing 200M (25600 blocks) of each schema.
Running iostat, vmstat and mpstat on current host--in background.
Launching sessions. 1 schema(s) will be accessed by 1 thread(s) each.
pg_stat_database stats:
datname| blks_hit| blks_read|tup_returned|tup_fetched|tup_updated
BEFORE: pgio | 98053 | 34640 | 34309 | 6780 | 8
AFTER: pgio | 19467796 | 11956992 | 30807460 | 29665212 | 115478
DBNAME: pgio. 1 schemas, 1 threads(each). Run time: 120 seconds. RIOPS >99352< CACHE_HITS/s >161414<

As you can see, 115470 tuples have been updated during this 120 seconds run without DRBD in place.

Second run: DRBD active

For the purpose of this post, let’s simulate “slowness” on the DRBD replication. To do so, we can apply some throttling on the DRBD target device thanks to the blkio cgroup. I would suggest to read Frits Hoogland post to see how it can be implemented.

My cgroup configuration on the DRBD secondary server is the following:

root@ubdrbd2:~# ls -l /dev/sdb
brw-rw---- 1 root disk 8, 16 Jun 1 15:16 /dev/sdb

root@ubdrbd2:~# cat /etc/cgconfig.conf
mount {
blkio = /cgroup/blkio;
}

group iothrottle {
blkio {
blkio.throttle.write_iops_device="8:16 500";
}
}

root@ubdrbd2:~# cat /etc/cgrules.conf
* blkio /iothrottle

Means: we want to ensure that the number of writes per second on /dev/sdb will be limited to 500 for all the processes.

Let’s check that DRDB is active:

root@ubdrbd1:~# drbdsetup status --verbose --statistics
postgresql role:Primary suspended:no
    write-ordering:flush
  volume:0 minor:1 disk:UpToDate
      size:20969820 read:453661 written:6472288 al-writes:215 bm-writes:0 upper-pending:4 lower-pending:0 al-suspended:no blocked:no
  peer connection:Connected role:Secondary congested:no
    volume:0 replication:Established peer-disk:UpToDate resync-suspended:no
        received:0 sent:5161968 out-of-sync:0 pending:4 unacked:0

root@ubdrbd2:~# drbdsetup status --verbose --statistics
postgresql role:Secondary suspended:no
    write-ordering:flush
  volume:0 minor:1 disk:UpToDate
      size:20969820 read:0 written:5296960 al-writes:0 bm-writes:0 upper-pending:0 lower-pending:2 al-suspended:no blocked:no
  peer connection:Connected role:Primary congested:no
    volume:0 replication:Established peer-disk:UpToDate resync-suspended:no
        received:5296968 sent:0 out-of-sync:0 pending:0 unacked:2

With this configuration in place, let’s run pgio on the source machine:

postgres@ubdrbd1:~/pgio$ ./runit.sh
Date: Sat Jun 2 05:37:03 UTC 2018
Database connect string: "pgio".
Shared buffers: 128MB.
Testing 1 schemas with 1 thread(s) accessing 200M (25600 blocks) of each schema.
Running iostat, vmstat and mpstat on current host--in background.
Launching sessions. 1 schema(s) will be accessed by 1 thread(s) each.
pg_stat_database stats:
datname| blks_hit| blks_read|tup_returned|tup_fetched|tup_updated
BEFORE: pgio | 111461 | 44099 | 35967 | 5640 | 6
AFTER: pgio | 3879414 | 2132201 | 5799491 | 5766888 | 22425
DBNAME: pgio. 1 schemas, 1 threads(each). Run time: 120 seconds. RIOPS >17400< CACHE_HITS/s >31399<

As you can see, 22419 tuples have been updated during this 120 seconds run with DRBD and blkio throttling in place (this is far less than the 115470 observed in the first test).

pgio also provides snapshots of iostat, vmstat and mpstat, so that it is easy to check that the throttling was in place (writes per second < 500 on the source due to the throttling on the target):

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdb              0.00  425.42      0.00      3.20     0.00    84.41   0.00  16.56    0.00    0.18   0.08     0.00     7.71   0.16   6.64
sdb              0.00  415.18      0.00      3.16     0.00    74.92   0.00  15.29    0.00    0.20   0.08     0.00     7.80   0.17   7.26
.
.

compares to:

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdb              0.00 1311.07      0.00     10.14     0.00    62.98   0.00   4.58    0.00    0.10   0.14     0.00     7.92   0.10  13.70
sdb              0.00 1252.05      0.00      9.67     0.00    62.33   0.00   4.74    0.00    0.11   0.13     0.00     7.91   0.10  13.01
.
.

with DRBD and throttling, both not in place.

So?

Thanks to pgio, we have been able to measure the impact of our DRBD replication setup on PostgreSQL by executing the exact same workload during both tests. We simulated a poor performing DRBD replication by using blkio throttling on the secondary node.

Want to read more about pgio?

Sneak Preview of pgio (The SLOB Method for PostgreSQL) Part I: The Beta pgio README File.

Sneak Preview of pgio (The SLOB Method for PostgreSQL) Part II: Bulk Data Loading.

Sneak Preview of pgio (The SLOB Method for PostgreSQL) Part III: Link To The Full README file for Beta pgio v0.9.

Sneak Preview of pgio (The SLOB Method for PostgreSQL) Part IV: How To Reduce The Amount of Memory In The Linux Page Cache For Testing Purposes.

https://blog.dbi-services.com/postgres-the-fsync-issue-and-pgio-the-slob-method-for-postgresql/

https://blog.dbi-services.com/which-bitnami-service-to-choose-in-the-oracle-cloud-infrastructure/

Advertisements

Visualize PostgreSQL index file with pgdfv

Introduction

In the previous blog post pgdfv (PostgreSQL data file visualizer) has been introduced. At that time the utility was able to display data file. It is now able to display index file. If you are not familiar with PostgreSQL block internals I would suggest to read Frits Hoogland study in this series of blogposts.

The utility usage is:

$ ./pgdfv
-df     Path to a datafile (mandatory if indexfile is used)
-if     Path to an indexfile
-b      Block size (default 8192)

As you can see you can now specify an indexfile. In that case the following information will be displayed:

  • The percentage of free space within an index page
  • The percentage of current rows an index page refers to
  • The percentage of HOT redirect rows an index page refers to

It works for B-tree index and will display those informations for the leaf blocks only.

As a picture is worth a thousand words, let’s see some examples.

Examples

Let’s create a table, an index, insert 4 rows:

pgdbv=# create table bdtable(id int not null, f varchar(30) );
CREATE TABLE
pgdbv=# insert into bdtable ( id, f ) values (1, 'aaaaaaaaaa'), (2, 'bbbbbbbbbb'), (3, 'cccccccccc'), (4, 'dddddddddd');
INSERT 0 4
pgdbv=# create index bdtindex on bdtable (f);
CREATE INDEX

and inspect the table and index content:

pgdbv=# select * from heap_page_items(get_raw_page('bdtable',0));
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |              t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
  1 |   8152 |        1 |     39 |   2489 |      0 |        0 | (0,1)  |           2 |       2050 |     24 |        |       | \x010000001761616161616161616161
  2 |   8112 |        1 |     39 |   2489 |      0 |        0 | (0,2)  |           2 |       2050 |     24 |        |       | \x020000001762626262626262626262
  3 |   8072 |        1 |     39 |   2489 |      0 |        0 | (0,3)  |           2 |       2050 |     24 |        |       | \x030000001763636363636363636363
  4 |   8032 |        1 |     39 |   2489 |      0 |        0 | (0,4)  |           2 |       2050 |     24 |        |       | \x040000001764646464646464646464
(4 rows)

pgdbv=# select * from bt_page_items('bdtindex',1);
 itemoffset | ctid  | itemlen | nulls | vars |                      data
------------+-------+---------+-------+------+-------------------------------------------------
          1 | (0,1) |      24 | f     | t    | 17 61 61 61 61 61 61 61 61 61 61 00 00 00 00 00
          2 | (0,2) |      24 | f     | t    | 17 62 62 62 62 62 62 62 62 62 62 00 00 00 00 00
          3 | (0,3) |      24 | f     | t    | 17 63 63 63 63 63 63 63 63 63 63 00 00 00 00 00
          4 | (0,4) |      24 | f     | t    | 17 64 64 64 64 64 64 64 64 64 64 00 00 00 00 00
(4 rows)

In PostgreSQL, each table is stored in a separate file. When a table exceeds 1 GB, it is divided into gigabyte-sized segments.

Let’s check which file contains the table and which one contains the index:

pgdbv=# SELECT pg_relation_filepath('bdtable');
 pg_relation_filepath
----------------------
 base/16416/16497
(1 row)

pgdbv=# SELECT pg_relation_filepath('bdtindex');
 pg_relation_filepath
----------------------
 base/16416/16503
(1 row)

So that we can use the utility on those files that way:

So the data block has been displayed (same behavior as the previous blog post) and also 2 index blocks.

The index block display can be :

  • a letter or a question mark: M for meta-page, R for root page and ? for “non root, non meta and non leaf”.
  • a number: It represents the percentage of HOT redirect rows the index refers to (instead of unused rows as it is the case for the data block).

The number is displayed only in case of a leaf block.

So, one leaf index block has been detected with more than 75% of free space (so the green color), more than 50% of the rows the index refers to are current (100% in our case as tx_max = 0 for all the rows) and HOT redirect are less than 10% (0 is displayed) (0% in our case as no rows with lp_flags = 2).

Let’s update 3 rows: one update on the indexed column and 2 updates on the non indexed column:

pgdbv=# update bdtable set f='aqwzsxedc' where id =1;
UPDATE 1
pgdbv=# update bdtable set id=id+10 where id in (2,3);
UPDATE 2
pgdbv=# checkpoint;
CHECKPOINT
pgdbv=# select * from heap_page_items(get_raw_page('bdtable',0));
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |              t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
  1 |   8152 |        1 |     39 |   2489 |   2490 |        0 | (0,5)  |           2 |       1282 |     24 |        |       | \x010000001761616161616161616161
  2 |   8112 |        1 |     39 |   2489 |   2491 |        0 | (0,6)  |       16386 |        258 |     24 |        |       | \x020000001762626262626262626262
  3 |   8072 |        1 |     39 |   2489 |   2491 |        0 | (0,7)  |       16386 |        258 |     24 |        |       | \x030000001763636363636363636363
  4 |   8032 |        1 |     39 |   2489 |      0 |        0 | (0,4)  |           2 |       2306 |     24 |        |       | \x040000001764646464646464646464
  5 |   7992 |        1 |     38 |   2490 |      0 |        0 | (0,5)  |           2 |      10498 |     24 |        |       | \x01000000156171777a7378656463
  6 |   7952 |        1 |     39 |   2491 |      0 |        0 | (0,6)  |       32770 |      10242 |     24 |        |       | \x0c0000001762626262626262626262
  7 |   7912 |        1 |     39 |   2491 |      0 |        0 | (0,7)  |       32770 |      10242 |     24 |        |       | \x0d0000001763636363636363636363
(7 rows)

pgdbv=# select * from bt_page_items('bdtindex',1);
 itemoffset | ctid  | itemlen | nulls | vars |                      data
------------+-------+---------+-------+------+-------------------------------------------------
          1 | (0,1) |      24 | f     | t    | 17 61 61 61 61 61 61 61 61 61 61 00 00 00 00 00
          2 | (0,5) |      24 | f     | t    | 15 61 71 77 7a 73 78 65 64 63 00 00 00 00 00 00
          3 | (0,2) |      24 | f     | t    | 17 62 62 62 62 62 62 62 62 62 62 00 00 00 00 00
          4 | (0,3) |      24 | f     | t    | 17 63 63 63 63 63 63 63 63 63 63 00 00 00 00 00
          5 | (0,4) |      24 | f     | t    | 17 64 64 64 64 64 64 64 64 64 64 00 00 00 00 00
(5 rows)

And launch the tool again:

As you can see we still have more than 75% of free space in the leaf index block but the way to display the color has been changed (because now less than 50% of the rows the index refers to are current aka tx_max = 0) and HOT redirect is still less than 10% (0 is displayed).

Let’s vacuum the table:

pgdbv=# vacuum  bdtable;
VACUUM
pgdbv=# checkpoint;
CHECKPOINT
pgdbv=# select * from heap_page_items(get_raw_page('bdtable',0));
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |              t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
  1 |      0 |        0 |      0 |        |        |          |        |             |            |        |        |       |
  2 |      6 |        2 |      0 |        |        |          |        |             |            |        |        |       |
  3 |      7 |        2 |      0 |        |        |          |        |             |            |        |        |       |
  4 |   8152 |        1 |     39 |   2489 |      0 |        0 | (0,4)  |           2 |       2306 |     24 |        |       | \x040000001764646464646464646464
  5 |   8112 |        1 |     38 |   2490 |      0 |        0 | (0,5)  |           2 |      10498 |     24 |        |       | \x01000000156171777a7378656463
  6 |   8072 |        1 |     39 |   2491 |      0 |        0 | (0,6)  |       32770 |      10498 |     24 |        |       | \x0c0000001762626262626262626262
  7 |   8032 |        1 |     39 |   2491 |      0 |        0 | (0,7)  |       32770 |      10498 |     24 |        |       | \x0d0000001763636363636363636363
(7 rows)

pgdbv=# select * from bt_page_items('bdtindex',1);
 itemoffset | ctid  | itemlen | nulls | vars |                      data
------------+-------+---------+-------+------+-------------------------------------------------
          1 | (0,5) |      24 | f     | t    | 15 61 71 77 7a 73 78 65 64 63 00 00 00 00 00 00
          2 | (0,2) |      24 | f     | t    | 17 62 62 62 62 62 62 62 62 62 62 00 00 00 00 00
          3 | (0,3) |      24 | f     | t    | 17 63 63 63 63 63 63 63 63 63 63 00 00 00 00 00
          4 | (0,4) |      24 | f     | t    | 17 64 64 64 64 64 64 64 64 64 64 00 00 00 00 00
(4 rows)

and launch the utility:

As you can see we still have more than 75% of free space. Now more than 50% of the rows the index refers to are current. The index refers to between 50 and 60% of HOT redirect rows (so 5 is displayed) (for ctid (0,2) and (0,3): so 2 rows out of 4 the index refers to).

The legend and summary are dynamic and depend of the contents of the scanned blocks (data and index).

For example on a table made of 757 blocks, you could end up with something like:

The same table, once FULL vacuum would produce:

Remarks

  • The tool is available in this repository.
  • The tool is for B-tree index only.
  • The tool is 100% inspired by Frits Hoogland blogpost series and by odbv (written by Kamil Stawiarski) that can be used to visualize oracle database blocks.

Conclusion

pgdfv can now be used to visualize PostgreSQL data and index file pages.

Welcome to pgdfv: PostgreSQL data file visualizer

Introduction

As you may know the PostgreSQL database page contains a lot of informations that is documented here. A great study has been done by Frits Hoogland in this series of blogposts. I strongly recommend to read Frits series before to read this blog post (unless you are familiar with PostgreSQL block internals).

By reading the contents of a page we can extract:

  • The percentage of free space within a page
  • The percentage of current rows within a page
  • The percentage of unused rows within a page

Welcome to pgdfv

pgdfv stands for: PostgreSQL data file visualizer. It helps to visualize the data file pages in a easy way.

For each block:

  • A color is assigned (depending of the percentage of free space)
  • The color can be displayed in 2 ways (depending if more than 50 percent of the rows are current)
  • A number is assigned (based on the percentage of unused rows)

At the end the utility provides a summary for all the blocks visited. As a picture is worth a thousand words, let’s see some examples.

Examples

Let’s create a table, insert 4 rows in it and inspect its content:

pgdbv=# create table bdtable(id int not null, f varchar(30) );
CREATE TABLE
pgdbv=# insert into bdtable ( id, f ) values (1, 'aaaaaaaaaa');
INSERT 0 1
pgdbv=# insert into bdtable ( id, f ) values (2, 'aaaaaaaaaa');
INSERT 0 1
pgdbv=# insert into bdtable ( id, f ) values (3, 'aaaaaaaaaa');
INSERT 0 1
pgdbv=# insert into bdtable ( id, f ) values (4, 'aaaaaaaaaa');
INSERT 0 1
pgdbv=# checkpoint;
CHECKPOINT
pgdbv=# select * from heap_page_items(get_raw_page('bdtable',0));
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |              t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
  1 |   8152 |        1 |     39 |   1945 |      0 |        0 | (0,1)  |           2 |       2050 |     24 |        |       | \x010000001761616161616161616161
  2 |   8112 |        1 |     39 |   1946 |      0 |        0 | (0,2)  |           2 |       2050 |     24 |        |       | \x020000001761616161616161616161
  3 |   8072 |        1 |     39 |   1947 |      0 |        0 | (0,3)  |           2 |       2050 |     24 |        |       | \x030000001761616161616161616161
  4 |   8032 |        1 |     39 |   1948 |      0 |        0 | (0,4)  |           2 |       2050 |     24 |        |       | \x040000001761616161616161616161
(4 rows)

In PostgreSQL, each table is stored in a separate file. When a table exceeds 1 GB, it is divided into gigabyte-sized segments.

Let’s check which file contains the table:

pgdbv=# SELECT pg_relation_filepath('bdtable');
 pg_relation_filepath
----------------------
 base/16416/16448

Let’s use the utility on this file:

So one block is detected, with more than 75% of free space (so the green color), more than 50% of the rows are current (100% in our case as tx_max = 0 for all the rows) and unused are less than 10% (0 is displayed) (0% in our case as no rows with lp_flags = 0).

Let’s delete, 3 rows:

pgdbv=# delete from bdtable where id <=3;
DELETE 3
pgdbv=# checkpoint;
CHECKPOINT
pgdbv=# select * from heap_page_items(get_raw_page('bdtable',0));
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |              t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
  1 |   8152 |        1 |     39 |   1945 |   1949 |        0 | (0,1)  |        8194 |        258 |     24 |        |       | \x010000001761616161616161616161
  2 |   8112 |        1 |     39 |   1946 |   1949 |        0 | (0,2)  |        8194 |        258 |     24 |        |       | \x020000001761616161616161616161
  3 |   8072 |        1 |     39 |   1947 |   1949 |        0 | (0,3)  |        8194 |        258 |     24 |        |       | \x030000001761616161616161616161
  4 |   8032 |        1 |     39 |   1948 |      0 |        0 | (0,4)  |           2 |       2306 |     24 |        |       | \x040000001761616161616161616161
(4 rows)

and launch the tool again:

As you can see we still have more than 75% of free space in the block but the way to display the color has been changed (because now less than 50% of the rows are current aka tx_max = 0) and unused is still less than 10% (0 is displayed).

Let’s vacuum the table:

pgdbv=# vacuum bdtable;
VACUUM
pgdbv=# checkpoint;
CHECKPOINT
pgdbv=# select * from heap_page_items(get_raw_page('bdtable',0));
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |              t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
  1 |      0 |        0 |      0 |        |        |          |        |             |            |        |        |       |
  2 |      0 |        0 |      0 |        |        |          |        |             |            |        |        |       |
  3 |      0 |        0 |      0 |        |        |          |        |             |            |        |        |       |
  4 |   8152 |        1 |     39 |   1948 |      0 |        0 | (0,4)  |           2 |       2306 |     24 |        |       | \x040000001761616161616161616161
(4 rows)

and launch the utility:

As you can see we still have more than 75% of free space, less than 50% of the rows are current (only one in our case) and now there is between 70 and 80% of unused rows in the block (so 7 is displayed).

The legend and summary are dynamic and depend of the contents of the scanned blocks.

For example, on a newly created table (made of 355 blocks) you could end up with something like:

Then, you delete half of the rows:

Then, once the table has been vacuum:

And once new rows have been inserted:

Remarks

  • The tool is available in this repository.
  • The tool is 100% inspired by Frits Hoogland blogpost series and by odbv (written by Kamil Stawiarski) that can be used to visualize oracle database blocks.

Conclusion

pgdfv is a new utility that can be used to visualize PostgreSQL data file pages.

SystemTap for PostgreSQL Toolkit

Introduction

The purpose of this post is to share some SystemTap tools that have been initially written for oracle and have been adapted for PostgreSQL.

The tools are:

  • pg_schedtimes.stp: To track time spend in various states (run, sleep, iowait, queued)
  • pg_page_faults.stp: To report the total number of page faults and splits them into Major or Minor faults as well as Read or Write access
  • pg_traffic.stp: To track the I/O (vfs, block) and Network (tcp, udp, nfs) traffic

Those tools are able to group the SystemTap probes per client connections (per database or user) and server processes.

Grouping the probes

As described into the documentation, on most platforms, PostgreSQL modifies its command title as reported by ps, so that individual server processes can readily be identified.

For example on my lab, the processes are:

# ps -ef | grep postgres:
postgres   1460   1447  0 09:35 ?        00:00:00 postgres: logger process
postgres   1462   1447  0 09:35 ?        00:00:00 postgres: checkpointer process
postgres   1463   1447  0 09:35 ?        00:00:00 postgres: writer process
postgres   1464   1447  0 09:35 ?        00:00:00 postgres: wal writer process
postgres   1465   1447  0 09:35 ?        00:00:00 postgres: autovacuum launcher process
postgres   1466   1447  0 09:35 ?        00:00:00 postgres: stats collector process
postgres   7981   1447  0 12:56 ?        00:00:00 postgres: postgres postgres [local] idle
postgres   7983   1447  0 12:56 ?        00:00:00 postgres: bdt postgres 172.16.170.1(56175) idle
postgres   7984   1447  0 12:56 ?        00:00:00 postgres: bdt bdtdb 172.16.170.1(56203) idle
  • The firsts six processes are background worker processes
  • Each of the remaining processes is a server process handling one client connection. Each such process sets its command line display in the form “postgres: user database host activity

That said, we can fetch the command line of the processes that trigger the probe event thanks to the cmdline_str() function and:

  • filter the processes
  • extract the piece of information to be used to group the probes

So let’s write two embedded C functions to extract the relevant piece of information from each of the command line output described above.

Functions

get_pg_dbname:

function get_pg_dbname:string (mystr:string) %{
char *ptr;
char *ptr2;

int  ch = ' ';
char substr_res[500];
char *strargs = STAP_ARG_mystr;
ptr = strchr( strchr( strargs , ch) + 1 , ch);
ptr2 = strchr( ptr + 1 , ch);
strncpy (substr_res,ptr, ptr2 - ptr);
substr_res[ptr2 - ptr]='\0';
snprintf(STAP_RETVALUE, MAXSTRINGLEN, "%s",substr_res+1);
%}

This function extracts the database from any “postgres: user database host activity” string

get_pg_user_proc:

function get_pg_user_proc:string (mystr:string) %{
char *ptr;
char *ptr2;

int ch = ' ';
char substr_res[500];
char *strargs = STAP_ARG_mystr;
ptr = strchr( strargs , ch);
ptr2 = strchr( ptr + 1 , ch);
strncpy (substr_res,ptr, ptr2 - ptr);
substr_res[ptr2 - ptr]='\0';
snprintf(STAP_RETVALUE, MAXSTRINGLEN, "%s",substr_res+1);
%}

This function extracts:

  • the user from any “postgres: user database host activity” string
  • the procname from any “postgres: procname <any string> process” string

Having in mind that the SystemTap aggregation operator is “<<<” (as explained here) we can use those 2 functions to aggregate within the probes by passing as parameter the cmdline_str().

Let’s see the usage and output of those tools.

pg_schedtimes.stp

Usage

 $> stap -g ./pg_schedtimes.stp  <pg uid> <refresh time ms> <db|user> <details|nodetails>

 <db|user>: group by db or user
 <details|nodetails>: process server details or grouped all together

Output example

The postgres userid is 26, and we want to see the time spend in various states by client connections (grouped by database) and all the worker process.

$> stap -g ./pg_schedtimes.stp 26 10000 db details

-----------------------------------------------------------------------
                   run(us)  sleep(us) iowait(us) queued(us)  total(us)
-----------------------------------------------------------------------
NOT_PG         :      34849  430368652       1433      35876  430440810
logger         :         81    9986664          0        446    9987191
checkpointer   :         24    9986622          0        543    9987189
writer         :        331    9986227          0        629    9987187
wal            :        248    9986279          0        657    9987184
autovacuum     :        862    9983132          0       3188    9987182
stats          :       2210    9981339          0       3631    9987180
postgres       :      11058    9975156          0        948    9987162
bdtdb          :         13    9986338          0        809    9987160

I can see the client connections grouped by the bdtdb and postgres databases, the worker processes and all that is not related to PostgreSQL (NOT_PG).

pg_page_faults.stp

Usage

 $> stap -g ./pg_page_faults.stp  <pg uid> <refresh time ms> <db|user> <details|nodetails>

 <db|user>: group by db or user
 <details|nodetails>: process server details or grouped all together

Output example

The postgres userid is 26, and we want to see the page faults by client connections grouped by user and no worker process details.

$> stap -g ./pg_page_faults.stp 26 10000 user nodetails

------------------------------------------------------------------------------------------
                READS_PFLT   WRITES_PFLT  TOTAL_PFLT   MAJOR_PFLT   MINOR_PFLT
------------------------------------------------------------------------------------------
bdt            : 0            294          294          0            294
NOT_PG         : 0            71           71           0            71
PG SERVER      : 0            3            3            0            3

I can see the client connections grouped by the bdt user, the worker processes grouped all together as “PG SERVER” and all that is not related to PostgreSQL (NOT_PG).

pg_traffic.stp

Usage

 $> stap -g ./pg_traffic.stp <pg uid> <refresh time ms> <io|network|both> <db|user> <details|nodetails>

 <io|network|both>: see io, network or both activity
 <db|user>: group by db or user
 <details|nodetails>: process server details or grouped all together

Output example

The postgres userid is 26, and we want to see the I/O activity by client connections grouped by database and all the worker process.

$> stap -g ./pg_traffic.stp 26 10000 io db details

-------------------------------------------------------------------------------------------------------------
|                                                          I/O                                              |
-------------------------------------------------------------------------------------------------------------
                                  READS                     |                     WRITES                    |
                                                            |                                               |
                       VFS                    BLOCK         |          VFS                    BLOCK         |
            | NB                 KB | NB                 KB | NB                 KB | NB                 KB |
            | --                 -- | --                 -- | --                 -- | --                 -- |
postgres    | 189               380 | 0                   0 | 0                   0 | 0                   0 |
bdtdb       | 38                127 | 0                   0 | 0                   0 | 0                   0 |
NOT_PG      | 79                  2 | 0                   0 | 10                  2 | 0                   0 |
autovacuum  | 49                141 | 0                   0 | 2                   0 | 0                   0 |
stats       | 0                   0 | 0                   0 | 42                 96 | 0                   0 |
-------------------------------------------------------------------------------------------------------------

I can see the client connections grouped by the bdtdb and postgres databases, the worker processes and all that is not related to PostgreSQL (NOT_PG).

For information, the network output produces something like:

-------------------------------------------------------------------------------------------------------------------------------------------------------------
|                                                                                Network                                                                    |
-------------------------------------------------------------------------------------------------------------------------------------------------------------
                                              RECV                                  |                                 SENT                                  |
                                                                                    |                                                                       |
                       TCP                     UDP                     NFS          |          TCP                     UDP                     NFS          |
            | NB                 KB | NB                 KB | NB                 KB | NB                 KB | NB                 KB | NB                 KB |
            | --                 -- | --                 -- | --                 -- | --                 -- | --                 -- | --                 -- |
postgres    | 95                760 | 0                   0 | 0                   0 | 52                  8 | 0                   0 | 0                   0 |
NOT_PG      | 6                  48 | 0                   0 | 0                   0 | 3                   3 | 0                   0 | 0                   0 |
bdtdb       | 10                 80 | 0                   0 | 0                   0 | 5                   0 | 0                   0 | 0                   0 |
stats       | 0                   0 | 0                   0 | 0                   0 | 0                   0 | 0                   0 | 0                   0 |
-------------------------------------------------------------------------------------------------------------------------------------------------------------

Source code

The source code is available in this github repository

Conclusion

Thanks to the embedded C functions we have been able to aggregate the probes and display the information by worker processes or client connections (grouped by database or user).