Retrieve PostgreSQL variable-length storage information thanks to pageinspect

Introduction

In PostgreSQL a variable-length datatype value can be stored in-line or out-of-line (as a TOAST). It can also be compressed or not (see the documentation for more details).

Let’s make use of the pageinspect extension and the information about variable-length datatype found in postgres.h to build a query to retrieve tuples variable-length storage information.

The query

The query is the following:

$ cat toast_info.sql
select
t_ctid,
-- See postgres.h
CASE
  WHEN (fo is NULL) THEN 'null'
  -- VARATT_IS_EXTERNAL_ONDISK: VARATT_IS_EXTERNAL (fo = 'x01') && tag == VARTAG_ONDISK (x12)
  -- rawsize - VARHDRSZ > extsize
  WHEN (fo = 'x01') AND (tag = 'x12') AND (osize - 4 > ssize) THEN 'toasted (compressed)'
-- rawsize - VARHDRSZ <= extsize
  WHEN (fo = 'x01') AND (tag = 'x12') AND (osize - 4 <= ssize) THEN 'toasted (uncompressed)'
  -- VARATT_IS_EXTERNAL_INDIRECT: VARATT_IS_EXTERNAL && tag == VARTAG_INDIRECT (x01)
  WHEN (fo = 'x01') AND (tag = 'x01') then 'indirect in-memory'
  -- VARATT_IS_EXTERNAL_EXPANDED: VARATT_IS_EXTERNAL && VARTAG_IS_EXPANDED(VARTAG_EXTERNAL)
  WHEN (fo = 'x01') AND (tag = 'x02' OR tag = 'x03') then 'expanded in-memory'
  -- VARATT_IS_SHORT (va_header & 0x01) == 0x01)
  WHEN (fo & 'x01' = 'x01') THEN 'short in-line'
  -- VARATT_IS_COMPRESSED (va_header & 0x03) == 0x02)
  WHEN (fo & 'x03' = 'x02') THEN 'long in-line (compressed)'
  ELSE 'long in-line (uncompressed)'
END as toast_info
from
(
select
page_items.t_ctid,
substr(page_items.t_attrs[1]::text,2,3)::bit(8) as fo,
('x'||substr(page_items.t_attrs[1]::text,5,2))::bit(8) as tag,
('x'||regexp_replace(substr(page_items.t_attrs[1]::text,7,8),'(\w\w)(\w\w)(\w\w)(\w\w)','\4\3\2\1'))::bit(32)::int as osize ,
('x'||regexp_replace(substr(page_items.t_attrs[1]::text,15,8),'(\w\w)(\w\w)(\w\w)(\w\w)','\4\3\2\1'))::bit(32)::int as ssize
from
generate_series(0, pg_relation_size('bdttoast'::regclass::text) / 8192 - 1) blkno ,
heap_page_item_attrs(get_raw_page('bdttoast',blkno::int), 'bdttoast'::regclass) as page_items
) as hp;

As you can see, the query focus on the bdttoast table. Let’s create this table and put some data in it to see the query in action.

Let’s see the query in action

Let’s create this table:

postgres=# CREATE TABLE bdttoast ( message text );
CREATE TABLE

the message field storage type is “extended”:

postgres=# \d+ bdttoast
                                 Table "public.bdttoast"
 Column  | Type | Collation | Nullable | Default | Storage  | Stats target | Description
---------+------+-----------+----------+---------+----------+--------------+-------------
 message | text |           |          |         | extended |              |

extended allows both compression and out-of-line storage. This is the default for most TOAST-able data types. Compression will be attempted first, then out-of-line storage if the row is still too big.

let’s add one tuple:

postgres=# INSERT INTO bdttoast VALUES ('default');
INSERT 0 1

and check how the message field has been stored thanks to the query:

postgres=# \i toast_info.sql
 t_ctid |  toast_info
--------+---------------
 (0,1)  | short in-line
(1 row)

as you can see it has been stored in-line.

Add another tuple with more data in the message field, and check its storage information:

postgres=# INSERT INTO bdttoast VALUES (repeat('a',10000));
INSERT 0 1
postgres=# \i toast_info.sql
 t_ctid |        toast_info
--------+---------------------------
 (0,1)  | short in-line
 (0,2)  | long in-line (compressed)
(2 rows)

as you can see this field value has been stored as long in-line and is compressed.

Add another tuple with even more data in the message field, and check its storage information:

postgres=# INSERT INTO bdttoast VALUES (repeat('b',1000000));
INSERT 0 1
postgres=# \i toast_info.sql
 t_ctid |        toast_info
--------+---------------------------
 (0,1)  | short in-line
 (0,2)  | long in-line (compressed)
 (0,3)  | toasted (compressed)
(3 rows)

this time it has been stored as TOAST-ed and is compressed.

Let’s change the message column storage to external:

postgres=# ALTER TABLE bdttoast ALTER COLUMN message SET STORAGE EXTERNAL;
ALTER TABLE
postgres=# \d+ bdttoast
                                 Table "public.bdttoast"
 Column  | Type | Collation | Nullable | Default | Storage  | Stats target | Description
---------+------+-----------+----------+---------+----------+--------------+-------------
 message | text |           |          |         | external |              |

external means it allows out-of-line storage but not compression.

Now, let’s add 3 tuples with the same field message size as the 3 ones previously added and compare the storage information.

Let’s add one tuple with the same message size as the one previously stored as “short in-line”:

postgres=# INSERT INTO bdttoast VALUES ('externa');
INSERT 0 1
postgres=# \i toast_info.sql
 t_ctid |        toast_info
--------+---------------------------
 (0,1)  | short in-line
 (0,2)  | long in-line (compressed)
 (0,3)  | toasted (compressed)
 (0,4)  | short in-line
(4 rows)

this one is still short in-line (same as t_ctid (0,1)).

Let’s add one tuple with the same message size as the one previously stored as “long in-line (compressed)”:

postgres=# INSERT INTO bdttoast VALUES (repeat('c',10000));
INSERT 0 1
postgres=# \i toast_info.sql
 t_ctid |        toast_info
--------+---------------------------
 (0,1)  | short in-line
 (0,2)  | long in-line (compressed)
 (0,3)  | toasted (compressed)
 (0,4)  | short in-line
 (0,5)  | toasted (uncompressed)
(5 rows)

this one is TOAST-ed and uncompressed with storage external (as compare with t_ctid (0,2) with storage extended).

Let’s add one tuple with the same message size as the one previously stored as “toasted (compressed)”:

postgres=# INSERT INTO bdttoast VALUES (repeat('d',1000000));
INSERT 0 1
postgres=# \i toast_info.sql
 t_ctid |        toast_info
--------+---------------------------
 (0,1)  | short in-line
 (0,2)  | long in-line (compressed)
 (0,3)  | toasted (compressed)
 (0,4)  | short in-line
 (0,5)  | toasted (uncompressed)
 (0,6)  | toasted (uncompressed)
(6 rows)

this one is TOAST-ed and uncompressed with storage external (as compare with t_ctid (0,3) with storage extended).

Remarks

  • use this query on little-endian machines only (bit layouts would not be the same on big-endian and would impact the query accuracy)
  • t_attrs[1] is used in the query to retrieve the information. This is because the message field is the 1st of the relation

Conclusion

Thanks to the pageinspect extension we have been able to write a query to retrieve variable-length storage information. We have been able to compare how our data has been stored depending on the column storage being used (extended or external).

Get toast chunk_id from the user table tuples or from the toast index thanks to pageinspect

Introduction

TOAST stands for “The Oversized-Attribute Storage Technique” and allows to broke up large fields value into multiple physical rows (see the PostgreSQL documentation for more details).

The goal of this post is to provide a way to retrieve toast’s information from the user table tuples or from the toast index without querying the toast directly.

We will be able to link the user table tuples to the toast pages by using user table tuples and toast index data (not querying the toast at all).

Build the playground

Create a table with a TOAST-able field:

toasted=# CREATE TABLE bdttab (id int,message text,age int);
CREATE TABLE

toasted=# \d+ bdttab
                                   Table "public.bdttab"
 Column  |  Type   | Collation | Nullable | Default | Storage  | Stats target | Description
---------+---------+-----------+----------+---------+----------+--------------+-------------
 id      | integer |           |          |         | plain    |              |
 message | text    |           |          |         | extended |              |
 age     | integer |           |          |         | plain    |              |
Access method: heap

Insert 10 rows:

toasted=# INSERT INTO bdttab
SELECT 1,(SELECT
string_agg(chr(floor(random() * 26)::int + 65), '')
FROM generate_series(1,10000)),6
FROM generate_series(1,10);
INSERT 0 10

check the toast and toast index names:

toasted=# select r.relname,t.relname as toast,i.relname as toast_index from pg_class r, pg_class i, pg_index d, pg_class t where r.relname = 'bdttab' and d.indrelid = r.reltoastrelid and i.oid = d.indexrelid and t.oid = r.reltoastrelid;
relname | toast | toast_index
---------+-----------------+-----------------------
bdttab | pg_toast_192881 | pg_toast_192881_index
(1 row)

Retrieve the chunk_id for each tuple directly from the user table

The chunk_id is part of the tuple’s data as explained in this slide (coming from this presentation):

Note that the toast relation id is also part of the tuple data.

As you can see from the slide we can get the information with pageinspect, so let’s use it to build a query to retrieve the chunk_id and the toast relation id from the tuples:

create the extension:

toasted=# create extension pageinspect;
CREATE EXTENSION

and query the user table to get the information:

toasted=# select
page_item_attrs.t_ctid,
page_item_attrs.t_attrs[2],
substr(substr(page_item_attrs.t_attrs[2],octet_length(page_item_attrs.t_attrs[2])-7,4)::text,3) as substr_for_chunk_id,
('x'||regexp_replace(substr(substr(page_item_attrs.t_attrs[2],octet_length(page_item_attrs.t_attrs[2])-7,4)::text,3),'(\w\w)(\w\w)(\w\w)(\w\w)','\4\3\2\1'))::bit(32)::int as chunk_id,
substr(substr(page_item_attrs.t_attrs[2],octet_length(page_item_attrs.t_attrs[2])-3,4)::text,3) as substr_for_toast_relid,
('x'||regexp_replace(substr(substr(page_item_attrs.t_attrs[2],octet_length(page_item_attrs.t_attrs[2])-3,4)::text,3),'(\w\w)(\w\w)(\w\w)(\w\w)','\4\3\2\1'))::bit(32)::int as toast_relid
FROM
generate_series(0, pg_relation_size('bdttab'::regclass::text) / 8192 - 1) blkno ,
heap_page_item_attrs(get_raw_page('bdttab', blkno::int), 'bdttab'::regclass) as page_item_attrs
where
substr(page_item_attrs.t_attrs[2]::text,3,2)='01';
 t_ctid |                t_attrs                 | substr_for_chunk_id | chunk_id | substr_for_toast_relid | toast_relid
--------+----------------------------------------+---------------------+----------+------------------------+-------------
 (0,1)  | \x0112142700001027000077f1020074f10200 | 77f10200            |   192887 | 74f10200               |      192884
 (0,2)  | \x0112142700001027000078f1020074f10200 | 78f10200            |   192888 | 74f10200               |      192884
 (0,3)  | \x0112142700001027000079f1020074f10200 | 79f10200            |   192889 | 74f10200               |      192884
 (0,4)  | \x011214270000102700007af1020074f10200 | 7af10200            |   192890 | 74f10200               |      192884
 (0,5)  | \x011214270000102700007bf1020074f10200 | 7bf10200            |   192891 | 74f10200               |      192884
 (0,6)  | \x011214270000102700007cf1020074f10200 | 7cf10200            |   192892 | 74f10200               |      192884
 (0,7)  | \x011214270000102700007df1020074f10200 | 7df10200            |   192893 | 74f10200               |      192884
 (0,8)  | \x011214270000102700007ef1020074f10200 | 7ef10200            |   192894 | 74f10200               |      192884
 (0,9)  | \x011214270000102700007ff1020074f10200 | 7ff10200            |   192895 | 74f10200               |      192884
 (0,10) | \x0112142700001027000080f1020074f10200 | 80f10200            |   192896 | 74f10200               |      192884
(10 rows)

as you can see we are able to get the chunk_id and the toast relation id from the tuples.

Let’s verify that those values make sense, first checking the toast relation id:

toasted=# select relname from pg_class where oid=192884;
     relname
-----------------
 pg_toast_192881
(1 row)

and then checking the chunk_ids from the toast itself:

toasted=# select distinct(chunk_id) from pg_toast.pg_toast_192881;
 chunk_id
----------
   192894
   192895
   192889
   192888
   192893
   192892
   192896
   192891
   192887
   192890
(10 rows)

The values match the ones we got directly from the user table tuples.

Retrieve the chunk_id and chunk_seq directly from the toast index

The toast index contains those information, so let’s query it too:

toasted=# select
page_items.ctid,
page_items.data,
('x'||regexp_replace(substr(page_items.data,1,11),'(\w\w) (\w\w) (\w\w) (\w\w)','\4\3\2\1'))::bit(32)::int as chunk_id,
('x'||regexp_replace(substr(page_items.data,13,23),'(\w\w) (\w\w) (\w\w) (\w\w)','\4\3\2\1'))::bit(32)::int as chunk_seq
FROM
generate_series(1, pg_relation_size('pg_toast.pg_toast_192881_index'::regclass::text) / 8192 - 1) blkno ,
bt_page_items('pg_toast.pg_toast_192881_index', blkno::int) as page_items;

  ctid  |          data           | chunk_id | chunk_seq
--------+-------------------------+----------+-----------
 (0,1)  | 77 f1 02 00 00 00 00 00 |   192887 |         0
 (0,2)  | 77 f1 02 00 01 00 00 00 |   192887 |         1
 (0,3)  | 77 f1 02 00 02 00 00 00 |   192887 |         2
 (0,4)  | 77 f1 02 00 03 00 00 00 |   192887 |         3
 (1,1)  | 77 f1 02 00 04 00 00 00 |   192887 |         4
 (1,2)  | 77 f1 02 00 05 00 00 00 |   192887 |         5
 (1,3)  | 78 f1 02 00 00 00 00 00 |   192888 |         0
 (1,4)  | 78 f1 02 00 01 00 00 00 |   192888 |         1
 (2,1)  | 78 f1 02 00 02 00 00 00 |   192888 |         2
 (2,2)  | 78 f1 02 00 03 00 00 00 |   192888 |         3
 (2,3)  | 78 f1 02 00 04 00 00 00 |   192888 |         4
 (2,4)  | 78 f1 02 00 05 00 00 00 |   192888 |         5
 (3,1)  | 79 f1 02 00 00 00 00 00 |   192889 |         0
 (3,2)  | 79 f1 02 00 01 00 00 00 |   192889 |         1
 (3,3)  | 79 f1 02 00 02 00 00 00 |   192889 |         2
 (3,4)  | 79 f1 02 00 03 00 00 00 |   192889 |         3
 (4,1)  | 79 f1 02 00 04 00 00 00 |   192889 |         4
 (4,2)  | 79 f1 02 00 05 00 00 00 |   192889 |         5
 (4,3)  | 7a f1 02 00 00 00 00 00 |   192890 |         0
 (4,4)  | 7a f1 02 00 01 00 00 00 |   192890 |         1
 (5,1)  | 7a f1 02 00 02 00 00 00 |   192890 |         2
 (5,2)  | 7a f1 02 00 03 00 00 00 |   192890 |         3
.
.
.
 (12,3) | 7f f1 02 00 02 00 00 00 |   192895 |         2
 (12,4) | 7f f1 02 00 03 00 00 00 |   192895 |         3
 (13,1) | 7f f1 02 00 04 00 00 00 |   192895 |         4
 (13,2) | 7f f1 02 00 05 00 00 00 |   192895 |         5
 (13,3) | 80 f1 02 00 00 00 00 00 |   192896 |         0
 (13,4) | 80 f1 02 00 01 00 00 00 |   192896 |         1
 (14,1) | 80 f1 02 00 02 00 00 00 |   192896 |         2
 (14,2) | 80 f1 02 00 03 00 00 00 |   192896 |         3
 (14,3) | 80 f1 02 00 04 00 00 00 |   192896 |         4
 (14,4) | 80 f1 02 00 05 00 00 00 |   192896 |         5
(60 rows)

Note that the chunk_ids coming from the index, the user table tuples and the toast itself match.

Use case example: a toast’s page is corrupted and I want to know which tuples from the user table are affected

Say the page 12 of the toast is corrupted, then we could get the affected chunk_id and chunk_seq from the toast index that way (by filtering on the ctid):

toasted=# select
page_items.ctid,
page_items.data,
('x'||regexp_replace(substr(page_items.data,1,11),'(\w\w) (\w\w) (\w\w) (\w\w)','\4\3\2\1'))::bit(32)::int as chunk_id,
('x'||regexp_replace(substr(page_items.data,13,23),'(\w\w) (\w\w) (\w\w) (\w\w)','\4\3\2\1'))::bit(32)::int as chunk_seq
FROM
generate_series(1, pg_relation_size('pg_toast.pg_toast_192881_index'::regclass::text) / 8192 - 1) blkno ,
bt_page_items('pg_toast.pg_toast_192881_index', blkno::int) as page_items where ctid::text like '(12,%';

  ctid  |          data           | chunk_id | chunk_seq
--------+-------------------------+----------+-----------
 (12,1) | 7f f1 02 00 00 00 00 00 |   192895 |         0
 (12,2) | 7f f1 02 00 01 00 00 00 |   192895 |         1
 (12,3) | 7f f1 02 00 02 00 00 00 |   192895 |         2
 (12,4) | 7f f1 02 00 03 00 00 00 |   192895 |         3
(4 rows)

And then look back at the user table tuples that way by filtering on the chunk_id:

toasted=# select
page_item_attrs.t_ctid,
page_item_attrs.t_attrs[2],
substr(substr(page_item_attrs.t_attrs[2],octet_length(page_item_attrs.t_attrs[2])-7,4)::text,3) as substr_for_chunk_id,
('x'||regexp_replace(substr(substr(page_item_attrs.t_attrs[2],octet_length(page_item_attrs.t_attrs[2])-7,4)::text,3),'(\w\w)(\w\w)(\w\w)(\w\w)','\4\3\2\1'))::bit(32)::int as chunk_id,
substr(substr(page_item_attrs.t_attrs[2],octet_length(page_item_attrs.t_attrs[2])-3,4)::text,3) as substr_for_toast_relid,
('x'||regexp_replace(substr(substr(page_item_attrs.t_attrs[2],octet_length(page_item_attrs.t_attrs[2])-3,4)::text,3),'(\w\w)(\w\w)(\w\w)(\w\w)','\4\3\2\1'))::bit(32)::int as toast_relid
FROM
generate_series(0, pg_relation_size('bdttab'::regclass::text) / 8192 - 1) blkno ,
heap_page_item_attrs(get_raw_page('bdttab', blkno::int), 'bdttab'::regclass) as page_item_attrs
where
substr(page_item_attrs.t_attrs[2]::text,3,2)='01' and ('x'||regexp_replace(substr(substr(page_item_attrs.t_attrs[2],octet_length(page_item_attrs.t_attrs[2])-7,4)::text,3),'(\w\w)(\w\w)(\w\w)(\w\w)','\4\3\2\1'))::bit(32)::int = 192895;

 t_ctid |                t_attrs                 | substr_for_chunk_id | chunk_id | substr_for_toast_relid | toast_relid
--------+----------------------------------------+---------------------+----------+------------------------+-------------
 (0,9)  | \x011214270000102700007ff1020074f10200 | 7ff10200            |   192895 | 74f10200               |      192884
(1 row)

so that we know that this tuple with ctid (0,9) is linked to the corrupted toast page.

Remarks

  • t_attrs[2] is used in the query to retrieve the information from the user table. This is because the TOAST-able field (message) is the 2nd of the relation.
  • substr(page_item_attrs.t_attrs[2]::text,3,2)=’01’ is used to filter the user table tuples information. This is because 0x01 is used as TOASTed string marker (see the slide).

Conclusion

Thanks to pageinspect we have been able to retrieve the chunk_id directly from the user table tuples. We also have been able to link the user table tuples to the toast pages by using user table tuples and toast index data (not querying the toast at all).

Sampling pg_stat_statements based on the active sessions and their associated queryid

Introduction

Now that we have the ability to sample and record the active sessions and their associated queryid with the pg_active_session_history view (see this blog post), it would be interesting to have insights about the queries statistics at the time the sessions were active.

PostgreSQL provides the queries statistics with the pg_stat_statements view. We could query the pg_active_session_history and the pg_stat_statements views and join them on the queryid field, but the queries statistics would be:

  • cumulative
  • the ones at the time we would launch this query

So it would not provide the queries statistics at the time the sessions were active.

What’s new?

To get more granular queries statistics, the pgsentinel extension has evolved so that it now samples the pg_stat_statements:

  • at the same time it is sampling the active sessions
  • only for the queryid that were associated to an active session (if any) during the sampling

The samples are recorded into a new pg_stat_statements_history view.

This view looks like:

                    View "public.pg_stat_statements_history"
       Column        |           Type           | Collation | Nullable | Default 
---------------------+--------------------------+-----------+----------+---------
 ash_time            | timestamp with time zone |           |          | 
 userid              | oid                      |           |          | 
 dbid                | oid                      |           |          | 
 queryid             | bigint                   |           |          | 
 calls               | bigint                   |           |          | 
 total_time          | double precision         |           |          | 
 rows                | bigint                   |           |          | 
 shared_blks_hit     | bigint                   |           |          | 
 shared_blks_read    | bigint                   |           |          | 
 shared_blks_dirtied | bigint                   |           |          | 
 shared_blks_written | bigint                   |           |          | 
 local_blks_hit      | bigint                   |           |          | 
 local_blks_read     | bigint                   |           |          | 
 local_blks_dirtied  | bigint                   |           |          | 
 local_blks_written  | bigint                   |           |          | 
 temp_blks_read      | bigint                   |           |          | 
 temp_blks_written   | bigint                   |           |          | 
 blk_read_time       | double precision         |           |          | 
 blk_write_time      | double precision         |           |          | 

Remarks:

  • The fields description are the same as for pg_stat_statements (except for the ash_time one, which is the time of the active session history sampling)
  • As for pg_active_sessions_history, the pg_stat_statements_history view is implemented as in-memory ring buffer where the number of samples to be kept is configurable (thanks to the pgsentinel_pgssh.max_entries parameter)
  • The data collected are still cumulative metrics but you can make use of the window functions in PostgreSQL to compute the delta between samples (and then get accurate statistics for the queries between two samples)

For example, we could get per queryid and ash_time:  the rows per second, calls per second and rows per call that way:

select ash_time,queryid,delta_rows/seconds "rows_per_second",delta_calls/seconds "calls_per_second",delta_rows/delta_calls "rows_per_call"
from(
SELECT ash_time,queryid,
EXTRACT(EPOCH FROM ash_time::timestamp) - lag (EXTRACT(EPOCH FROM ash_time::timestamp))
OVER (
        PARTITION BY pgssh.queryid
        ORDER BY ash_time
        ASC) as "seconds",
rows-lag(rows)
        OVER (
        PARTITION BY pgssh.queryid
        ORDER BY ash_time
        ASC) as "delta_rows",
calls-lag(calls)
        OVER (
        PARTITION BY pgssh.queryid
        ORDER BY ash_time
        ASC) as "delta_calls"
    FROM pg_stat_statements_history pgssh) as delta
where delta_calls > 0 and seconds > 0
order by ash_time desc;

           ash_time            |      queryid       | rows_per_second  | calls_per_second | rows_per_call
-------------------------------+--------------------+------------------+------------------+----------------
 2019-07-06 11:09:48.629218+00 | 250416904599144140 | 10322.0031121842 | 10322.0031121842 |              1
 2019-07-06 11:09:47.627184+00 | 250416904599144140 | 10331.3930170891 | 10331.3930170891 |              1
 2019-07-06 11:09:46.625383+00 | 250416904599144140 | 10257.7574710211 | 10257.7574710211 |              1
 2019-07-06 11:09:42.620219+00 | 250416904599144140 |  10296.311364551 |  10296.311364551 |              1
 2019-07-06 11:09:41.618404+00 | 250416904599144140 | 10271.6737455877 | 10271.6737455877 |              1
 2019-07-06 11:09:36.612209+00 | 250416904599144140 | 10291.1563299622 | 10291.1563299622 |              1
 2019-07-06 11:09:35.610378+00 | 250416904599144140 | 10308.9798914136 | 10308.9798914136 |              1
 2019-07-06 11:09:33.607367+00 | 250416904599144140 |  10251.230955397 |  10251.230955397 |              1
 2019-07-06 11:09:31.604193+00 | 250416904599144140 | 10284.3551339058 | 10284.3551339058 |              1
 2019-07-06 11:09:30.60238+00  | 250416904599144140 | 10277.4631222064 | 10277.4631222064 |              1
 2019-07-06 11:09:24.595353+00 | 250416904599144140 | 10283.3919912856 | 10283.3919912856 |              1
 2019-07-06 11:09:22.59222+00  | 250416904599144140 | 10271.3534800552 | 10271.3534800552 |              1
 2019-07-06 11:09:21.59021+00  | 250416904599144140 | 10300.1104655978 | 10300.1104655978 |              1
 2019-07-06 11:09:20.588376+00 | 250416904599144140 | 10343.9790974522 | 10343.9790974522 |              1
 2019-07-06 11:09:16.583341+00 | 250416904599144140 | 10276.5525289304 | 10276.5525289304 |              1
  • it’s a good practice to normalize the statistics per second (as the sampling interval might change)
  • With that level of information we can understand the database activity in the past (thanks to the active sessions sampling) and get statistics per query at the time they were active
  • pgsentinel is available in this github repository

Conclusion

The pgsentinel extension now provides:

  • Active session history (through the pg_active_session_history view)
  • Queries statistics history (through the pg_stat_statements_history view), recorded at the exact same time as their associated active sessions

Active Session History in PostgreSQL: blocker and wait chain

While the active session history extension for PostgreSQL is still in beta, some information is added to it.

The pg_active_session_history view is currently made of:

                   View "public.pg_active_session_history"
      Column      |           Type           | Collation | Nullable | Default
------------------+--------------------------+-----------+----------+---------
 ash_time         | timestamp with time zone |           |          |
 datid            | oid                      |           |          |
 datname          | text                     |           |          |
 pid              | integer                  |           |          |
 usesysid         | oid                      |           |          |
 usename          | text                     |           |          |
 application_name | text                     |           |          |
 client_addr      | text                     |           |          |
 client_hostname  | text                     |           |          |
 client_port      | integer                  |           |          |
 backend_start    | timestamp with time zone |           |          |
 xact_start       | timestamp with time zone |           |          |
 query_start      | timestamp with time zone |           |          |
 state_change     | timestamp with time zone |           |          |
 wait_event_type  | text                     |           |          |
 wait_event       | text                     |           |          |
 state            | text                     |           |          |
 backend_xid      | xid                      |           |          |
 backend_xmin     | xid                      |           |          |
 top_level_query  | text                     |           |          |
 query            | text                     |           |          |
 cmdtype          | text                     |           |          |
 queryid          | bigint                   |           |          |
 backend_type     | text                     |           |          |
 blockers         | integer                  |           |          |
 blockerpid       | integer                  |           |          |
 blocker_state    | text                     |           |          |

You could see it as samplings of pg_stat_activity providing more information:

  • ash_time: the sampling time
  • top_level_query: the top level statement (in case PL/pgSQL is used)
  • query: the statement being executed (not normalised, as it is in pg_stat_statements, means you see the values)
  • cmdtype: the statement type (SELECT,UPDATE,INSERT,DELETE,UTILITY,UNKNOWN,NOTHING)
  • queryid: the queryid of the statement which links to pg_stat_statements
  • blockers: the number of blockers
  • blockerpid: the pid of the blocker (if blockers = 1), the pid of one blocker (if blockers > 1)
  • blocker_state: state of the blocker (state of the blockerpid)

Thanks to the queryid field you are able to link the session activity with the sql activity.

The information related to the blocking activity (if any) has been added recently. Why? To easily drill down in case of session being blocked.

Let’s see how we could display some interesting information in case of blocked session(s), for examples:

  • The wait chain
  • The seconds in wait in this chain
  • The percentage of the total wait time that this chain represents

As PostgreSQL provides recursive query and window functions, let’s make use of them to write this query:

postgres@pgu:~$ cat pg_ash_wait_chain.sql
WITH RECURSIVE search_wait_chain(ash_time,pid, blockerpid, wait_event_type,wait_event,level, path)
AS (
          SELECT ash_time,pid, blockerpid, wait_event_type,wait_event, 1 AS level,
          'pid:'||pid||' ('||wait_event_type||' : '||wait_event||') ->'||'pid:'||blockerpid AS path
          from pg_active_session_history WHERE blockers > 0
        union ALL
          SELECT p.ash_time,p.pid, p.blockerpid, p.wait_event_type,p.wait_event, swc.level + 1 AS level,
          'pid:'||p.pid||' ('||p.wait_event_type||' : '||p.wait_event||') ->'||swc.path AS path
          FROM pg_active_session_history p, search_wait_chain swc
          WHERE p.blockerpid = swc.pid and p.ash_time = swc.ash_time and p.blockers > 0
)
select round(100 * count(*) / cnt)||'%' as "% of total wait",count(*) as seconds,path as wait_chain  from (
        SELECT  pid,wait_event,path,sum(count) over() as cnt from (
                select ash_time,level,pid,wait_event,path,count(*) as count, max(level) over(partition by ash_time,pid) as max_level
                FROM search_wait_chain where level > 0 group by ash_time,level,pid,wait_event,path
        ) as all_wait_chain
        where level=max_level
) as wait_chain
group by path,cnt
order by count(*) desc;

Let’s launch this query while only one session is being blocked by another one:

postgres@pgu:~$ psql -f pg_ash_wait_chain.sql
 % of total wait | seconds |                 wait_chain
-----------------+---------+--------------------------------------------
 100%            |      23 | pid:1890 (Lock : transactionid) ->pid:1888
(1 row)

It means that the pid 1890 is waiting since 23 seconds on the transactionid wait event, while being blocked by pid 1888. This wait chain represents 100% of the blocking activity time.

Now another session comes into the game, query the active session history view one more time:

postgres@pgu:~$ psql -f pg_ash_wait_chain.sql
 % of total wait | seconds |                                  wait_chain
-----------------+---------+------------------------------------------------------------------------------
 88%             |     208 | pid:1890 (Lock : transactionid) ->pid:1888
 12%             |      29 | pid:1913 (Lock : transactionid) ->pid:1890 (Lock : transactionid) ->pid:1888
(2 rows)

So we still see our first blocking chain. It is now not the only one (so represents 88% of the blocking activity time).

We can see a new chain that represents 12% of the blocking activity time:

  • pid 1913 (waiting on transactionid) is blocked since 29 seconds by pid 1890 (waiting on transactionid) that is also blocked by pid 1888.

Let’s commit the transaction hold by pid 1888 and launch the query again 2 times:

postgres@pgu:~$ psql -f pg_ash_wait_chain.sql
 % of total wait | seconds |                                  wait_chain
-----------------+---------+------------------------------------------------------------------------------
 57%             |     582 | pid:1890 (Lock : transactionid) ->pid:1888
 40%             |     403 | pid:1913 (Lock : transactionid) ->pid:1890 (Lock : transactionid) ->pid:1888
 3%              |      32 | pid:1913 (Lock : transactionid) ->pid:1890
(3 rows)

postgres@pgu:~$ psql -f pg_ash_wait_chain.sql
 % of total wait | seconds |                                  wait_chain
-----------------+---------+------------------------------------------------------------------------------
 57%             |     582 | pid:1890 (Lock : transactionid) ->pid:1888
 40%             |     403 | pid:1913 (Lock : transactionid) ->pid:1890 (Lock : transactionid) ->pid:1888
 3%              |      33 | pid:1913 (Lock : transactionid) ->pid:1890
(3 rows)

As you can see the first two chains are still displayed (as the query does not filter on ash_time) but are not waiting anymore (seconds does not increase) while the last one (new one) is still waiting (seconds increase).

Remarks

Conclusion

We have seen how the blocking information part of the pg_active_session_history view could help to drill down in case of blocking activity.

PostgreSQL Active Session History extension testing with Docker

Introduction

You may have noticed that a beta version of the pgsentinel extension (providing active session history for postgresql) is publicly available in this github repository.

We can rely on docker to facilitate and automate the testing of the extension on a particular postgreSQL version (has to be >= 10).

Let’s share a Dockerfile so that we can easly build a docker image for testing pgsentinel (on a postgreSQL version of our choice).

Description

The dockerfile used to provision a pgsentinel testing docker image:

  • downloads the postgresql source code (version of your choice)
  • compiles and installs it
  • downloads the pgsentinel extension
  • compiles and installs it
  • compiles and installs the pg_stat_statements extension

Dockerfile github repository

The dockerfile is available in this github repository.

Usage

3 arguments can be used:

  • PG_VERSION (default: 10.5)
  • PG_ASH_SAMPLING (default: 1)
  • PG_ASH_MAX_ENTRIES (default: 10000)

Example to build an image

Let’s build a pgsentinel testing docker image for postgreSQL version 11beta3:

[root@bdtdocker dockerfile]# docker build -t pgsentinel-testing -f Dockerfile-pgsentinel-testing --build-arg PG_VERSION=11beta3 --force-rm=true --no-cache=true .

Once done, let’s run a container

[root@bdtdocker dockerfile]# docker run -d -p 5432:5432 --name pgsentinel pgsentinel-testing

and verify that the pg_active_session_history view is available

[root@bdtdocker dockerfile]# docker exec -it pgsentinel psql -c "\d pg_active_session_history"
                   View "public.pg_active_session_history"
      Column      |           Type           | Collation | Nullable | Default
------------------+--------------------------+-----------+----------+---------
 ash_time         | timestamp with time zone |           |          |
 datid            | oid                      |           |          |
 datname          | text                     |           |          |
 pid              | integer                  |           |          |
 usesysid         | oid                      |           |          |
 usename          | text                     |           |          |
 application_name | text                     |           |          |
 client_addr      | text                     |           |          |
 client_hostname  | text                     |           |          |
 client_port      | integer                  |           |          |
 backend_start    | timestamp with time zone |           |          |
 xact_start       | timestamp with time zone |           |          |
 query_start      | timestamp with time zone |           |          |
 state_change     | timestamp with time zone |           |          |
 wait_event_type  | text                     |           |          |
 wait_event       | text                     |           |          |
 state            | text                     |           |          |
 backend_xid      | xid                      |           |          |
 backend_xmin     | xid                      |           |          |
 top_level_query  | text                     |           |          |
 query            | text                     |           |          |
 cmdtype          | text                     |           |          |
 queryid          | bigint                   |           |          |
 backend_type     | text                     |           |          |
 blockers         | integer                  |           |          |
 blockerpid       | integer                  |           |          |
 blocker_state    | text                     |           |          |

So that we can now test the extension behavior on the postgreSQL version of our choice (11beta3 in this example).

PostgreSQL Active Session History extension is now publicly available

Publicly available

A quick one to let you know that the pgsentinel extension providing active session history sampling is now publicly available.

You can find more details on it into this previous blog post and also from some early beta testers:

Thank you Franck Pachot and Daniel Westermann for beta testing and sharing.

Where to find it?

The extension is available from the pgsentinel github repository. Please keep in mind that at the time of this writing the extension is still in beta so may contain some bugs: don’t hesitate to raise issues at github with your bug report.

Where to follow pgsentinel stuff?

On the website, twitter or github. More stuff to come, stay tuned.

Please do contribute

If you’re lacking of some functionality, then you’re welcome to make pull requests.

PostgreSQL Active Session History (ash): welcome to the pg_active_session_history view (part of the pgsentinel extension)

Why active session history?

What if you could record and query an history of the active sessions? Would not it be useful for performance tuning activities?

With active session history in place you could have a look to the “near” past database activity. You could answer questions like:

  • What wait events type were taking most time?
  • What wait events were taking most time?
  • Which application name was taking most time?
  • What was a session doing?
  • What does a SQL statement wait for?
  • How many sessions were running in CPU?
  • Which database was taking most time?
  • Which backend type was taking most time?
  • On which wait event was the session waiting for?
  • And so on….

How does it look like?

Let’s have a look to the pg_active_session_history view (more details on how to create it later on):

postgres@pgu:~$ /usr/local/pgsql/bin/psql -c "\d pg_active_session_history"
                   View "public.pg_active_session_history"
      Column      |           Type           | Collation | Nullable | Default
------------------+--------------------------+-----------+----------+---------
 ash_time         | timestamp with time zone |           |          |
 datid            | oid                      |           |          |
 datname          | text                     |           |          |
 pid              | integer                  |           |          |
 usesysid         | oid                      |           |          |
 usename          | text                     |           |          |
 application_name | text                     |           |          |
 client_addr      | text                     |           |          |
 client_hostname  | text                     |           |          |
 client_port      | integer                  |           |          |
 backend_start    | timestamp with time zone |           |          |
 xact_start       | timestamp with time zone |           |          |
 query_start      | timestamp with time zone |           |          |
 state_change     | timestamp with time zone |           |          |
 wait_event_type  | text                     |           |          |
 wait_event       | text                     |           |          |
 state            | text                     |           |          |
 backend_xid      | xid                      |           |          |
 backend_xmin     | xid                      |           |          |
 top_level_query  | text                     |           |          |
 query            | text                     |           |          |
 queryid          | bigint                   |           |          |
 backend_type     | text                     |           |          |

You could see it as samplings of pg_stat_activity (one second interval) providing more information:

  • ash_time: the sampling time
  • top_level_query: the top level statement (in case PL/pgSQL is used)
  • query: the statement being executed (not normalised, as it is in pg_stat_statements, means you see the values)
  • queryid: the queryid of the statement (the one coming from pg_stat_statements)

Thanks to the queryid field you will be able to link the session activity with the sql activity (then addressing the point reported by Franck Pachot into this blog post: PGIO, PG_STAT_ACTIVITY and PG_STAT_STATEMENTS)

Installation

pgsentinel uses the pg_stat_statements extension (officially bundled with PostgreSQL) for tracking which queries get executed in your database.

So, add the following entries to your postgres.conf:

shared_preload_libraries = 'pg_stat_statements,pgsentinel'

# Increase the max size of the query strings Postgres records
track_activity_query_size = 2048

# Track statements generated by stored procedures as well
pg_stat_statements.track = all

restart the postgresql daemon and create the extension:

postgres@pgu:~$ /usr/local/pgsql/bin/psql -c "create extension pgsentinel;"
CREATE EXTENSION

Now, the pg_active_session_history view has been created and the activity is being recorded in it.

Let’s see the extension in action during a 2 minutes pgio run:

Remarks

  • No objects are created into the PostgreSQL instance (except the view). The data being queried through the pg_active_session_history view are fully in memory
  • You are able to choose the maximum number of records, once reached, then the data rotate (oldest records are aged out)

Next Steps

  • The library and the source code will be available soon on github in the pgsentinel github repository (under the GNU Affero General Public License v3.0): please do contribute!
  • Once the source code is published, more information on how to create the pgsentinel library will be available
  • Once the source code is published, more information on how to define the maximum of records will be available
  • The extension is named “pgsentinel” because more cool stuff will be added in it in the near future

Conclusion

At pgsentinel we really love performance tuning activity. We will do our best to add valuable performance tuning stuff to the postgresql community: stay tuned

Update 07/14/2018:

The extension is now publicly available.