Collating Kevin

Learn from my mistakes. Build it better.

Chasing Postgres Performance

2023-12-02 Kevin Olynyk

Chasing Database Performance

Part 1 In a Series of Chasing Postgres Performance

Over my years as a systems administrator, I’ve lost count of the number of databases that I’ve set up, and all of the idiosyncratic one-off scenarios that I’ve been forced to deal with.

It’s not often that I get to work with a brand new database, trying to find out what works best right off the bat. I’m usually dealing with arcane requirements for applications that have extremely specific requirements (read legacy software), and so it’s expected that the database is set up in a certain way.

Enter Postgres

Microsoft SQL seems to be what I deal with on a daily basis–however, most of the open source software I’m dealing with has dropped support for SQL Server entirely in favor of MySQL/MariaDB and Postgres.

I’ll leave install Postgres as an exercise for the reader, as this is not a setup tutorial, but a journey into the world of performance maximization.

With many forays into SQL Server, I felt ready to tackle Postgres I first checked the documentation to see what I could find out about the the WAL (write ahead log) and the data directories.

Initial Setup

I created two extra disks on my VM using the VirtualIO driver. I then formatted the disks using XFS which is what postgres recommends.

I was able to format the partitions without issue and I set the blocksizes as one does when trying to eke out a bit more performance.

mkfs.xfs -b size=8k -L PGDATA /dev/vdb1 mkfs.xfs -b size=64k -L PGWAL /dev/vdc1

I then went to mount the drives and realized that I had mad a critical mistake.

I was getting one of my favourite non-descript linux error messages

mount: Function not implemented

Frustrated, I assumed that this was something to do with selinux, as it usually is, but after a bit of digging, I realized, that the default linux Kernel that ships with most distros does not support XFS block sizes larger than 4Kb. So I reformatted the partitions with 4Kb blocks and tried again using the default options, which will have to suffice for now.

The next roadblock that I faced came when trying to mount the WAL partition, which lead to a chicken and egg scenario. I needed to create the WAL directory /var/lib/pgsql/16/data/pg_wal, but once this was done, running the initialization of Postgres failed because the data directory /var/lib/pgsql/16/data was not empty.

This took an embarassingly long time to figure out because on the surface postgres-16-setup init was working fine and returned no error when ran. It was only after examining the log at /var/lib/pgsql/16/initdb.log that I realized my mistake. After unmounting the pg_wal directory I was finally able to get Postgres to initialize.

I copied the contents of the pg_wal directory to a temporary location, mounted the WAL drive, and copied the WAL back to it’s new home. Finally Postgres started and I was able to start getting some baseline results.

Benchmarking

Although I’m probably not approaching my testing in the most scientific way possible, I’m just trying to get quick and dirty results. After initializing the test db scaled to 50x using pgbench -s50 -i I run a couple tests using pgbench -c10 -j2 -t1000 and average the results.

Here are some of my test results, keep in mind, I’m not using the most modern hardware available, my Postgres instances are running on Proxymox 8 on a Dell r610 server backed by SATA SSDs, both are running on Rocky 9.2 with 2 vCPUs and 2GiB of RAM:

Fresh Postgres Install w/Separate WAL

latency average = 7.623 ms
initial connection time = 21.266 ms
tps = 1311.869917 (without initial connection time)

‘Tuned’ Postgres 14 in Production

latency average = 47.313 ms
initial connection time = 7.313 ms
tps = 457.992021 (without initial connection time)

I’ll be the first to admit, this is not a fair comparison. But to it’s credit I did the testing for this on a relatively quiet day where the databases were doing little more than just idling.

Separating the WAL seems to have a signifigant impact on performance right out of the gate. We see nearly 3x the transactions per second just by moving the WAL to a dedicated device (I’m sure the jump in version probably has some impact as well, but I’ll have to repeat the testing with Postgres 14). Let’s mess with some settings and see what we get.

Postgres 16 w/512MB Shared Memory Set and 9GB WAL

latency average = 7.070 ms
initial connection time = 32.068 ms
tps = 1414.403750 (without initial connection time)

As one might expect, throwing a bit more memory at the problem speeds up the database a little bit.

Same Settings w/LZ4 WAL Compression

latency average = 6.464 ms
initial connection time = 26.453 ms
tps = 1546.999069 (without initial connection time)

Not too bad, especially when we remember that this is running on shared hardware that was essentially salvaged from a dumpster. A very quick test on an AWS m5.xlarge instance running Postgres 14 will yield about 15,000 transactions per second and cost $156.00 a month, that money will run my server for an entire year and allow me to spin up 6-7 of these instances on top of everything else that I am already running.

I’m going to work on optimizing this configuration some more, so keep your eyes out for part two of this series, where I perform even more optimizations.