Managing PostgreSQL pg_wal Growth: Troubleshooting and Solutions

Table of Contents

Recently, I’ve noticed that my TimescaleDB PostgreSQL clusters have experienced significant growth in their pg_wal volumes, sometimes expanding to over 20GB or even filling up the data disk entirely. This has occasionally led to the data volume running out of space quickly, causing access issues on the database. My goal is to spend less time fixing and worrying about these databases.

The first step is identifying what causes the pg_wal folder to exceed its expected size (about 2GB). Potential causes include replication lag, incorrect WAL configuration, failed backups, etc.

Troubleshooting steps

High replication lag can cause WAL size to inflate. To check for this, we can examine the pg_stat_replication table. Running the query below provides statistics on our replication slots:

SELECT * FROM pg_stat_replication;

I will focus on write_lag, flush_lag, and replay_lag. If these values are high, it could indicate an issue with replicas streaming data from the primary node.

Example output:

postgres=# SELECT * FROM pg_stat_replication;
  pid   | usesysid |  usename   | application_name | client_addr |               client_hostname                | client_port |         backend_start         | backend_xmin |   state   |  sent_lsn   |  write_lsn  |  flush_lsn  | replay_lsn  |    write_lag    |    flush_lag    |   replay_lag    | sync_priority | sync_state |          reply_time
--------+----------+------------+------------------+-------------+----------------------------------------------+-------------+-------------------------------+--------------+-----------+-------------+-------------+-------------+-------------+-----------------+-----------------+-----------------+---------------+------------+-------------------------------
 434726 |    16384 | replicator | timescale-2      | 10.0.0.111  | timescale-2.c.gumband-v2-production.internal |       29188 | 2024-10-22 13:20:51.299949+00 |              | streaming | 1D/6B679988 | 1D/6B679988 | 1D/6B679988 | 1D/6B679988 | 00:00:00.000575 | 00:00:00.001661 | 00:00:00.001782 |             0 | async      | 2024-11-13 16:34:01.159582+00
 434744 |    16384 | replicator | timescale-3      | 10.0.0.112  | timescale-3.c.gumband-v2-production.internal |       45948 | 2024-10-22 13:20:53.585643+00 |              | streaming | 1D/6B679988 | 1D/6B679988 | 1D/6B679988 | 1D/6B679988 | 00:00:00.000618 | 00:00:00.001797 | 00:00:00.001901 |             0 | async      | 2024-11-13 16:34:00.357485+00

In my case, the write_lag, flush_lag, and replay_lag are all very low, indicating that the replicas are keeping up with the primary server’s WAL generation. This appears to be OK.

WAL Configuration

My /etc/patroni/patroni.yml configuration is set using an Ansible script, so I don’t expect any issues here. However, it’s good to double-check that my values are set accordingly.

Here are my main wal settings:

min_wal_size: "512MB" 
max_wal_size: "1GB" 
wal_level: "replica" 
wal_keep_size: "2GB" 
max_wal_senders: "10"

This does not explain why the Primary node’s pg_wal folder is 18GB. Let’s keep going.

Check Replication Slot Activity

We may be able to extrapolate some information from replication slot activity. If we see a lot of inactive slots that could indicate to us why our pg_wal directory is filling up. Lets list all replication slots and verify their activity status.

SELECT slot_name, active, restart_lsn FROM pg_replication_slots;

SELECT slot_name, active, restart_lsn FROM pg_replication_slots;
  slot_name  | active | restart_lsn
-------------+--------+-------------
 timescale_2 | t      | 1D/6B8FA368
 timescale_3 | t      | 1D/6B8FA368

Both replication slots (timescale_2 and timescale_3) are active, which is good because it means they are being used by the cluster’s replicas.

Checking PostgreSQL logs for hints

Ahh, PostgreSQL logs. I should have checked here first! Looking at the logs, the issue is pretty obvious:

ERROR: [041]: unable to get info for path/file '/tmp/pgbackrest/main.stop': [13] Permission denied 2024-11-13 16:46:40 UTC [434699-211272] 
LOG: archive command failed with exit code 41 2024-11-13 16:46:40 UTC [434699-211273] 

DETAIL: The failed archive command was: pgbackrest --stanza=main archive-push pg_wal/000000030000001900000003 2024-11-13 16:46:40 UTC [434699-211274] 

WARNING: archiving write-ahead log file "000000030000001900000003" failed too many times, will try again later

I use pgbackrest to backup my clusters. I have a full backup that runs once every week, and a diff backup that runs every night. According to the logs, postgres is unable to write in the /tmp/pgbackrest folder. Pgbackrest will place its lock file here when performing backups.

Long story short, the default lock-path folder for pgbackrest is located at tmp/pgbackrest. When running a script that triggers pgbackrest, the tmp folder gets created as root. Which is inaccessible by postgres.

Let’s Fix it!

Option 1

First, lets set the owner of the problem directory to postgres:postgres.

sudo chown postgres:postgres /tmp/pgbackrest

Now, I can run the pgbackrest backup command pgbackrest --stanza=main backup --type full

Note: This may fail a couple times to start as postgres recycles unneeded wal segments. At least this is what happened in my case.

Making it Repeatable

To ensure that this tmp directory is always created with Postgres as the owner, we can use a tool called systemd-tmpfiles. On reboot, it will ensure that the specified folder is present with the correct ownership. Our conf file would look like this:

d /tmp/pgbackrest 750 postgres postgres -

We can test that this works by deleting the folder, then running:

sudo systemd-tmpfiles --create /etc/tmpfiles.d/postgres.conf

Performing a reboot, we will see the /tmp/pgbackrest folder created with the correct permissions.

Option 2

There is a more stateful and descriptive way of doing this. We can set the lock-path in our pgbackrest.conf. Essentially, you need to add lock-path=/path/to/dir to your pgbackrest.conf file. This should make the folder persistent on reboots. We will only have to create the folder and set permissions once.

I deployed pgbackrest with a custom ansible playbook, so I can update the config in one place and propagate it to all of my nodes.

First, we will update the pgbackrest.conf file:

[global]
repo1-type=gcs
repo1-gcs-bucket={{ gcs_bucket_name }}
repo1-gcs-endpoint=storage.googleapis.com
repo1-gcs-key-type=service
repo1-gcs-key=/path/to/key
repo1-retention-full=1
#Here is where we can add the lock-path
lock-path=/home/postgres/pgbackrest-backup

[main]
pg1-path=/mnt/postgresql/15/main
pg1-user=postgres

Second, I will adjust my deploy_pgbackrest.yaml to make sure the /home/postgres/pgbackrest-backup folder is present on the system with the correct permissions.

    - name: Create pgbackrest-backup directory
      file:
        path: /home/postgres/pgbackrest-backup
        state: directory
        owner: postgres
        group: postgres
        mode: "750"

This step should take care of it.

I can deploy the updated config with ansible-playbook deploy_pgbackrest.yml --ask-vault-pass.

Now we don’t have to worry about the lock-path directory being deleted from the /tmp directory when the system is rebooted.

This approach is preferable because the steps are well-defined in my Ansible script. It allows me to easily see from the pgbackrest.conf file where the backups are meant to take place, making troubleshooting cleaner and more straightforward. I generally think this approach is a little cleaner, too.

Takeaways

This experience was a good reminder of the importance of monitoring and maintaining our database environments. Fortunately, this issue only affected my non-production PostgreSQL clusters, allowing me to address it without impacting production. In production, I have alerting mechanisms to notify me if disks exceed 80% capacity, nodes become unavailable, or backups fail to complete.

This was a fun way to spend an afternoon and it narrowed down some frustrations in our testing environments. Moving forward, I may consider implementing alerts for our non-production clusters, although I’m cautious about potential alert fatigue obscuring critical notifications.