PostgreSQL high availability using Pgpool-II

01.03.2021 | Wolfgang Denzinger

In the BI environment, it is a challenge to update the data while providing consistent reporting. This article describes a solution for a system based on PostgreSQL infrastructure.

Introduction

The requirement was to update individual nodes of the cluster and still have consistent reporting available. After some research, Pgpool-II was identified to be a good candidate to fulfill this requirement. Pgpool-II is an open-source proxy sitting between PostgreSQL client applications and PostgreSQL servers. It provides many features; among other things, load balancing, connection pooling, automated fail over and watchdog.

The goal was to verify the technical feasibility of using Pgpool-II for load balancing and to manage the availability of the cluster nodes. In the selected scenario (see Fig. 1), an ETL process will be executed on an additional PostgreSQL server (host_1), which is not part of the Pgpool-II managed cluster. After the refresh job has finished successfully on host_1, the updated data will be copied to the “reporting cluster”, consisting of three PostgreSQL instances (host_2, host_3, host_4). The end user interfaces are configured to connect the Pgpool-II as proxy (host_0).

This post describes the configuration of a PostgreSQL high availability cluster environment using Pgpool-II for load balancing, and further describes the update process of the “reporting cluster”. Ubuntu 18.04.3 LTS was installed on all hosts. All instances use PostgreSQL version 10.10 and are configured to listen on the default port 5432. The Pgpool-II package is not included in the Ubuntu 18.04 repositories. The source code can be downloaded here and compiled as described in the installation documentation. The Pgpool-II version used was 4.1.0.

Configuration of Pgpool-II

All described steps will be executed as root (sudo -s) on host_0. After the installation of Pgpool-II, go to folder /usr/local/etc/ and create the files pgpool.conf and pcp.conf as a copy of the <file>.conf.sample file. Additionally, the directory /var/log/pgpool was created. The modifications to the file /usr/local/etc/pgpool.conf are:

value	default	new
listen_addresses	localhost	*
port	9999	<every_free_port_you_like>
sr_check_user	nobody	<db_user>
sr_check_password	“	<password_of_db_user>
pid_file_name	‘/var/run/pgpool/pgpool.pid’	‘/tmp/pgpool.pid’
logdir	‘/tmp’	‘/var/log/pgpool’
load_balance_mode	‘off’	‘on’
replication_mode	‘off’	‘on’

At the end of the file, add a section for each PostgreSQL instance:

backend_hostname<?> = '<ip_address_or_DNS_name_of_PostgreSQL_node'
backend_port<?> = 5432
backend_weight<?> = 1
backend_data_directory<?> = '/var/lib/postgresql/10/main'
backend_flag<?> = 'ALLOW_TO_FAILOVER'

The placeholder <?> represents an integer sequence starting from 0. The integer is equal to the cluster node id used by Pgpool-II.

The execution on command line of the command

pg_md5 -m -u <db_user> -f /usr/local/etc/pgpool.conf -p

prompts for the password of <db_user> and adds the following line into /usr/local/etc/pool_passwd:

<db_user>:md5<md5_hash_of_db_user_password>

Pgpool-II now can be started via command line with

pgpool -n

and stopped:

pgpool -m fast stop

Configuration to execute pcp commands

To execute the pcp commands with the shell, the file /usr/local/etc/pcp.conf has to be modified too.

The following command

pg_md5 -p

will prompt for the password of the <db_user> and return the md5 hash. Add the following line into/usr/local/etc/pcp.conf:

<db_user>:<md5_hash_of_passwort_of_db_user>

If the execution of pcp commands raises the error message

error while loading shared libraries: libpcp.so.1: cannot open shared object file: No such file or directory

create a new file /etc/ld.so.conf.d/pgpool-II.conf and add the line

/usr/local/lib

Then reload the dynamic link cache with

ldconfig

Set up Pgpool-II as a daemon

Create the file /lib/systemd/system/pgpool2.service as follows:

[Unit]
Description=pgpool-II
Requires=pgpool2-helper.service

[Service]
ExecStart=/usr/local/bin/pgpool -n
ExecReload=/bin/kill -HUP $MAINPID
KillSignal=SIGINT
StandardOutput=syslog
SyslogFacility=local0

[Install]
WantedBy=multi-user.target

Create the file /lib/systemd/system/pgpool2-helper.service and insert

[Unit]
Description=Helper service for pgpool-II to create /var/run/pgpool
After=network.target
Before=pgpool2.service

[Service]
ExecStart=/bin/mkdir /var/run/pgpool
ExecReload=/bin/kill -HUP $MAINPID
KillSignal=SIGINT
StandardOutput=syslog
SyslogFacility=local0

[Install]
WantedBy=multi-user.target

Afterwards, create the symbolic links:

ln -s /lib/systemd/system/pgpool2.service /etc/systemd/system/pgpool2.service
ln -s /lib/systemd/system/pgpool2-helper.service /etc/systemd/system/pgpool2-helper.service

and reload systemd

systemctl daemon-reload

To start the Pgpool-II daemon automatically during server startup, execute

systemctl enable pgpool2-helper.service
systemctl enable pgpool2.service

To start the Pgpool-II daemon manually, execute

systemctl start pgpool2-helper.service
systemctl start pgpool2.service

Now modify file /usr/local/etc/pgpool.conf and restart Pgpool-II

value	default	new
pid_file_name	‘/tmp/pgpool.pid’	‘/var/run/pgpool/pgpool.pid’
socket_dir	‘/tmp’	‘/var/run/pgpool’
pcp_socket_dir	‘/tmp’	‘/var/run/pgpool’

Test load balancing

Before the load balancing test can be started on all PostgreSQL instances, the following line has to be added to the file /etc/postgresql/10/main/pg_hba.conf, and the instances restarted.

host       all     <dbuser>       <host_name_of_pgpool>    md5

The tool pgbench can be used to test the configuration and the load balancing. First initialize the PostgreSQL databases by executing the following command on each PostgreSQL instance:

pg_bench -i -h <PostgreSQL_host> -p <port> -U <db_user> <database>

Log into the PostgreSQL and check the current status of the pool nodes:

psql -h <host_of_pgpool> -p <port_pgpool_is_listen> -U <db_user> -d <database>
    <database>=# show pool_nodes;

Before starting the load balance test, the output is:

node_id	hostname	port	status
0	host_2	5432	up
1	host_3	5432	up
2	host_4	5432	up

Keep in mind, the column node_id represents the value <?> of variable backend_hostname<?> in file /usr/local/etc/pgpool.conf (see above).

To test the load balancing, execute on a PostgreSQL node:

pgbench -h <host_of_pgpool> -p <port_pgpool_is_listen> -c 10 -j 10 -S -T60 -U <db_user> <database>

In this example, 10 clients (-c) executes select-only transactions (-S). To check the load balancing execute the show pool_nodes; command again. The output looks like:

node_id	hostname	port	status	select_cnt
0	host_2	5432	up	212090
1	host_3	5432	up	126605
2	host_4	5432	up	84033

To verify database connections exists, you can also execute

ps aux | grep pgpool

The output looks like the following table:

USER	PID	%CPU	%MEM	VSZ	RSS	TTY	STAT	START	TIME	COMMAND
root	6292	0.0	0.1	622672	4356	?	S	10:13	0:00	pgpool: <db_user> <database> host_3(51270) idle
root	6295	0.0	0.1	622672	4356	?	S	10:13	0:00	pgpool: <db_user> <database> host_2(51270) idle
root	6298	0.0	0.1	622672	4356	?	S	10:13	0:00	pgpool: <db_user> <database> host_4(51270) idle
root	6308	0.0	0.0	622672	4356	?	S	10:13	0:00	pgpool: wait for connection request
root	6309	0.0	0.0	622672	4356	?	S	10:13	0:00	pgpool: wait for connection request
root	6310	0.0	0.0	622672	4356	?	S	10:13	0:00	pgpool: wait for connection request
root	6311	0.0	0.0	622672	4356	?	S	10:13	0:00	pgpool: wait for connection request
root	6312	0.2	0.3	231676	5126	?	S	10:13	0:06	pgpool: <db_user> <database> host_4(51270) SELECT
root	6313	0.2	0.3	231676	5126	?	S	10:13	0:05	pgpool: <db_user> <database> host_4(51270) SELECT

The idle sessions are connected (PIDs 6308, 6309, 6310, 6311) and the SELECT (PIDs 6312, 6313) is an active session.

Description of the refresh process

This section describes the implementation of the database node refresh process. To keep a long story short, not all steps are fully explained. This section should only give an impression of how this process was implemented.

Fig. 2: Overview of the implemented process flow

The communication of the different servers will be done via SSH. Therefore, SSH keys must be generated (ssh-keygen) and the public must to be copied to the PostgreSQL servers (ssh-copy-id) host_1, host_2, host_3 and host_4. To simplify the communication, create file ~/.ssh/config on host_0. For each PostgreSQL node add the following entries:

host <alias_that_can_used_in_bash_scripts>
hostname <ip_address_or_DNS_name_of_the_PostgreSQL_instance>
user <name_of_OS_user_in_which_context_the_scripts_will_be_executed>
IdentityFile <path_to_the_id_rsa_file>

A valid entry could look like:

host node0
hostname host_2
user bob
IdentityFile ~/.ssh/id_rsa

The value of the variable host was set to "node<node_id>" where <node_id> represents the Pgpool-II cluster node number (see output of command show pool_nodes;) and is reflected in the configuration of file /usr/local/etc/pgpool.conf as (backend_hostname<?>).

On host_0, the master Bash script manage_pgpool.sh is stored. The purpose of this script is to manage the refresh of the "reporting cluster”.

In the first step, the Bash script export_PG.sh will be started on host_1, in order to export the data using pg_dump. The folder of the latest dump will be written to the file .outfile. This file will be read by the master script. To get the folder, where the latest dump is stored, the script reads the .outfile by executing the following command:

dumpfolder=$(ssh etl_server cat .outfile)
# content of .outfile on host_1 is stored in variable dumpfolder (dumpfolder_on_host_1)
#
# etl_server is the host variable in file ~/.ssh/config of host_0
# Example ~/.ssh/config:
# host etl_server
# hostname host_1
# user bob
# IdentityFile ~/.ssh/id_rsa

In the next step, the script determines the number of configured PostgreSQL nodes by executing the Pgpool-II command:

pcp_node_count -h <host_of_pgpool> -U <db_user> -w

To execute the pcp command successfully, the ~/.pcppass file on the Pgpool-II host has to be customized:

:*:*:*:<password>
(the values of the variables hostname, port and db_user aren't specified)

The command scp copies the dump from host_1 to the configured PostgreSQL instances (hosts_2, host_3, host_4). For the scp, the above described ~/.ssh/config is important because of the used syntax.

scp -r "<host_alias_defined_in_file>:${dumpfolder_on_host_1}" "node${i}:${dumpfolder_on_host_X}"

Because the number of nodes in the Pgpool-II cluster is known as a result of pcp_node_count command, the copy job can loop over all nodes. In order for this to work, set the variable host in file ~/.ssh/config to node${X}, node0, node1.

After the copy of all nodes has finished, half of the nodes will be refreshed in the next step. Each node will be detached from the cluster and updated. The idea behind this procedure is to have a consistent report on non-updated data. After the first half of the nodes were refreshed, the not-yet-refreshed nodes will be deactivated and the refreshed nodes will be attached again. Then the other half of the nodes will be updated. The weak point in this process is that during the refresh, not all nodes are available and this could lead to a performance bottleneck. To reduce the effect on the end user, the update of the cluster must be done when the number of report execution is low, i.e. overnight or over the weekend.

First, the PostgreSQL node will be detached from Pgpool-II by executing

pcp_detach_node -n <node_id> -h <host_of_pgpool> -U <db_user> -w

then the import_PG.sh script is starting on the remote hosts via SSH. Keep in mind that <node_id> is the node id shown by the command show pool_nodes; and will be configured in the file /usr/local/etc/pgpool.conf.

To do the import using pg_restore, the configuration of file ~/.pgpass on the PostgreSQL servers should be as follows:

*:*:<database>:<db_user>:<password>
(hostname and port are not specified)

These steps will be repeated iteratively, i.e., the number of active nodes will be reduced step by step. When the dumps have been imported into half of the cluster nodes the detached nodes, will be attached again.

pcp_attach_node -n <node_id> -h <host_of_pgpool> -U <db_user> -w

After the refreshed nodes are enabled, all unrefreshed nodes will be detached in one step and refreshed. After the refresh has finished, the nodes will be activated again. The advantage of this process is the scalability. If an additional node will be added to the cluster or a node will be removed from the cluster no changes in the refresh process have to be implemented, because the number of configured nodes will be determined by the script and all steps will be done iteratively.

Summary

All in all, it can be said that we have achieved our goal. We were able to set up a system that can automatically transfer the changes on a backend database to a database cluster managed with pgpool. By using the pcp commands of Pgpool-II, a scripted automated process flow can be implemented with a variable number of nodes.

The title image was published by Janosch Diggelmann under the Unsplash License.

« Size does matter: Area Cartograms in Cadenza Publishing your data with OGC API Features »

Disy Tech-Blog