From f2bb24970ea31109bd317b274e2a520ea6233bb2 Mon Sep 17 00:00:00 2001 From: Dyson Simmons Date: Wed, 1 Apr 2020 13:20:58 +0100 Subject: [PATCH] Fix wait for PostgreSQL in draupnir-finalise-image (#80) The draupnir-finalise-image script waits for PostgreSQL to start accepting connections before issuing commands against it. It starts postgres via pg_ctl with the wait (-w) flag. This waits for PostgreSQL to accept connections but by default it only waits 60s. If we take longer than 60 seconds it exists nonzero and the script exists. Following the pg_ctl start command the script was then looping for up to 10 minutes trying to read in PostgreSQL logs that it was ready for connections. This loop would never function as intended as the wait flag on pg_ctl either ensures the PostgreSQL is accepting connections or it has exited nonzero and therefore the script exits as well. Removing the loop, and changing the default wait timeout of pg_ctl to 10 minutes provides the intended behaviour. --- CHANGELOG.md | 6 ++++-- DRAUPNIR_VERSION | 2 +- cmd/draupnir-finalise-image | 28 +++++++++++----------------- 3 files changed, 16 insertions(+), 20 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index f7ad450e..81ed7f4f 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,8 +1,10 @@ Changelog ========= -Unreleased ----------- +5.0.1 +----- +- Use pg_ctl wait with a timeout in draupnir-finalise-image script to wait until + PostgreSQL is ready to accept connections. 5.0.0 ----- diff --git a/DRAUPNIR_VERSION b/DRAUPNIR_VERSION index 0062ac97..6b244dcd 100644 --- a/DRAUPNIR_VERSION +++ b/DRAUPNIR_VERSION @@ -1 +1 @@ -5.0.0 +5.0.1 diff --git a/cmd/draupnir-finalise-image b/cmd/draupnir-finalise-image index 8517b1be..4f35f52f 100755 --- a/cmd/draupnir-finalise-image +++ b/cmd/draupnir-finalise-image @@ -100,23 +100,17 @@ EOF LOG_FILE="/var/log/postgresql/image_${ID}" # Start postgres -sudo -u postgres $PG_CTL -w -D "$UPLOAD_PATH" -o "-p $PORT" -l "${LOG_FILE}" start - -# We need to wait for postgres to boot and announce that the recovery has -# completed. Ideally WAL recovery shouldn't take long, but for high volume -# databases Postgres needs a window to catch-up from the last checkpoint. -TIMEOUT=600 # 10m -sudo -u postgres touch "${LOG_FILE}" # otherwise we'll fail grep'ing the file -until grep "database system is ready to accept connections" "${LOG_FILE}" -do - if [ $(( TIMEOUT-- )) -eq 0 ]; - then - cat "${LOG_FILE}" >&2 - echo "Postgres recovery failed, timed out waiting for recovery" >&2 - exit 255 - fi - sleep 1 -done + +# We need to wait (-w) for postgres to boot and accept +# connections before continuing. Ideally WAL recovery shouldn't take long, but +# for high volume databases Postgres needs a window to catch-up from the last +# checkpoint. + +# If startup doesn't complete within the timeout (-t ) then pg_ctl +# exits with a nonzero exit status. Note that the startup will continue in the +# background and may eventually succeed - all the nonzero exit has done here is +# notify that it didn't happen within the timout. +sudo -u postgres $PG_CTL -w -t 600 -D "$UPLOAD_PATH" -o "-p $PORT" -l "${LOG_FILE}" start # Create a user to perform admin operations with sudo -u postgres createuser --port="$PORT" --createdb --createrole --superuser draupnir-admin