Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

neon endpoint list shows incorrect LSN for hot standby #5825

Closed
alexanderlaw opened this issue Nov 8, 2023 · 6 comments · Fixed by #10931
Closed

neon endpoint list shows incorrect LSN for hot standby #5825

alexanderlaw opened this issue Nov 8, 2023 · 6 comments · Fixed by #10931
Assignees
Labels
c/compute Component: compute, excluding postgres itself m/good_first_issue Moment: when doing your first Neon contributions t/bug Issue Type: Bug

Comments

@alexanderlaw
Copy link
Contributor

Steps to reproduce

cargo neon init --force
cargo neon start
ten="10000000000000000000000000000001"
tim="20000000000000000000000000000002"
cargo neon tenant create --tenant-id=$ten --timeline-id=$tim --pg-version 16 --set-default
cargo neon endpoint create main --pg-version 16 --pg-port 15432
echo "max_connections = 1000" >> .neon/endpoints/main/postgresql.conf # to hinder recovery 
cargo neon endpoint start main
cargo neon endpoint start main-rep --pg-version 16 --pg-port 25432 --hot-standby true
cargo neon endpoint list
psql -p 15432 -c "SELECT 1 i INTO TABLE test"
curl -s -X PUT "http://127.0.0.1:9898/v1/tenant/$ten/timeline/$tim/checkpoint"; echo ""; 
psql -p 15432 -c "SELECT pg_current_wal_flush_lsn()"

sleep 5
cargo neon endpoint list
psql -p 25432 -c "SELECT pg_last_wal_replay_lsn()"
psql -p 25432 -c "SELECT * FROM test"

Actual result

 ENDPOINT  ADDRESS          TIMELINE                          BRANCH NAME  LSN        STATUS  
 main      127.0.0.1:15432  20000000000000000000000000000002  main         0/14F5BB0  running 
 main-rep  127.0.0.1:25432  20000000000000000000000000000002  main         0/14F5BB0  running 
SELECT 1
null
 pg_current_wal_flush_lsn 
--------------------------
 0/1515640
(1 row)

    Finished dev [optimized + debuginfo] target(s) in 0.33s
     Running `target/debug/neon_local endpoint list`
 ENDPOINT  ADDRESS          TIMELINE                          BRANCH NAME  LSN        STATUS  
 main      127.0.0.1:15432  20000000000000000000000000000002  main         0/15196E0  running 
 main-rep  127.0.0.1:25432  20000000000000000000000000000002  main         0/15196E0  running 
 pg_last_wal_replay_lsn 
------------------------
 0/14F5BB0
(1 row)

ERROR:  relation "test" does not exist
LINE 1: SELECT * FROM test

Expected result

The LSN value shown for a hot standby reflects a real LSN position.

Logs, links

.neon/endpoints/main-rep/compute.log contains:

...
2023-11-08 12:14:04.391 GMT [1558062] LOG:  started streaming WAL from primary at 0/1000000 on timeline 1
2023-11-08 12:14:04.415 GMT [1558061] LOG:  redo starts at 0/14F5BB0
2023-11-08 12:14:04.415 GMT [1558061] WARNING:  hot standby is not possible because of insufficient parameter settings
2023-11-08 12:14:04.415 GMT [1558061] DETAIL:  max_connections = 100 is a lower setting than on the primary server, where its value was 1000.
2023-11-08 12:14:04.415 GMT [1558061] CONTEXT:  WAL redo at 0/14F5BB0 for XLOG/PARAMETER_CHANGE: max_connections=1000 max_worker_processes=8 max_wal_senders=10 max_prepared_xacts=0 max_locks_per_xact=64 wal_level=logical wal_log_hints=off track_commit_timestamp=off
2023-11-08 12:14:04.415 GMT [1558061] LOG:  recovery has paused
...

That is, recovery was paused (as expected), but neon endpoint list shows a newer LSN for main-rep as if recovery goes on normally.

@alexanderlaw alexanderlaw added the t/bug Issue Type: Bug label Nov 8, 2023
@jcsp jcsp added c/compute Component: compute, excluding postgres itself and removed c/cloud/compute labels Jun 19, 2024
@ololobus ololobus added the m/good_first_issue Moment: when doing your first Neon contributions label Jan 7, 2025
@myrrc myrrc self-assigned this Jan 7, 2025
@ololobus
Copy link
Member

@alexanderlaw is it still the problem?

@alexanderlaw
Copy link
Contributor Author

Yes, I've reproduced it with the updated script:

cargo neon init --force remove-all-contents
cargo neon start
ten="10000000000000000000000000000001"
tim="20000000000000000000000000000002"
cargo neon tenant create --tenant-id=$ten --timeline-id=$tim --pg-version 16 --set-default
cargo neon endpoint create main --pg-version 16 --pg-port 15432

cargo neon endpoint start main
createdb test -p 15432

cargo neon endpoint create main-rep --pg-version 16 --pg-port 25432 --hot-standby true
echo "recovery_min_apply_delay = 30000" >> .neon/endpoints/main-rep/postgresql.conf # to hinder recovery
cargo neon endpoint start main-rep || true
cargo neon endpoint list

psql -p 15432 -c "SELECT 1 i INTO TABLE test"
curl -s -X PUT "http://127.0.0.1:9898/v1/tenant/$ten/timeline/$tim/checkpoint"; echo "";
psql -p 15432 -c "SELECT pg_current_wal_flush_lsn()"


sleep 5
cargo neon endpoint list
psql -p 25432 -c "SELECT pg_last_wal_replay_lsn()"
psql -p 25432 -c "SELECT * FROM test"
 ENDPOINT  ADDRESS          TIMELINE                          BRANCH NAME  LSN        STATUS  
 main      127.0.0.1:15432  20000000000000000000000000000002  main         0/203F9E8  running 
 main-rep  127.0.0.1:25432  20000000000000000000000000000002  main         0/203F9E8  running 
 pg_last_wal_replay_lsn 
------------------------
 0/203D800
(1 row)

ERROR:  relation "test" does not exist
LINE 1: SELECT * FROM test
                      ^

@ololobus
Copy link
Member

Thanks!

@thesuhas
Copy link
Contributor

Posting an updated script that worked for me to replicate this issue (needed appropriate role and db specified):

cargo neon init --force remove-all-contents
cargo neon start
ten="10000000000000000000000000000001"
tim="20000000000000000000000000000002"
cargo neon tenant create --tenant-id=$ten --timeline-id=$tim --pg-version 16 --set-default
cargo neon endpoint create main --pg-version 16 --pg-port 15432

cargo neon endpoint start main
createdb test -p 15432 -U cloud_admin

cargo neon endpoint create main-rep --pg-version 16 --pg-port 25432 --hot-standby true
echo "recovery_min_apply_delay = 300000" >> .neon/endpoints/main-rep/postgresql.conf # to hinder recovery
cargo neon endpoint start main-rep || true
cargo neon endpoint list

psql -p 15432 -c "SELECT 1 i INTO TABLE test" -U cloud_admin -d test
curl -s -X PUT "http://127.0.0.1:9898/v1/tenant/$ten/timeline/$tim/checkpoint"; echo "";
psql -p 15432 -c "SELECT pg_current_wal_flush_lsn()" -U cloud_admin -d test


sleep 5
cargo neon endpoint list
psql -p 25432 -c "SELECT pg_last_wal_replay_lsn()" -U cloud_admin -d test
psql -p 25432 -c "SELECT * FROM test" -U cloud_admin -d test

@skyzh
Copy link
Member

skyzh commented Feb 20, 2025

The LSN column in the endpoint list command has different meanings for static/primary+hotstandby computes. For static computes, it's the LSN used for starting the endpoint (the endpoint time travels to that specific LSN). For primary+hotstandby computes, it is currently displaying the last_record_lsn from the pageserver. This LSN does not have any correlation to the pg_last_wal_replay_lsn compute LSN -- it could lag behind if pageserver has not finished ingesting the safekeeper data, or be newer than the compute LSN because the hot-standby compute has not replayed it yet.

I would suggest only setting the column when the compute is a static compute. Otherwise the LSN does not make sense.

@ololobus
Copy link
Member

This week:

github-merge-queue bot pushed a commit that referenced this issue Feb 25, 2025
…0931)

## Problem

`neon endpoint list` shows a different LSN than what the state of the
replica is. This is mainly down to what we define as LSN in this output.
If we define it as the LSN that a compute was started with, it only
makes sense to show it for static computes.

## Summary of changes

Removed the output of `last_record_lsn` for primary/hot standby
computes.

Closes: #5825

---------

Co-authored-by: Tristan Partin <tristan@neon.tech>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/compute Component: compute, excluding postgres itself m/good_first_issue Moment: when doing your first Neon contributions t/bug Issue Type: Bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants