Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: only showing LSN for static computes in neon endpoint list #10931

Merged
merged 14 commits into from
Feb 25, 2025

Conversation

thesuhas
Copy link
Contributor

@thesuhas thesuhas commented Feb 21, 2025

Problem

neon endpoint list shows a different LSN than what the state of the replica is. This is mainly down to what we define as LSN in this output. If we define it as the LSN that a compute was started with, it only makes sense to show it for static computes.

Summary of changes

Removed the output of last_record_lsn for primary/hot standby computes.

Closes: #5825

@thesuhas thesuhas linked an issue Feb 21, 2025 that may be closed by this pull request
@thesuhas thesuhas requested a review from a team as a code owner February 21, 2025 15:49
@thesuhas thesuhas requested review from hlinnaka and MMeent February 21, 2025 15:49
Copy link

If this PR added a GUC in the Postgres fork or neon extension,
please regenerate the Postgres settings in the cloud repo:

make NEON_WORKDIR=path/to/neon/checkout \
  -C goapp/internal/shareddomain/postgres generate

If you're an external contributor, a Neon employee will assist in
making sure this step is done.

@thesuhas thesuhas force-pushed the thesuhas/incorrect_lsn_neon_endpoint_list branch from 7671090 to 42db511 Compare February 21, 2025 15:51
@MMeent
Copy link
Contributor

MMeent commented Feb 21, 2025

If we define it as the LSN that a compute was started with, it only makes sense to show it for static computes.

That doesn't make sense at all. We can know what LSN a read-only compute started with, why hide it?

@thesuhas
Copy link
Contributor Author

thesuhas commented Feb 21, 2025

If we define it as the LSN that a compute was started with, it only makes sense to show it for static computes.

That doesn't make sense at all. We can know what LSN a read-only compute started with, why hide it?

From my understanding, the LSN currently being shown is the last_record_lsn which specifies the last LSN persisted to the disk.
When it comes to read-only computes that are not static, how do you get the LSN that it starts from? This only contains info that a compute was started at if it is a read-only static compute. There are no LSNs in the TimelineInfo that specify the LSN that a compute was started at (if it is not static).

@MMeent
Copy link
Contributor

MMeent commented Feb 21, 2025

Not in TLI, no, but the logs of the compute will contain the LSN of the basebackup (which is as good as the last_record_lsn)

But why change this? I don't get the need to hide this info?

neon endpoint list shows a different LSN than what the state of the replica is.

I'd say that's expected; data will change while you're still receiving it. We can't instantly teleport data into the terminal, but I think this is the next best thing.

@thesuhas
Copy link
Contributor Author

Not in TLI, no, but the logs of the compute will contain the LSN of the basebackup (which is as good as the last_record_lsn)

But why change this? I don't get the need to hide this info?

neon endpoint list shows a different LSN than what the state of the replica is.

I'd say that's expected; data will change while you're still receiving it. We can't instantly teleport data into the terminal, but I think this is the next best thing.

I think it stemmed from the fact that what the LSN table means differs based on the type of compute, as discussed here.
I'm open to either, based on whatever is determined to be most appropriate.

@hlinnaka
Copy link
Contributor

I agree that displaying pageserver's last_record_lsn for a replica makes little sense. It's very misleading if the replica is lagging behind.

If we can easily display the last replay LSN (i.e. select pg_last_wal_replay_lsn ()), that would be best. For consistency with that, though, it'd then be nice to show last written LSN for the primary, rather than the pageserver's last_record_lsn.

If that's non-trivial, then I agree it's better to hide the LSN field than display a misleading value like we do today.

@skyzh
Copy link
Member

skyzh commented Feb 21, 2025

If we can easily display the last replay LSN

Then we need to use Postgres client to do that. I don't think it's trivial. And given the neon_local tool is supposed to replicate the real cplane, and cplane doesn't have/use this functionality, I'd rather just hide the field.

@skyzh
Copy link
Member

skyzh commented Feb 21, 2025

And we can also have inactive primary/hotstandbys in the list, which we cannot connect with psql...

Copy link

github-actions bot commented Feb 21, 2025

7755 tests run: 7375 passed, 0 failed, 380 skipped (full report)


Flaky tests (4)

Postgres 17

Postgres 15

Code coverage* (full report)

  • functions: 32.8% (8638 of 26360 functions)
  • lines: 48.6% (72947 of 150077 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
5e90b06 at 2025-02-25T19:33:05.457Z :recycle:

Copy link
Member

@ololobus ololobus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, unless there are strong objections, but looks like Chi and Heikki +1'ed as well

Showing it for primary and replica is more confusing than useful. And making to to follow the actual compute LSN is tricky

@thesuhas thesuhas added this pull request to the merge queue Feb 25, 2025
Merged via the queue into main with commit d056062 Feb 25, 2025
88 checks passed
@thesuhas thesuhas deleted the thesuhas/incorrect_lsn_neon_endpoint_list branch February 25, 2025 19:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

neon endpoint list shows incorrect LSN for hot standby
6 participants