-
Notifications
You must be signed in to change notification settings - Fork 496
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PS/Prefetch: Use a timeout for reading data from TCP #10834
Conversation
7744 tests run: 7366 passed, 0 failed, 378 skipped (full report)Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
e456cc9 at 2025-02-27T13:31:47.712Z :recycle: |
65b6124
to
cce5d5d
Compare
If this PR added a GUC in the Postgres fork or
If you're an external contributor, a Neon employee will assist in |
f94533c
to
f24e99e
Compare
6db9ede
to
93ab153
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM from the storage side
0caec61
to
b86b759
Compare
b86b759
to
7214004
Compare
This still makes me pretty squirmish. Are you sure you might not "re-enter" some parts of the prefetching code? The For example, can an interrupt happen while we're in a Does the guard protect some specific fields of MyPState? |
Took me a while to understand, but I got it now:
The above cannot happen, because prefetch_pump_state() would find that there are no in-flight requests in the ring. So that's OK, although feels a bit fragile. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a concrete problem with this even though I'm a bit squirmish.
Let's add a GUC for this though. I'd love to have this in staging and pre-prod for a while, and perhaps do a slow roll out to production too, starting with the endpoints that have experienced problems with the socket buffers filling up.
I think a simple check in reconfigure_timeout_if_needed
to do nothing if the GUC is not set would do the trick. Or since we're adding a GUC, perhaps make it an integer GUC to replace the PS_BACKGROUND_DELAY constant. Then we can also experiment making it more or less aggressive.
This reduces pressure on OS TCP buffers, reducing TCP backpressure and therefore flush times in upstream systems like PageServer. The timeout triggers every 100 ms (compile-time constant) if there are any outstanding getpage requests.
That's better than a builtin constant, as it allows us to better control the behaviour of this feature.
7214004
to
db031d7
Compare
@hlinnaka Did you mean something like this latest? (i.e. commit 2 of 3) |
- Adjust comment on readahead_getpage_pull_timeout_ms - Use 0 for disabling the timeout, not both 0 and -1 - Limit the timeout to 5 minutes
This reduces pressure on OS TCP buffers, reducing flush times in other systems like PageServer.
Problem
Summary of changes