-
Notifications
You must be signed in to change notification settings - Fork 496
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LFC prewarming with only light locking #10312
base: main
Are you sure you want to change the base?
Conversation
985f2be
to
1a391eb
Compare
7396 tests run: 7016 passed, 0 failed, 380 skipped (full report)Flaky tests (5)Postgres 17
Postgres 14
Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
538c06f at 2025-01-21T18:02:52.188Z :recycle: |
0f93d01
to
94560df
Compare
* - Writers to the chunk will wait for the prewarm worker to finish | ||
* writing in the chunk before starting their own writes into | ||
* as-of-yet unwritten blocks in the chunk. | ||
* - Readers won't be blocked (but can decide to wait for prewarm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be contradiction with 3) - "readers are never blocked. I understand what you mean, but still the difference between "reader is blocked" and "reader has to wait" is obscure. Also "can decide to wait" sounds unclear - looks like there is some AI in reader which decides whether to wait or not:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand what you mean, but still the difference between "reader is blocked" and "reader has to wait" is obscure
The reader has an option to wait - it can behave as if the LFC doesn't contain the page if the page isn't present, but it can also wait for the prewarming to be done to see if the prewarm worker put the page into the LFC. Either option is correct from LFC's perspective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also "can decide to wait" sounds unclear - looks like there is some AI in reader which decides whether to wait or not:)
Well, we could do that. Or we can hard-code the decision and determine whether we're going to wait based on locally available data. I don't mind either, I just wanted to point out that this option is available as potential optimization.
* 3. Readers are never blocked | ||
* 4. Only potentially conflicting writers are blocked, *but are not ignored*. | ||
* | ||
* The only consideration is that in hot-standby mode, we *can not* ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I do not understand what is the specific of hot-standby mode: why we can not ignore writes to LFC only in this case? As far as I understand writers are not ignore in any case: writer just has to wait until prewarm is completed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is the following:
- Prewarm worker starts work on page X, "pinning" the LFC chunk, and requests the page at LSN Y
- Redo worker needs to modify page X for xlog record at LSN Y+1, but page X isn't in buffers/LFC, so it only updates LwLSN
- Prewarm worker had the page pinned, so concurrent modifications of the page would've gone through LFC. The page isn't yet in LFC, so "no modification happend", so it loads the requested version Y of page X into the LFC, while the latest version of the page is
Y+1
. New readers will not see that version and instead read Y+1 until the buffer is evicted from both LFC and shared_buffers.
Note how this "skip replay of non-local pages" breaks when the chunk is marked for prewarming, due to Redo skipping replay of records to buffers not currently in memory. To protect against this, I made sure that the "no replay for non-buffered pages" optimization is not applied when the page might currently be getting sideloaded into the LFC (sideloaded, because no shared buffer lock is acquired).
* 5. Read all to-be-retrieved blocks of the chunk into local memory. | ||
* 6. With exclusively locked LFC: | ||
* 1. Mark chunk as LFC_PREWARM_STARTED | ||
* 2. Copy this backend's prewarmer worker number (if relevant) into new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is true for prewarm, but if we are going to use the same mechanism for storing prefetch results in LFC as soon as they arrived, number of writer ca. be as larger as number of backends (max_connections
) which can be thousands.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, but I wouldn't expect them to be working on the LFC concurrently. Similar to the shared buffers partition locks and the WAL writers locks, I think we only need a limited amount of concurrent slots to satisfy the workload (e.g. 128), as we only really need the lock once we're ready to start loading data.
* e.g. 8, as we only will likely have very few workers concurrently | ||
* prewarming the LFC), | ||
* 3. Copy the LFC entry's refcount into per-prewarmworker state's | ||
* 'wait_count', reduced by 1 (to account for the bgworker itself). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once again, if this mechanism is used not only for prewarm, but for normal backends, then maintaining per-bakend state is non-disarable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, see #10312 (comment) - we can limit this to a limited number of concurrent backends.
Note that we'll only need a prewarm state once a backend has the prefetched data in memory (no need to hold an LFC entry for a not-yet-received prefetch entry; actually, holding that is dangerous for other backends that want to prefetch a set of pages from that chunk), and the turnaround time for prewarming is expected to be very small: it's quite unlikely you're waiting for a long time on concurrent backends to finish their work, as reasonably there are at most BLOCKS_PER_CHUNK backends that are doing IO on the LFC chunk, and each such IO probably won't take very long.
* 4. If the current chunk's prewarm worker's wait count is 0, signal | ||
* chunkrelease. | ||
* 9. Prewarmer: | ||
* 1. wait on new condition variable `chunkrelease` until the wait_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if having max_connections
condition variables (taking pin account not only prewarm case) is ok.
My be we should better maintain some turnstile of condition variables...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you agree that that would be resolved by limiting the number of backends that can behave like a prewarm worker?
We allow extraction and loading of LFC contents through SQL-callable functions. The LFC sideloading is based on these principles: - While sideloading, the page won't get evicted (e.g. through chunk recycling) - Access to existing LFC contents won't be negatively impacted by sideloading - New writes to LFC may block, but only while sideloading is ready to write to the chunk. - prewarming/sideloading only adds data to the LFC, it won't evict pages. Co-authored-by: Matthias van de Meent <matthias@neon.tech> Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
94560df
to
538c06f
Compare
…#10442) ## Problem Prefetch is performed locally, so different backers can request the same pages form PS. Such duplicated request increase load of page server and network traffic. Making prefetch global seems to be very difficult and undesirable, because different queries can access chunks on different speed. Storing prefetch chunks in LFC will not completely eliminate duplicates, but can minimise such requests. The problem with storing prefetch result in LFC is that in this case page is not protected by share buffer lock. So we will have to perform extra synchronisation at LFC side. See: https://neondb.slack.com/archives/C0875PUD0LC/p1736772890602029?thread_ts=1736762541.116949&cid=C0875PUD0LC @MMeent implementation of prewarm: See #10312 ## Summary of changes Use conditional variables to sycnhronize access to LFC entry. --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Problem
We want to prewarm the LFC, preferably without blocking any other processes (or at least blocking the least amount of processes
Summary of changes
See
file_cache_internal.h
for implementation details.