Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LFC prewarming with only light locking #10312

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

MMeent
Copy link
Contributor

@MMeent MMeent commented Jan 8, 2025

Problem

We want to prewarm the LFC, preferably without blocking any other processes (or at least blocking the least amount of processes

Summary of changes

See file_cache_internal.h for implementation details.

@MMeent MMeent force-pushed the MMeent/lfc-prewarm-light-locks branch from 985f2be to 1a391eb Compare January 8, 2025 18:56
Copy link

github-actions bot commented Jan 8, 2025

7396 tests run: 7016 passed, 0 failed, 380 skipped (full report)


Flaky tests (5)

Postgres 17

Postgres 14

Code coverage* (full report)

  • functions: 33.7% (8451 of 25111 functions)
  • lines: 49.2% (70814 of 143967 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
538c06f at 2025-01-21T18:02:52.188Z :recycle:

@MMeent MMeent force-pushed the MMeent/lfc-prewarm-light-locks branch 2 times, most recently from 0f93d01 to 94560df Compare January 10, 2025 22:11
@MMeent MMeent marked this pull request as ready for review January 13, 2025 12:36
@MMeent MMeent requested review from a team as code owners January 13, 2025 12:36
* - Writers to the chunk will wait for the prewarm worker to finish
* writing in the chunk before starting their own writes into
* as-of-yet unwritten blocks in the chunk.
* - Readers won't be blocked (but can decide to wait for prewarm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be contradiction with 3) - "readers are never blocked. I understand what you mean, but still the difference between "reader is blocked" and "reader has to wait" is obscure. Also "can decide to wait" sounds unclear - looks like there is some AI in reader which decides whether to wait or not:)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand what you mean, but still the difference between "reader is blocked" and "reader has to wait" is obscure

The reader has an option to wait - it can behave as if the LFC doesn't contain the page if the page isn't present, but it can also wait for the prewarming to be done to see if the prewarm worker put the page into the LFC. Either option is correct from LFC's perspective.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also "can decide to wait" sounds unclear - looks like there is some AI in reader which decides whether to wait or not:)

Well, we could do that. Or we can hard-code the decision and determine whether we're going to wait based on locally available data. I don't mind either, I just wanted to point out that this option is available as potential optimization.

* 3. Readers are never blocked
* 4. Only potentially conflicting writers are blocked, *but are not ignored*.
*
* The only consideration is that in hot-standby mode, we *can not* ignore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I do not understand what is the specific of hot-standby mode: why we can not ignore writes to LFC only in this case? As far as I understand writers are not ignore in any case: writer just has to wait until prewarm is completed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is the following:

  • Prewarm worker starts work on page X, "pinning" the LFC chunk, and requests the page at LSN Y
  • Redo worker needs to modify page X for xlog record at LSN Y+1, but page X isn't in buffers/LFC, so it only updates LwLSN
  • Prewarm worker had the page pinned, so concurrent modifications of the page would've gone through LFC. The page isn't yet in LFC, so "no modification happend", so it loads the requested version Y of page X into the LFC, while the latest version of the page is Y+1. New readers will not see that version and instead read Y+1 until the buffer is evicted from both LFC and shared_buffers.

Note how this "skip replay of non-local pages" breaks when the chunk is marked for prewarming, due to Redo skipping replay of records to buffers not currently in memory. To protect against this, I made sure that the "no replay for non-buffered pages" optimization is not applied when the page might currently be getting sideloaded into the LFC (sideloaded, because no shared buffer lock is acquired).

* 5. Read all to-be-retrieved blocks of the chunk into local memory.
* 6. With exclusively locked LFC:
* 1. Mark chunk as LFC_PREWARM_STARTED
* 2. Copy this backend's prewarmer worker number (if relevant) into new
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is true for prewarm, but if we are going to use the same mechanism for storing prefetch results in LFC as soon as they arrived, number of writer ca. be as larger as number of backends (max_connections) which can be thousands.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, but I wouldn't expect them to be working on the LFC concurrently. Similar to the shared buffers partition locks and the WAL writers locks, I think we only need a limited amount of concurrent slots to satisfy the workload (e.g. 128), as we only really need the lock once we're ready to start loading data.

* e.g. 8, as we only will likely have very few workers concurrently
* prewarming the LFC),
* 3. Copy the LFC entry's refcount into per-prewarmworker state's
* 'wait_count', reduced by 1 (to account for the bgworker itself).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once again, if this mechanism is used not only for prewarm, but for normal backends, then maintaining per-bakend state is non-disarable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, see #10312 (comment) - we can limit this to a limited number of concurrent backends.

Note that we'll only need a prewarm state once a backend has the prefetched data in memory (no need to hold an LFC entry for a not-yet-received prefetch entry; actually, holding that is dangerous for other backends that want to prefetch a set of pages from that chunk), and the turnaround time for prewarming is expected to be very small: it's quite unlikely you're waiting for a long time on concurrent backends to finish their work, as reasonably there are at most BLOCKS_PER_CHUNK backends that are doing IO on the LFC chunk, and each such IO probably won't take very long.

* 4. If the current chunk's prewarm worker's wait count is 0, signal
* chunkrelease.
* 9. Prewarmer:
* 1. wait on new condition variable `chunkrelease` until the wait_count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if having max_connections condition variables (taking pin account not only prewarm case) is ok.
My be we should better maintain some turnstile of condition variables...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you agree that that would be resolved by limiting the number of backends that can behave like a prewarm worker?

We allow extraction and loading of LFC contents through SQL-callable
functions.

The LFC sideloading is based on these principles:

- While sideloading, the page won't get evicted (e.g. through chunk recycling)
- Access to existing LFC contents won't be negatively impacted by sideloading
- New writes to LFC may block, but only while sideloading is ready to
  write to the chunk.
- prewarming/sideloading only adds data to the LFC, it won't evict pages.

Co-authored-by: Matthias van de Meent <matthias@neon.tech>
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
@MMeent MMeent force-pushed the MMeent/lfc-prewarm-light-locks branch from 94560df to 538c06f Compare January 21, 2025 15:03
github-merge-queue bot pushed a commit that referenced this pull request Feb 21, 2025
…#10442)

## Problem

Prefetch is performed locally, so different backers can request the same
pages form PS.
Such duplicated request increase load of page server and network
traffic.
Making prefetch global seems to be very difficult and undesirable,
because different queries can access chunks on different speed. Storing
prefetch chunks in LFC will not completely eliminate duplicates, but can
minimise such requests.

The problem with storing prefetch result in LFC is that in this case
page is not protected by share buffer lock.
So we will have to perform extra synchronisation at LFC side.

See:

https://neondb.slack.com/archives/C0875PUD0LC/p1736772890602029?thread_ts=1736762541.116949&cid=C0875PUD0LC

@MMeent implementation of prewarm:
See #10312


## Summary of changes

Use conditional variables to sycnhronize access to LFC entry.

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants