New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

LFC prewarming with only light locking #10312

Open

MMeent wants to merge 1 commit into main from MMeent/lfc-prewarm-light-locks

Contributor

MMeent commented Jan 8, 2025

Problem

We want to prewarm the LFC, preferably without blocking any other processes (or at least blocking the least amount of processes

Summary of changes

See file_cache_internal.h for implementation details.

MMeent force-pushed the MMeent/lfc-prewarm-light-locks branch from 985f2be to 1a391eb Compare

January 8, 2025 18:56

github-actions bot commented Jan 8, 2025 •

edited

Loading

7396 tests run: 7016 passed, 0 failed, 380 skipped (full report)

Flaky tests (5)

Postgres 17

test_metrics_normal_work: release-x86-64, debug-x86-64, release-arm64

Postgres 14

test_metrics_normal_work: release-arm64, release-x86-64

Code coverage* (full report)

functions: 33.7% (8451 of 25111 functions)
lines: 49.2% (70814 of 143967 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
538c06f at 2025-01-21T18:02:52.188Z :recycle:}

MMeent force-pushed the MMeent/lfc-prewarm-light-locks branch 2 times, most recently from 0f93d01 to 94560df Compare

January 10, 2025 22:11

MMeent marked this pull request as ready for review

January 13, 2025 12:36

MMeent requested review from a team as code owners

January 13, 2025 12:36

MMeent requested review from myrrc, lubennikovaav and arssher

January 13, 2025 12:36

knizhnik reviewed

View reviewed changes

pgxn/neon/file_cache_internal.h

+               *			 - Writers to the chunk will wait for the prewarm worker to finish
+               *			   writing in the chunk before starting their own writes into
+               *			   as-of-yet unwritten blocks in the chunk.
+               *			 - Readers won't be blocked (but can decide to wait for prewarm

Contributor

knizhnik Jan 17, 2025

There seems to be contradiction with 3) - "readers are never blocked. I understand what you mean, but still the difference between "reader is blocked" and "reader has to wait" is obscure. Also "can decide to wait" sounds unclear - looks like there is some AI in reader which decides whether to wait or not:)

Contributor Author

MMeent Jan 20, 2025

I understand what you mean, but still the difference between "reader is blocked" and "reader has to wait" is obscure

The reader has an option to wait - it can behave as if the LFC doesn't contain the page if the page isn't present, but it can also wait for the prewarming to be done to see if the prewarm worker put the page into the LFC. Either option is correct from LFC's perspective.

Contributor Author

MMeent Jan 20, 2025

Also "can decide to wait" sounds unclear - looks like there is some AI in reader which decides whether to wait or not:)

Well, we could do that. Or we can hard-code the decision and determine whether we're going to wait based on locally available data. I don't mind either, I just wanted to point out that this option is available as potential optimization.

pgxn/neon/file_cache_internal.h

+               *		3. Readers are never blocked
+               *		4. Only potentially conflicting writers are blocked, *but are not ignored*.
+               *
+               * The only consideration is that in hot-standby mode, we *can not* ignore

Contributor

knizhnik Jan 17, 2025

Sorry, I do not understand what is the specific of hot-standby mode: why we can not ignore writes to LFC only in this case? As far as I understand writers are not ignore in any case: writer just has to wait until prewarm is completed.

Contributor Author

MMeent Jan 20, 2025

The issue is the following:

Prewarm worker starts work on page X, "pinning" the LFC chunk, and requests the page at LSN Y
Redo worker needs to modify page X for xlog record at LSN Y+1, but page X isn't in buffers/LFC, so it only updates LwLSN
Prewarm worker had the page pinned, so concurrent modifications of the page would've gone through LFC. The page isn't yet in LFC, so "no modification happend", so it loads the requested version Y of page X into the LFC, while the latest version of the page is Y+1. New readers will not see that version and instead read Y+1 until the buffer is evicted from both LFC and shared_buffers.

Note how this "skip replay of non-local pages" breaks when the chunk is marked for prewarming, due to Redo skipping replay of records to buffers not currently in memory. To protect against this, I made sure that the "no replay for non-buffered pages" optimization is not applied when the page might currently be getting sideloaded into the LFC (sideloaded, because no shared buffer lock is acquired).

pgxn/neon/file_cache_internal.h

+               * 5.  Read all to-be-retrieved blocks of the chunk into local memory.
+               * 6.  With exclusively locked LFC:
+               *     1.  Mark chunk as LFC_PREWARM_STARTED
+               *     2.  Copy this backend's prewarmer worker number (if relevant) into new

Contributor

knizhnik Jan 17, 2025

It is true for prewarm, but if we are going to use the same mechanism for storing prefetch results in LFC as soon as they arrived, number of writer ca. be as larger as number of backends (max_connections) which can be thousands.

Contributor Author

MMeent Jan 20, 2025

True, but I wouldn't expect them to be working on the LFC concurrently. Similar to the shared buffers partition locks and the WAL writers locks, I think we only need a limited amount of concurrent slots to satisfy the workload (e.g. 128), as we only really need the lock once we're ready to start loading data.

pgxn/neon/file_cache_internal.h

+               *         e.g. 8, as we only will likely have very few workers concurrently
+               *         prewarming the LFC),
+               *     3.  Copy the LFC entry's refcount into per-prewarmworker state's
+               *         'wait_count', reduced by 1 (to account for the bgworker itself).

Contributor

knizhnik Jan 17, 2025

Once again, if this mechanism is used not only for prewarm, but for normal backends, then maintaining per-bakend state is non-disarable.

Contributor Author

MMeent Jan 20, 2025

Yes, see #10312 (comment) - we can limit this to a limited number of concurrent backends.

Note that we'll only need a prewarm state once a backend has the prefetched data in memory (no need to hold an LFC entry for a not-yet-received prefetch entry; actually, holding that is dangerous for other backends that want to prefetch a set of pages from that chunk), and the turnaround time for prewarming is expected to be very small: it's quite unlikely you're waiting for a long time on concurrent backends to finish their work, as reasonably there are at most BLOCKS_PER_CHUNK backends that are doing IO on the LFC chunk, and each such IO probably won't take very long.

pgxn/neon/file_cache_internal.h

+               *     4.  If the current chunk's prewarm worker's wait count is 0, signal
+               *         chunkrelease.
+               * 9.  Prewarmer:
+               *     1.  wait on new condition variable `chunkrelease` until the wait_count

Contributor

knizhnik Jan 17, 2025

I am not sure if having max_connections condition variables (taking pin account not only prewarm case) is ok.
My be we should better maintain some turnstile of condition variables...

Contributor Author

MMeent Jan 20, 2025

Would you agree that that would be resolved by limiting the number of backends that can behave like a prewarm worker?

knizhnik mentioned this pull request

Store prefetch results in LFC cache once as soon as they are received #10442

Merged

MMeent mentioned this pull request

Change expected error message for test_physical_replication_config_mismatch_max_locks_per_transaction #10376

Open


          Implement SQL-level LFC prewarming through sideloading pages

538c06f

We allow extraction and loading of LFC contents through SQL-callable
functions.

The LFC sideloading is based on these principles:

- While sideloading, the page won't get evicted (e.g. through chunk recycling)
- Access to existing LFC contents won't be negatively impacted by sideloading
- New writes to LFC may block, but only while sideloading is ready to
  write to the chunk.
- prewarming/sideloading only adds data to the LFC, it won't evict pages.

Co-authored-by: Matthias van de Meent <matthias@neon.tech>
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>

MMeent force-pushed the MMeent/lfc-prewarm-light-locks branch from 94560df to 538c06f Compare

January 21, 2025 15:03

MMeent mentioned this pull request

Compute/prefetch+LFC: Sideloading for prefetched pages #10492

Draft

github-merge-queue bot pushed a commit that referenced this pull request


          Store prefetch results in LFC cache once as soon as they are received (…

b1d8771

…#10442)

## Problem

Prefetch is performed locally, so different backers can request the same
pages form PS.
Such duplicated request increase load of page server and network
traffic.
Making prefetch global seems to be very difficult and undesirable,
because different queries can access chunks on different speed. Storing
prefetch chunks in LFC will not completely eliminate duplicates, but can
minimise such requests.

The problem with storing prefetch result in LFC is that in this case
page is not protected by share buffer lock.
So we will have to perform extra synchronisation at LFC side.

See:

https://neondb.slack.com/archives/C0875PUD0LC/p1736772890602029?thread_ts=1736762541.116949&cid=C0875PUD0LC

@MMeent implementation of prewarm:
See #10312


## Summary of changes

Use conditional variables to sycnhronize access to LFC entry.

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

knizhnik knizhnik left review comments

myrrc Awaiting requested review from myrrc myrrc is a code owner automatically assigned from neondatabase/compute

lubennikovaav Awaiting requested review from lubennikovaav lubennikovaav is a code owner automatically assigned from neondatabase/compute

arssher Awaiting requested review from arssher arssher is a code owner automatically assigned from neondatabase/storage

At least 1 approving review is required to merge this pull request.

Labels

None yet