-
Notifications
You must be signed in to change notification settings - Fork 496
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pageserver: getpage requests sometimes skip reading recently written image layers #9185
Comments
A cleaner reproducer that uses layer eviction + on-demand downloads to prove which layers are touched by a getpage request: /~https://github.com/neondatabase/neon/tree/jcsp/layer-map-search-at-image-lsn-3 This test does reads at exactly the LSN of the image layer, but I can also reproduce the issue with some writes between generating the image layer and doing the read, so this is not something that only occurs when reading exactly at the image layer's LSN. I suspect our reads are skipping the image layer until the next time we freeze the ephemeral layer. |
Perhaps this piece of logic is at fault in get_vectored_reconstruct_data_timeline:
...because lsn_range is being constructed from the absolute start of the layer. Our |
## Problem These tests can encounter a bug in the pageserver read path (#9185) which occurs under the very specific circumstances that the tests create, but is very unlikely to happen in the field. We will fix the bug, but in the meantime let's un-flake the tests. Related: #10720 ## Summary of changes - Permit "could not find data for key" errors in tests affected by #9185
Via investigation of #9058 -- in that issue, it was observed that layers before recently written image layers were being visited by getpage requests.
It seems like under some circumstances, a getpage request to the exact same LSN where an image layer exists can fail to hit that image layer. Not clear if being at the exact same LSN is important or not: it might just be that we don't hit image layers for reads until the current in memory layer is closed?
Lots of uncertainty here, not claiming to have conclusively diagnosed this
Branch with experimental test:
/~https://github.com/neondatabase/neon/tree/jcsp/layer-map-search-at-image-lsn-2
In that branch, there are some log lines hacked in to record which layers are visited at INFO level. In the test, there is a checkpoint line commented out:
The presence or absence of inmemory layers shouldn't make any difference to whether reads hit an image layer, but apparently it does.
The text was updated successfully, but these errors were encountered: