pageserver: "could not find data for key" in `test_scrubber_physical_gc_ancestors` #10720

jcsp · 2025-02-07T12:40:54Z

https://neon-github-public-dev.s3.amazonaws.com/reports/main/13196464234/index.html#testresult/628b11d330ec56c6/retries

 CRITICAL: could not ingest record at 0/15C4FF8\n')
Hint: use scripts/check_allowed_errors.sh to test any new allowed_error you add

Full error:

timeline_id=bddb5abb9b9209fd64bf7b9bba3d9c08}:connection{node_id=1}: CRITICAL: could not ingest record at 0/15C4FF8

Caused by:
    could not find data for key 020000000000000000000000000000000000 (shard ShardNumber(0)) at LSN 0/15C4FF9, request LSN 0/15C4FF8, ancestor 0/0

Tasks

Give feedback

The text was updated successfully, but these errors were encountered:

jcsp · 2025-02-10T12:43:48Z

Also seen for key 020000000000000000000000000000000000 in test_sharding_gc in #10741

jcsp · 2025-02-10T13:26:57Z

Very strangely, I can see the TWOPHASEDIR_KEY being included in image layer generation on non-zero shards as expected, but then subsequently being not found for read during ingest.

jcsp · 2025-02-10T14:11:43Z

On a failure of test_sharding_gc I can see that there is an image layer covering the key: 000000068000000000000017720000000001-030000000000000000000000000000000002__00000000016A8E98-v1-00000001

and with pagectl I can see that the key is indeed in the file.

However, the ingest code is trying to read at an older LSN than the image layer was created at.

jcsp · 2025-02-10T22:40:38Z

I think this is quite... niche. You need a shard that:

passes through some WAL region where there is no data for it
then gets a freeze_and_flush that advances its disk_consistent_lsn up to the highest LSN it has seen (the Advancing disk_consistent_lsn past WAL ingest gap code path)
then it does ingest some data, and opens an in-memory layer, but that in-memory layer starts from OpenLayerManager.next_open_layer_at, which is lower than where we actually ingested up to
then an image layer generation at disk_consistent_lsn, which now falls within the ephemeral layer range
then GC with pitr=0, which may remove layers below the image layer just generated
Finally, an ingest message type which tries to read a key that is not present in the ephemeral layer.

The reason this happens at all in certain tests is:

they're hammering the pageserver with cycles of write,checkpoint,write,checkpoint -- so steps 1+2+3 are quite likely.
then to exercise GC they intentionally do image layer generation and GC in quick succession

Related: #10720

## Problem These tests can encounter a bug in the pageserver read path (#9185) which occurs under the very specific circumstances that the tests create, but is very unlikely to happen in the field. We will fix the bug, but in the meantime let's un-flake the tests. Related: #10720 ## Summary of changes - Permit "could not find data for key" errors in tests affected by #9185

jcsp · 2025-02-11T15:13:11Z

disk_consistent_lsn is weird:

We use it for backpressure (~10GB threshold)
It doesn't literally mean what it used to mean: it can advance without flushing an L0 for shards.
There's no such thing as "disk consistent" any more, we always load from remote storage.

## Problem In #10752 I used an overly-strict regex that only ignored error on a particular key. ## Summary of changes - Drop key from regex so it matches all such errors

This reverts commit ae463f3.

jcsp added c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug labels Feb 7, 2025

jcsp changed the title ~~"could not find data for key" in test_scrubber_physical_gc_ancestors~~ pageserver: "could not find data for key" in test_scrubber_physical_gc_ancestors Feb 7, 2025

jcsp self-assigned this Feb 7, 2025

jcsp mentioned this issue Feb 10, 2025

pageserver: don't try and ingest XLOG_CHECKPOINT_SHUTDOWN on nonzero shards #10742

Closed

1 task

jcsp added a commit that referenced this issue Feb 10, 2025

tests: temporarily permit a log error

613d15a

Related: #10720

jcsp mentioned this issue Feb 10, 2025

tests: temporarily permit a log error #10752

Merged

jcsp added a commit that referenced this issue Feb 13, 2025

tests: broaden allow-list for #10720 workaround

69dd4eb

jcsp added a commit that referenced this issue Feb 17, 2025

Revert "tests: broaden allow-list for #10720 workaround (#10807)"

a8f59f8

This reverts commit ae463f3.

jcsp linked a pull request Feb 17, 2025 that will close this issue

pageserver: don't skip past image layers that overlap with ephemeral layers #10861

Draft

1 task

jcsp added m/stability-feb25 triaged bugs that were already triaged labels Feb 19, 2025

jcsp assigned VladLazar and unassigned jcsp Feb 21, 2025

VladLazar linked a pull request Feb 27, 2025 that will close this issue

pageserver: handle in-memory layer overlaps with persistent layers #11000

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pageserver: "could not find data for key" in `test_scrubber_physical_gc_ancestors` #10720

pageserver: "could not find data for key" in `test_scrubber_physical_gc_ancestors` #10720

jcsp commented Feb 7, 2025 •

edited

Loading

Tasks

jcsp commented Feb 10, 2025

jcsp commented Feb 10, 2025

jcsp commented Feb 10, 2025 •

edited

Loading

jcsp commented Feb 10, 2025

jcsp commented Feb 11, 2025

pageserver: "could not find data for key" in test_scrubber_physical_gc_ancestors #10720

pageserver: "could not find data for key" in test_scrubber_physical_gc_ancestors #10720

Comments

jcsp commented Feb 7, 2025 • edited Loading

Tasks

jcsp commented Feb 10, 2025

jcsp commented Feb 10, 2025

jcsp commented Feb 10, 2025 • edited Loading

jcsp commented Feb 10, 2025

jcsp commented Feb 11, 2025

pageserver: "could not find data for key" in `test_scrubber_physical_gc_ancestors` #10720

pageserver: "could not find data for key" in `test_scrubber_physical_gc_ancestors` #10720

jcsp commented Feb 7, 2025 •

edited

Loading

jcsp commented Feb 10, 2025 •

edited

Loading