Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: "could not find data for key" in test_scrubber_physical_gc_ancestors #10720

Open
2 tasks
jcsp opened this issue Feb 7, 2025 · 5 comments · May be fixed by #11000 or #10861
Open
2 tasks

pageserver: "could not find data for key" in test_scrubber_physical_gc_ancestors #10720

jcsp opened this issue Feb 7, 2025 · 5 comments · May be fixed by #11000 or #10861
Assignees
Labels
c/storage/pageserver Component: storage: pageserver m/stability-feb25 t/bug Issue Type: Bug triaged bugs that were already triaged

Comments

@jcsp
Copy link
Contributor

jcsp commented Feb 7, 2025

https://neon-github-public-dev.s3.amazonaws.com/reports/main/13196464234/index.html#testresult/628b11d330ec56c6/retries

 CRITICAL: could not ingest record at 0/15C4FF8\n')
Hint: use scripts/check_allowed_errors.sh to test any new allowed_error you add

Full error:

timeline_id=bddb5abb9b9209fd64bf7b9bba3d9c08}:connection{node_id=1}: CRITICAL: could not ingest record at 0/15C4FF8

Caused by:
    could not find data for key 020000000000000000000000000000000000 (shard ShardNumber(0)) at LSN 0/15C4FF9, request LSN 0/15C4FF8, ancestor 0/0

Tasks

Preview Give feedback
@jcsp jcsp added c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug labels Feb 7, 2025
@jcsp jcsp changed the title "could not find data for key" in test_scrubber_physical_gc_ancestors pageserver: "could not find data for key" in test_scrubber_physical_gc_ancestors Feb 7, 2025
@jcsp jcsp self-assigned this Feb 7, 2025
@jcsp
Copy link
Contributor Author

jcsp commented Feb 10, 2025

Also seen for key 020000000000000000000000000000000000 in test_sharding_gc in #10741

@jcsp
Copy link
Contributor Author

jcsp commented Feb 10, 2025

Very strangely, I can see the TWOPHASEDIR_KEY being included in image layer generation on non-zero shards as expected, but then subsequently being not found for read during ingest.

@jcsp
Copy link
Contributor Author

jcsp commented Feb 10, 2025

On a failure of test_sharding_gc I can see that there is an image layer covering the key: 000000068000000000000017720000000001-030000000000000000000000000000000002__00000000016A8E98-v1-00000001

and with pagectl I can see that the key is indeed in the file.

However, the ingest code is trying to read at an older LSN than the image layer was created at.

@jcsp
Copy link
Contributor Author

jcsp commented Feb 10, 2025

I think this is quite... niche. You need a shard that:

  1. passes through some WAL region where there is no data for it
  2. then gets a freeze_and_flush that advances its disk_consistent_lsn up to the highest LSN it has seen (the Advancing disk_consistent_lsn past WAL ingest gap code path)
  3. then it does ingest some data, and opens an in-memory layer, but that in-memory layer starts from OpenLayerManager.next_open_layer_at, which is lower than where we actually ingested up to
  4. then an image layer generation at disk_consistent_lsn, which now falls within the ephemeral layer range
  5. then GC with pitr=0, which may remove layers below the image layer just generated
  6. Finally, an ingest message type which tries to read a key that is not present in the ephemeral layer.

The reason this happens at all in certain tests is:

  • they're hammering the pageserver with cycles of write,checkpoint,write,checkpoint -- so steps 1+2+3 are quite likely.
  • then to exercise GC they intentionally do image layer generation and GC in quick succession

jcsp added a commit that referenced this issue Feb 10, 2025
github-merge-queue bot pushed a commit that referenced this issue Feb 11, 2025
## Problem

These tests can encounter a bug in the pageserver read path (#9185)
which occurs under the very specific circumstances that the tests
create, but is very unlikely to happen in the field.

We will fix the bug, but in the meantime let's un-flake the tests.

Related: #10720

## Summary of changes

- Permit "could not find data for key" errors in tests affected by #9185
@jcsp
Copy link
Contributor Author

jcsp commented Feb 11, 2025

disk_consistent_lsn is weird:

  • We use it for backpressure (~10GB threshold)
  • It doesn't literally mean what it used to mean: it can advance without flushing an L0 for shards.
  • There's no such thing as "disk consistent" any more, we always load from remote storage.

github-merge-queue bot pushed a commit that referenced this issue Feb 13, 2025
## Problem

In #10752 I used an overly-strict regex that only ignored error on a
particular key.

## Summary of changes

- Drop key from regex so it matches all such errors
jcsp added a commit that referenced this issue Feb 17, 2025
@jcsp jcsp added m/stability-feb25 triaged bugs that were already triaged labels Feb 19, 2025
@jcsp jcsp assigned VladLazar and unassigned jcsp Feb 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment