Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

for #10993: cancellation sensitivity for buffered writer #11052

Draft
wants to merge 22 commits into
base: problame/enspc-must-not-use-after-returned-error
Choose a base branch
from

Conversation

problame
Copy link
Contributor

@problame problame commented Feb 28, 2025

In

I added infinite retries for buffered writer flush IOs, primarily to gracefully handle ENOSPC but more generally so that the buffered writer is not left in a state where reads from the surrounding InMemoryLayer cause panics, but stall instead until the read succeeds. See the #10993 description for details.

However, I didn't add cancellation sensitivity, which is concerning because then there is no way to detach a timeline/tenant tenant that is encountering the problem.

This PR

  • addresses a TODO that highlights that flush loop is in fact infallible with the infinite retries
  • adds sensitivity to Timeline::cancel to the flush loop
  • and fixes the InMemoryLayer/EphemeralFile/BufferedWriter almagamate to remain read-available after flush loop is cancelled.

The support for read-availability after cancellation is is necessary so that reads from the InMemoryLayer that are already queued up behind the RwLock that wraps the BufferedWriter won't panic because of the mutable=None that we leave behind in case the flush loop gets cancelled.

Alternatives

One might think that we can drop #10993 + the cancellation sensitivity added in this PR, using only the changes that add read-availability after flush error.

The problem here is that read-availability sounds good but is really quite useless, because we cannot ingest new WAL and hence very soon, reads from compute are going to wait indefinitely on wait_lsn, because ingest isn't progressing.

Thus, having the infinite flush retry still retries still make more sense because they're just "slowness" to the user

This avoids any sort of panic when detaching a
Remaining read-available allows in-flight getpage

One might think

The problem with cancellation sensitivity is the same as with handling IO errors.
is that it's basically

… mutable = None condition that we're fixing in this PR. We can go fix _that_ as welll, but, before we do that, explore alternatives
…gain the mutable = None condition that we're fixing in this PR. We can go fix _that_ as welll, but, before we do that, explore alternatives"

This reverts commit ee9d68c.
…st in the result to go away"

This reverts commit a53987c.
This reverts commit 1b7d4ab2a6fd732cb08baa67baed55e94cc52be4.
…ble() always present; nice in theory but causes lost writes if flush() is retried
…ake mutable() always present; nice in theory but causes lost writes if flush() is retried"

This reverts commit 5086c561c426521074e758a2f9206ced98e73901.
…xposes again the mutable = None condition that we're fixing in this PR. We can go fix _that_ as welll, but, before we do that, explore alternatives
…; walreceiver shouldn't be retrying any writes, and will cause a panic if it does
@problame problame changed the title for #10993: cancellation sensitivy for buffered writer for #10993: cancellation sensitivity for buffered writer Feb 28, 2025
@problame problame requested review from VladLazar and jcsp February 28, 2025 17:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant