-
Notifications
You must be signed in to change notification settings - Fork 496
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
buffered writer: handle write errors by retrying all write IO errors indefinitely #10993
base: main
Are you sure you want to change the base?
Conversation
7744 tests run: 7366 passed, 0 failed, 378 skipped (full report)Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
80f2c26 at 2025-02-28T12:26:29.622Z :recycle: |
cc @alexanderlaw , please test your scenario from #10856 against this PR. I wasn't able to do it exactly like in your instructions --- I only put the I would love a regression test that actually exercises the ENOSPC case. Would like to hear the Storage reviewers' feedback on the trade-offs I made in this PR first, before I spend the time implementing the regression test. |
I've tested e935aff and see the improvement:
pageserver.log contains:
Regarding tests for ENOSPC (and perhaps other errors), I can also suggest emulating such errors using LD_PRELOAD. (I didn't encounter such tests in postgres, maybe there are some challenges with it, but personally I tried that: https://www.postgresql.org/message-id/f9bebfe6-cee4-ed87-d4e6-29b5ca4be08d@gmail.com) |
This PR is ready to merge IMO. Reviewers, please review 🙏 |
@@ -229,6 +229,7 @@ async fn download_object( | |||
|| IoBufferMut::with_capacity(super::BUFFER_SIZE), | |||
gate.enter().map_err(|_| DownloadError::Cancelled)?, | |||
ctx, | |||
info_span!(parent: None, "download_object_buffered_writer", %dst_path), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't checked the ingest code, but are these writers kind of short-lived (like one wal record?)
Just thinking about perf impact of the extra span
// then we can't shut down the timeline/tenant/pageserver cleanrly because | ||
// upp layers of the Pageserver write path are holding the gate open for EphemeralFile. | ||
// | ||
// TODO: use utils::backoff::retry once async closures are actually usable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
didn't async closures just get stabilized in 1.85? (if this comment is calling out a specific outstanding issue then let's mention or link it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried in this PR, they're not good yet: cbf7354
https://neondb.slack.com/archives/C0277TKAJCA/p1740579692226909
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the commit where I gave up: a03f335
utils::backoff::exponential_backoff(attempt, 1.0, 10.0, &NO_CANCELLATION).await; | ||
ControlFlow::Continue(()) | ||
} | ||
.instrument(info_span!("flush_attempt", %attempt)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A little concerned about extra spans this deep in the I/O stack -- I guess the idea is that in the rare event where we retry here, we want to know who/what is retrying rather than it just looking like a hang. Is there some way we can do that without constructing the spans on the happy path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What we are really paying for on the happy path is the span lifecycle.
There's atomic operations when creating and exiting the span (to determine the right subscriber).
The span allocation itself is pooled, so should be fairly cheap.
I don't know if this would show up without benchmarking. We could only instrument if attempt > 1
and not worry about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks fine to me.
As pointed out in the Context section, throwing away a InMemoryLayer that encountered an error and reingesting the WAL is a lot of complexity that IMO isn't justified for such an edge case.
I think this is the truly robust thing to do. I'm aware it's complicated and not saying you should do it now. It would handle not only IO errors, but most bugs/errors in the ingest path too.
What's your risk assessment on rolling out without cancellation sensitivity on retries?
To me it feels a bit spicy: it's an infinite loop we can't break out of nicely.
utils::backoff::exponential_backoff(attempt, 1.0, 10.0, &NO_CANCELLATION).await; | ||
ControlFlow::Continue(()) | ||
} | ||
.instrument(info_span!("flush_attempt", %attempt)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What we are really paying for on the happy path is the span lifecycle.
There's atomic operations when creating and exiting the span (to determine the right subscriber).
The span allocation itself is pooled, so should be fairly cheap.
I don't know if this would show up without benchmarking. We could only instrument if attempt > 1
and not worry about it.
I disagree that it's more robust to retry locally: throwing away InMemoryLayer and reingesting from SKs adds more load to the system, and not just the one node that is experiencing IO errors, but also the SKs. Killing ourselves on write IO failures and letting secondaries retry, on their other nodes, which presumably are not experiencing IO failures, is much more robust I think. |
There were two review comments pertaining spans. There are two concerns with spans:
I added two spans. The first span is for the top level flush background task future. I.e.,
The second span is within the flush task, for each flush attempt. Bot spans were added for better error context so that, if the code indeed retries, those retries are discoverable. I guess the alternatives to the current way are:
I'll try and see if |
Under heavy ingest load, secondaries will have to re-ingest too (potentially more than the primary). |
Setup
Baseline SUT: ResultsBaseline
SUT
ConclusionNo meaningful change. |
So, takeaway from But it doesn't test for contention under concurrency. Can we contend on anything across multiple tasks? So, both the Span::enter and drop(Entered) code paths will do their refcount inc's and dec's only against the uniquely owned span object. Probably the costliest part of span exit/enter is tracing::...::sharded.rs's iteration over the entire span stack. But the span stack has depth 0 or 1, respectively, for our flusher task. So, that's cheap as well. |
Regarding concerns about cancellation: I'll try to hack in cancellation but the trouble here is to not end up in the |
Problem
If the Pageserver ingest path (InMemoryLayer=>EphemeralFile=>BufferedWriter)
encounters ENOSPC or any other write IO error when flushing the mutable buffer
of the BufferedWriter, the buffered writer is left in a state where subsequent reads
from the InMemoryLayer it will cause a
must not use after we returned an error
panic.The reason is that
FlushHandle::flush
function to fail at channel.recv() andFlushHandle::flush
function to bail with the flush error,BufferedWriter::flush
withBufferedWriter::mutable = None
,mutable = None
and cause the panic.Context
It has always been the contract that writes against the BufferedWriter API
must not be retried because the writer/stream-style/append-only interface makes no atomicity guarantees ("On error, did nothing or a piece of the buffer get appended?").
The idea was that the error would bubble up to upper layers that can throw away the buffered writer and create a new one. (See our internal error handling policy document on how to handle e.g.
ENOSPC
).That might be true for delta/image layer writers, I haven't checked.
But it's certainly not true for the ingest path: there are no provisions to throw away an InMemoryLayer that encountered a write error an reingest the WAL already written to it.
Adding such higher-level retries would involve either resetting last_record_lsn to a lower value and restarting walreceiver. The code isn't flexible enough to do that, and such complexity likely isn't worth it given that write errors are rare.
Solution
The solution in this PR is to retry any failing write operation indefinitely inside the buffered writer flush task, except of course those that are fatal as per
maybe_fatal_err
.Retrying indefinitely ensures that
BufferedWriter::mutable
is never leftNone
in the case of IO errors, thereby solving the problem described above.It's a clear improvement over the status quo.
However, while we're retrying, we build up backpressure because the
flush
is only double-buffered, not infinitely buffered.Backpressure here is generally good to avoid resource exhaustion, but blocks reads and hence stalls GetPage requests because InMemoryLayer reads and writes are mutually exclusive.
That's orthogonal to the problem that is solved here, though.
Caveats
Note that there are some remaining conditions in the flush background task where it can bail with an error. I have annotated one of them with a TODO comment.
Hence the
FlushHandle::flush
is still fallible and hence the overall scenario of leavingmutable = None
on the bail path is still possible.We can clean that up in a later commit.
Note also that retrying indefinitely is great for temporary errors like ENOSPC but likely undesirable in case the
std::io::Error
we get is really due to higher-level logic bugs.For example, we could fail to flush because the timeline or tenant directory got deleted and VirtualFile's reopen fails with ENOENT.
Note finally that cancellation is not respected while we're retrying.
This means we will block timeline/tenant/pageserver shutdown.
The reason is that the existing cancellation story for the buffered writer background task was to recv from flush op channel until the sending side (FlushHandle) is explicitly shut down or dropped.
Failing to handle cancellation carries the operational risk that even if a single timeline gets stuck because of a logic bug such as the one laid out above, we must still restart the whole pageserver process.
Alternatives Considered
As pointed out in the
Context
section, throwing away a InMemoryLayer that encountered an error and reingesting the WAL is a lot of complexity that IMO isn't justified for such an edge case.Also, it's wasteful.
I think it's a local optimum.
A more general and simpler solution for ENOSPC is to
abort()
the process and run eviction on startup before bringing up the rest of pageserver.I argued for it in the past, the pro arguments are still valid and complete:
https://neondb.slack.com/archives/C033RQ5SPDH/p1716896265296329
The trouble at the time was implementing eviction on startup.
However, maybe things are simpler now that we are fully storcon-managed and all tenants have secondaries.
For example, if pageserver
abort()
s on ENOSPC and then simply don't respond to storcon heartbeats while we're running eviction on startup, storcon will fail tenants over to the secondary anyway, giving us all the time we need to clean up.The downside is that if there's a systemic space management bug, above proposal will just propagate the problem to other nodes. But I imagine that because of the delays involved with filling up disks, the system might reach a half-stable state, providing operators more time to react.
Demo
Intermediary commit
a03f335121480afc0171b0f34606bdf929e962c5
is demoed in this (internal) screen recording: https://drive.google.com/file/d/1nBC6lFV2himQ8vRXDXrY30yfWmI2JL5J/view?usp=drive_linkPerf Testing
Ran
bench_ingest
on tmpfs, no measurable difference.Spans are uniquely owned by the flush task, and the span stack isn't too deep, so, enter and exit should be cheap.
Plus, each flush takes ~150us with direct IO enabled, so, not that high frequency event anyways.
Refs