buffered writer: handle write errors by retrying all write IO errors indefinitely #10993

problame · 2025-02-26T14:19:53Z

Problem

If the Pageserver ingest path (InMemoryLayer=>EphemeralFile=>BufferedWriter)
encounters ENOSPC or any other write IO error when flushing the mutable buffer
of the BufferedWriter, the buffered writer is left in a state where subsequent reads
from the InMemoryLayer it will cause a must not use after we returned an error panic.

The reason is that

the flush background task bails on flush failure,
causing the FlushHandle::flush function to fail at channel.recv() and
causing the FlushHandle::flush function to bail with the flush error,
leaving its caller BufferedWriter::flush with BufferedWriter::mutable = None,
once the InMemoryLayer's RwLock::write guard is dropped, subsequent reads can enter,
those reads find mutable = None and cause the panic.

Context

It has always been the contract that writes against the BufferedWriter API
must not be retried because the writer/stream-style/append-only interface makes no atomicity guarantees ("On error, did nothing or a piece of the buffer get appended?").

The idea was that the error would bubble up to upper layers that can throw away the buffered writer and create a new one. (See our internal error handling policy document on how to handle e.g. ENOSPC).

That might be true for delta/image layer writers, I haven't checked.

But it's certainly not true for the ingest path: there are no provisions to throw away an InMemoryLayer that encountered a write error an reingest the WAL already written to it.

Adding such higher-level retries would involve either resetting last_record_lsn to a lower value and restarting walreceiver. The code isn't flexible enough to do that, and such complexity likely isn't worth it given that write errors are rare.

Solution

The solution in this PR is to retry any failing write operation indefinitely inside the buffered writer flush task, except of course those that are fatal as per maybe_fatal_err.

Retrying indefinitely ensures that BufferedWriter::mutable is never left None in the case of IO errors, thereby solving the problem described above.

It's a clear improvement over the status quo.

However, while we're retrying, we build up backpressure because the flush is only double-buffered, not infinitely buffered.

Backpressure here is generally good to avoid resource exhaustion, but blocks reads and hence stalls GetPage requests because InMemoryLayer reads and writes are mutually exclusive.
That's orthogonal to the problem that is solved here, though.

Caveats

Note that there are some remaining conditions in the flush background task where it can bail with an error. I have annotated one of them with a TODO comment.
Hence the FlushHandle::flush is still fallible and hence the overall scenario of leaving mutable = None on the bail path is still possible.
We can clean that up in a later commit.

Note also that retrying indefinitely is great for temporary errors like ENOSPC but likely undesirable in case the std::io::Error we get is really due to higher-level logic bugs.
For example, we could fail to flush because the timeline or tenant directory got deleted and VirtualFile's reopen fails with ENOENT.

Note finally that cancellation is not respected while we're retrying.
This means we will block timeline/tenant/pageserver shutdown.
The reason is that the existing cancellation story for the buffered writer background task was to recv from flush op channel until the sending side (FlushHandle) is explicitly shut down or dropped.

Failing to handle cancellation carries the operational risk that even if a single timeline gets stuck because of a logic bug such as the one laid out above, we must still restart the whole pageserver process.

Alternatives Considered

As pointed out in the Context section, throwing away a InMemoryLayer that encountered an error and reingesting the WAL is a lot of complexity that IMO isn't justified for such an edge case.
Also, it's wasteful.
I think it's a local optimum.

A more general and simpler solution for ENOSPC is to abort() the process and run eviction on startup before bringing up the rest of pageserver.
I argued for it in the past, the pro arguments are still valid and complete:
https://neondb.slack.com/archives/C033RQ5SPDH/p1716896265296329
The trouble at the time was implementing eviction on startup.
However, maybe things are simpler now that we are fully storcon-managed and all tenants have secondaries.
For example, if pageserver abort()s on ENOSPC and then simply don't respond to storcon heartbeats while we're running eviction on startup, storcon will fail tenants over to the secondary anyway, giving us all the time we need to clean up.

The downside is that if there's a systemic space management bug, above proposal will just propagate the problem to other nodes. But I imagine that because of the delays involved with filling up disks, the system might reach a half-stable state, providing operators more time to react.

Demo

Intermediary commit a03f335121480afc0171b0f34606bdf929e962c5 is demoed in this (internal) screen recording: https://drive.google.com/file/d/1nBC6lFV2himQ8vRXDXrY30yfWmI2JL5J/view?usp=drive_link

Perf Testing

Ran bench_ingest on tmpfs, no measurable difference.

Spans are uniquely owned by the flush task, and the span stack isn't too deep, so, enter and exit should be cheap.
Plus, each flush takes ~150us with direct IO enabled, so, not that high frequency event anyways.

Refs

fixes pageserver produces panic errors "must not use after we returned an error" after I/O error (ENOSPC) #10856

github-actions · 2025-02-26T14:36:57Z

7744 tests run: 7366 passed, 0 failed, 378 skipped (full report)

Flaky tests (1)

Postgres 16

test_change_pageserver: release-x86-64-with-lfc

Code coverage* (full report)

functions: 32.8% (8643 of 26363 functions)
lines: 48.6% (73227 of 150587 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
80f2c26 at 2025-02-28T12:26:29.622Z :recycle:}

problame · 2025-02-26T18:50:17Z

cc @alexanderlaw , please test your scenario from #10856 against this PR.

I wasn't able to do it exactly like in your instructions --- I only put the pageserver_1/tenants directory on a tmpfs ; if I put the whole .neon on tmpfs, I get error in the compute / walproposer code (which is another thing you could look into, separately, but not in this issue or PR).

I would love a regression test that actually exercises the ENOSPC case.
But setting up a disk image is platform specific.
And filesystem quotas are hard either.
So I guess the only good option is failpoint-based test.

Would like to hear the Storage reviewers' feedback on the trade-offs I made in this PR first, before I spend the time implementing the regression test.

alexanderlaw · 2025-02-27T06:33:25Z

I've tested e935aff and see the improvement:

-rw-rw-r-- 1 a a 28K Feb 27 08:14 /tmp/tmpfs/.neon/pageserver_1/pageserver.log
-rw-rw-r-- 1 a a 31K Feb 27 08:15 /tmp/tmpfs/.neon/pageserver_1/pageserver.log

pageserver.log contains:

2025-02-27T06:14:22.539406Z  INFO compaction_loop{tenant_id=daa3745f15757989cef9f2ca2f87a043 shard_id=0000}:run:compact_timeline{timeline_id=8d9538523fdd13c05213165f28fc5f60}:create_image_layers{lsn=0/79E3D80 mode=Try}: metadata key compaction: trigger_generation=false, delta_files_accessed=1, total_kb_retrieved=0, total_keys_retrieved=0, read_time=0.000156982s
2025-02-27T06:14:28.154686Z  WARN ephemeral_file_buffered_writer{tenant_id=daa3745f15757989cef9f2ca2f87a043 shard_id=0000 timeline_id=8d9538523fdd13c05213165f28fc5f60 path=/tmp/tmpfs/.neon/pageserver_1/tenants/daa3745f15757989cef9f2ca2f87a043/timelines/8d9538523fdd13c05213165f28fc5f60/ephemeral-2}:flush_attempt{attempt=1}: \
 error flushing buffered writer buffer to disk, retrying after backoff err=No space left on device (os error 28)
2025-02-27T06:14:28.154768Z  INFO ephemeral_file_buffered_writer{tenant_id=daa3745f15757989cef9f2ca2f87a043 shard_id=0000 timeline_id=8d9538523fdd13c05213165f28fc5f60 path=/tmp/tmpfs/.neon/pageserver_1/tenants/daa3745f15757989cef9f2ca2f87a043/timelines/8d9538523fdd13c05213165f28fc5f60/ephemeral-2}:flush_attempt{attempt=1}: Backoff: waiting 2 seconds before processing with the task
2025-02-27T06:14:30.156184Z  INFO ephemeral_file_buffered_writer{tenant_id=daa3745f15757989cef9f2ca2f87a043 shard_id=0000 timeline_id=8d9538523fdd13c05213165f28fc5f60 path=/tmp/tmpfs/.neon/pageserver_1/tenants/daa3745f15757989cef9f2ca2f87a043/timelines/8d9538523fdd13c05213165f28fc5f60/ephemeral-2}:flush_attempt{attempt=2}: retrying flush
2025-02-27T06:14:30.157023Z  WARN ephemeral_file_buffered_writer{tenant_id=daa3745f15757989cef9f2ca2f87a043 shard_id=0000 timeline_id=8d9538523fdd13c05213165f28fc5f60 path=/tmp/tmpfs/.neon/pageserver_1/tenants/daa3745f15757989cef9f2ca2f87a043/timelines/8d9538523fdd13c05213165f28fc5f60/ephemeral-2}:flush_attempt{attempt=2}: \
 error flushing buffered writer buffer to disk, retrying after backoff err=No space left on device (os error 28)
2025-02-27T06:14:30.157055Z  INFO ephemeral_file_buffered_writer{tenant_id=daa3745f15757989cef9f2ca2f87a043 shard_id=0000 timeline_id=8d9538523fdd13c05213165f28fc5f60 path=/tmp/tmpfs/.neon/pageserver_1/tenants/daa3745f15757989cef9f2ca2f87a043/timelines/8d9538523fdd13c05213165f28fc5f60/ephemeral-2}:flush_attempt{attempt=2}: Backoff: waiting 4 seconds before processing with the task
...
### no other WARN/ERROR/panic messages ###

Regarding tests for ENOSPC (and perhaps other errors), I can also suggest emulating such errors using LD_PRELOAD. (I didn't encounter such tests in postgres, maybe there are some challenges with it, but personally I tried that: https://www.postgresql.org/message-id/f9bebfe6-cee4-ed87-d4e6-29b5ca4be08d@gmail.com)

problame · 2025-02-27T18:41:51Z

This PR is ready to merge IMO. Reviewers, please review 🙏

jcsp · 2025-02-28T09:57:04Z

pageserver/src/tenant/remote_timeline_client/download.rs

@@ -229,6 +229,7 @@ async fn download_object(
                    || IoBufferMut::with_capacity(super::BUFFER_SIZE),
                    gate.enter().map_err(|_| DownloadError::Cancelled)?,
                    ctx,
+                    info_span!(parent: None, "download_object_buffered_writer", %dst_path),


I haven't checked the ingest code, but are these writers kind of short-lived (like one wal record?)

Just thinking about perf impact of the extra span

jcsp · 2025-02-28T10:00:15Z

pageserver/src/virtual_file/owned_buffers_io/write/flush.rs

+            // then we can't shut down the timeline/tenant/pageserver cleanrly because
+            // upp layers of the Pageserver write path are holding the gate open for EphemeralFile.
+            //
+            // TODO: use utils::backoff::retry once async closures are actually usable


didn't async closures just get stabilized in 1.85? (if this comment is calling out a specific outstanding issue then let's mention or link it)

I tried in this PR, they're not good yet: cbf7354

https://neondb.slack.com/archives/C0277TKAJCA/p1740579692226909

This is the commit where I gave up: a03f335

jcsp · 2025-02-28T10:00:50Z

pageserver/src/virtual_file/owned_buffers_io/write/flush.rs

+                    utils::backoff::exponential_backoff(attempt, 1.0, 10.0, &NO_CANCELLATION).await;
+                    ControlFlow::Continue(())
+                }
+                .instrument(info_span!("flush_attempt", %attempt))


A little concerned about extra spans this deep in the I/O stack -- I guess the idea is that in the rare event where we retry here, we want to know who/what is retrying rather than it just looking like a hang. Is there some way we can do that without constructing the spans on the happy path?

What we are really paying for on the happy path is the span lifecycle.
There's atomic operations when creating and exiting the span (to determine the right subscriber).
The span allocation itself is pooled, so should be fairly cheap.

I don't know if this would show up without benchmarking. We could only instrument if attempt > 1
and not worry about it.

VladLazar

Code looks fine to me.

As pointed out in the Context section, throwing away a InMemoryLayer that encountered an error and reingesting the WAL is a lot of complexity that IMO isn't justified for such an edge case.

I think this is the truly robust thing to do. I'm aware it's complicated and not saying you should do it now. It would handle not only IO errors, but most bugs/errors in the ingest path too.

What's your risk assessment on rolling out without cancellation sensitivity on retries?
To me it feels a bit spicy: it's an infinite loop we can't break out of nicely.

VladLazar · 2025-02-28T10:10:02Z

pageserver/src/virtual_file/owned_buffers_io/write/flush.rs

+                    utils::backoff::exponential_backoff(attempt, 1.0, 10.0, &NO_CANCELLATION).await;
+                    ControlFlow::Continue(())
+                }
+                .instrument(info_span!("flush_attempt", %attempt))


What we are really paying for on the happy path is the span lifecycle.
There's atomic operations when creating and exiting the span (to determine the right subscriber).
The span allocation itself is pooled, so should be fairly cheap.

I don't know if this would show up without benchmarking. We could only instrument if attempt > 1
and not worry about it.

pageserver/src/virtual_file/owned_buffers_io/write/flush.rs

problame · 2025-02-28T10:20:04Z

As pointed out in the Context section, throwing away a InMemoryLayer that encountered an error and reingesting the WAL is a lot of complexity that IMO isn't justified for such an edge case.

I think this is the truly robust thing to do. I'm aware it's complicated and not saying you should do it now. It would handle not only IO errors, but most bugs/errors in the ingest path too.

I disagree that it's more robust to retry locally: throwing away InMemoryLayer and reingesting from SKs adds more load to the system, and not just the one node that is experiencing IO errors, but also the SKs.

Killing ourselves on write IO failures and letting secondaries retry, on their other nodes, which presumably are not experiencing IO failures, is much more robust I think.

problame · 2025-02-28T10:35:16Z

There were two review comments pertaining spans.

There are two concerns with spans:

Allocations: these are negligible except in very tight loops because spans are slab allcoated. If a span lives for >1ms, it's a nobrainer I think.
Entering/Exiting overhead: that one is real and we've seen it to hurt e.g. getpage latency in the past (which in total take hundreds of microseconds, usually).

I added two spans.

The first span is for the top level flush background task future. I.e.,

download_object_buffered_writer (on-demand download)
ephemeral_file_buffered_writer (InMemoryLayer)

The second span is within the flush task, for each flush attempt.
That one is quite short-lived (one write, i.e., 150us approximately).

Bot spans were added for better error context so that, if the code indeed retries, those retries are discoverable.
Having tenant_id,shard_id,timeline_id, plus filename, in error messages is super important.
Without them, the log messages emitted by flush task will be at top level, and thereby not show up in log searches for a particular tenant/timeline.

I guess the alternatives to the current way are:

only creating and/or entering spans once we are on the error path where we log something
no spans at all

I'll try and see if bench_ingest shows a difference before addressing those.

VladLazar · 2025-02-28T10:36:50Z

As pointed out in the Context section, throwing away a InMemoryLayer that encountered an error and reingesting the WAL is a lot of complexity that IMO isn't justified for such an edge case.
I think this is the truly robust thing to do. I'm aware it's complicated and not saying you should do it now. It would handle not only IO errors, but most bugs/errors in the ingest path too.

I disagree that it's more robust to retry locally: throwing away InMemoryLayer and reingesting from SKs adds more load to the system, and not just the one node that is experiencing IO errors, but also the SKs.

Killing ourselves on write IO failures and letting secondaries retry, on their other nodes, which presumably are not experiencing IO failures, is much more robust I think.

Under heavy ingest load, secondaries will have to re-ingest too (potentially more than the primary).
But, yeah, I agree with you that io errors are a special case.

problame · 2025-02-28T11:01:12Z

I'll try and see if bench_ingest shows a difference before addressing those.

Setup

bench_ingest on a tmpfs mount; Hetzner AX102

Baseline 40ad42d55

SUT: e935aff9e98493411067c94da5c3233f88389f23

Results

Baseline

ingest-small-values/ingest 128MB/100b seq              time:   [860.20 ms 864.20 ms 869.95 ms]       thrpt:  [147.13 MiB/s 148.11 MiB/s 148.80 MiB/s]
ingest-small-values/ingest 128MB/100b rand                    time:   [1.2399 s 1.2434 s 1.2470 s]   thrpt:  [102.65 MiB/s 102.94 MiB/s 103.23 MiB/s]
ingest-small-values/ingest 128MB/100b rand-1024keys                  time:   [870.53 ms 875.29 ms 881.19 ms]       thrpt:  [145.26 MiB/s 146.24 MiB/s 147.04 MiB/s]
ingest-small-values/ingest 128MB/100b seq, no delta                  time:   [268.79 ms 269.47 ms 270.11 ms]       thrpt:  [473.89 MiB/s 475.01 MiB/s 476.21 MiB/s]
ingest-big-values/ingest 128MB/8k seq                  time:   [223.95 ms 233.94 ms 244.07 ms]       thrpt:  [524.44 MiB/s 547.16 MiB/s 571.56 MiB/s]
ingest-big-values/ingest 128MB/8k seq, no delta               time:   [72.977 ms 80.812 ms 84.147 ms]       thrpt:  [1.4855 GiB/s 1.5468 GiB/s 1.7129 GiB/s]

SUT

ingest-small-values/ingest 128MB/100b seq               time:   [855.04 ms 876.75 ms 907.91 ms]       thrpt:  [140.98 MiB/s 145.99 MiB/s 149.70 MiB/s]
ingest-small-values/ingest 128MB/100b rand                     time:   [1.2298 s 1.2351 s 1.2405 s]   thrpt:  [103.18 MiB/s 103.63 MiB/s 104.08 MiB/s]
ingest-small-values/ingest 128MB/100b rand-1024keys                   time:   [877.36 ms 889.81 ms 901.81 ms]       thrpt:  [141.94 MiB/s 143.85 MiB/s 145.89 MiB/s]
ingest-small-values/ingest 128MB/100b seq, no delta                   time:   [272.56 ms 274.23 ms 276.00 ms]       thrpt:  [463.77 MiB/s 466.77 MiB/s 469.62 MiB/s]
ingest-big-values/ingest 128MB/8k seq                   time:   [217.68 ms 224.14 ms 229.71 ms]       thrpt:  [557.23 MiB/s 571.08 MiB/s 588.02 MiB/s]
ingest-big-values/ingest 128MB/8k seq, no delta                time:   [71.053 ms 72.785 ms 73.958 ms]       thrpt:  [1.6902 GiB/s 1.7174 GiB/s 1.7593 GiB/s]

Conclusion

No meaningful change.
Even on tmpfs, bench_ingest has about 5% noise for me (in some runs, sometimes just +-1%).

problame · 2025-02-28T11:24:32Z

So, takeaway from bench_ingest is that in the isolated use case, the span allocation and enter/exit overhead is cheap.

But it doesn't test for contention under concurrency.

Can we contend on anything across multiple tasks?
I think not because both Span objects are uniquely owned by the BufferedWriter instance (we don't clone() them anywhere).

So, both the Span::enter and drop(Entered) code paths will do their refcount inc's and dec's only against the uniquely owned span object.

Probably the costliest part of span exit/enter is tracing::...::sharded.rs's iteration over the entire span stack. But the span stack has depth 0 or 1, respectively, for our flusher task. So, that's cheap as well.

problame · 2025-02-28T11:39:28Z

Regarding concerns about cancellation:

I'll try to hack in cancellation but the trouble here is to not end up in the mutable = None situation again.

problame added 3 commits February 26, 2025 14:59

WIP fix

adbbec2

higher kinded lifetime error

cbf7354

don't forget to terminate on fatal IO errors

4d73357

bury the dreams of async closures for now

a03f335

problame changed the title ~~WIP: buffer writer: internally retry all write IO errors indefinitely~~ buffered writer: internally retry all write IO errors indefinitely Feb 26, 2025

problame changed the title ~~buffered writer: internally retry all write IO errors indefinitely~~ buffered writer: handle write errors by retrying all write IO errors indefinitely Feb 26, 2025

problame requested review from VladLazar and jcsp February 26, 2025 18:21

patch

e935aff

problame marked this pull request as ready for review February 26, 2025 18:46

problame requested a review from a team as a code owner February 26, 2025 18:46

jcsp reviewed Feb 28, 2025

View reviewed changes

VladLazar reviewed Feb 28, 2025

View reviewed changes

typo fix; #10993 (comment)

80f2c26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

buffered writer: handle write errors by retrying all write IO errors indefinitely #10993

buffered writer: handle write errors by retrying all write IO errors indefinitely #10993

problame commented Feb 26, 2025 •

edited

Loading

github-actions bot commented Feb 26, 2025 •

edited

Loading

Postgres 16

problame commented Feb 26, 2025

alexanderlaw commented Feb 27, 2025

problame commented Feb 27, 2025

jcsp Feb 28, 2025

jcsp Feb 28, 2025

problame Feb 28, 2025

problame Feb 28, 2025

jcsp Feb 28, 2025

VladLazar Feb 28, 2025

VladLazar left a comment

VladLazar Feb 28, 2025

problame commented Feb 28, 2025

problame commented Feb 28, 2025

VladLazar commented Feb 28, 2025

problame commented Feb 28, 2025 •

edited

Loading

problame commented Feb 28, 2025

problame commented Feb 28, 2025

buffered writer: handle write errors by retrying all write IO errors indefinitely #10993

Are you sure you want to change the base?

buffered writer: handle write errors by retrying all write IO errors indefinitely #10993

Conversation

problame commented Feb 26, 2025 • edited Loading

Problem

Context

Solution

Caveats

Alternatives Considered

Demo

Perf Testing

Refs

github-actions bot commented Feb 26, 2025 • edited Loading

7744 tests run: 7366 passed, 0 failed, 378 skipped (full report)

Postgres 16

Code coverage* (full report)

problame commented Feb 26, 2025

alexanderlaw commented Feb 27, 2025

problame commented Feb 27, 2025

jcsp Feb 28, 2025

Choose a reason for hiding this comment

jcsp Feb 28, 2025

Choose a reason for hiding this comment

problame Feb 28, 2025

Choose a reason for hiding this comment

problame Feb 28, 2025

Choose a reason for hiding this comment

jcsp Feb 28, 2025

Choose a reason for hiding this comment

VladLazar Feb 28, 2025

Choose a reason for hiding this comment

VladLazar left a comment

Choose a reason for hiding this comment

VladLazar Feb 28, 2025

Choose a reason for hiding this comment

problame commented Feb 28, 2025

problame commented Feb 28, 2025

VladLazar commented Feb 28, 2025

problame commented Feb 28, 2025 • edited Loading

Setup

Results

Conclusion

problame commented Feb 28, 2025

problame commented Feb 28, 2025

problame commented Feb 26, 2025 •

edited

Loading

github-actions bot commented Feb 26, 2025 •

edited

Loading

problame commented Feb 28, 2025 •

edited

Loading