Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

buffered writer: handle write errors by retrying all write IO errors indefinitely #10993

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

problame
Copy link
Contributor

@problame problame commented Feb 26, 2025

Problem

If the Pageserver ingest path (InMemoryLayer=>EphemeralFile=>BufferedWriter)
encounters ENOSPC or any other write IO error when flushing the mutable buffer
of the BufferedWriter, the buffered writer is left in a state where subsequent reads
from the InMemoryLayer it will cause a must not use after we returned an error panic.

The reason is that

  1. the flush background task bails on flush failure,
  2. causing the FlushHandle::flush function to fail at channel.recv() and
  3. causing the FlushHandle::flush function to bail with the flush error,
  4. leaving its caller BufferedWriter::flush with BufferedWriter::mutable = None,
  5. once the InMemoryLayer's RwLock::write guard is dropped, subsequent reads can enter,
  6. those reads find mutable = None and cause the panic.

Context

It has always been the contract that writes against the BufferedWriter API
must not be retried because the writer/stream-style/append-only interface makes no atomicity guarantees ("On error, did nothing or a piece of the buffer get appended?").

The idea was that the error would bubble up to upper layers that can throw away the buffered writer and create a new one. (See our internal error handling policy document on how to handle e.g. ENOSPC).

That might be true for delta/image layer writers, I haven't checked.

But it's certainly not true for the ingest path: there are no provisions to throw away an InMemoryLayer that encountered a write error an reingest the WAL already written to it.

Adding such higher-level retries would involve either resetting last_record_lsn to a lower value and restarting walreceiver. The code isn't flexible enough to do that, and such complexity likely isn't worth it given that write errors are rare.

Solution

The solution in this PR is to retry any failing write operation indefinitely inside the buffered writer flush task, except of course those that are fatal as per maybe_fatal_err.

Retrying indefinitely ensures that BufferedWriter::mutable is never left None in the case of IO errors, thereby solving the problem described above.

It's a clear improvement over the status quo.

However, while we're retrying, we build up backpressure because the flush is only double-buffered, not infinitely buffered.

Backpressure here is generally good to avoid resource exhaustion, but blocks reads and hence stalls GetPage requests because InMemoryLayer reads and writes are mutually exclusive.
That's orthogonal to the problem that is solved here, though.

Caveats

Note that there are some remaining conditions in the flush background task where it can bail with an error. I have annotated one of them with a TODO comment.
Hence the FlushHandle::flush is still fallible and hence the overall scenario of leaving mutable = None on the bail path is still possible.
We can clean that up in a later commit.

Note also that retrying indefinitely is great for temporary errors like ENOSPC but likely undesirable in case the std::io::Error we get is really due to higher-level logic bugs.
For example, we could fail to flush because the timeline or tenant directory got deleted and VirtualFile's reopen fails with ENOENT.

Note finally that cancellation is not respected while we're retrying.
This means we will block timeline/tenant/pageserver shutdown.
The reason is that the existing cancellation story for the buffered writer background task was to recv from flush op channel until the sending side (FlushHandle) is explicitly shut down or dropped.

Failing to handle cancellation carries the operational risk that even if a single timeline gets stuck because of a logic bug such as the one laid out above, we must still restart the whole pageserver process.

Alternatives Considered

As pointed out in the Context section, throwing away a InMemoryLayer that encountered an error and reingesting the WAL is a lot of complexity that IMO isn't justified for such an edge case.
Also, it's wasteful.
I think it's a local optimum.

A more general and simpler solution for ENOSPC is to abort() the process and run eviction on startup before bringing up the rest of pageserver.
I argued for it in the past, the pro arguments are still valid and complete:
https://neondb.slack.com/archives/C033RQ5SPDH/p1716896265296329
The trouble at the time was implementing eviction on startup.
However, maybe things are simpler now that we are fully storcon-managed and all tenants have secondaries.
For example, if pageserver abort()s on ENOSPC and then simply don't respond to storcon heartbeats while we're running eviction on startup, storcon will fail tenants over to the secondary anyway, giving us all the time we need to clean up.

The downside is that if there's a systemic space management bug, above proposal will just propagate the problem to other nodes. But I imagine that because of the delays involved with filling up disks, the system might reach a half-stable state, providing operators more time to react.

Demo

Intermediary commit a03f335121480afc0171b0f34606bdf929e962c5 is demoed in this (internal) screen recording: https://drive.google.com/file/d/1nBC6lFV2himQ8vRXDXrY30yfWmI2JL5J/view?usp=drive_link

Perf Testing

Ran bench_ingest on tmpfs, no measurable difference.

Spans are uniquely owned by the flush task, and the span stack isn't too deep, so, enter and exit should be cheap.
Plus, each flush takes ~150us with direct IO enabled, so, not that high frequency event anyways.

Refs

Copy link

github-actions bot commented Feb 26, 2025

7744 tests run: 7366 passed, 0 failed, 378 skipped (full report)


Flaky tests (1)

Postgres 16

Code coverage* (full report)

  • functions: 32.8% (8643 of 26363 functions)
  • lines: 48.6% (73227 of 150587 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
80f2c26 at 2025-02-28T12:26:29.622Z :recycle:

@problame problame changed the title WIP: buffer writer: internally retry all write IO errors indefinitely buffered writer: internally retry all write IO errors indefinitely Feb 26, 2025
@problame problame changed the title buffered writer: internally retry all write IO errors indefinitely buffered writer: handle write errors by retrying all write IO errors indefinitely Feb 26, 2025
@problame problame requested review from VladLazar and jcsp February 26, 2025 18:21
@problame problame marked this pull request as ready for review February 26, 2025 18:46
@problame problame requested a review from a team as a code owner February 26, 2025 18:46
@problame
Copy link
Contributor Author

cc @alexanderlaw , please test your scenario from #10856 against this PR.

I wasn't able to do it exactly like in your instructions --- I only put the pageserver_1/tenants directory on a tmpfs ; if I put the whole .neon on tmpfs, I get error in the compute / walproposer code (which is another thing you could look into, separately, but not in this issue or PR).

I would love a regression test that actually exercises the ENOSPC case.
But setting up a disk image is platform specific.
And filesystem quotas are hard either.
So I guess the only good option is failpoint-based test.

Would like to hear the Storage reviewers' feedback on the trade-offs I made in this PR first, before I spend the time implementing the regression test.

@alexanderlaw
Copy link
Contributor

I've tested e935aff and see the improvement:

-rw-rw-r-- 1 a a 28K Feb 27 08:14 /tmp/tmpfs/.neon/pageserver_1/pageserver.log
-rw-rw-r-- 1 a a 31K Feb 27 08:15 /tmp/tmpfs/.neon/pageserver_1/pageserver.log

pageserver.log contains:

2025-02-27T06:14:22.539406Z  INFO compaction_loop{tenant_id=daa3745f15757989cef9f2ca2f87a043 shard_id=0000}:run:compact_timeline{timeline_id=8d9538523fdd13c05213165f28fc5f60}:create_image_layers{lsn=0/79E3D80 mode=Try}: metadata key compaction: trigger_generation=false, delta_files_accessed=1, total_kb_retrieved=0, total_keys_retrieved=0, read_time=0.000156982s
2025-02-27T06:14:28.154686Z  WARN ephemeral_file_buffered_writer{tenant_id=daa3745f15757989cef9f2ca2f87a043 shard_id=0000 timeline_id=8d9538523fdd13c05213165f28fc5f60 path=/tmp/tmpfs/.neon/pageserver_1/tenants/daa3745f15757989cef9f2ca2f87a043/timelines/8d9538523fdd13c05213165f28fc5f60/ephemeral-2}:flush_attempt{attempt=1}: \
 error flushing buffered writer buffer to disk, retrying after backoff err=No space left on device (os error 28)
2025-02-27T06:14:28.154768Z  INFO ephemeral_file_buffered_writer{tenant_id=daa3745f15757989cef9f2ca2f87a043 shard_id=0000 timeline_id=8d9538523fdd13c05213165f28fc5f60 path=/tmp/tmpfs/.neon/pageserver_1/tenants/daa3745f15757989cef9f2ca2f87a043/timelines/8d9538523fdd13c05213165f28fc5f60/ephemeral-2}:flush_attempt{attempt=1}: Backoff: waiting 2 seconds before processing with the task
2025-02-27T06:14:30.156184Z  INFO ephemeral_file_buffered_writer{tenant_id=daa3745f15757989cef9f2ca2f87a043 shard_id=0000 timeline_id=8d9538523fdd13c05213165f28fc5f60 path=/tmp/tmpfs/.neon/pageserver_1/tenants/daa3745f15757989cef9f2ca2f87a043/timelines/8d9538523fdd13c05213165f28fc5f60/ephemeral-2}:flush_attempt{attempt=2}: retrying flush
2025-02-27T06:14:30.157023Z  WARN ephemeral_file_buffered_writer{tenant_id=daa3745f15757989cef9f2ca2f87a043 shard_id=0000 timeline_id=8d9538523fdd13c05213165f28fc5f60 path=/tmp/tmpfs/.neon/pageserver_1/tenants/daa3745f15757989cef9f2ca2f87a043/timelines/8d9538523fdd13c05213165f28fc5f60/ephemeral-2}:flush_attempt{attempt=2}: \
 error flushing buffered writer buffer to disk, retrying after backoff err=No space left on device (os error 28)
2025-02-27T06:14:30.157055Z  INFO ephemeral_file_buffered_writer{tenant_id=daa3745f15757989cef9f2ca2f87a043 shard_id=0000 timeline_id=8d9538523fdd13c05213165f28fc5f60 path=/tmp/tmpfs/.neon/pageserver_1/tenants/daa3745f15757989cef9f2ca2f87a043/timelines/8d9538523fdd13c05213165f28fc5f60/ephemeral-2}:flush_attempt{attempt=2}: Backoff: waiting 4 seconds before processing with the task
...
### no other WARN/ERROR/panic messages ###

Regarding tests for ENOSPC (and perhaps other errors), I can also suggest emulating such errors using LD_PRELOAD. (I didn't encounter such tests in postgres, maybe there are some challenges with it, but personally I tried that: https://www.postgresql.org/message-id/f9bebfe6-cee4-ed87-d4e6-29b5ca4be08d@gmail.com)

@problame
Copy link
Contributor Author

This PR is ready to merge IMO. Reviewers, please review 🙏

@@ -229,6 +229,7 @@ async fn download_object(
|| IoBufferMut::with_capacity(super::BUFFER_SIZE),
gate.enter().map_err(|_| DownloadError::Cancelled)?,
ctx,
info_span!(parent: None, "download_object_buffered_writer", %dst_path),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't checked the ingest code, but are these writers kind of short-lived (like one wal record?)

Just thinking about perf impact of the extra span

// then we can't shut down the timeline/tenant/pageserver cleanrly because
// upp layers of the Pageserver write path are holding the gate open for EphemeralFile.
//
// TODO: use utils::backoff::retry once async closures are actually usable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't async closures just get stabilized in 1.85? (if this comment is calling out a specific outstanding issue then let's mention or link it)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the commit where I gave up: a03f335

utils::backoff::exponential_backoff(attempt, 1.0, 10.0, &NO_CANCELLATION).await;
ControlFlow::Continue(())
}
.instrument(info_span!("flush_attempt", %attempt))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little concerned about extra spans this deep in the I/O stack -- I guess the idea is that in the rare event where we retry here, we want to know who/what is retrying rather than it just looking like a hang. Is there some way we can do that without constructing the spans on the happy path?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we are really paying for on the happy path is the span lifecycle.
There's atomic operations when creating and exiting the span (to determine the right subscriber).
The span allocation itself is pooled, so should be fairly cheap.

I don't know if this would show up without benchmarking. We could only instrument if attempt > 1
and not worry about it.

Copy link
Contributor

@VladLazar VladLazar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks fine to me.


As pointed out in the Context section, throwing away a InMemoryLayer that encountered an error and reingesting the WAL is a lot of complexity that IMO isn't justified for such an edge case.

I think this is the truly robust thing to do. I'm aware it's complicated and not saying you should do it now. It would handle not only IO errors, but most bugs/errors in the ingest path too.


What's your risk assessment on rolling out without cancellation sensitivity on retries?
To me it feels a bit spicy: it's an infinite loop we can't break out of nicely.

utils::backoff::exponential_backoff(attempt, 1.0, 10.0, &NO_CANCELLATION).await;
ControlFlow::Continue(())
}
.instrument(info_span!("flush_attempt", %attempt))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we are really paying for on the happy path is the span lifecycle.
There's atomic operations when creating and exiting the span (to determine the right subscriber).
The span allocation itself is pooled, so should be fairly cheap.

I don't know if this would show up without benchmarking. We could only instrument if attempt > 1
and not worry about it.

@problame
Copy link
Contributor Author

As pointed out in the Context section, throwing away a InMemoryLayer that encountered an error and reingesting the WAL is a lot of complexity that IMO isn't justified for such an edge case.

I think this is the truly robust thing to do. I'm aware it's complicated and not saying you should do it now. It would handle not only IO errors, but most bugs/errors in the ingest path too.

I disagree that it's more robust to retry locally: throwing away InMemoryLayer and reingesting from SKs adds more load to the system, and not just the one node that is experiencing IO errors, but also the SKs.

Killing ourselves on write IO failures and letting secondaries retry, on their other nodes, which presumably are not experiencing IO failures, is much more robust I think.

@problame
Copy link
Contributor Author

There were two review comments pertaining spans.

There are two concerns with spans:

  • Allocations: these are negligible except in very tight loops because spans are slab allcoated. If a span lives for >1ms, it's a nobrainer I think.
  • Entering/Exiting overhead: that one is real and we've seen it to hurt e.g. getpage latency in the past (which in total take hundreds of microseconds, usually).

I added two spans.

The first span is for the top level flush background task future. I.e.,

  1. download_object_buffered_writer (on-demand download)
  2. ephemeral_file_buffered_writer (InMemoryLayer)

The second span is within the flush task, for each flush attempt.
That one is quite short-lived (one write, i.e., 150us approximately).

Bot spans were added for better error context so that, if the code indeed retries, those retries are discoverable.
Having tenant_id,shard_id,timeline_id, plus filename, in error messages is super important.
Without them, the log messages emitted by flush task will be at top level, and thereby not show up in log searches for a particular tenant/timeline.

I guess the alternatives to the current way are:

  1. only creating and/or entering spans once we are on the error path where we log something
  2. no spans at all

I'll try and see if bench_ingest shows a difference before addressing those.

@VladLazar
Copy link
Contributor

As pointed out in the Context section, throwing away a InMemoryLayer that encountered an error and reingesting the WAL is a lot of complexity that IMO isn't justified for such an edge case.
I think this is the truly robust thing to do. I'm aware it's complicated and not saying you should do it now. It would handle not only IO errors, but most bugs/errors in the ingest path too.

I disagree that it's more robust to retry locally: throwing away InMemoryLayer and reingesting from SKs adds more load to the system, and not just the one node that is experiencing IO errors, but also the SKs.

Killing ourselves on write IO failures and letting secondaries retry, on their other nodes, which presumably are not experiencing IO failures, is much more robust I think.

Under heavy ingest load, secondaries will have to re-ingest too (potentially more than the primary).
But, yeah, I agree with you that io errors are a special case.

@problame
Copy link
Contributor Author

problame commented Feb 28, 2025

I'll try and see if bench_ingest shows a difference before addressing those.

Setup

bench_ingest on a tmpfs mount; Hetzner AX102

Baseline 40ad42d55

SUT: e935aff9e98493411067c94da5c3233f88389f23

Results

Baseline

ingest-small-values/ingest 128MB/100b seq              time:   [860.20 ms 864.20 ms 869.95 ms]       thrpt:  [147.13 MiB/s 148.11 MiB/s 148.80 MiB/s]
ingest-small-values/ingest 128MB/100b rand                    time:   [1.2399 s 1.2434 s 1.2470 s]   thrpt:  [102.65 MiB/s 102.94 MiB/s 103.23 MiB/s]
ingest-small-values/ingest 128MB/100b rand-1024keys                  time:   [870.53 ms 875.29 ms 881.19 ms]       thrpt:  [145.26 MiB/s 146.24 MiB/s 147.04 MiB/s]
ingest-small-values/ingest 128MB/100b seq, no delta                  time:   [268.79 ms 269.47 ms 270.11 ms]       thrpt:  [473.89 MiB/s 475.01 MiB/s 476.21 MiB/s]
ingest-big-values/ingest 128MB/8k seq                  time:   [223.95 ms 233.94 ms 244.07 ms]       thrpt:  [524.44 MiB/s 547.16 MiB/s 571.56 MiB/s]
ingest-big-values/ingest 128MB/8k seq, no delta               time:   [72.977 ms 80.812 ms 84.147 ms]       thrpt:  [1.4855 GiB/s 1.5468 GiB/s 1.7129 GiB/s]

SUT

ingest-small-values/ingest 128MB/100b seq               time:   [855.04 ms 876.75 ms 907.91 ms]       thrpt:  [140.98 MiB/s 145.99 MiB/s 149.70 MiB/s]
ingest-small-values/ingest 128MB/100b rand                     time:   [1.2298 s 1.2351 s 1.2405 s]   thrpt:  [103.18 MiB/s 103.63 MiB/s 104.08 MiB/s]
ingest-small-values/ingest 128MB/100b rand-1024keys                   time:   [877.36 ms 889.81 ms 901.81 ms]       thrpt:  [141.94 MiB/s 143.85 MiB/s 145.89 MiB/s]
ingest-small-values/ingest 128MB/100b seq, no delta                   time:   [272.56 ms 274.23 ms 276.00 ms]       thrpt:  [463.77 MiB/s 466.77 MiB/s 469.62 MiB/s]
ingest-big-values/ingest 128MB/8k seq                   time:   [217.68 ms 224.14 ms 229.71 ms]       thrpt:  [557.23 MiB/s 571.08 MiB/s 588.02 MiB/s]
ingest-big-values/ingest 128MB/8k seq, no delta                time:   [71.053 ms 72.785 ms 73.958 ms]       thrpt:  [1.6902 GiB/s 1.7174 GiB/s 1.7593 GiB/s]

Conclusion

No meaningful change.
Even on tmpfs, bench_ingest has about 5% noise for me (in some runs, sometimes just +-1%).

@problame
Copy link
Contributor Author

So, takeaway from bench_ingest is that in the isolated use case, the span allocation and enter/exit overhead is cheap.

But it doesn't test for contention under concurrency.

Can we contend on anything across multiple tasks?
I think not because both Span objects are uniquely owned by the BufferedWriter instance (we don't clone() them anywhere).

So, both the Span::enter and drop(Entered) code paths will do their refcount inc's and dec's only against the uniquely owned span object.

Probably the costliest part of span exit/enter is tracing::...::sharded.rs's iteration over the entire span stack. But the span stack has depth 0 or 1, respectively, for our flusher task. So, that's cheap as well.

@problame
Copy link
Contributor Author

Regarding concerns about cancellation:

I'll try to hack in cancellation but the trouble here is to not end up in the mutable = None situation again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pageserver produces panic errors "must not use after we returned an error" after I/O error (ENOSPC)
4 participants