pageserver produces panic errors "must not use after we returned an error" after I/O error (ENOSPC) #10856

alexanderlaw · 2025-02-17T16:00:01Z

The following script:

mkdir /tmp/tmpfs && sudo mount -t tmpfs -o size=2G tmpfs /tmp/tmpfs # one-time action

rm -rf .neon /tmp/tmpfs/.neon || true
mkdir /tmp/tmpfs/.neon
ln -s /tmp/tmpfs/.neon .neon

cargo neon init --force empty-dir-ok
cargo neon start
cargo neon tenant create --pg-version 17 --set-default
cargo neon endpoint create --pg-version 17 main
cargo neon endpoint start main

dd if=/dev/zero of=/tmp/tmpfs/filler bs=1M count=1500
timeout 60 pgbench -i -I dtGp -s 10 || true
rm /tmp/tmpfs/filler
ls -lh /tmp/tmpfs/.neon/pageserver_1/pageserver.log
sleep 60
ls -lh /tmp/tmpfs/.neon/pageserver_1/pageserver.log

cargo neon stop

makes pageserver fill its log with thousands of panic buffer-related messages while being unable to recover after a transient ENOSPC condition:

-rw-rw-r-- 1 a a 61K Feb 17 17:33 /tmp/tmpfs/.neon/pageserver_1/pageserver.log
-rw-rw-r-- 1 a a 998M Feb 17 17:34 /tmp/tmpfs/.neon/pageserver_1/pageserver.log

             a2025-02-17T15:33:04.198848Z ERROR page_service_conn_main{peer_addr=127.0.0.1:35970 application_name=1452227}:process_query{tenant_id=45d5e91e4e5502c80ab985104c4fa4f1 timeline_id=876f7eb02c86cb75058cbb27f3387e68}:handle_pagerequests:request:handle_get_page_request{rel=1663/5/1259 blkno=0 req_lsn=FFFFFFFF/FFFFFFFF shard_id=0000}:handle_get_page_at_lsn_request_batched:panic{thread=compute request worker location=pageserver/src/virtual_file/owned_buffers_io/write.rs:132:14}:\
 must not use after we returned an error
2025-02-17T15:33:04.201331Z ERROR page_service_conn_main{peer_addr=127.0.0.1:35984 application_name=1452227}:process_query{tenant_id=45d5e91e4e5502c80ab985104c4fa4f1 timeline_id=876f7eb02c86cb75058cbb27f3387e68}:handle_pagerequests:request:handle_get_page_request{rel=1663/5/1259 blkno=0 req_lsn=FFFFFFFF/FFFFFFFF shard_id=0000}:handle_get_page_at_lsn_request_batched:panic{thread=compute request worker location=pageserver/src/virtual_file/owned_buffers_io/write.rs:132:14}:\
 must not use after we returned an error
2025-02-17T15:33:04.204146Z ERROR page_service_conn_main{peer_addr=127.0.0.1:35988 application_name=1452227}:process_query{tenant_id=45d5e91e4e5502c80ab985104c4fa4f1 timeline_id=876f7eb02c86cb75058cbb27f3387e68}:handle_pagerequests:request:handle_get_page_request{rel=1663/5/1259 blkno=0 req_lsn=FFFFFFFF/FFFFFFFF shard_id=0000}:handle_get_page_at_lsn_request_batched:panic{thread=compute request worker location=pageserver/src/virtual_file/owned_buffers_io/write.rs:132:14}:\
 must not use after we returned an error

grep ':panic{' /tmp/tmpfs/.neon/pageserver_1/pageserver.log | wc -l
31562

The backtrace of the error is:

   0: utils::logging::tracing_panic_hook
             at /src/neon-t/libs/utils/src/logging.rs:236:21
   1: core::ops::function::Fn::call
             at /rustc/e71f9a9a98b0faf423844bf0ba7438f29dc27d58/library/core/src/ops/function.rs:79:5
   2: <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call
             at /rustc/e71f9a9a98b0faf423844bf0ba7438f29dc27d58/library/alloc/src/boxed.rs:1986:9
   3: std::panicking::rust_panic_with_hook
             at /rustc/e71f9a9a98b0faf423844bf0ba7438f29dc27d58/library/std/src/panicking.rs:809:13
   4: std::panicking::begin_panic_handler::{{closure}}
             at /rustc/e71f9a9a98b0faf423844bf0ba7438f29dc27d58/library/std/src/panicking.rs:674:13
   5: std::sys::backtrace::__rust_end_short_backtrace
             at /rustc/e71f9a9a98b0faf423844bf0ba7438f29dc27d58/library/std/src/sys/backtrace.rs:170:18
   6: rust_begin_unwind
             at /rustc/e71f9a9a98b0faf423844bf0ba7438f29dc27d58/library/std/src/panicking.rs:665:5
   7: core::panicking::panic_fmt
             at /rustc/e71f9a9a98b0faf423844bf0ba7438f29dc27d58/library/core/src/panicking.rs:76:14
   8: core::panicking::panic_display
             at /rustc/e71f9a9a98b0faf423844bf0ba7438f29dc27d58/library/core/src/panicking.rs:269:5
   9: core::option::expect_failed 
             at /rustc/e71f9a9a98b0faf423844bf0ba7438f29dc27d58/library/core/src/option.rs:2018:5 
  10: core::option::Option<T>::expect
             at /rustc/e71f9a9a98b0faf423844bf0ba7438f29dc27d58/library/core/src/option.rs:933:21
  11: pageserver::virtual_file::owned_buffers_io::write::BufferedWriter<B,W>::mutable
             at /src/neon-t/pageserver/src/virtual_file/owned_buffers_io/write.rs:130:9
  12: pageserver::virtual_file::owned_buffers_io::write::BufferedWriter<B,W>::inspect_mutable
             at /src/neon-t/pageserver/src/virtual_file/owned_buffers_io/write.rs:100:9
  13: <pageserver::tenant::ephemeral_file::EphemeralFile as pageserver::tenant::storage_layer::inmemory_layer::vectored_dio_read::File>::read_exact_at_eof_ok::{{closure}}
             at /src/neon-t/pageserver/src/tenant/ephemeral_file.rs:185:23
  14: pageserver::tenant::storage_layer::inmemory_layer::vectored_dio_read::execute::{{closure}}
             at /src/neon-t/pageserver/src/tenant/storage_layer/inmemory_layer/vectored_dio_read.rs:249:95
  15: pageserver::tenant::storage_layer::inmemory_layer::InMemoryLayer::get_values_reconstruct_data::{{closure}}::{{closure}}
             at /src/neon-t/pageserver/src/tenant/storage_layer/inmemory_layer.rs:483:22
  16: pageserver::tenant::storage_layer::IoConcurrency::spawn_io::{{closure}}
             at /src/neon-t/pageserver/src/tenant/storage_layer.rs:492:46
  17: pageserver::tenant::storage_layer::ValuesReconstructState::spawn_io::{{closure}}
             at /src/neon-t/pageserver/src/tenant/storage_layer.rs:623:43
  18: pageserver::tenant::storage_layer::inmemory_layer::InMemoryLayer::get_values_reconstruct_data::{{closure}}
             at /src/neon-t/pageserver/src/tenant/storage_layer/inmemory_layer.rs:511:14
  19: pageserver::tenant::storage_layer::ReadableLayer::get_values_reconstruct_data::{{closure}}
             at /src/neon-t/pageserver/src/tenant/storage_layer.rs:879:22
...

The text was updated successfully, but these errors were encountered:

problame · 2025-02-26T11:37:50Z

Coming back to this.

Your scenario adds filesystem capacity usage outside of the control of pageserver.

Such a scenario is unsupported: Pageserver assumes that capacity usage outside of the pageserver's working directory is constant.

Within its working directory, Pageserver is responsible for dealing with ENOSPC.
The current strategy is that we evict layer files if we run low on disk space ("disk-usage-based eviction")

See this module comment

neon/pageserver/src/disk_usage_eviction_task.rs

Lines 1 to 43 in 920040e

    
           //! This module implements the pageserver-global disk-usage-based layer eviction task. 
        
           //! 
        
           //! # Mechanics 
        
           //! 
        
           //! Function `launch_disk_usage_global_eviction_task` starts a pageserver-global background 
        
           //! loop that evicts layers in response to a shortage of available bytes 
        
           //! in the $repo/tenants directory's filesystem. 
        
           //! 
        
           //! The loop runs periodically at a configurable `period`. 
        
           //! 
        
           //! Each loop iteration uses `statvfs` to determine filesystem-level space usage. 
        
           //! It compares the returned usage data against two different types of thresholds. 
        
           //! The iteration tries to evict layers until app-internal accounting says we should be below the thresholds. 
        
           //! We cross-check this internal accounting with the real world by making another `statvfs` at the end of the iteration. 
        
           //! We're good if that second statvfs shows that we're _actually_ below the configured thresholds. 
        
           //! If we're still above one or more thresholds, we emit a warning log message, leaving it to the operator to investigate further. 
        
           //! 
        
           //! # Eviction Policy 
        
           //! 
        
           //! There are two thresholds: 
        
           //! `max_usage_pct` is the relative available space, expressed in percent of the total filesystem space. 
        
           //! If the actual usage is higher, the threshold is exceeded. 
        
           //! `min_avail_bytes` is the absolute available space in bytes. 
        
           //! If the actual usage is lower, the threshold is exceeded. 
        
           //! If either of these thresholds is exceeded, the system is considered to have "disk pressure", and eviction 
        
           //! is performed on the next iteration, to release disk space and bring the usage below the thresholds again. 
        
           //! The iteration evicts layers in LRU fashion, but, with a weak reservation per tenant. 
        
           //! The reservation is to keep the most recently accessed X bytes per tenant resident. 
        
           //! If we cannot relieve pressure by evicting layers outside of the reservation, we 
        
           //! start evicting layers that are part of the reservation, LRU first. 
        
           //! 
        
           //! The value for the per-tenant reservation is referred to as `tenant_min_resident_size` 
        
           //! throughout the code, but, no actual variable carries that name. 
        
           //! The per-tenant default value is the `max(tenant's layer file sizes, regardless of local or remote)`. 
        
           //! The idea is to allow at least one layer to be resident per tenant, to ensure it can make forward progress 
        
           //! during page reconstruction. 
        
           //! An alternative default for all tenants can be specified in the `tenant_config` section of the config. 
        
           //! Lastly, each tenant can have an override in their respective tenant config (`min_resident_size_override`). 
        
           // Implementation notes: 
        
           // - The `#[allow(dead_code)]` above various structs are to suppress warnings about only the Debug impl 
        
           //   reading these fields. We use the Debug impl for semi-structured logging, though.

Our error handling policy is to bubble up ENOSPC so that disk-usage-based eviction can continue to function

/~https://github.com/neondatabase/docs/blob/c870a50bc099d82444947a353fb302c761949c94/src/storage/handling_io_and_logical_errors.md#L36-L43

So, with all of that being said, the behavior you are describing (log flooded with backtraces) is exactly what I'd expect to happen.

Do you think the behavior should be different?

alexanderlaw · 2025-02-26T12:26:53Z

I'm concerned about two things:

I'm not sure whether the expectation of being eventually retried (as written in the document referenced) is satisfied in this case. Maybe there are other scenarios where ENOSPC is handled as a transient error, though.
Isn't the message "must not use after we returned an error" indicating an expected situation in the buffer-related code (ENOSPC aside)?

Yeah, if ENOSPC on one disk partition didn't lead to filling up another disk partition with log, it would be less questionable to me, but maybe I can just perform abort() in pageserver for local testing.

problame · 2025-02-26T12:50:50Z

I'm digging into the "must not use after we returned an error" condition separately, that's clearly a pageserver bug.
But any fix will still result in an error being bubbled up.

I'm not sure whether the expectation of being eventually retried (as written in the document referenced) is satisfied in this case. Maybe there are other scenarios where ENOSPC is handled as a transient error, though.

I agree.

However, layer write path is hard to make retryable because all the internal interfaces are append-only and hide the offset.

The locally optimal solution would be to throw away the half-written layer and, with some back-off, retry writing it, hoping that the ENOSPC has gone away.

But we can't easily throw the half-written layer away because the buffers that have already been successfully written are already freed, so we'd have to re-seed them from memory.

Overall, IMO it's not worth the effort to achieve that local optimum, considering how rare ENOSPC is in practice.

So, my action items are to fix the must not use after we returned an error bug and check that my claims about local optimum are true, but not work towards the global optimum.

Regarding a global optimium: I think it is to just die immediately on ENOSPC and run eviction on startup.
I argued for it in the past, the pro arguments are still valid and complete:
https://neondb.slack.com/archives/C033RQ5SPDH/p1716896265296329

The trouble at the time was implementing eviction on startup.

However, maybe things are simpler now that we're fully storcon-managed & all tenants have secondaries.
For example, if we kill ourselves because of ENOSPC and then simply don't respond to storcon heartbeats while we're running eviction on startup, storcon will fail tenants over to the secondary anyway, giving us all the time we need to clean up.

Of course if there's a systemic space management bug, that will just propagate the problem to other nodes. But I think that's an orthogonal problem? And because of the delays involved with filling up disks, it buys us more time to react (e.g. rollback the bad code).

@jcsp wdty, worth re-opening that discussion?

jcsp · 2025-02-26T17:24:44Z

Regarding a global optimium: I think it is to just die immediately on ENOSPC and run eviction on startup.

I'm open to it. The fragile part is making sure that none of the code that runs around startup + before eviction will use an I/O helper that has the panic-on-ENOSPC behavior

alexanderlaw added c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug labels Feb 17, 2025

jcsp assigned problame Feb 19, 2025

problame linked a pull request Feb 26, 2025 that will close this issue

buffered writer: handle write errors by retrying all write IO errors indefinitely #10993

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pageserver produces panic errors "must not use after we returned an error" after I/O error (ENOSPC) #10856

pageserver produces panic errors "must not use after we returned an error" after I/O error (ENOSPC) #10856

alexanderlaw commented Feb 17, 2025

problame commented Feb 26, 2025

alexanderlaw commented Feb 26, 2025 •

edited

Loading

problame commented Feb 26, 2025

jcsp commented Feb 26, 2025

pageserver produces panic errors "must not use after we returned an error" after I/O error (ENOSPC) #10856

pageserver produces panic errors "must not use after we returned an error" after I/O error (ENOSPC) #10856

Comments

alexanderlaw commented Feb 17, 2025

problame commented Feb 26, 2025

alexanderlaw commented Feb 26, 2025 • edited Loading

problame commented Feb 26, 2025

jcsp commented Feb 26, 2025

alexanderlaw commented Feb 26, 2025 •

edited

Loading