-
Notifications
You must be signed in to change notification settings - Fork 496
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pageserver produces panic errors "must not use after we returned an error" after I/O error (ENOSPC) #10856
Comments
Coming back to this. Your scenario adds filesystem capacity usage outside of the control of pageserver. Such a scenario is unsupported: Pageserver assumes that capacity usage outside of the pageserver's working directory is constant. Within its working directory, Pageserver is responsible for dealing with ENOSPC. See this module comment neon/pageserver/src/disk_usage_eviction_task.rs Lines 1 to 43 in 920040e
Our error handling policy is to bubble up ENOSPC so that disk-usage-based eviction can continue to function So, with all of that being said, the behavior you are describing (log flooded with backtraces) is exactly what I'd expect to happen. Do you think the behavior should be different? |
I'm concerned about two things:
Yeah, if ENOSPC on one disk partition didn't lead to filling up another disk partition with log, it would be less questionable to me, but maybe I can just perform abort() in pageserver for local testing. |
I'm digging into the
I agree. However, layer write path is hard to make retryable because all the internal interfaces are append-only and hide the offset. The locally optimal solution would be to throw away the half-written layer and, with some back-off, retry writing it, hoping that the ENOSPC has gone away. But we can't easily throw the half-written layer away because the buffers that have already been successfully written are already freed, so we'd have to re-seed them from memory. Overall, IMO it's not worth the effort to achieve that local optimum, considering how rare ENOSPC is in practice. So, my action items are to fix the Regarding a global optimium: I think it is to just die immediately on ENOSPC and run eviction on startup. The trouble at the time was implementing eviction on startup. However, maybe things are simpler now that we're fully storcon-managed & all tenants have secondaries. Of course if there's a systemic space management bug, that will just propagate the problem to other nodes. But I think that's an orthogonal problem? And because of the delays involved with filling up disks, it buys us more time to react (e.g. rollback the bad code). @jcsp wdty, worth re-opening that discussion? |
I'm open to it. The fragile part is making sure that none of the code that runs around startup + before eviction will use an I/O helper that has the panic-on-ENOSPC behavior |
The following script:
makes pageserver fill its log with thousands of panic buffer-related messages while being unable to recover after a transient ENOSPC condition:
The backtrace of the error is:
The text was updated successfully, but these errors were encountered: