-
-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Merged by Bors] - TaskPool Panic Handling #6443
Conversation
crates/bevy_tasks/src/task_pool.rs
Outdated
local_executor.tick().await; | ||
} | ||
}; | ||
// Use unwrap_err because we expect a Closed error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is no longer true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is accurate once again, haha.
The error bubbling approach and the log-and-ignore patterns both seem useful for different use cases. Abort on panic seems... interesting, but really feels like something that should be opt-in (and I'm not sure it's worth the pervasive complexity maintaining both paths would take). What would it take to be able to restart tasks if we discover that they failed? |
I would use all of them for different use cases:
The executor will drop the task, as it may be in a unrecoverable state after the panic. IMO users should handle the panic and reschedule the task(s) if they want to retry the same computation. |
Some further investigation:
These two combined with the current implementation suggests that we can just straight up ignore any errors from The Abort-on-Panic approach can be taken when we start diving deeper into making our own async executor. |
5ca752f
to
70a0cc9
Compare
Setting this to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, based on your investigations this seems like a significant improvement already.
I'd much rather merge this and then add more sophisticated strategies later as needed than block on designing them right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, although this isn't my area of expertise.
@hymm if you disable cargo test's output capture, do you see any panic messages? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Edit: Didn't see the comments about the panicking.
I think the hierarchy test is failing due to some weird interaction between the 2 scopes. I tried to reproduce tasks disappearing with just the task pools and was unable to. AsyncComputeTaskPool::get()
.spawn(async {
loop {
info!("this is a task");
sleep(Duration::from_millis(1000));
yield_now().await;
}
})
.detach();
AsyncComputeTaskPool::get()
.spawn(async {
sleep(Duration::from_millis(100));
panic!("this is a panic");
})
.detach(); If we're trying to get this in before 0.9, I'd be happy to approve this with the hierarchy test changes reverted. As I can confirm that this PR is preventing the thread from getting killed and is an improvement over the status quo. In the meantime I'm going to investigate some more and see if I can get a reproduction using 2 scopes. |
Done. Can you file a separate issue to track this? |
bors r+ |
# Objective Right now, the `TaskPool` implementation allows panics to permanently kill worker threads upon panicking. This is currently non-recoverable without using a `std::panic::catch_unwind` in every scheduled task. This is poor ergonomics and even poorer developer experience. This is exacerbated by #2250 as these threads are global and cannot be replaced after initialization. Removes the need for temporary fixes like #4998. Fixes #4996. Fixes #6081. Fixes #5285. Fixes #5054. Supersedes #2307. ## Solution The current solution is to wrap `Executor::run` in `TaskPool` with a `catch_unwind`, and discarding the potential panic. This was taken straight from [smol](/~https://github.com/smol-rs/smol/blob/404c7bcc0aea59b82d7347058043b8de7133241c/src/spawn.rs#L44)'s current implementation. ~~However, this is not entirely ideal as:~~ - ~~the signaled to the awaiting task. We would need to change `Task<T>` to use `async_task::FallibleTask` internally, and even then it doesn't signal *why* it panicked, just that it did.~~ (See below). - ~~no error is logged of any kind~~ (See below) - ~~it's unclear if it drops other tasks in the executor~~ (it does not) - ~~This allows the ECS parallel executor to keep chugging even though a system's task has been dropped. This inevitably leads to deadlock in the executor.~~ Assuming we don't catch the unwind in ParallelExecutor, this will naturally kill the main thread. ### Alternatives A final solution likely will incorporate elements of any or all of the following. #### ~~Log and Ignore~~ ~~Log the panic, drop the task, keep chugging. This only addresses the discoverability of the panic. The process will continue to run, probably deadlocking the executor. tokio's detatched tasks operate in this fashion.~~ Panics already do this by default, even when caught by `catch_unwind`. #### ~~`catch_unwind` in `ParallelExecutor`~~ ~~Add another layer catching system-level panics into the `ParallelExecutor`. How the executor continues when a core dependency of many systems fails to run is up for debate.~~ `async_task::Task` bubbles up panics already, this will transitively push panics all the way to the main thread. #### ~~Emulate/Copy `tokio::JoinHandle` with `Task<T>`~~ ~~`tokio::JoinHandle<T>` bubbles up the panic from the underlying task when awaited. This can be transitively applied across other APIs that also use `Task<T>` like `Query::par_for_each` and `TaskPool::scope`, bubbling up the panic until it's either caught or it reaches the main thread.~~ `async_task::Task` bubbles up panics already, this will transitively push panics all the way to the main thread. #### Abort on Panic The nuclear option. Log the error, abort the entire process on any thread in the task pool panicking. Definitely avoids any additional infrastructure for passing the panic around, and might actually lead to more efficient code as any unwinding is optimized out. However gives the developer zero options for dealing with the issue, a seemingly poor choice for debuggability, and prevents graceful shutdown of the process. Potentially an option for handling very low-level task management (a la #4740). Roughly takes the shape of: ```rust struct AbortOnPanic; impl Drop for AbortOnPanic { fn drop(&mut self) { abort!(); } } let guard = AbortOnPanic; // Run task std::mem::forget(AbortOnPanic); ``` --- ## Changelog Changed: `bevy_tasks::TaskPool`'s threads will no longer terminate permanently when a task scheduled onto them panics. Changed: `bevy_tasks::Task` and`bevy_tasks::Scope` will propagate panics in the spawned tasks/scopes to the parent thread.
Pull request successfully merged into main. Build succeeded:
|
FYI I'm not sure #5054 should have been closed as I think you'll still get different panic messages on the main thread depending on if a system was run on the main thread or not.
will do. |
|
# Objective Right now, the `TaskPool` implementation allows panics to permanently kill worker threads upon panicking. This is currently non-recoverable without using a `std::panic::catch_unwind` in every scheduled task. This is poor ergonomics and even poorer developer experience. This is exacerbated by bevyengine#2250 as these threads are global and cannot be replaced after initialization. Removes the need for temporary fixes like bevyengine#4998. Fixes bevyengine#4996. Fixes bevyengine#6081. Fixes bevyengine#5285. Fixes bevyengine#5054. Supersedes bevyengine#2307. ## Solution The current solution is to wrap `Executor::run` in `TaskPool` with a `catch_unwind`, and discarding the potential panic. This was taken straight from [smol](/~https://github.com/smol-rs/smol/blob/404c7bcc0aea59b82d7347058043b8de7133241c/src/spawn.rs#L44)'s current implementation. ~~However, this is not entirely ideal as:~~ - ~~the signaled to the awaiting task. We would need to change `Task<T>` to use `async_task::FallibleTask` internally, and even then it doesn't signal *why* it panicked, just that it did.~~ (See below). - ~~no error is logged of any kind~~ (See below) - ~~it's unclear if it drops other tasks in the executor~~ (it does not) - ~~This allows the ECS parallel executor to keep chugging even though a system's task has been dropped. This inevitably leads to deadlock in the executor.~~ Assuming we don't catch the unwind in ParallelExecutor, this will naturally kill the main thread. ### Alternatives A final solution likely will incorporate elements of any or all of the following. #### ~~Log and Ignore~~ ~~Log the panic, drop the task, keep chugging. This only addresses the discoverability of the panic. The process will continue to run, probably deadlocking the executor. tokio's detatched tasks operate in this fashion.~~ Panics already do this by default, even when caught by `catch_unwind`. #### ~~`catch_unwind` in `ParallelExecutor`~~ ~~Add another layer catching system-level panics into the `ParallelExecutor`. How the executor continues when a core dependency of many systems fails to run is up for debate.~~ `async_task::Task` bubbles up panics already, this will transitively push panics all the way to the main thread. #### ~~Emulate/Copy `tokio::JoinHandle` with `Task<T>`~~ ~~`tokio::JoinHandle<T>` bubbles up the panic from the underlying task when awaited. This can be transitively applied across other APIs that also use `Task<T>` like `Query::par_for_each` and `TaskPool::scope`, bubbling up the panic until it's either caught or it reaches the main thread.~~ `async_task::Task` bubbles up panics already, this will transitively push panics all the way to the main thread. #### Abort on Panic The nuclear option. Log the error, abort the entire process on any thread in the task pool panicking. Definitely avoids any additional infrastructure for passing the panic around, and might actually lead to more efficient code as any unwinding is optimized out. However gives the developer zero options for dealing with the issue, a seemingly poor choice for debuggability, and prevents graceful shutdown of the process. Potentially an option for handling very low-level task management (a la bevyengine#4740). Roughly takes the shape of: ```rust struct AbortOnPanic; impl Drop for AbortOnPanic { fn drop(&mut self) { abort!(); } } let guard = AbortOnPanic; // Run task std::mem::forget(AbortOnPanic); ``` --- ## Changelog Changed: `bevy_tasks::TaskPool`'s threads will no longer terminate permanently when a task scheduled onto them panics. Changed: `bevy_tasks::Task` and`bevy_tasks::Scope` will propagate panics in the spawned tasks/scopes to the parent thread.
Objective
Right now, the
TaskPool
implementation allows panics to permanently kill worker threads upon panicking. This is currently non-recoverable without using astd::panic::catch_unwind
in every scheduled task. This is poor ergonomics and even poorer developer experience. This is exacerbated by #2250 as these threads are global and cannot be replaced after initialization.Removes the need for temporary fixes like #4998. Fixes #4996. Fixes #6081. Fixes #5285. Fixes #5054. Supersedes #2307.
Solution
The current solution is to wrap
Executor::run
inTaskPool
with acatch_unwind
, and discarding the potential panic. This was taken straight from smol's current implementation.However, this is not entirely ideal as:the signaled to the awaiting task. We would need to change(See below).Task<T>
to useasync_task::FallibleTask
internally, and even then it doesn't signal why it panicked, just that it did.no error is logged of any kind(See below)it's unclear if it drops other tasks in the executor(it does not)This allows the ECS parallel executor to keep chugging even though a system's task has been dropped. This inevitably leads to deadlock in the executor.Assuming we don't catch the unwind in ParallelExecutor, this will naturally kill the main thread.Alternatives
A final solution likely will incorporate elements of any or all of the following.
Log and IgnoreLog the panic, drop the task, keep chugging. This only addresses the discoverability of the panic. The process will continue to run, probably deadlocking the executor. tokio's detatched tasks operate in this fashion.Panics already do this by default, even when caught by
catch_unwind
.catch_unwind
inParallelExecutor
Add another layer catching system-level panics into theParallelExecutor
. How the executor continues when a core dependency of many systems fails to run is up for debate.async_task::Task
bubbles up panics already, this will transitively push panics all the way to the main thread.Emulate/Copytokio::JoinHandle
withTask<T>
tokio::JoinHandle<T>
bubbles up the panic from the underlying task when awaited. This can be transitively applied across other APIs that also useTask<T>
likeQuery::par_for_each
andTaskPool::scope
, bubbling up the panic until it's either caught or it reaches the main thread.async_task::Task
bubbles up panics already, this will transitively push panics all the way to the main thread.Abort on Panic
The nuclear option. Log the error, abort the entire process on any thread in the task pool panicking. Definitely avoids any additional infrastructure for passing the panic around, and might actually lead to more efficient code as any unwinding is optimized out. However gives the developer zero options for dealing with the issue, a seemingly poor choice for debuggability, and prevents graceful shutdown of the process. Potentially an option for handling very low-level task management (a la #4740). Roughly takes the shape of:
Changelog
Changed:
bevy_tasks::TaskPool
's threads will no longer terminate permanently when a task scheduled onto them panics.Changed:
bevy_tasks::Task
andbevy_tasks::Scope
will propagate panics in the spawned tasks/scopes to the parent thread.