[Merged by Bors] - TaskPool Panic Handling #6443

james7132 · 2022-11-01T23:27:51Z

Objective

Right now, the TaskPool implementation allows panics to permanently kill worker threads upon panicking. This is currently non-recoverable without using a std::panic::catch_unwind in every scheduled task. This is poor ergonomics and even poorer developer experience. This is exacerbated by #2250 as these threads are global and cannot be replaced after initialization.

Removes the need for temporary fixes like #4998. Fixes #4996. Fixes #6081. Fixes #5285. Fixes #5054. Supersedes #2307.

Solution

The current solution is to wrap Executor::run in TaskPool with a catch_unwind, and discarding the potential panic. This was taken straight from smol's current implementation. ~~However, this is not entirely ideal as:~~

~~the signaled to the awaiting task. We would need to change Task<T> to use async_task::FallibleTask internally, and even then it doesn't signal why it panicked, just that it did.~~ (See below).
~~no error is logged of any kind~~ (See below)
~~it's unclear if it drops other tasks in the executor~~ (it does not)
~~This allows the ECS parallel executor to keep chugging even though a system's task has been dropped. This inevitably leads to deadlock in the executor.~~ Assuming we don't catch the unwind in ParallelExecutor, this will naturally kill the main thread.

Alternatives

A final solution likely will incorporate elements of any or all of the following.

Log and Ignore

Log the panic, drop the task, keep chugging. This only addresses the discoverability of the panic. The process will continue to run, probably deadlocking the executor. tokio's detatched tasks operate in this fashion.

Panics already do this by default, even when caught by catch_unwind.

`catch_unwind` in `ParallelExecutor`

~~Add another layer catching system-level panics into the ParallelExecutor. How the executor continues when a core dependency of many systems fails to run is up for debate.~~

async_task::Task bubbles up panics already, this will transitively push panics all the way to the main thread.

Emulate/Copy `tokio::JoinHandle` with `Task<T>`

tokio::JoinHandle<T> bubbles up the panic from the underlying task when awaited. This can be transitively applied across other APIs that also use Task<T> like Query::par_for_each and TaskPool::scope, bubbling up the panic until it's either caught or it reaches the main thread.

async_task::Task bubbles up panics already, this will transitively push panics all the way to the main thread.

Abort on Panic

The nuclear option. Log the error, abort the entire process on any thread in the task pool panicking. Definitely avoids any additional infrastructure for passing the panic around, and might actually lead to more efficient code as any unwinding is optimized out. However gives the developer zero options for dealing with the issue, a seemingly poor choice for debuggability, and prevents graceful shutdown of the process. Potentially an option for handling very low-level task management (a la #4740). Roughly takes the shape of:

struct AbortOnPanic;

impl Drop for AbortOnPanic {
   fn drop(&mut self) {
     abort!();
   }
}

let guard = AbortOnPanic;
// Run task
std::mem::forget(AbortOnPanic);

Changelog

Changed: bevy_tasks::TaskPool's threads will no longer terminate permanently when a task scheduled onto them panics.
Changed: bevy_tasks::Task andbevy_tasks::Scope will propagate panics in the spawned tasks/scopes to the parent thread.

alice-i-cecile · 2022-11-02T02:27:47Z

crates/bevy_tasks/src/task_pool.rs

+                                            local_executor.tick().await;
+                                        }
+                                    };
+                                    // Use unwrap_err because we expect a Closed error


This comment is no longer true.

It is accurate once again, haha.

alice-i-cecile · 2022-11-02T02:28:12Z

The error bubbling approach and the log-and-ignore patterns both seem useful for different use cases. Abort on panic seems... interesting, but really feels like something that should be opt-in (and I'm not sure it's worth the pervasive complexity maintaining both paths would take).

What would it take to be able to restart tasks if we discover that they failed?

james7132 · 2022-11-02T03:10:26Z

The error bubbling approach and the log-and-ignore patterns both seem useful for different use cases. Abort on panic seems... interesting, but really feels like something that should be opt-in (and I'm not sure it's worth the pervasive complexity maintaining both paths would take).

I would use all of them for different use cases:

error bubbling should be the default user-facing approach, it's by far the most flexible one.
log-and-ignore should be used whenever the user detatches a task
abort on panic should be the default when dealing with panics internal to bevy_tasks (a la Internalize task distinction into TaskPool #4740)

What would it take to be able to restart tasks if we discover that they failed?

The executor will drop the task, as it may be in a unrecoverable state after the panic. IMO users should handle the panic and reschedule the task(s) if they want to retry the same computation.

james7132 · 2022-11-02T05:11:47Z

Some further investigation:

Panics are by default logged even when caught with catch_unwind
async_task::Task will propagate the panic if the task has been dropped. Right now this is only possible if the original task panicked.

These two combined with the current implementation suggests that we can just straight up ignore any errors from catch_unwind as it will both log and bubble up as is, even for ParallelExecutor. Almost tempted to say we should merge this with a few small cleanups.

The Abort-on-Panic approach can be taken when we start diving deeper into making our own async executor.

james7132 · 2022-11-02T08:47:19Z

Setting this to P-High since #2307 was labeled that.

alice-i-cecile

Yep, based on your investigations this seems like a significant improvement already.

I'd much rather merge this and then add more sophisticated strategies later as needed than block on designing them right now.

PROMETHIA-27

Looks good to me, although this isn't my area of expertise.

crates/bevy_transform/src/systems.rs

crates/bevy_ecs/src/system/system_piping.rs

crates/bevy_transform/src/systems.rs

hymm · 2022-11-02T20:30:37Z

getting intermittent failures on bevy_transform tests again with this PR

crates/bevy_transform/src/systems.rs

james7132 · 2022-11-02T20:57:51Z

@hymm if you disable cargo test's output capture, do you see any panic messages?

hymm · 2022-11-02T21:27:04Z

a little hard to decipher as the output between different threads is interleaving, but I don't think there's any new info here

hymm · 2022-11-02T21:41:40Z

ran with quiet on too and it's more readable

Not sure why there isn't a thread 'Compute Task Pool (0)' panicked at error message

JoJoJet

~~LGTM~~

Edit: Didn't see the comments about the panicking.

hymm · 2022-11-02T22:58:52Z

I think the hierarchy test is failing due to some weird interaction between the 2 scopes. I tried to reproduce tasks disappearing with just the task pools and was unable to.

    AsyncComputeTaskPool::get()
        .spawn(async {
            loop {
                info!("this is a task");
                sleep(Duration::from_millis(1000));
                yield_now().await;
            }
        })
        .detach();
    AsyncComputeTaskPool::get()
        .spawn(async {
            sleep(Duration::from_millis(100));
            panic!("this is a panic");
        })
        .detach();

If we're trying to get this in before 0.9, I'd be happy to approve this with the hierarchy test changes reverted. As I can confirm that this PR is preventing the thread from getting killed and is an improvement over the status quo. In the meantime I'm going to investigate some more and see if I can get a reproduction using 2 scopes.

james7132 · 2022-11-02T23:37:33Z

If we're trying to get this in before 0.9, I'd be happy to approve this with the hierarchy test changes reverted. As I can confirm that this PR is preventing the thread from getting killed and is an improvement over the status quo. In the meantime I'm going to investigate some more and see if I can get a reproduction using 2 scopes.

Done. Can you file a separate issue to track this?

alice-i-cecile · 2022-11-02T23:39:52Z

bors r+

# Objective Right now, the `TaskPool` implementation allows panics to permanently kill worker threads upon panicking. This is currently non-recoverable without using a `std::panic::catch_unwind` in every scheduled task. This is poor ergonomics and even poorer developer experience. This is exacerbated by #2250 as these threads are global and cannot be replaced after initialization. Removes the need for temporary fixes like #4998. Fixes #4996. Fixes #6081. Fixes #5285. Fixes #5054. Supersedes #2307. ## Solution The current solution is to wrap `Executor::run` in `TaskPool` with a `catch_unwind`, and discarding the potential panic. This was taken straight from [smol](/~https://github.com/smol-rs/smol/blob/404c7bcc0aea59b82d7347058043b8de7133241c/src/spawn.rs#L44)'s current implementation. ~~However, this is not entirely ideal as:~~ - ~~the signaled to the awaiting task. We would need to change `Task<T>` to use `async_task::FallibleTask` internally, and even then it doesn't signal *why* it panicked, just that it did.~~ (See below). - ~~no error is logged of any kind~~ (See below) - ~~it's unclear if it drops other tasks in the executor~~ (it does not) - ~~This allows the ECS parallel executor to keep chugging even though a system's task has been dropped. This inevitably leads to deadlock in the executor.~~ Assuming we don't catch the unwind in ParallelExecutor, this will naturally kill the main thread. ### Alternatives A final solution likely will incorporate elements of any or all of the following. #### ~~Log and Ignore~~ ~~Log the panic, drop the task, keep chugging. This only addresses the discoverability of the panic. The process will continue to run, probably deadlocking the executor. tokio's detatched tasks operate in this fashion.~~ Panics already do this by default, even when caught by `catch_unwind`. #### ~~`catch_unwind` in `ParallelExecutor`~~ ~~Add another layer catching system-level panics into the `ParallelExecutor`. How the executor continues when a core dependency of many systems fails to run is up for debate.~~ `async_task::Task` bubbles up panics already, this will transitively push panics all the way to the main thread. #### ~~Emulate/Copy `tokio::JoinHandle` with `Task<T>`~~ ~~`tokio::JoinHandle<T>` bubbles up the panic from the underlying task when awaited. This can be transitively applied across other APIs that also use `Task<T>` like `Query::par_for_each` and `TaskPool::scope`, bubbling up the panic until it's either caught or it reaches the main thread.~~ `async_task::Task` bubbles up panics already, this will transitively push panics all the way to the main thread. #### Abort on Panic The nuclear option. Log the error, abort the entire process on any thread in the task pool panicking. Definitely avoids any additional infrastructure for passing the panic around, and might actually lead to more efficient code as any unwinding is optimized out. However gives the developer zero options for dealing with the issue, a seemingly poor choice for debuggability, and prevents graceful shutdown of the process. Potentially an option for handling very low-level task management (a la #4740). Roughly takes the shape of: ```rust struct AbortOnPanic; impl Drop for AbortOnPanic { fn drop(&mut self) { abort!(); } } let guard = AbortOnPanic; // Run task std::mem::forget(AbortOnPanic); ``` --- ## Changelog Changed: `bevy_tasks::TaskPool`'s threads will no longer terminate permanently when a task scheduled onto them panics. Changed: `bevy_tasks::Task` and`bevy_tasks::Scope` will propagate panics in the spawned tasks/scopes to the parent thread.

bors · 2022-11-02T23:58:53Z

Pull request successfully merged into main.

Build succeeded:

hymm · 2022-11-03T00:32:38Z

FYI I'm not sure #5054 should have been closed as I think you'll still get different panic messages on the main thread depending on if a system was run on the main thread or not.

Can you file a separate issue to track this?

will do.

hymm · 2022-11-03T05:24:09Z

Can you file a separate issue to track this?

#6453

# Objective Right now, the `TaskPool` implementation allows panics to permanently kill worker threads upon panicking. This is currently non-recoverable without using a `std::panic::catch_unwind` in every scheduled task. This is poor ergonomics and even poorer developer experience. This is exacerbated by bevyengine#2250 as these threads are global and cannot be replaced after initialization. Removes the need for temporary fixes like bevyengine#4998. Fixes bevyengine#4996. Fixes bevyengine#6081. Fixes bevyengine#5285. Fixes bevyengine#5054. Supersedes bevyengine#2307. ## Solution The current solution is to wrap `Executor::run` in `TaskPool` with a `catch_unwind`, and discarding the potential panic. This was taken straight from [smol](/~https://github.com/smol-rs/smol/blob/404c7bcc0aea59b82d7347058043b8de7133241c/src/spawn.rs#L44)'s current implementation. ~~However, this is not entirely ideal as:~~ - ~~the signaled to the awaiting task. We would need to change `Task<T>` to use `async_task::FallibleTask` internally, and even then it doesn't signal *why* it panicked, just that it did.~~ (See below). - ~~no error is logged of any kind~~ (See below) - ~~it's unclear if it drops other tasks in the executor~~ (it does not) - ~~This allows the ECS parallel executor to keep chugging even though a system's task has been dropped. This inevitably leads to deadlock in the executor.~~ Assuming we don't catch the unwind in ParallelExecutor, this will naturally kill the main thread. ### Alternatives A final solution likely will incorporate elements of any or all of the following. #### ~~Log and Ignore~~ ~~Log the panic, drop the task, keep chugging. This only addresses the discoverability of the panic. The process will continue to run, probably deadlocking the executor. tokio's detatched tasks operate in this fashion.~~ Panics already do this by default, even when caught by `catch_unwind`. #### ~~`catch_unwind` in `ParallelExecutor`~~ ~~Add another layer catching system-level panics into the `ParallelExecutor`. How the executor continues when a core dependency of many systems fails to run is up for debate.~~ `async_task::Task` bubbles up panics already, this will transitively push panics all the way to the main thread. #### ~~Emulate/Copy `tokio::JoinHandle` with `Task<T>`~~ ~~`tokio::JoinHandle<T>` bubbles up the panic from the underlying task when awaited. This can be transitively applied across other APIs that also use `Task<T>` like `Query::par_for_each` and `TaskPool::scope`, bubbling up the panic until it's either caught or it reaches the main thread.~~ `async_task::Task` bubbles up panics already, this will transitively push panics all the way to the main thread. #### Abort on Panic The nuclear option. Log the error, abort the entire process on any thread in the task pool panicking. Definitely avoids any additional infrastructure for passing the panic around, and might actually lead to more efficient code as any unwinding is optimized out. However gives the developer zero options for dealing with the issue, a seemingly poor choice for debuggability, and prevents graceful shutdown of the process. Potentially an option for handling very low-level task management (a la bevyengine#4740). Roughly takes the shape of: ```rust struct AbortOnPanic; impl Drop for AbortOnPanic { fn drop(&mut self) { abort!(); } } let guard = AbortOnPanic; // Run task std::mem::forget(AbortOnPanic); ``` --- ## Changelog Changed: `bevy_tasks::TaskPool`'s threads will no longer terminate permanently when a task scheduled onto them panics. Changed: `bevy_tasks::Task` and`bevy_tasks::Scope` will propagate panics in the spawned tasks/scopes to the parent thread.

james7132 added 2 commits November 1, 2022 15:52

Basic panic handling

4da5aca

Add branch for handling panics

c1b7987

james7132 requested a review from hymm November 1, 2022 23:35

james7132 added C-Bug An unexpected or incorrect behavior A-Tasks Tools for parallel and async work labels Nov 1, 2022

Formatting

da36773

alice-i-cecile reviewed Nov 2, 2022

View reviewed changes

Rework the error handling of catch_unwind

c6fd027

james7132 requested a review from alice-i-cecile November 2, 2022 05:27

james7132 marked this pull request as ready for review November 2, 2022 05:27

Replace single_threaded -> parallel

70a0cc9

james7132 force-pushed the task-pool-panicking branch from 5ca752f to 70a0cc9 Compare November 2, 2022 06:09

james7132 added the P-High This is particularly urgent, and deserves immediate attention label Nov 2, 2022

alice-i-cecile approved these changes Nov 2, 2022

View reviewed changes

PROMETHIA-27 approved these changes Nov 2, 2022

View reviewed changes

crates/bevy_transform/src/systems.rs Outdated Show resolved Hide resolved

crates/bevy_ecs/src/system/system_piping.rs Show resolved Hide resolved

james7132 added the S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it label Nov 2, 2022

james7132 commented Nov 2, 2022

View reviewed changes

crates/bevy_transform/src/systems.rs Outdated Show resolved Hide resolved

Remove outdated comment.

35d3b2d

hymm removed the S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it label Nov 2, 2022

hymm reviewed Nov 2, 2022

View reviewed changes

crates/bevy_transform/src/systems.rs Outdated Show resolved Hide resolved

james7132 commented Nov 2, 2022

View reviewed changes

crates/bevy_transform/src/systems.rs Outdated Show resolved Hide resolved

Remove unnecessary stage from test.

da5b9ef

james7132 requested a review from hymm November 2, 2022 20:48

james7132 commented Nov 2, 2022

View reviewed changes

crates/bevy_transform/src/systems.rs Outdated Show resolved Hide resolved

Derp

71a59cc

JoJoJet approved these changes Nov 2, 2022

View reviewed changes

Revert bevy_transform changes

4b7bbfe

james7132 added the S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it label Nov 2, 2022

bors bot changed the title ~~TaskPool Panic Handling~~ [Merged by Bors] - TaskPool Panic Handling Nov 2, 2022

bors bot closed this Nov 2, 2022

james7132 mentioned this pull request Nov 3, 2022

[tasks] Handle TaskPool panicking threads #2307

Closed

3 tasks

james7132 mentioned this pull request Nov 3, 2022

Panics in one task pool scope can panic a separate scope #6453

Closed

mockersf mentioned this pull request Jan 18, 2023

Panics while loading assets kill IoThreads preventing assets to be further loaded #2302

Closed

james7132 deleted the task-pool-panicking branch March 14, 2023 04:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Merged by Bors] - TaskPool Panic Handling #6443

[Merged by Bors] - TaskPool Panic Handling #6443

james7132 commented Nov 1, 2022 •

edited

Loading

alice-i-cecile Nov 2, 2022

james7132 Nov 2, 2022

alice-i-cecile commented Nov 2, 2022

james7132 commented Nov 2, 2022

james7132 commented Nov 2, 2022

james7132 commented Nov 2, 2022

alice-i-cecile left a comment

PROMETHIA-27 left a comment

hymm commented Nov 2, 2022

james7132 commented Nov 2, 2022

hymm commented Nov 2, 2022

hymm commented Nov 2, 2022 •

edited

Loading

JoJoJet left a comment •

edited

Loading

hymm commented Nov 2, 2022

james7132 commented Nov 2, 2022

alice-i-cecile commented Nov 2, 2022

bors bot commented Nov 2, 2022

hymm commented Nov 3, 2022

hymm commented Nov 3, 2022

[Merged by Bors] - TaskPool Panic Handling #6443

[Merged by Bors] - TaskPool Panic Handling #6443

Conversation

james7132 commented Nov 1, 2022 • edited Loading

Objective

Solution

Alternatives

Log and Ignore

catch_unwind in ParallelExecutor

Emulate/Copy tokio::JoinHandle with Task<T>

Abort on Panic

Changelog

alice-i-cecile Nov 2, 2022

Choose a reason for hiding this comment

james7132 Nov 2, 2022

Choose a reason for hiding this comment

alice-i-cecile commented Nov 2, 2022

james7132 commented Nov 2, 2022

james7132 commented Nov 2, 2022

james7132 commented Nov 2, 2022

alice-i-cecile left a comment

Choose a reason for hiding this comment

PROMETHIA-27 left a comment

Choose a reason for hiding this comment

hymm commented Nov 2, 2022

james7132 commented Nov 2, 2022

hymm commented Nov 2, 2022

hymm commented Nov 2, 2022 • edited Loading

JoJoJet left a comment • edited Loading

Choose a reason for hiding this comment

hymm commented Nov 2, 2022

james7132 commented Nov 2, 2022

alice-i-cecile commented Nov 2, 2022

bors bot commented Nov 2, 2022

hymm commented Nov 3, 2022

hymm commented Nov 3, 2022

james7132 commented Nov 1, 2022 •

edited

Loading

`catch_unwind` in `ParallelExecutor`

Emulate/Copy `tokio::JoinHandle` with `Task<T>`

hymm commented Nov 2, 2022 •

edited

Loading

JoJoJet left a comment •

edited

Loading