Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Merged by Bors] - TaskPool Panic Handling #6443

Closed
wants to merge 9 commits into from

Conversation

james7132
Copy link
Member

@james7132 james7132 commented Nov 1, 2022

Objective

Right now, the TaskPool implementation allows panics to permanently kill worker threads upon panicking. This is currently non-recoverable without using a std::panic::catch_unwind in every scheduled task. This is poor ergonomics and even poorer developer experience. This is exacerbated by #2250 as these threads are global and cannot be replaced after initialization.

Removes the need for temporary fixes like #4998. Fixes #4996. Fixes #6081. Fixes #5285. Fixes #5054. Supersedes #2307.

Solution

The current solution is to wrap Executor::run in TaskPool with a catch_unwind, and discarding the potential panic. This was taken straight from smol's current implementation. However, this is not entirely ideal as:

  • the signaled to the awaiting task. We would need to change Task<T> to use async_task::FallibleTask internally, and even then it doesn't signal why it panicked, just that it did. (See below).
  • no error is logged of any kind (See below)
  • it's unclear if it drops other tasks in the executor (it does not)
  • This allows the ECS parallel executor to keep chugging even though a system's task has been dropped. This inevitably leads to deadlock in the executor. Assuming we don't catch the unwind in ParallelExecutor, this will naturally kill the main thread.

Alternatives

A final solution likely will incorporate elements of any or all of the following.

Log and Ignore

Log the panic, drop the task, keep chugging. This only addresses the discoverability of the panic. The process will continue to run, probably deadlocking the executor. tokio's detatched tasks operate in this fashion.

Panics already do this by default, even when caught by catch_unwind.

catch_unwind in ParallelExecutor

Add another layer catching system-level panics into the ParallelExecutor. How the executor continues when a core dependency of many systems fails to run is up for debate.

async_task::Task bubbles up panics already, this will transitively push panics all the way to the main thread.

Emulate/Copy tokio::JoinHandle with Task<T>

tokio::JoinHandle<T> bubbles up the panic from the underlying task when awaited. This can be transitively applied across other APIs that also use Task<T> like Query::par_for_each and TaskPool::scope, bubbling up the panic until it's either caught or it reaches the main thread.

async_task::Task bubbles up panics already, this will transitively push panics all the way to the main thread.

Abort on Panic

The nuclear option. Log the error, abort the entire process on any thread in the task pool panicking. Definitely avoids any additional infrastructure for passing the panic around, and might actually lead to more efficient code as any unwinding is optimized out. However gives the developer zero options for dealing with the issue, a seemingly poor choice for debuggability, and prevents graceful shutdown of the process. Potentially an option for handling very low-level task management (a la #4740). Roughly takes the shape of:

struct AbortOnPanic;

impl Drop for AbortOnPanic {
   fn drop(&mut self) {
     abort!();
   }
}

let guard = AbortOnPanic;
// Run task
std::mem::forget(AbortOnPanic);

Changelog

Changed: bevy_tasks::TaskPool's threads will no longer terminate permanently when a task scheduled onto them panics.
Changed: bevy_tasks::Task andbevy_tasks::Scope will propagate panics in the spawned tasks/scopes to the parent thread.

@james7132 james7132 requested a review from hymm November 1, 2022 23:35
@james7132 james7132 added C-Bug An unexpected or incorrect behavior A-Tasks Tools for parallel and async work labels Nov 1, 2022
local_executor.tick().await;
}
};
// Use unwrap_err because we expect a Closed error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is no longer true.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is accurate once again, haha.

@alice-i-cecile
Copy link
Member

The error bubbling approach and the log-and-ignore patterns both seem useful for different use cases. Abort on panic seems... interesting, but really feels like something that should be opt-in (and I'm not sure it's worth the pervasive complexity maintaining both paths would take).

What would it take to be able to restart tasks if we discover that they failed?

@james7132
Copy link
Member Author

The error bubbling approach and the log-and-ignore patterns both seem useful for different use cases. Abort on panic seems... interesting, but really feels like something that should be opt-in (and I'm not sure it's worth the pervasive complexity maintaining both paths would take).

I would use all of them for different use cases:

  • error bubbling should be the default user-facing approach, it's by far the most flexible one.
  • log-and-ignore should be used whenever the user detatches a task
  • abort on panic should be the default when dealing with panics internal to bevy_tasks (a la Internalize task distinction into TaskPool #4740)

What would it take to be able to restart tasks if we discover that they failed?

The executor will drop the task, as it may be in a unrecoverable state after the panic. IMO users should handle the panic and reschedule the task(s) if they want to retry the same computation.

@james7132
Copy link
Member Author

Some further investigation:

  • Panics are by default logged even when caught with catch_unwind
  • async_task::Task will propagate the panic if the task has been dropped. Right now this is only possible if the original task panicked.

These two combined with the current implementation suggests that we can just straight up ignore any errors from catch_unwind as it will both log and bubble up as is, even for ParallelExecutor. Almost tempted to say we should merge this with a few small cleanups.

The Abort-on-Panic approach can be taken when we start diving deeper into making our own async executor.

@james7132 james7132 marked this pull request as ready for review November 2, 2022 05:27
@james7132 james7132 force-pushed the task-pool-panicking branch from 5ca752f to 70a0cc9 Compare November 2, 2022 06:09
@james7132 james7132 added the P-High This is particularly urgent, and deserves immediate attention label Nov 2, 2022
@james7132
Copy link
Member Author

Setting this to P-High since #2307 was labeled that.

Copy link
Member

@alice-i-cecile alice-i-cecile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, based on your investigations this seems like a significant improvement already.

I'd much rather merge this and then add more sophisticated strategies later as needed than block on designing them right now.

Copy link
Contributor

@PROMETHIA-27 PROMETHIA-27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, although this isn't my area of expertise.

crates/bevy_transform/src/systems.rs Outdated Show resolved Hide resolved
crates/bevy_ecs/src/system/system_piping.rs Show resolved Hide resolved
@james7132 james7132 added the S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it label Nov 2, 2022
@hymm
Copy link
Contributor

hymm commented Nov 2, 2022

getting intermittent failures on bevy_transform tests again with this PR
image

@hymm hymm removed the S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it label Nov 2, 2022
@james7132 james7132 requested a review from hymm November 2, 2022 20:48
@james7132
Copy link
Member Author

@hymm if you disable cargo test's output capture, do you see any panic messages?

@hymm
Copy link
Contributor

hymm commented Nov 2, 2022

a little hard to decipher as the output between different threads is interleaving, but I don't think there's any new info here
image

@hymm
Copy link
Contributor

hymm commented Nov 2, 2022

ran with quiet on too and it's more readable
image

Not sure why there isn't a thread 'Compute Task Pool (0)' panicked at error message

Copy link
Member

@JoJoJet JoJoJet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Edit: Didn't see the comments about the panicking.

@hymm
Copy link
Contributor

hymm commented Nov 2, 2022

I think the hierarchy test is failing due to some weird interaction between the 2 scopes. I tried to reproduce tasks disappearing with just the task pools and was unable to.

    AsyncComputeTaskPool::get()
        .spawn(async {
            loop {
                info!("this is a task");
                sleep(Duration::from_millis(1000));
                yield_now().await;
            }
        })
        .detach();
    AsyncComputeTaskPool::get()
        .spawn(async {
            sleep(Duration::from_millis(100));
            panic!("this is a panic");
        })
        .detach();

If we're trying to get this in before 0.9, I'd be happy to approve this with the hierarchy test changes reverted. As I can confirm that this PR is preventing the thread from getting killed and is an improvement over the status quo. In the meantime I'm going to investigate some more and see if I can get a reproduction using 2 scopes.

@james7132 james7132 added the S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it label Nov 2, 2022
@james7132
Copy link
Member Author

If we're trying to get this in before 0.9, I'd be happy to approve this with the hierarchy test changes reverted. As I can confirm that this PR is preventing the thread from getting killed and is an improvement over the status quo. In the meantime I'm going to investigate some more and see if I can get a reproduction using 2 scopes.

Done. Can you file a separate issue to track this?

@alice-i-cecile
Copy link
Member

bors r+

bors bot pushed a commit that referenced this pull request Nov 2, 2022
# Objective
Right now, the `TaskPool` implementation allows panics to permanently kill worker threads upon panicking. This is currently non-recoverable without using a `std::panic::catch_unwind` in every scheduled task. This is poor ergonomics and even poorer developer experience. This is exacerbated by #2250 as these threads are global and cannot be replaced after initialization.

Removes the need for temporary fixes like #4998. Fixes #4996. Fixes #6081. Fixes #5285. Fixes #5054. Supersedes #2307.

## Solution

The current solution is to wrap `Executor::run` in `TaskPool` with a `catch_unwind`, and discarding the potential panic. This was taken straight from [smol](/~https://github.com/smol-rs/smol/blob/404c7bcc0aea59b82d7347058043b8de7133241c/src/spawn.rs#L44)'s current implementation. ~~However, this is not entirely ideal as:~~
 
 - ~~the signaled to the awaiting task. We would need to change `Task<T>` to use `async_task::FallibleTask` internally, and even then it doesn't signal *why* it panicked, just that it did.~~ (See below).
 - ~~no error is logged of any kind~~ (See below)
 - ~~it's unclear if it drops other tasks in the executor~~ (it does not)
 - ~~This allows the ECS parallel executor to keep chugging even though a system's task has been dropped. This inevitably leads to deadlock in the executor.~~ Assuming we don't catch the unwind in ParallelExecutor, this will naturally kill the main thread.

### Alternatives
A final solution likely will incorporate elements of any or all of the following.

#### ~~Log and Ignore~~
~~Log the panic, drop the task, keep chugging. This only addresses the discoverability of the panic. The process will continue to run, probably deadlocking the executor. tokio's detatched tasks operate in this fashion.~~

Panics already do this by default, even when caught by `catch_unwind`.

#### ~~`catch_unwind` in `ParallelExecutor`~~
~~Add another layer catching system-level panics into the `ParallelExecutor`. How the executor continues when a core dependency of many systems fails to run is up for debate.~~

`async_task::Task`  bubbles up panics already, this will transitively push panics all the way to the main thread.

#### ~~Emulate/Copy `tokio::JoinHandle` with `Task<T>`~~
~~`tokio::JoinHandle<T>` bubbles up the panic from the underlying task when awaited. This can be transitively applied across other APIs that also use `Task<T>` like `Query::par_for_each` and `TaskPool::scope`, bubbling up the panic until it's either caught or it reaches the main thread.~~

`async_task::Task`  bubbles up panics already, this will transitively push panics all the way to the main thread.

#### Abort on Panic
The nuclear option. Log the error, abort the entire process on any thread in the task pool panicking. Definitely avoids any additional infrastructure for passing the panic around, and might actually lead to more efficient code as any unwinding is optimized out. However gives the developer zero options for dealing with the issue, a seemingly poor choice for debuggability, and prevents graceful shutdown of the process. Potentially an option for handling very low-level task management (a la #4740). Roughly takes the shape of:

```rust
struct AbortOnPanic;

impl Drop for AbortOnPanic {
   fn drop(&mut self) {
     abort!();
   }
}

let guard = AbortOnPanic;
// Run task
std::mem::forget(AbortOnPanic);
```

---

## Changelog

Changed: `bevy_tasks::TaskPool`'s threads  will no longer terminate permanently when a task scheduled onto them panics.
Changed: `bevy_tasks::Task` and`bevy_tasks::Scope` will propagate panics in the spawned tasks/scopes to the parent thread.
@bors bors bot changed the title TaskPool Panic Handling [Merged by Bors] - TaskPool Panic Handling Nov 2, 2022
@bors bors bot closed this Nov 2, 2022
@hymm
Copy link
Contributor

hymm commented Nov 3, 2022

FYI I'm not sure #5054 should have been closed as I think you'll still get different panic messages on the main thread depending on if a system was run on the main thread or not.

Can you file a separate issue to track this?

will do.

@hymm
Copy link
Contributor

hymm commented Nov 3, 2022

Can you file a separate issue to track this?

#6453

ItsDoot pushed a commit to ItsDoot/bevy that referenced this pull request Feb 1, 2023
# Objective
Right now, the `TaskPool` implementation allows panics to permanently kill worker threads upon panicking. This is currently non-recoverable without using a `std::panic::catch_unwind` in every scheduled task. This is poor ergonomics and even poorer developer experience. This is exacerbated by bevyengine#2250 as these threads are global and cannot be replaced after initialization.

Removes the need for temporary fixes like bevyengine#4998. Fixes bevyengine#4996. Fixes bevyengine#6081. Fixes bevyengine#5285. Fixes bevyengine#5054. Supersedes bevyengine#2307.

## Solution

The current solution is to wrap `Executor::run` in `TaskPool` with a `catch_unwind`, and discarding the potential panic. This was taken straight from [smol](/~https://github.com/smol-rs/smol/blob/404c7bcc0aea59b82d7347058043b8de7133241c/src/spawn.rs#L44)'s current implementation. ~~However, this is not entirely ideal as:~~
 
 - ~~the signaled to the awaiting task. We would need to change `Task<T>` to use `async_task::FallibleTask` internally, and even then it doesn't signal *why* it panicked, just that it did.~~ (See below).
 - ~~no error is logged of any kind~~ (See below)
 - ~~it's unclear if it drops other tasks in the executor~~ (it does not)
 - ~~This allows the ECS parallel executor to keep chugging even though a system's task has been dropped. This inevitably leads to deadlock in the executor.~~ Assuming we don't catch the unwind in ParallelExecutor, this will naturally kill the main thread.

### Alternatives
A final solution likely will incorporate elements of any or all of the following.

#### ~~Log and Ignore~~
~~Log the panic, drop the task, keep chugging. This only addresses the discoverability of the panic. The process will continue to run, probably deadlocking the executor. tokio's detatched tasks operate in this fashion.~~

Panics already do this by default, even when caught by `catch_unwind`.

#### ~~`catch_unwind` in `ParallelExecutor`~~
~~Add another layer catching system-level panics into the `ParallelExecutor`. How the executor continues when a core dependency of many systems fails to run is up for debate.~~

`async_task::Task`  bubbles up panics already, this will transitively push panics all the way to the main thread.

#### ~~Emulate/Copy `tokio::JoinHandle` with `Task<T>`~~
~~`tokio::JoinHandle<T>` bubbles up the panic from the underlying task when awaited. This can be transitively applied across other APIs that also use `Task<T>` like `Query::par_for_each` and `TaskPool::scope`, bubbling up the panic until it's either caught or it reaches the main thread.~~

`async_task::Task`  bubbles up panics already, this will transitively push panics all the way to the main thread.

#### Abort on Panic
The nuclear option. Log the error, abort the entire process on any thread in the task pool panicking. Definitely avoids any additional infrastructure for passing the panic around, and might actually lead to more efficient code as any unwinding is optimized out. However gives the developer zero options for dealing with the issue, a seemingly poor choice for debuggability, and prevents graceful shutdown of the process. Potentially an option for handling very low-level task management (a la bevyengine#4740). Roughly takes the shape of:

```rust
struct AbortOnPanic;

impl Drop for AbortOnPanic {
   fn drop(&mut self) {
     abort!();
   }
}

let guard = AbortOnPanic;
// Run task
std::mem::forget(AbortOnPanic);
```

---

## Changelog

Changed: `bevy_tasks::TaskPool`'s threads  will no longer terminate permanently when a task scheduled onto them panics.
Changed: `bevy_tasks::Task` and`bevy_tasks::Scope` will propagate panics in the spawned tasks/scopes to the parent thread.
@james7132 james7132 deleted the task-pool-panicking branch March 14, 2023 04:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Tasks Tools for parallel and async work C-Bug An unexpected or incorrect behavior P-High This is particularly urgent, and deserves immediate attention S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it
Projects
None yet
5 participants