Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent test failure on systems::test::correct_transforms_when_no_children in bevy_transform #6081

Closed
targrub opened this issue Sep 24, 2022 · 5 comments
Labels
A-Hierarchy Parent-child entity hierarchies A-Tasks Tools for parallel and async work C-Bug An unexpected or incorrect behavior C-Testing A change that impacts how we test Bevy or how users test their apps

Comments

@targrub
Copy link
Contributor

targrub commented Sep 24, 2022

Bevy version

Any recent commit, like fb74ca3 or a09dd03

Relevant system information

Windows 11, stable 1.64.0

What you did

cargo run -p ci on my machine

What went wrong

     Running unittests src\lib.rs (target\debug\deps\bevy_transform-15b5c6f3f18cfb6b.exe)

running 5 tests
test systems::test::did_propagate_command_buffer ... ok
test systems::test::did_propagate ... ok
test systems::test::correct_children ... ok
test systems::test::correct_transforms_when_no_children ... FAILED
test systems::test::panic_when_hierarchy_cycle - should panic ... ok

failures:

---- systems::test::correct_transforms_when_no_children stdout ----
thread 'systems::test::correct_transforms_when_no_children' panicked at 'assertion failed: `(left == right)`
  left: `0v0`,
 right: `2v0`: Malformed hierarchy. This probably means that your hierarchy has been improperly maintained, or contains a cycle', crates\bevy_transform\src\systems.rs:69:9
stack backtrace:
   0: std::panicking::begin_panic_handler
             at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library\std\src\panicking.rs:584
   1: core::panicking::panic_fmt
             at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library\core\src\panicking.rs:142
   2: core::fmt::Arguments::new_v1
             at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library\core\src\fmt\mod.rs:394
   3: core::panicking::assert_failed_inner
             at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library\core\src\panicking.rs:218
   4: core::panicking::assert_failed<bevy_ecs::entity::Entity,bevy_ecs::entity::Entity>
             at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52\library\core\src\panicking.rs:181
   5: bevy_transform::systems::propagate_recursive
             at .\src\systems.rs:69
   6: bevy_transform::systems::transform_propagate_system
             at .\src\systems.rs:38
   7: core::ops::function::FnMut::call_mut<void (*)(bevy_ecs::system::query::Query<tuple$<enum$<core::option::Option<tuple$<ref$<bevy_hierarchy::components::children::Children>,bevy_ecs::query::filter::Changed<bevy_hierarchy::components::children::Children> > >
             at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52\library\core\src\ops\function.rs:164
   8: core::ops::function::impls::impl$3::call_mut<tuple$<bevy_ecs::system::query::Query<tuple$<enum$<core::option::Option<tuple$<ref$<bevy_hierarchy::components::children::Children>,bevy_ecs::query::filter::Changed<bevy_hierarchy::components::children::Childre
             at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52\library\core\src\ops\function.rs:290
  ...
  29: bevy_transform::systems::test::correct_transforms_when_no_children
             at .\src\systems.rs:303
  30: bevy_transform::systems::test::correct_transforms_when_no_children::closure$0
             at .\src\systems.rs:278
  31: core::ops::function::FnOnce::call_once<bevy_transform::systems::test::correct_transforms_when_no_children::closure_env$0,tuple$<> >
             at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52\library\core\src\ops\function.rs:248
  32: core::ops::function::FnOnce::call_once
             at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library\core\src\ops\function.rs:248
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.


failures:
    systems::test::correct_transforms_when_no_children

test result: FAILED. 4 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.04s

error: test failed, to rerun pass '-p bevy_transform --lib'
thread 'main' panicked at 'Please fix failing tests in output above.: command exited with non-zero code `cargo test --workspace --lib --bins --tests --benches`: 101', tools\ci\src\main.rs:103:14
stack backtrace:
   0: std::panicking::begin_panic_handler
             at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library\std\src\panicking.rs:584
   1: core::panicking::panic_fmt
             at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library\core\src\panicking.rs:142
   2: core::result::unwrap_failed
             at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library\core\src\result.rs:1814
   3: enum$<core::result::Result<tuple$<>,xshell::error::Error>, 1, 18446744073709551615, Err>::expect<tuple$<>,xshell::error::Error>
             at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52\library\core\src\result.rs:1064
   4: ci::main
             at .\tools\ci\src\main.rs:101
   5: core::ops::function::FnOnce::call_once<void (*)(),tuple$<> >
             at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52\library\core\src\ops\function.rs:248
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
error: process didn't exit successfully: `target\debug\ci.exe` (exit code: 101)

Additional information

The same command almost always works ok (i.e., the test passes) if I run it immediately afterwards.

@targrub targrub added C-Bug An unexpected or incorrect behavior S-Needs-Triage This issue needs to be labelled labels Sep 24, 2022
@hymm
Copy link
Contributor

hymm commented Sep 24, 2022

I think this is the same issue as #6036.

@targrub
Copy link
Contributor Author

targrub commented Sep 24, 2022

Interesting. I've never had a failure in the did_propagate test, always in this one. But the commonality of the assert, Malformed hierarchy. This probably means that your hierarchy has been improperly maintained, is a gun producing some smoke.

@hymm
Copy link
Contributor

hymm commented Sep 24, 2022

it's an issue with the panicking test. When the test panics it causes other test's systems not to run if they're on the thread that panics.

@alice-i-cecile alice-i-cecile added C-Testing A change that impacts how we test Bevy or how users test their apps A-Tasks Tools for parallel and async work A-Hierarchy Parent-child entity hierarchies and removed S-Needs-Triage This issue needs to be labelled labels Sep 26, 2022
@oddfacade
Copy link
Contributor

i think this may also affect systems::test::correct_children as i'm getting an intermittent failure there too.

@mockersf
Copy link
Member

mockersf commented Oct 5, 2022

closing for #5285

@mockersf mockersf closed this as completed Oct 5, 2022
bors bot pushed a commit that referenced this issue Nov 2, 2022
# Objective
Right now, the `TaskPool` implementation allows panics to permanently kill worker threads upon panicking. This is currently non-recoverable without using a `std::panic::catch_unwind` in every scheduled task. This is poor ergonomics and even poorer developer experience. This is exacerbated by #2250 as these threads are global and cannot be replaced after initialization.

Removes the need for temporary fixes like #4998. Fixes #4996. Fixes #6081. Fixes #5285. Fixes #5054. Supersedes #2307.

## Solution

The current solution is to wrap `Executor::run` in `TaskPool` with a `catch_unwind`, and discarding the potential panic. This was taken straight from [smol](/~https://github.com/smol-rs/smol/blob/404c7bcc0aea59b82d7347058043b8de7133241c/src/spawn.rs#L44)'s current implementation. ~~However, this is not entirely ideal as:~~
 
 - ~~the signaled to the awaiting task. We would need to change `Task<T>` to use `async_task::FallibleTask` internally, and even then it doesn't signal *why* it panicked, just that it did.~~ (See below).
 - ~~no error is logged of any kind~~ (See below)
 - ~~it's unclear if it drops other tasks in the executor~~ (it does not)
 - ~~This allows the ECS parallel executor to keep chugging even though a system's task has been dropped. This inevitably leads to deadlock in the executor.~~ Assuming we don't catch the unwind in ParallelExecutor, this will naturally kill the main thread.

### Alternatives
A final solution likely will incorporate elements of any or all of the following.

#### ~~Log and Ignore~~
~~Log the panic, drop the task, keep chugging. This only addresses the discoverability of the panic. The process will continue to run, probably deadlocking the executor. tokio's detatched tasks operate in this fashion.~~

Panics already do this by default, even when caught by `catch_unwind`.

#### ~~`catch_unwind` in `ParallelExecutor`~~
~~Add another layer catching system-level panics into the `ParallelExecutor`. How the executor continues when a core dependency of many systems fails to run is up for debate.~~

`async_task::Task`  bubbles up panics already, this will transitively push panics all the way to the main thread.

#### ~~Emulate/Copy `tokio::JoinHandle` with `Task<T>`~~
~~`tokio::JoinHandle<T>` bubbles up the panic from the underlying task when awaited. This can be transitively applied across other APIs that also use `Task<T>` like `Query::par_for_each` and `TaskPool::scope`, bubbling up the panic until it's either caught or it reaches the main thread.~~

`async_task::Task`  bubbles up panics already, this will transitively push panics all the way to the main thread.

#### Abort on Panic
The nuclear option. Log the error, abort the entire process on any thread in the task pool panicking. Definitely avoids any additional infrastructure for passing the panic around, and might actually lead to more efficient code as any unwinding is optimized out. However gives the developer zero options for dealing with the issue, a seemingly poor choice for debuggability, and prevents graceful shutdown of the process. Potentially an option for handling very low-level task management (a la #4740). Roughly takes the shape of:

```rust
struct AbortOnPanic;

impl Drop for AbortOnPanic {
   fn drop(&mut self) {
     abort!();
   }
}

let guard = AbortOnPanic;
// Run task
std::mem::forget(AbortOnPanic);
```

---

## Changelog

Changed: `bevy_tasks::TaskPool`'s threads  will no longer terminate permanently when a task scheduled onto them panics.
Changed: `bevy_tasks::Task` and`bevy_tasks::Scope` will propagate panics in the spawned tasks/scopes to the parent thread.
ItsDoot pushed a commit to ItsDoot/bevy that referenced this issue Feb 1, 2023
# Objective
Right now, the `TaskPool` implementation allows panics to permanently kill worker threads upon panicking. This is currently non-recoverable without using a `std::panic::catch_unwind` in every scheduled task. This is poor ergonomics and even poorer developer experience. This is exacerbated by bevyengine#2250 as these threads are global and cannot be replaced after initialization.

Removes the need for temporary fixes like bevyengine#4998. Fixes bevyengine#4996. Fixes bevyengine#6081. Fixes bevyengine#5285. Fixes bevyengine#5054. Supersedes bevyengine#2307.

## Solution

The current solution is to wrap `Executor::run` in `TaskPool` with a `catch_unwind`, and discarding the potential panic. This was taken straight from [smol](/~https://github.com/smol-rs/smol/blob/404c7bcc0aea59b82d7347058043b8de7133241c/src/spawn.rs#L44)'s current implementation. ~~However, this is not entirely ideal as:~~
 
 - ~~the signaled to the awaiting task. We would need to change `Task<T>` to use `async_task::FallibleTask` internally, and even then it doesn't signal *why* it panicked, just that it did.~~ (See below).
 - ~~no error is logged of any kind~~ (See below)
 - ~~it's unclear if it drops other tasks in the executor~~ (it does not)
 - ~~This allows the ECS parallel executor to keep chugging even though a system's task has been dropped. This inevitably leads to deadlock in the executor.~~ Assuming we don't catch the unwind in ParallelExecutor, this will naturally kill the main thread.

### Alternatives
A final solution likely will incorporate elements of any or all of the following.

#### ~~Log and Ignore~~
~~Log the panic, drop the task, keep chugging. This only addresses the discoverability of the panic. The process will continue to run, probably deadlocking the executor. tokio's detatched tasks operate in this fashion.~~

Panics already do this by default, even when caught by `catch_unwind`.

#### ~~`catch_unwind` in `ParallelExecutor`~~
~~Add another layer catching system-level panics into the `ParallelExecutor`. How the executor continues when a core dependency of many systems fails to run is up for debate.~~

`async_task::Task`  bubbles up panics already, this will transitively push panics all the way to the main thread.

#### ~~Emulate/Copy `tokio::JoinHandle` with `Task<T>`~~
~~`tokio::JoinHandle<T>` bubbles up the panic from the underlying task when awaited. This can be transitively applied across other APIs that also use `Task<T>` like `Query::par_for_each` and `TaskPool::scope`, bubbling up the panic until it's either caught or it reaches the main thread.~~

`async_task::Task`  bubbles up panics already, this will transitively push panics all the way to the main thread.

#### Abort on Panic
The nuclear option. Log the error, abort the entire process on any thread in the task pool panicking. Definitely avoids any additional infrastructure for passing the panic around, and might actually lead to more efficient code as any unwinding is optimized out. However gives the developer zero options for dealing with the issue, a seemingly poor choice for debuggability, and prevents graceful shutdown of the process. Potentially an option for handling very low-level task management (a la bevyengine#4740). Roughly takes the shape of:

```rust
struct AbortOnPanic;

impl Drop for AbortOnPanic {
   fn drop(&mut self) {
     abort!();
   }
}

let guard = AbortOnPanic;
// Run task
std::mem::forget(AbortOnPanic);
```

---

## Changelog

Changed: `bevy_tasks::TaskPool`'s threads  will no longer terminate permanently when a task scheduled onto them panics.
Changed: `bevy_tasks::Task` and`bevy_tasks::Scope` will propagate panics in the spawned tasks/scopes to the parent thread.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Hierarchy Parent-child entity hierarchies A-Tasks Tools for parallel and async work C-Bug An unexpected or incorrect behavior C-Testing A change that impacts how we test Bevy or how users test their apps
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants