Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ParU: demo intermittently stuck indefinitely #474

Closed
mmuetzel opened this issue Oct 29, 2023 · 11 comments
Closed

ParU: demo intermittently stuck indefinitely #474

mmuetzel opened this issue Oct 29, 2023 · 11 comments
Assignees

Comments

@mmuetzel
Copy link
Contributor

Describe the bug
When running make demos for ParU, execution occasionally is stuck indefinitely in paru_demo.

To Reproduce
Build ParU and run make demos.
It only happens occasionally. So, it might be a threading issue. I'll try to attach a debugger to the process when that happens the next time. Maybe that can give a clue where it is stuck.

Expected behavior
The demo executable terminates in a finite time.

Desktop (please complete the following information):

  • OS: Windows
  • compiler: gcc.exe (Rev2, Built by MSYS2 project) 13.2.0
  • BLAS and LAPACK library: OpenBLAS with OpenMP threading
@DrTimothyAldenDavis
Copy link
Owner

We haven't tested ParU on Windows until now, if I recall. This is using OpenMP 4.5 for ParU itself, correct?

@Aznaveh : We're making progress on getting the ParU cmake build system updated to the latest SuiteSparse, thanks to @mmuetzel's help.

One solution would be to disable OpenMP for ParU itself on Windows entirely, since it might take some time to track this down. ParU would then use parallelism inside the BLAS and LAPACK alone, on Windows. That's not ideal since ParU is meant to be able to factorize many frontal matrices in parallel, but this would at least be a stable temporary solution in order to get to a first stable release.

@mmuetzel
Copy link
Contributor Author

mmuetzel commented Nov 1, 2023

Seems to have happened in CI here:
/~https://github.com/DrTimothyAldenDavis/SuiteSparse/actions/runs/6711348727/job/18238519007?pr=486#step:9:6007

The job timed out after 6 hours.

@mmuetzel
Copy link
Contributor Author

mmuetzel commented Nov 3, 2023

Happened again: /~https://github.com/DrTimothyAldenDavis/SuiteSparse/actions/runs/6736593096/job/18312283038#step:9:4262

So far, this happened once for MINGW32 and three times for MINGW64 (afaict). It didn't happen on CLANG* or MSVC runners.
I wonder if that is already a pattern and what could be the reason for this.
Just speculating: There are two different (slightly incompatible) C runtimes for current versions of Windows. The (older) MSVCRT and the (newer) UCRT. The MINGW* environments are using the older MSVCRT. The other Windows runners are using the newer UCRT. (See also: https://www.msys2.org/docs/environments/)
Maybe something doesn't work quite right with the older MSVCRT?

This is still very speculative. But maybe that "pattern" solidifies if we wait a bit longer...

@DrTimothyAldenDavis
Copy link
Owner

I will try to track it down, early next week or so.

ParU has some parallel data structures and I'm guessing we're missing a #pragma omp flush somewhere.

Some sort of race condition, anyway.

@DrTimothyAldenDavis
Copy link
Owner

Paru has an extensive internal debug code where it can print out data structures , status, does asserts, etc. I will turn that on (takes a code edit if I recall) and then the log should show me where it's stuck.

Tied up most of today though

@mmuetzel
Copy link
Contributor Author

mmuetzel commented Nov 3, 2023

I opened #494 to avoid that the runners are blocked for the full 6 hours in case this (or something akin) is happening.

@DrTimothyAldenDavis
Copy link
Owner

I'm working on debugging this now, by enabling the ParU debug mode with its extensive printing. I forced on the GraphBLAS COMPACT mode to speed up the tests, temporarily.

I wonder if the old MSVCRT libraries are thread-safe. ParU uses various C++ libraries in parallel, in multiple threads, to do things inside individual openmp threads. If those libraries are not thread-safe, then this will fail.

@mmuetzel
Copy link
Contributor Author

mmuetzel commented Nov 4, 2023

I don't know if the MSVCRT libraries are different to UCRT when it comes to thread safety. I didn't find anything in this respect online. (But I might have used the wrong search terms.)

It might also be that it is the compiler (not the C runtime) that makes the difference. On their "Environments" page (link in comment above), they list for the LLVM/Clang compiler "Native support for TLS (Thread-local storage)". That might mean that GCC does not have native support for TLS (whatever "native" means in this circumstances).
Could that be the reason why it's getting stuck intermittently? Does ParU use TLS?

@mmuetzel
Copy link
Contributor Author

mmuetzel commented Nov 4, 2023

I asked on the MSYS2 Discord. @mati865 proposed trying to statically link and/or build with Clang in the MINGW* environments of MSYS2 for a test.

mmuetzel — Today at 4:47 PM
Is TLS working with GCC on Windows?
On https://www.msys2.org/docs/environments/, it says for the LLVM/Clang environments:
Native support for TLS (Thread-local storage)
Does that mean it doesn't work with GCC?
mati865 — Today at 4:55 PM
GCC has only emuTLS (emulated TLS) which is a can of worms. It might work reasonably well withing single binary but when you try to use TLS variable from another DLL it breaks in various ways.
mmuetzel — Today at 4:57 PM
Ah. Thanks. SuiteSparse is seeing random deadlocks with a recently added library in GCC environments. That could be an explanation maybe...
Does that mean that static linking might help?
mati865 — Today at 4:58 PM
I'd rather expect it to be the same issue LLVM testsuite has despite proper native TLS. TL;DR exiting from within DLLs is tricky and likely to hang in the destructors IIRC.
mati865 — Today at 4:59 PM
If it's the TLS then it might help, if it's the issue that I mentioned it will definitely help.
mmuetzel — Today at 5:00 PM
These random deadlocks didn't occur (yet?) on LLVM/Clang environments for that particular project.
mati865 — Today at 5:01 PM
Oh, that might be destructors ordering issue which is also borked on GCC but dunno whether it's caused by TLS.
I don't remember the details but you might try with Clang from mingw64/ucrt64 env. It uses emuTLS for compatibility with GCC.
IIRC in specific cases GCC's destructors were totally messed up, Clang's with emuTLS were not quite right and only "native" Clang worked fine.

mmuetzel added a commit to mmuetzel/SuiteSparse that referenced this issue Nov 5, 2023
GCC with emuTLS on Windows might have trouble with shared linking.
See DrTimothyAldenDavis#474.
mmuetzel added a commit to mmuetzel/SuiteSparse that referenced this issue Nov 5, 2023
GCC with emuTLS on Windows might have trouble with shared linking.
See DrTimothyAldenDavis#474.
mmuetzel added a commit to mmuetzel/SuiteSparse that referenced this issue Nov 5, 2023
mmuetzel added a commit to mmuetzel/SuiteSparse that referenced this issue Nov 5, 2023
GCC with emuTLS on Windows might have trouble with shared linking.
See DrTimothyAldenDavis#474.
@mmuetzel
Copy link
Contributor Author

There weren't any cases where the CI got stuck on this issue in a while.
Assuming this was fixed by the changes in #500.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants