-
Notifications
You must be signed in to change notification settings - Fork 493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pacman] getting stuck on clang-aarch64 #4340
Comments
Does the CI script terminate msys2 processes after update? See /~https://github.com/msys2/setup-msys2/blob/8b0d40b8912601756301a7b3de7752d5dba969cd/main.js#L408 In that main.js file, |
There is also a bug (presumably in msys2-runtime), which I have never been able to debug, that manifests as processes hanging around when they should have exited. This usually seems to happen when pacman uses gpgme to attempt to validate signatures. As a workaround, I disable the validation of database signatures in pacman.conf, because the database signature verification seems to happen every time pacman is run, whereas package verification is only done when a package is being installed. I still see occasional hangups in package verification though. See also msys2/msys2-autobuild#62 which I think is the closest thing to an existing bug tracking this. Workarounds I currently apply: REM /~https://github.com/msys2/msys2-autobuild/issues/62
CALL C:\msys64\msys2_shell.cmd -defterm -no-start -c "mkdir -p /etc/pacman.d/hooks && touch /etc/pacman.d/hooks/texinfo-{install,remove}.hook"
REM the caret is messing with CMD parsing, try it another way
C:\msys64\usr\bin\sed.exe -i -e 's/^^\(SigLevel\s\+=\s\+Required\)\s*$/\1 DatabaseNever/' /etc/pacman.conf |
We are now doing something similar, which seems to get us further, but it's finding a database lock. On my own Arm Dev Kit, I just updated pacman, which went without problems. However, it is now stuck in compiling part of GIMP (I think I've seen this before too). So, this getting stuck is probably not specific to pacman. Looking in process explorer, the innermost process is env.exe. Could it be related to reading/setting env vars, which I seem to remember can have problems from multiple threads. |
The database lock file |
Hi. The database lock is being investigated externally (this is not a MSYS2 bug). The problem is, when the database is not a concern, we cann't kill pacman easily. See: https://gitlab.gnome.org/GNOME/gimp/-/jobs/3458478#L157 |
The msys2 related processes should be terminated outside of msys2 environment. For example, |
I tried before with takkill but the exit code makes the job fail. |
OK, I am out of ideas then. By the way, @hmartinez82 has done some great work of porting apps to aarch64. He may suggest some ideas. |
I wish somebody with good knowledge of low-level debugging on WoA (and/or of Cygwin) could debug this, I have tried and had no luck (I always got an error getting the context of the main thread, from every debugger I tried: windbg, gdb, lldb). |
I even see this happening, randomly, in my personal laptop when using pacman. @Biswa96 I'm not low level debugger. Actually I'm thinking about installing Tailscale in that VM and letting someone else with more expertise take a look. |
Hi all! Now we have new runners contributed by Arm Ltd., additionally to the one by @hmartinez82. And this pacman getting stuck issue is also happening randomly on their runners. Is there anything we could tell the admins at Arm to look for in order to help you debug this issue? |
@Jehan I'm glad they are having it too, so we now know it's not just my runner. I don't know what the issue is. |
MSYS2 pacman gets randomly stuck on Windows/Aarch64. The actual issue is still being investigated by upstream projects, though anyway it's bad for us right now, to the point that there are discussions to remove Aarch64 support from the Windows installer (whereas it just got added recently!) in #10729. This is an attempt to a workaround. Instead of getting stuck forever and waiting until the whole job times out (per Gitlab CI settings), I time-out the pacman command within our script and try again, up to 2 more times. Hopefully one of the calls would succeed. See: msys2/MSYS2-packages#4340
MSYS2 pacman gets randomly stuck on Windows/Aarch64. The actual issue is still being investigated by upstream projects, though anyway it's bad for us right now, to the point that there are discussions to remove Aarch64 support from the Windows installer (whereas it just got added recently!) in #10729. This is an attempt to a workaround. Instead of getting stuck forever and waiting until the whole job times out (per Gitlab CI settings), I time-out the pacman command within our script and try again, up to 2 more times. Hopefully one of the calls would succeed. See: msys2/MSYS2-packages#4340
MSYS2 pacman gets randomly stuck on Windows/Aarch64. The actual issue is still being investigated by upstream projects, though anyway it's bad for us right now, to the point that there are discussions to remove Aarch64 support from the Windows installer (whereas it just got added recently!) in #10729. This is an attempt to a workaround. Instead of getting stuck forever and waiting until the whole job times out (per Gitlab CI settings), I time-out the pacman command within our script and try again, up to 2 more times. Hopefully one of the calls would succeed. See: msys2/MSYS2-packages#4340
MSYS2 pacman gets randomly stuck on Windows/Aarch64. The actual issue is still being investigated by upstream projects, though anyway it's bad for us right now, to the point that there are discussions to remove Aarch64 support from the Windows installer (whereas it just got added recently!) in #10729. This is an attempt to a workaround. Instead of getting stuck forever and waiting until the whole job times out (per Gitlab CI settings), I time-out the pacman command within our script and try again, up to 2 more times. Hopefully one of the calls would succeed. See: msys2/MSYS2-packages#4340
MSYS2 pacman gets randomly stuck on Windows/Aarch64. The actual issue is still being investigated by upstream projects, though anyway it's bad for us right now, to the point that there are discussions to remove Aarch64 support from the Windows installer (whereas it just got added recently!) in #10729. This is an attempt to a workaround. Instead of getting stuck forever and waiting until the whole job times out (per Gitlab CI settings), I time-out (after 3 minutes) the pacman command within our script and try again, up to 2 more times. Hopefully one of the calls would succeed. I also send a SIGKILL through the timeout (though I have no idea how signals translate to Windows processes) and run again taskkill after this, which may seem overkill. Interestingly I get output for both, which seems to indicate that the kill succeeds in both cases (because of several processes?). Anyway clearly it's a bit of random code not completely understood, but the inability to test this all locally clearly doesn't help so it's good enough for the time being. See: msys2/MSYS2-packages#4340
…64 jobs. This is the command suggest by MSYS2 developers here: msys2/MSYS2-packages#4340 (comment) They also say to run it outside the MSYS2 environment, which is why it's in the CI rules, not in the shell script.
…64 jobs. This is the command suggest by MSYS2 developers here: msys2/MSYS2-packages#4340 (comment) They also say to run it outside the MSYS2 environment, which is why it's in the CI rules, not in the shell script. Honestly at this point, it feels like we are just stacking weird workaround to get it to fail not too often. ;-(
MSYS2 pacman gets randomly stuck on Windows/Aarch64. The actual issue is still being investigated by upstream projects, though anyway it's bad for us right now, to the point that there are discussions to remove Aarch64 support from the Windows installer (whereas it just got added recently!) in #10729. This is an attempt to a workaround. Instead of getting stuck forever and waiting until the whole job times out (per Gitlab CI settings), I time-out (after 3 minutes) the pacman command within our script and try again, up to 2 more times. Hopefully one of the calls would succeed. I also send a SIGKILL through the timeout (though I have no idea how signals translate to Windows processes) and run again taskkill after this, which may seem overkill. Interestingly I get output for both, which seems to indicate that the kill succeeds in both cases (because of several processes?). Anyway clearly it's a bit of random code not completely understood, but the inability to test this all locally clearly doesn't help so it's good enough for the time being. See: msys2/MSYS2-packages#4340
…64 jobs. This is the command suggest by MSYS2 developers here: msys2/MSYS2-packages#4340 (comment) They also say to run it outside the MSYS2 environment, which is why it's in the CI rules, not in the shell script. Honestly at this point, it feels like we are just stacking weird workaround to get it to fail not too often. ;-(
Just a heads-up: this has been fixed by @jeremyd2019's patch that made it into |
After the fix pacman is getting stuck on x64 and x86 MSYSTEMs at gpg phase using @lazka runners |
@jeremyd2019 Have you experienced this? 🤔 |
There's this in
I have seen some differences with other overlapped I/O functions between x64 and arm64 systems, namely Could it be that |
Got two hangs in CI just now:
gpgme/libgpg-error also got updated and pacman rebuilt recently (should we open a new issue for this?) |
Which reminds me that we had a stuck job in autobuild some days ago (2024-11-17): /~https://github.com/msys2/msys2-autobuild/actions/runs/11876484140/job/33094866915 in a different place though (?) |
Probably. I am going to suspect the CancelSynchronousIo call, without any further evidence. Maybe there is some other synchronous io on the thread that is canceled, that does not result in the thread exiting as expected. It looks like it's hanging in pacman/gpgme so maybe I'll start a loop of pacman -Suu and see if it hangs so I can debug. |
If you think it's worth keeping the CancelSynchronousIo call only for ARM64, with the decision at runtime then IsWow64Process2 is an option. |
I don't think it's necessary at all, the SuspendThread/GetThreadContext dance fixed the ARM64 issue, the CancelSynchronousIo was an attempt to avoid the necessity of TerminateThread at all. If I can get some insight into what's going wrong, and it is due to CancelSyncronousIo somehow, it'd be better to just revert that part of the patch. I sort of suspect whoever got that stack trace in the image above attached to the "wrong" process in the tree, though. That stack looks like a process in the |
I was able to reproduce and poke in the debugger. wait_thread is happily waiting in Using gdb to call |
Probably good to let @dscho know as well |
Probably makes sense to undo the |
It appears this is causing hangs on native x86_64 in similar scenarios as the hangs on ARM64. Addresses: msys2/MSYS2-packages#4340
It appears this is causing hangs on native x86_64 in similar scenarios as the hangs on ARM64. Addresses: msys2/MSYS2-packages#4340 (comment)
It appears this is causing hangs on native x86_64 in similar scenarios as the hangs on ARM64, because `CancelSynchronousIo` is returning `TRUE` but not canceling the `ReadFile` call as expected. Addresses: msys2/MSYS2-packages#4340 (comment) Fixes: b091b47 ("cygthread: suspend thread before terminating.")
I did a quick commit last night (my time) and fired off some tests on x86_64 and ARM64 while I slept. There were no hangs on either architecture, thought I saw a couple errors like this on ARM64: these didn't cause any other hang, error output, or abort the while loop... |
It appears this is causing hangs on native x86_64 in similar scenarios as the hangs on ARM64, because `CancelSynchronousIo` is returning `TRUE` but not canceling the `ReadFile` call as expected. Addresses: msys2/MSYS2-packages#4340 (comment) Fixes: b091b47 ("cygthread: suspend thread before terminating.") Signed-off-by: Jeremy Drake <cygwin@jdrake.com>
This change seems to have caused hangs on x86_64, so let's revert it. Addresses msys2#4340 (comment) and corresponds to msys2/msys2-runtime#243. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
This change seems to have caused hangs on x86_64, so let's revert it. Addresses #4340 (comment) and corresponds to msys2/msys2-runtime#243. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
It appears this is causing hangs on native x86_64 in similar scenarios as the hangs on ARM64, because `CancelSynchronousIo` is returning `TRUE` but not canceling the `ReadFile` call as expected. Addresses: msys2/MSYS2-packages#4340 (comment) Fixes: b091b47 ("cygthread: suspend thread before terminating.") Signed-off-by: Jeremy Drake <cygwin@jdrake.com>
It appears this is causing hangs on native x86_64 in similar scenarios as the hangs on ARM64, because `CancelSynchronousIo` is returning `TRUE` but not canceling the `ReadFile` call as expected. Addresses: msys2/MSYS2-packages#4340 (comment) Fixes: b091b47 ("cygthread: suspend thread before terminating.") Signed-off-by: Jeremy Drake <cygwin@jdrake.com>
The fix for the x86_64 hang is merged now, and it looks like msys2-runtime 3.5.4-7 is already in the repo. |
Looks like a hang on /~https://github.com/msys2/msys2-autobuild/actions/runs/11990401626/job/33427879361 not sure why though. UPDATE: maybe not hung, just uploading really really slowly? |
Yes it looks like it goes like 1 package per 20 minutes or something. |
It appears this is causing hangs on native x86_64 in similar scenarios as the hangs on ARM64, because `CancelSynchronousIo` is returning `TRUE` but not canceling the `ReadFile` call as expected. Cherry-picked from msys2/msys2-runtime's 2eb6be14ee (Cygwin: revert use of CancelSyncronousIo on wait_thread., 2024-11-21). Addresses: msys2/MSYS2-packages#4340 (comment) Fixes: b091b47b9e56 ("cygthread: suspend thread before terminating.") Signed-off-by: Jeremy Drake <cygwin@jdrake.com> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
It appears this is causing hangs on native x86_64 in similar scenarios as the hangs on ARM64, because `CancelSynchronousIo` is returning `TRUE` but not canceling the `ReadFile` call as expected. Addresses: msys2/MSYS2-packages#4340 (comment) Fixes: b091b47 ("cygthread: suspend thread before terminating.") Signed-off-by: Jeremy Drake <cygwin@jdrake.com>
This change seems to have caused hangs on x86_64, so let's revert it. Addresses msys2/MSYS2-packages#4340 (comment) and corresponds to msys2/msys2-runtime#243. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
It appears this is causing hangs on native x86_64 in similar scenarios as the hangs on ARM64, because `CancelSynchronousIo` is returning `TRUE` but not canceling the `ReadFile` call as expected. Addresses: msys2/MSYS2-packages#4340 (comment) Fixes: b091b47 ("cygthread: suspend thread before terminating.") Signed-off-by: Jeremy Drake <cygwin@jdrake.com>
Description / Steps to reproduce the issue
Since about a month GIMP's aarch64 CI runner is getting stuck when running
pacman --noconfirm -Suy
.The last job that succeeded was Dec 11 and another from the same day and any later one is failing.
Most of the time (e.g. here) it already seems to stop before the databases are updated:
Sometimes it gets a little further:
When testing on my Ms Dev kit now it did not get stuck (but I do remember seeing that sometimes in the past). However, when checking with Process Explorer, I do see that after pacman closed the terminal, the pacman and conhost processes are still running.
Expected behavior
Pacman finishes after doing its thing.
Actual behavior
Pacman gets stuck
Verification
Windows Version
MSYS64_NT-10.0-22621
Are you willing to submit a PR?
No response
The text was updated successfully, but these errors were encountered: