-
Notifications
You must be signed in to change notification settings - Fork 573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CRASH from new glibc 2.35 rseq on any app (-disable_rseq solves) #5431
Comments
Inlining stacktrace for searching:
|
Could you clarify: which instrumentation is slow? This should not affect tool-added instrumentation: it affects the application's code paths, disabling things like per-cpu caches, which can affect performance by forcing per-thread caches: but we're talking small percentages on large many-threaded applications. I would expect zero observable impact on the apps listed above. Could you show precise command lines and times? |
I'm currently testing I must correct myself, the slowdown was because I was still running a debug build of DynamoRIO. Now with the release build, I don't experience much difference in performance between my host machine running |
I should have suggested |
I can reproduce this on Ubuntu 22.04 on some apps, like |
Here's the stack trace for the crash:
I think there might be an issue in our heuristics for locating the thread's struct rseq ( |
I see that we set In dynamorio/core/unix/rseq_linux.c Line 732 in c233329
In dynamorio/core/unix/rseq_linux.c Line 675 in c233329
Also, we expect both of them to give the same value, and if they don't we throw this error: dynamorio/core/unix/rseq_linux.c Line 687 in c233329
I found the following glibc thread that discusses "put the rseq area into struct pthread, not into a initial-exec TLS" |
Are they at least following the convention of having Static TLS is assumed for marking our own region restartable (https://dynamorio.org/page_rseq.html#autotoc_md267) and that assumption is supposedly checked (I think the error you point to above) -- why didn't the error check fire? |
Probably not. I'm also seeing |
That is not good: if the conventions are not followed we might have no hope of handling it at all.... we would need to push for changes to glibc. Wondering whether Mathieu Desnoyers or others who set up the conventions were involved in any of the glibc decisions. |
Re-examining this: the original conventions may have only said the data had to be in the ELF file in that section, not inside a loadable segment. Is there such a section here in the file, and the problem is just that the section is not in a loaded segment? |
Adds a workaround for the SIGFPE in glibc 2.34+ __libc_early_init() by setting two ld.so globals located via hardcoded offsets, making this fragile and considered temporary. (Improvements might include decoding __libc_early_init or other functions to find the offsets, which is also fragile; making runtime options to set them for a non-rebuild fix; disabling the call to __libc_early_init which doesn't seem to be needed for 2.34). Tested on glibc 2.34 where every libc-using client crashes with SIGFPE but they work with this fix. Adds an Ubuntu22 GA CI run but it has many failures due to the rseq issue #5431. Adds a workaround for this by having drrun set -disable_rseq if it detects glibc 2.35+. Even with this we have a number of test failures so for now we use a label to just run 4 sanity-check tests. This should be enough to detect glibc changes that break the offsets here. Issue: #5437, #5431
Updates DR to cacb5424e for workarounds for 2 Ubuntu22 issues (glibc SIGFPE and rseq failure). Issue: DynamoRIO/dynamorio#5437, DynamoRIO/dynamorio#5431
Updates DR to cacb5424e for workarounds for 2 Ubuntu22 issues (glibc SIGFPE and rseq failure). Issue: DynamoRIO/dynamorio#5437, DynamoRIO/dynamorio#5431
A few observations:
This prevents glibc from registering its own rseq_cs; essentially disables the glibc rseq support. Looking at the Old:
New:
To support older glibcs, I think some of this fix needs to be guarded based on the value set for |
Update: I have a local fix where I detect whether glibc's rseq support is enabled, and if it is, adjust some logic of locating the struct rseq and processing modules. I'll try pushing it for review today. |
Fixes issues with DR's rseq handling in glibc 2.35+. Glibc 2.35 added support for the Linux rseq feature. See https://lwn.net/Articles/883104/ for details. TLDR; glibc registers its own struct rseq at init time, and stores its offset from the thread pointer in __rseq_offset. The glibc-registered struct rseq is present in the struct pthread. If glibc's rseq support isn't available, either due to some issue or because the user disabled it by exporting GLIBC_TUNABLES=glibc.pthread.rseq=0, it will set __rseq_size to zero. Improves the heuristic to find the registered struct rseq. For the glibc-support case: on AArch64, it is at a -ve offset from app lib seg base, whereas on x86 it's at a +ve offset. On both AArch64 and x86, the offset is of the opposite sign than what it would be if the app registered the struct rseq manually in its static TLS (which happens for older glibc and when glibc's rseq support is disabled). Detects whether the glibc rseq support is enabled by looking at the sign of the struct rseq offset. Removes the drrun -disable_rseq workaround added by #5695. Adjusts the linux.rseq test to get the struct rseq registered by glibc, when it's available. Also fixes some issues in the test. Adds the Ubuntu_22 tag to rseq tests so that they are enabled. Our Ubuntu-20 CI tests the case without rseq support in glibc, where the app registers the struct rseq. This also helps test the case where the app is not using glibc. Also, our Ubuntu-22 CI tests the case with Glibc rseq support. Manually tested the disabled rseq support case on glibc 2.35, but not adding a CI version of it. Fixes #5431
Pulls in a fix for DynamoRIO/dynamorio#5431 where the rseq feature in glibc 2.35 broke DR's rseq support.
Pulls in a fix for DynamoRIO/dynamorio#5431 where the rseq feature in glibc 2.35 broke DR's rseq support.
Describe the bug
Running DynamoRIO to instrument any application, a crash occurs.
./drrun -- grep
(no client at all) results in:Note that running trivial applications like
ls
or a simple hello world program does not result in a crash.Small list of applications that also do not work:
vim
,vi
(it crashes at the moment you type anything),less
,more
.Small list of applications that do work:
ls
,uname
,cat
.To Reproduce
Steps to reproduce the behavior:
grep
that comes with every Linux distribution../drrun -- grep
See above
I can reproduce on a fresh Arch Linux environment.
Same result with or without client: crash
Same result
Expected behavior
No crash, correct instrumentation.
Versions
What version of DynamoRIO are you using?
Tested the 9.0.1 release and also a fresh build on master.
Does the latest build from /~https://github.com/DynamoRIO/dynamorio/releases solve the problem?
No
What operating system version are you running on?
Manjaro Linux (derivative of Arch Linux)
Is your application 32-bit or 64-bit?
64 bit
Operating System: Manjaro Linux
KDE Plasma Version: 5.24.3
KDE Frameworks Version: 5.91.0
Qt Version: 5.15.3
Kernel Version: 5.16.14-1-MANJARO (64-bit)
Graphics Platform: X11
Processors: 8 × Intel® Core™ i5-8250U CPU @ 1.60GHz
Memory: 7.6 GiB of RAM
Graphics Processor: Mesa Intel® UHD Graphics 620
Additional context
This is the same bug as described in https://groups.google.com/g/dynamorio-users/c/eq5zD824QwY
The problem might be related to rseq.
Also, one observation I made is that could be related to the recent update of Arch Linux to glibc version 2.35. For a small test I downgraded to 2.33 and the crash did not occur. However, this is not a solution as it breaks almost all applications that need the new version to run.
Running
drrun
with-disable_rseq
also fixes the problem. However, with this flag the instrumentation is dead slow to say the least.Logs and backtrace of the crash:
log.0.32805.txt
grep.0.32805.txt
'bt' and 'bt full'.txt
The text was updated successfully, but these errors were encountered: