Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRASH from new glibc 2.35 rseq on any app (-disable_rseq solves) #5431

Closed
derchr opened this issue Mar 25, 2022 · 13 comments · Fixed by #5711
Closed

CRASH from new glibc 2.35 rseq on any app (-disable_rseq solves) #5431

derchr opened this issue Mar 25, 2022 · 13 comments · Fixed by #5711

Comments

@derchr
Copy link
Contributor

derchr commented Mar 25, 2022

Describe the bug
Running DynamoRIO to instrument any application, a crash occurs.
./drrun -- grep (no client at all) results in:

<Starting application /usr/bin/grep (331303)>
<Initial options = -no_dynamic_options -code_api -stack_size 56K -signal_stack_size 32K -max_elide_jmp 0 -max_elide_call 0 -early_inject -emulate_brk -no_inline_ignored_syscalls -native_exec_default_list '' -no_native_exec_managed_code -no_indcall2direct >
<Paste into GDB to debug DynamoRIO clients:
set confirm off
add-symbol-file '/home/derek/Git/dynamorio/build/lib64/debug/libdynamorio.so' 0x00007f972794a000
>
<(1+x) Handling our fault in a TRY at 0x00007f9727bbd98c>
<spurious rep/repne prefix @0x00007f97278a9650 (f3 0f 1e fa): >
<Application /usr/bin/grep (331303).  DynamoRIO internal crash at PC 0x00007f9727c26fda.  Please report this at http://dynamorio.org/issues/.  Program aborted.
Received SIGSEGV at pc 0x00007f9727c26fda in thread 331303
Base: 0x00007f97278ff000
Registers:eax=0x00007f94e391c0a0 ebx=0x0000000000050e27 ecx=0x00007f9727c29258 edx=0x00000000000009a0
        esi=0x00007f94e392b948 edi=0x0000000000001003 esp=0x00007f94e392b9d0 ebp=0x00007f94e392b9f0
        r8 =0x0000000000000000 r9 =0x0000000000000000 r10=0x00007f94e392b948 r11=0x0000000000000246
        r12=0x00007f97238c66d0 r13=0x00007f97238c66c7 r14=0x00007f97238c66cf r15=0x00007f97238c66dd
        eflags=0x0000000000010206
version 9.0.19068, custom build
-no_dynamic_options -code_api -stack_size 56K -signal_stack_size 32K -max_elide_jmp 0 -max_elide_call 0 -early_inject -emulate_brk -no_inline_ignored_syscalls -native_exec_default_list '' -no_native_exec_managed_code -no_indcall2direct 
0x00007f94e392b9f0 0x00007f9727c27254
0x00007f94e392ba10 0x00007f972798076b
0x00007f94e392ba70 0x00007f9727981bad
0x00007f94e392bab0 0x00007f9727ab66f9
0x00007f94e392bb10 0x00007f9727ab6bee
0x00007f94e392bb60 0x00007f9727978977
0x00007f94e392bba0 0x00007f9727a986a2
0x00007f94e392bd10 0x00007f9727a9b147
0x00007f94e392bf10 0x00007f97279cff0b
0x00007f94e392bff0 0x00007f96e388bf05
0x00007ffe2cb9c140 0x00007f9727639e0c
0x00007f97238c64e0 0x00007f97238d0f54>

Note that running trivial applications like ls or a simple hello world program does not result in a crash.
Small list of applications that also do not work:
vim, vi (it crashes at the moment you type anything), less, more.

Small list of applications that do work:
ls, uname, cat.

To Reproduce
Steps to reproduce the behavior:

  1. Pointer to a minimized application (ideally the source code for it and instructions on which toolchain it was built with).
    grep that comes with every Linux distribution.
  2. Precise command line for running the application.
    ./drrun -- grep
  3. Exact output or incorrect behavior.
    See above

I can reproduce on a fresh Arch Linux environment.

  • What happens when you run without any client?
    Same result with or without client: crash
  • What happens when you run with debug build ("-debug" flag to drrun/drconfig/drinject)?
    Same result

Expected behavior
No crash, correct instrumentation.

Versions

  • What version of DynamoRIO are you using?
    Tested the 9.0.1 release and also a fresh build on master.

  • Does the latest build from /~https://github.com/DynamoRIO/dynamorio/releases solve the problem?
    No

  • What operating system version are you running on?
    Manjaro Linux (derivative of Arch Linux)

  • Is your application 32-bit or 64-bit?
    64 bit

Operating System: Manjaro Linux
KDE Plasma Version: 5.24.3
KDE Frameworks Version: 5.91.0
Qt Version: 5.15.3
Kernel Version: 5.16.14-1-MANJARO (64-bit)
Graphics Platform: X11
Processors: 8 × Intel® Core™ i5-8250U CPU @ 1.60GHz
Memory: 7.6 GiB of RAM
Graphics Processor: Mesa Intel® UHD Graphics 620

Additional context
This is the same bug as described in https://groups.google.com/g/dynamorio-users/c/eq5zD824QwY
The problem might be related to rseq.

Also, one observation I made is that could be related to the recent update of Arch Linux to glibc version 2.35. For a small test I downgraded to 2.33 and the crash did not occur. However, this is not a solution as it breaks almost all applications that need the new version to run.

Running drrun with -disable_rseq also fixes the problem. However, with this flag the instrumentation is dead slow to say the least.

Logs and backtrace of the crash:
log.0.32805.txt
grep.0.32805.txt
'bt' and 'bt full'.txt

@derekbruening
Copy link
Contributor

Inlining stacktrace for searching:

(gdb) bt
#0  0x00007ffff7e8cdc8 in rseq_clear_tls_ptr (dcontext=0x7ffdb3b65080)
    at /home/derek/Git/dynamorio/core/unix/rseq_linux.c:221
#1  0x00007ffff7e8d042 in rseq_shared_fragment_flushtime_update (dcontext=0x7ffdb3b65080)
    at /home/derek/Git/dynamorio/core/unix/rseq_linux.c:280
#2  0x00007ffff7be676b in check_flush_queue (dcontext=0x7ffdb3b65080, was_I_flushed=0x0)
    at /home/derek/Git/dynamorio/core/fragment.c:5436
#3  0x00007ffff7be7bad in enter_couldbelinking (dcontext=0x7ffdb3b65080, was_I_flushed=0x0,
    cache_transition=false) at /home/derek/Git/dynamorio/core/fragment.c:5784
#4  0x00007ffff7d1c605 in move_lazy_list_to_pending_delete (dcontext=0x7ffdb3b65080)
    at /home/derek/Git/dynamorio/core/vmareas.c:9226
#5  0x00007ffff7d1cafa in add_to_lazy_deletion_list (dcontext=0x7ffdb3b65080, f=0x7ffdb3c3cb80)
    at /home/derek/Git/dynamorio/core/vmareas.c:9313
#6  0x00007ffff7bde977 in fragment_remove_shared_no_flush (dcontext=0x7ffdb3b65080, f=0x7ffdb3c3cb80)
    at /home/derek/Git/dynamorio/core/fragment.c:3222
#7  0x00007ffff7cfe5ae in end_and_emit_trace (dcontext=0x7ffdb3b65080, cur_f=0x7ffdb3c40408)
    at /home/derek/Git/dynamorio/core/monitor.c:1520
#8  0x00007ffff7d01053 in monitor_cache_enter (dcontext=0x7ffdb3b65080, f=0x7ffdb3c40408)
    at /home/derek/Git/dynamorio/core/monitor.c:2030
#9  0x00007ffff7c35f0b in d_r_dispatch (dcontext=0x7ffdb3b65080)
    at /home/derek/Git/dynamorio/core/dispatch.c:193

@derekbruening
Copy link
Contributor

Running drrun with -disable_rseq also fixes the problem. However, with this flag the instrumentation is dead slow to say the least.

Could you clarify: which instrumentation is slow? This should not affect tool-added instrumentation: it affects the application's code paths, disabling things like per-cpu caches, which can affect performance by forcing per-thread caches: but we're talking small percentages on large many-threaded applications. I would expect zero observable impact on the apps listed above. Could you show precise command lines and times?

@derchr
Copy link
Contributor Author

derchr commented Mar 25, 2022

Could you clarify: which instrumentation is slow?

I'm currently testing ./drrun without any tool / client active.

I must correct myself, the slowdown was because I was still running a debug build of DynamoRIO. Now with the release build, I don't experience much difference in performance between my host machine running ./drrun -disable_rseq -- vim and my docker container (Debian) running ./drrun -- vim.
With -disable_rseq I'm now even able to run graphical applications without significant slowdown.

@derekbruening
Copy link
Contributor

With -disable_rseq I'm now even able to run graphical applications without significant slowdown.

I should have suggested -disable_rseq earlier: glad you found it. So there is a workaround. I would say that this gives pretty high confidence this is an rseq issue with the new glibc => updating the title.

@derekbruening derekbruening changed the title CRASH when running DynamoRIO even without any client CRASH from new glibc 2.35 rseq on any app (-disable_rseq solves) Mar 25, 2022
@derekbruening
Copy link
Contributor

I can reproduce this on Ubuntu 22.04 on some apps, like grep as reported, but no on all apps: ls and hello,world work fine. Will investigate further.

@abhinav92003
Copy link
Contributor

Here's the stack trace for the crash:

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7e89909 in rseq_clear_tls_ptr (dcontext=0x7ffdb3b86080) at /home/abhinav92003/dynamorio-master/core/unix/rseq_linux.c:221
221         if (is_dynamo_address((byte *)(ptr_uint_t)app_rseq->rseq_cs.ptr64))
(gdb) bt
#0  0x00007ffff7e89909 in rseq_clear_tls_ptr (dcontext=0x7ffdb3b86080) at /home/abhinav92003/dynamorio-master/core/unix/rseq_linux.c:221
#1  0x00007ffff7e89b97 in rseq_shared_fragment_flushtime_update (dcontext=0x7ffdb3b86080) at /home/abhinav92003/dynamorio-master/core/unix/rseq_linux.c:280
#2  0x00007ffff7bdfb7a in check_flush_queue (dcontext=0x7ffdb3b86080, was_I_flushed=0x0) at /home/abhinav92003/dynamorio-master/core/fragment.c:5436
#3  0x00007ffff7be0fe4 in enter_couldbelinking (dcontext=0x7ffdb3b86080, was_I_flushed=0x0, cache_transition=false) at /home/abhinav92003/dynamorio-master/core/fragment.c:5784
#4  0x00007ffff7d1671a in move_lazy_list_to_pending_delete (dcontext=0x7ffdb3b86080) at /home/abhinav92003/dynamorio-master/core/vmareas.c:9229
#5  0x00007ffff7d16c13 in add_to_lazy_deletion_list (dcontext=0x7ffdb3b86080, f=0x7ffdb3c1f928) at /home/abhinav92003/dynamorio-master/core/vmareas.c:9316
#6  0x00007ffff7bd7d02 in fragment_remove_shared_no_flush (dcontext=0x7ffdb3b86080, f=0x7ffdb3c1f928) at /home/abhinav92003/dynamorio-master/core/fragment.c:3222
#7  0x00007ffff7cf842c in end_and_emit_trace (dcontext=0x7ffdb3b86080, cur_f=0x7ffdb3c20798) at /home/abhinav92003/dynamorio-master/core/monitor.c:1520
#8  0x00007ffff7cfaee5 in monitor_cache_enter (dcontext=0x7ffdb3b86080, f=0x7ffdb3c20798) at /home/abhinav92003/dynamorio-master/core/monitor.c:2030
#9  0x00007ffff7c2f6fe in d_r_dispatch (dcontext=0x7ffdb3b86080) at /home/abhinav92003/dynamorio-master/core/dispatch.c:193
#10 0x00007fffb3afef05 in ?? ()
#11 0x0000000000000000 in ?? ()

app_rseq is not readable, which causes the crash.

217         struct rseq *app_rseq = (struct rseq *)(base + rseq_tls_offset);
...
(gdb) p base
$1 = (byte *) 0x7ffdb3b9c700 ""
(gdb) p app_rseq
$2 = (struct rseq *) 0x7ffdb3b9d0a0
(gdb) memquery app_rseq
7ffdb3b9d000-7ffdb3b9f000 ---p 00000000 00:00 0
(gdb) memquery base
7ffdb3b9b000-7ffdb3b9d000 rw-p 00000000 00:00 0

I think there might be an issue in our heuristics for locating the thread's struct rseq (rseq_tls_offset), in rseq_locate_tls_offset.

@abhinav92003
Copy link
Contributor

I see that we set rseq_tls_offset at two places:

In rseq_locate_rseq_regions during init/takeover:

ATOMIC_4BYTE_WRITE(&rseq_tls_offset, offset, false);

In rseq_process_syscall post an rseq syscall:

int prior = atomic_exchange_int(&rseq_tls_offset, offset);

Also, we expect both of them to give the same value, and if they don't we throw this error:

"struct rseq is not always in static thread-local storage");

I found the following glibc thread that discusses "put the rseq area into struct pthread, not into a initial-exec TLS"
https://sourceware.org/pipermail/libc-alpha/2021-November/133221.html. I wonder if this is causing some assumption to break on our side.

@derekbruening
Copy link
Contributor

Are they at least following the convention of having __rseq_cs or __rseq_table sections in the ELF file (https://dynamorio.org/page_rseq.html#autotoc_md251)?

Static TLS is assumed for marking our own region restartable (https://dynamorio.org/page_rseq.html#autotoc_md267) and that assumption is supposedly checked (I think the error you point to above) -- why didn't the error check fire?

@abhinav92003
Copy link
Contributor

rseq_process_syscall and rseq_locate_tls_offset compute rseq_tls_offset relative to get_app_segment_base, however rseq_clear_tls_ptr adds it to get_segment_base, which is wrong as we need the app's seg base. It works when I change it to get_app_segment_base.

Are they at least following the convention of having __rseq_cs or __rseq_table sections in the ELF file (https://dynamorio.org/page_rseq.html#autotoc_md251)?

Probably not. linux.rseq is still failing because rseq_process_elf_sections reports "__rseq_cs_ptr_array's entries are not in a loaded segment"

I'm also seeing linux.rseq_table and linux.rseq_noarray fail with "Rseq signature is unreadable".

@derekbruening
Copy link
Contributor

That is not good: if the conventions are not followed we might have no hope of handling it at all.... we would need to push for changes to glibc. Wondering whether Mathieu Desnoyers or others who set up the conventions were involved in any of the glibc decisions.

@derekbruening
Copy link
Contributor

Re-examining this: the original conventions may have only said the data had to be in the ELF file in that section, not inside a loadable segment. Is there such a section here in the file, and the problem is just that the section is not in a loaded segment?

derekbruening added a commit that referenced this issue Oct 22, 2022
Adds a workaround for the SIGFPE in glibc 2.34+ __libc_early_init() by
setting two ld.so globals located via hardcoded offsets, making this
fragile and considered temporary.  (Improvements might include decoding
__libc_early_init or other functions to find the offsets, which is also
fragile; making runtime options to set them for a non-rebuild fix;
disabling the call to __libc_early_init which doesn't seem to be needed
for 2.34).

Tested on glibc 2.34 where every libc-using client crashes with SIGFPE
but they work with this fix.

Adds an Ubuntu22 GA CI run but it has many failures due to the rseq
issue #5431.  Adds a workaround for this by having drrun set -disable_rseq
if it detects glibc 2.35+.  Even with this we have a number of test failures
so for now we use a label to just run 4 sanity-check tests.  This should
be enough to detect glibc changes that break the offsets here.

Issue: #5437, #5431
derekbruening added a commit to DynamoRIO/drmemory that referenced this issue Oct 22, 2022
Updates DR to cacb5424e for workarounds for 2 Ubuntu22 issues (glibc
SIGFPE and rseq failure).

Issue: DynamoRIO/dynamorio#5437, DynamoRIO/dynamorio#5431
derekbruening added a commit to DynamoRIO/drmemory that referenced this issue Oct 22, 2022
Updates DR to cacb5424e for workarounds for 2 Ubuntu22 issues (glibc
SIGFPE and rseq failure).

Issue: DynamoRIO/dynamorio#5437, DynamoRIO/dynamorio#5431
@abhinav92003
Copy link
Contributor

abhinav92003 commented Oct 25, 2022

A few observations:

  • Another, probably cleaner workaround than -disable_rseq is:
GLIBC_TUNABLES=glibc.pthread.rseq=0
export GLIBC_TUNABLES

This prevents glibc from registering its own rseq_cs; essentially disables the glibc rseq support.
The rseq and burst tests work fine with this.

Looking at the suite/tests/bin/linux.rseq ELF file on an older glibc (2.31) vs newer glibc (2.35), I can see that the relevant sections are still there.

Old:

$ objdump -s -j __rseq_cs_ptr_array suite/tests/bin/linux.rseq

suite/tests/bin/linux.rseq:     file format elf64-x86-64

Contents of section __rseq_cs_ptr_array:
 50a0 20500000 00000000 40500000 00000000   P......@P......
 50b0 60500000 00000000 80500000 00000000  `P.......P......

$ objdump -s -j __rseq_cs suite/tests/bin/linux.rseq

suite/tests/bin/linux.rseq:     file format elf64-x86-64

Contents of section __rseq_cs:
 5020 00000000 00000000 9a170000 00000000  ................
 5030 34000000 00000000 d3170000 00000000  4...............
 5040 00000000 00000000 2f190000 00000000  ......../.......
 5050 37000000 00000000 6c190000 00000000  7.......l.......
 5060 00000000 00000000 8f1a0000 00000000  ................
 5070 12000000 00000000 a71a0000 00000000  ................
 5080 00000000 00000000 181b0000 00000000  ................
 5090 48000000 00000000 661b0000 00000000  H.......f.......

New:

$ objdump -s -j __rseq_cs_ptr_array suite/tests/bin/linux.rseq

suite/tests/bin/linux.rseq:     file format elf64-x86-64

Contents of section __rseq_cs_ptr_array:
 5140 c0500000 00000000 e0500000 00000000  .P.......P......
 5150 00510000 00000000 20510000 00000000  .Q...... Q......

$ objdump -s -j __rseq_cs suite/tests/bin/linux.rseq

suite/tests/bin/linux.rseq:     file format elf64-x86-64

Contents of section __rseq_cs:
 50c0 00000000 00000000 6d170000 00000000  ........m.......
 50d0 34000000 00000000 a6170000 00000000  4...............
 50e0 00000000 00000000 14190000 00000000  ................
 50f0 37000000 00000000 51190000 00000000  7.......Q.......
 5100 00000000 00000000 871a0000 00000000  ................
 5110 12000000 00000000 9f1a0000 00000000  ................
 5120 00000000 00000000 151b0000 00000000  ................
 5130 48000000 00000000 631b0000 00000000  H.......c.......

  • For the "__rseq_cs_ptr_array's entries are not in a loaded segment" error: I was able to get the rseq_cs by hacking around in rseq_process_elf_sections, specifically by forcing this statement to execute always:

    entry_offs = load_offs;

  • The final issue I'm running into is I think because the struct rseq is not in the static TLS anymore. (It causes an EINVAL in rseq tests and "struct rseq is not in static thread-local storage" in burst tests). This is as documented at https://sourceware.org/pipermail/libc-alpha/2021-November/133221.html. Also, https://lwn.net/Articles/883104/ says that the rseq registered by glibc is stored in the Thread Control Block maintained by glibc. I'm working on modifying rseq_locate_tls_offset for this.

To support older glibcs, I think some of this fix needs to be guarded based on the value set for glibc.pthread.rseq.

@abhinav92003
Copy link
Contributor

Update: I have a local fix where I detect whether glibc's rseq support is enabled, and if it is, adjust some logic of locating the struct rseq and processing modules. I'll try pushing it for review today.

abhinav92003 added a commit that referenced this issue Nov 18, 2022
Fixes issues with DR's rseq handling in glibc 2.35+.

Glibc 2.35 added support for the Linux rseq feature. See
https://lwn.net/Articles/883104/ for details. TLDR; glibc registers
its own struct rseq at init time, and stores its offset from the
thread pointer in __rseq_offset. The glibc-registered struct rseq is
present in the struct pthread. If glibc's rseq support isn't
available, either due to some issue or because the user disabled
it by exporting GLIBC_TUNABLES=glibc.pthread.rseq=0, it will
set __rseq_size to zero.

Improves the heuristic to find the registered struct rseq. For the
glibc-support case: on AArch64, it is at a -ve offset from app lib
seg base, whereas on x86 it's at a +ve offset. On both AArch64
and x86, the offset is of the opposite sign than what it would be
if the app registered the struct rseq manually in its static TLS
(which happens for older glibc and when glibc's rseq support
is disabled).

Detects whether the glibc rseq support is enabled by looking at
the sign of the struct rseq offset.

Removes the drrun -disable_rseq workaround added by #5695.

Adjusts the linux.rseq test to get the struct rseq registered by
glibc, when it's available. Also fixes some issues in the test.

Adds the Ubuntu_22 tag to rseq tests so that they are enabled.

Our Ubuntu-20 CI tests the case without rseq support in glibc,
where the app registers the struct rseq. This also helps test the
case where the app is not using glibc.

Also, our Ubuntu-22 CI tests the case with Glibc rseq support.
Manually tested the disabled rseq support case on glibc 2.35,
but not adding a CI version of it.

Fixes #5431
derekbruening added a commit to DynamoRIO/drmemory that referenced this issue Nov 18, 2022
Pulls in a fix for DynamoRIO/dynamorio#5431 where the rseq feature in
glibc 2.35 broke DR's rseq support.
derekbruening added a commit to DynamoRIO/drmemory that referenced this issue Nov 18, 2022
Pulls in a fix for DynamoRIO/dynamorio#5431 where the rseq feature in
glibc 2.35 broke DR's rseq support.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants