Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A bug of the up-to-date develop branch code. #2475

Closed
NHZlX opened this issue Jun 15, 2017 · 11 comments
Closed

A bug of the up-to-date develop branch code. #2475

NHZlX opened this issue Jun 15, 2017 · 11 comments
Assignees

Comments

@NHZlX
Copy link
Contributor

NHZlX commented Jun 15, 2017

When i run the demo of book/03.image_classification, It generates a bug, here is the log:

[xzl@03.image_classification]$ python train.py 
I0615 17:25:26.157414  9625 Util.cpp:166] commandline:  --use_gpu=True --trainer_count=1 
*** Aborted at 1497518726 (unix time) try "date -d @1497518726" if you are using GNU date ***
Segmentation fault

The problem is here:

[xzl@ 03.image_classification]$ python
Python 2.7.3 (default, May 25 2017, 20:23:14) 
[GCC 4.4.6 20120305 (Red Hat 4.4.6-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import paddle.v2 as paddle
>>> paddle.init()
I0615 17:31:36.499310 31343 Util.cpp:166] commandline:  
*** Aborted at 1497519096 (unix time) try "date -d @1497519096" if you are using GNU date ***
Segmentation fault

here is the gdb log:

(gdb) bt
#0  0x00007fffeea5790c in paddle::runInitFunctions() () at /home/xingzhaolong/.jumbo/opt/gcc48/include/c++/4.8.3/mutex:776
#1  0x00007fffeea59a53 in paddle::initMain(int, char**) () at /home/xingzhaolong/pr/temp/paddle_me/paddle/utils/Util.cpp:199
#2  0x00007fffeeaed721 in initPaddle(int, char**) () at /home/xingzhaolong/pr/temp/paddle_me/paddle/api/Util.cpp:28
#3  0x00007fffee7730f9 in _wrap_initPaddle () at /home/xingzhaolong/pr/temp/paddle_me/build/paddle/api/PaddlePYTHON_wrap.cxx:15504
#4  0x00007ffff7d1d3a3 in ext_do_call (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4331
#5  PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2705
#6  0x00007ffff7d1f130 in PyEval_EvalCodeEx (co=Traceback (most recent call last):
  File "/home/xingzhaolong/.jumbo/lib/go/src/runtime/runtime-gdb.py", line 303, in ifacematcher
    if is_iface(val) or is_eface(val):
  File "/home/xingzhaolong/.jumbo/lib/go/src/runtime/runtime-gdb.py", line 205, in is_iface
    except gdb.error:
AttributeError: 'module' object has no attribute 'error'
0x7ffff1c5e730, globals=<value optimized out>, locals=<value optimized out>, args=<value optimized out>, argcount=Traceback (most recent call last):
  File "/home/xingzhaolong/.jumbo/lib/go/src/runtime/runtime-gdb.py", line 303, in ifacematcher
    if is_iface(val) or is_eface(val):
  File "/home/xingzhaolong/.jumbo/lib/go/src/runtime/runtime-gdb.py", line 205, in is_iface
    except gdb.error:
AttributeError: 'module' object has no attribute 'error'
1, kws=<value optimized out>, kwcount=Traceback (most recent call last):
  File "/home/xingzhaolong/.jumbo/lib/go/src/runtime/runtime-gdb.py", line 303, in ifacematcher
    if is_iface(val) or is_eface(val):
  File "/home/xingzhaolong/.jumbo/lib/go/src/runtime/runtime-gdb.py", line 205, in is_iface
    except gdb.error:
AttributeError: 'module' object has no attribute 'error'
0, defs=Traceback (most recent call last):
  File "/home/xingzhaolong/.jumbo/lib/go/src/runtime/runtime-gdb.py", line 303, in ifacematcher
    if is_iface(val) or is_eface(val):
  File "/home/xingzhaolong/.jumbo/lib/go/src/runtime/runtime-gdb.py", line 205, in is_iface
    except gdb.error:
AttributeError: 'module' object has no attribute 'error'
0x0, defcount=Traceback (most recent call last):
  File "/home/xingzhaolong/.jumbo/lib/go/src/runtime/runtime-gdb.py", line 303, in ifacematcher
    if is_iface(val) or is_eface(val):
  File "/home/xingzhaolong/.jumbo/lib/go/src/runtime/runtime-gdb.py", line 205, in is_iface
    except gdb.error:
AttributeError: 'module' object has no attribute 'error'
0, closure=Traceback (most recent call last):
  File "/home/xingzhaolong/.jumbo/lib/go/src/runtime/runtime-gdb.py", line 303, in ifacematcher
    if is_iface(val) or is_eface(val):
  File "/home/xingzhaolong/.jumbo/lib/go/src/runtime/runtime-gdb.py", line 205, in is_iface
    except gdb.error:
AttributeError: 'module' object has no attribute 'error'
0x0) at Python/ceval.c:3253
#7  0x00007ffff7d1d4a1 in fast_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4117
#8  call_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4042
---Type <return> to continue, or q <return> to quit---q

I hope it can be settled ASAP, thank you!

@jacquesqiao jacquesqiao self-assigned this Jun 15, 2017
@helinwang helinwang self-assigned this Jun 15, 2017
@helinwang
Copy link
Contributor

helinwang commented Jun 15, 2017

It's running fine on our docker image built from the latest code (to make sure docker image is up to date, I started a new CI run to build and push the docker image.)

root@99ecb469e8b2:/data/03.image_classification# python train.py 
I0615 18:59:50.055052    14 Util.cpp:166] commandline:  --use_gpu=True --trainer_count=1 
[INFO 2017-06-15 18:59:52,162 layers.py:2245] output for __conv_0__: c = 64, h = 32, w = 32, size = 65536
[INFO 2017-06-15 18:59:52,163 layers.py:2245] output for __conv_1__: c = 64, h = 32, w = 32, size = 65536

I think the most important stack-trace is the top few lines:

#0  0x00007fffeea5790c in paddle::runInitFunctions() () at /home/xingzhaolong/.jumbo/opt/gcc48/include/c++/4.8.3/mutex:776
#1  0x00007fffeea59a53 in paddle::initMain(int, char**) () at /home/xingzhaolong/pr/temp/paddle_me/paddle/utils/Util.cpp:199

I noticed the bottom few lines have error (probably unrelated to the crash):

#6  0x00007ffff7d1f130 in PyEval_EvalCodeEx (co=Traceback (most recent call last):
  File "/home/xingzhaolong/.jumbo/lib/go/src/runtime/runtime-gdb.py", line 303, in ifacematcher
    if is_iface(val) or is_eface(val):
  File "/home/xingzhaolong/.jumbo/lib/go/src/runtime/runtime-gdb.py", line 205, in is_iface
    except gdb.error:

This error is Go related, but I don't think it's the error causing the crash. To fix it, can you update your gdb version? According to golang/go#10359 , GDB version < 7.3 will have this problem (GDB 7.3 was released in July of 2011).

Back to the problem, The stack-trace shows gcc48/include/c++/4.8.3/mutex:776 crashed, here is the code:

766  extern "C" void __once_proxy(void);
767
768  /// call_once                                                                                                                                                 
769  template<typename _Callable, typename... _Args>
770    void
771    call_once(once_flag& __once, _Callable&& __f, _Args&&... __args)
772    {
773#ifdef _GLIBCXX_HAVE_TLS
774      auto __bound_functor = std::__bind_simple(std::forward<_Callable>(__f),
775          std::forward<_Args>(__args)...);
776      __once_callable = &__bound_functor;
777      __once_call = &__once_call_impl<decltype(__bound_functor)>;
778#else
779      unique_lock<mutex> __functor_lock(__get_once_mutex());
780      auto __callable = std::__bind_simple(std::forward<_Callable>(__f),
781          std::forward<_Args>(__args)...);
782      __once_functor = [&]() { __callable(); };
783      __set_once_functor_lock_ptr(&__functor_lock);
784#endif
785
786      int __e = __gthread_once(&(__once._M_once), &__once_proxy);
787
788#ifndef _GLIBCXX_HAVE_TLS
789      if (__functor_lock)
790        __set_once_functor_lock_ptr(0);
791#endif
792
793      if (__e)
794        __throw_system_error(__e);
795    }
796#endif // _GLIBCXX_HAS_GTHREADS

I can not reproduce the crash on our docker image, so it is fine on Ubuntu 16.04 with GCC 5.4. I suspect it's due to our compiler / linker command changed that caused this crash on some version of OS and GCC.

@jacquesqiao
Copy link
Member

已经确认是公司内部jumbo安装的python有问题,自己编译的python2.7可以正常使用,正在找一个合适的安装方式。

@lcy-seso
Copy link
Contributor

我用了自己编译的python,同样的问题。python 使用 jumbo下gcc 编译。

@jacquesqiao
Copy link
Member

fixed: #2530
证明公司中jumbo环境下的gcc编译有问题

@lcy-seso lcy-seso reopened this Jun 20, 2017
@lcy-seso
Copy link
Contributor

现在不编译 go 绕过去这个问题,是不是潜在还是有风险?

@helinwang
Copy link
Contributor

helinwang commented Jun 20, 2017

We need to figure out what exactly happened. It works fine on the docker image. I will try to reproduce with Baidu dev machine which uses jumbo.

@jacquesqiao
Copy link
Member

@helinwang 对,除了公司默认gcc编译的python有问题,其他所有情况都没有问题,包括docker,mac以及用/opt/compiler下的gcc编译的python。

@wangkuiyi
Copy link
Collaborator

@jacquesqiao noticed 10 hours ago that binaries generated by the GCC installed by Jumbo depend on /lib64/libpthread.so.0:

a

and binaries generated by /opt/compiler/gcc* depend on /opt/compiler/gcc*/lib/libpthread.so.0:

b

and suspected that it's the problem of machine initialization, which installed /lib64/libpthread.so.0 with defects.

@helinwang
Copy link
Contributor

helinwang commented Jun 22, 2017

@NHZlX Could you let me know how did you load the debug symbols? When I do

$ gdb python
(gdb) run train.py

I get:

(gdb) run train.py
Starting program: /home/helin/.jumbo/bin/python train.py
warning: Unable to find dynamic linker breakpoint function.
GDB will be unable to debug shared library initializers
and track explicitly loaded dynamic code.
warning: Could not load shared library symbols for 7 libraries, e.g. /home/helin/.jumbo/lib/libpython2.7.so.1.0.
Use the "info sharedlibrary" command to see the complete listing.
Do you need "set solib-search-path" or "set sysroot"?
I0622 09:57:38.023835 11685 Util.cpp:166] commandline:  --use_gpu=False --trainer_count=1 

Program received signal SIGSEGV, Segmentation fault.
0x00007fffeefebc8c in ?? ()

With no symbols.

@NHZlX
Copy link
Contributor Author

NHZlX commented Jun 22, 2017

@helinwang
(gdb) bt
#0 0x00007fffeea5790c in paddle::runInitFunctions() () at /home/xingzhaolong/.jumbo/opt/gcc48/include/c++/4.8.3/mutex:776

@lcy-seso
Copy link
Contributor

I have ever met the same problem that debug symbols are missing.
After adding this to .gdbinit the problem is solved in the dev machine:

set sysroot /
set auto-load safe-path /

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants