Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Too large max depth value in _recursive_fork_recordio #12619

Open
caiqi opened this issue Sep 20, 2018 · 17 comments
Open

Too large max depth value in _recursive_fork_recordio #12619

caiqi opened this issue Sep 20, 2018 · 17 comments

Comments

@caiqi
Copy link

caiqi commented Sep 20, 2018

It seems that 1000 is too large for _recursive_fork_recordio in
/~https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/gluon/data/dataloader.py#L178
Even if len(obj.dict.items()) > 2, this function will be called by more than 2 ** 1000 times.

The following code in /~https://github.com/dmlc/gluon-cv/blob/master/scripts/detection/ssd/train_ssd.py#L96 in gluon-cv will cause RecursionError: maximum recursion depth exceeded in comparison error on windows 10 with the latest build. I found that the reason is that there will be a HybridSequential object in the dataset object and the HybridSequential contains many children. This function is brought in commit #12554 . Is it ok to jump out of this function when obj is not an instance of mx.gluon.data.dataset.Dataset?

@stu1130
Copy link
Contributor

stu1130 commented Sep 20, 2018

Thanks for submitting the issue @caiqi
@mxnet-label-bot [data-loading]

@eric-haibin-lin
Copy link
Member

@zhreshold

@zhreshold
Copy link
Member

see #12622

@Angzz
Copy link

Angzz commented Sep 24, 2018

@zhreshold I change the code you commit, but the error still exits

@zhreshold
Copy link
Member

@Angzz What os? Can you print this for me to debug?

import sys
print(sys.getrecursionlimit())

@Angzz
Copy link

Angzz commented Sep 25, 2018

@zhreshold
ubuntu 16.04, I print the info you mention above with python2, and the output is 1000

@zhreshold
Copy link
Member

Okay, I modified the search depth to be less aggressive.

@Angzz
Copy link

Angzz commented Sep 25, 2018

@zhreshold OK, I will update mxnet pre version to do a experiment, thanks

@Angzz
Copy link

Angzz commented Sep 25, 2018

when update to 1.3.1b20180925, error occurs when train ssd with coco, but voc is normal:

---------------- train log and error log ------------------

INFO:root:Start training from [Epoch 0]
[19:54:19] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:109: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
[19:54:28] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:109: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
python: malloc.c:3722: _int_malloc: Assertion (unsigned long) (size) >= (unsigned long) (nb)' failed. *** Error in python': malloc(): memory corruption: 0x00007fe3d29b3690 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7fe5b37c87e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x8213e)[0x7fe5b37d313e]
/lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x54)[0x7fe5b37d5184]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(_Znwm+0x18)[0x7fe5af411e78]
/home/liang/.local/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x407eb0)[0x7fe52630beb0]
/home/liang/.local/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x40d7c9)[0x7fe5263117c9]
/home/liang/.local/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x2b88458)[0x7fe528a8c458]
/home/liang/.local/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x2adcb29)[0x7fe5289e0b29]
/home/liang/.local/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x2ae6544)[0x7fe5289ea544]
/home/liang/.local/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x2aea6c2)[0x7fe5289ee6c2]
/home/liang/.local/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x2ae6c64)[0x7fe5289eac64]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80)[0x7fe5af43cc80]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7fe5b3b226ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fe5b385841d]
======= Memory map: ========
00400000-006de000 r-xp 00000000 103:02 16254177 /usr/bin/python2.7
008dd000-008de000 r--p 002dd000 103:02 16254177 /usr/bin/python2.7
008de000-00955000 rw-p 002de000 103:02 16254177 /usr/bin/python2.7
00955000-00978000 rw-p 00000000 00:00 0
00c8d000-a94d5000 rw-p 00000000 00:00 0 [heap]
a94d5000-a9806000 rw-p 00000000 00:00 0 [heap]
200000000-200200000 rw-s 00000000 00:06 456 /dev/nvidiactl
200200000-200400000 ---p 00000000 00:00 0
200400000-200404000 rw-s 00000000 00:06 456 /dev/nvidiactl
200404000-200600000 ---p 00000000 00:00 0
200600000-200a00000 rw-s 00000000 00:06 456 /dev/nvidiactl
200a00000-201800000 ---p 00000000 00:00 0
201800000-201804000 rw-s 00000000 00:06 456 /dev/nvidiactl
201804000-201a00000 ---p 00000000 00:00 0
201a00000-201e00000 rw-s 00000000 00:06 456 /dev/nvidiactl
201e00000-201e04000 rw-s 00000000 00:06 456 /dev/nvidiactl
201e04000-202000000 ---p 00000000 00:00 0
202000000-202400000 rw-s 00000000 00:06 456 /dev/nvidiactl
202400000-202404000 rw-s 00000000 00:06 456 /dev/nvidiactl
202404000-202600000 ---p 00000000 00:00 0
202600000-202a00000 rw-s 00000000 00:06 456 /dev/nvidiactl
202a00000-202a04000 rw-s 00000000 00:06 456 /dev/nvidiactl
202a04000-202c00000 ---p 00000000 00:00 0
202c00000-203000000 rw-s 00000000 00:06 456 /dev/nvidiactl
203000000-203004000 rw-s 00000000 00:06 456 /dev/nvidiactl
203004000-203200000 ---p 00000000 00:00 0
203200000-203600000 rw-s 00000000 00:06 456 /dev/nvidiactl
203600000-203604000 rw-s 00000000 00:06 456 /dev/nvidiactl
203604000-203800000 ---p 00000000 00:00 0
203800000-203c00000 rw-s 00000000 00:06 456 /dev/nvidiactl
203c00000-203c04000 rw-s 00000000 00:06 456 /dev/nvidiactl
203c04000-203e00000 ---p 00000000 00:00 0
203e00000-204200000 rw-s 00000000 00:06 456 /dev/nvidiactl
204200000-204204000 rw-s 00000000 00:06 456 /dev/nvidiactl
204204000-204400000 ---p 00000000 00:00 0
204400000-204800000 rw-s 00000000 00:06 456 /dev/nvidiactl
204800000-204804000 rw-s 00000000 00:06 456 /dev/nvidiactl
204804000-204a00000 ---p 00000000 00:00 0
204a00000-204e00000 rw-s 00000000 00:06 456 /dev/nvidiactl
204e00000-204e04000 rw-s 00000000 00:06 456 /dev/nvidiactl
204e04000-205000000 ---p 00000000 00:00 0
205000000-205400000 rw-s 00000000 00:06 456 /dev/nvidiactl
205400000-205404000 rw-s 00000000 00:06 456 /dev/nvidiactl
205404000-205600000 ---p 00000000 00:00 0
205600000-205a00000 rw-s 00000000 00:06 456 /dev/nvidiactl
205a00000-205a04000 rw-s 00000000 00:06 456 /dev/nvidiactl
205a04000-205c00000 ---p 00000000 00:00 0
205c00000-206000000 rw-s 00000000 00:06 456 /dev/nvidiactl
206000000-206004000 rw-s 00000000 00:06 456 /dev/nvidiactl
206004000-206200000 ---p 00000000 00:00 0
206200000-206600000 rw-s 00000000 00:06 456 /dev/nvidiactl
206600000-206604000 rw-s 00000000 00:06 456 /dev/nvidiactl
206604000-206800000 ---p 00000000 00:00 0
206800000-206c00000 rw-s 00000000 00:06 456 /dev/nvidiactl
206c00000-206c04000 rw-s 00000000 00:06 456 /dev/nvidiactl
206c04000-206e00000 ---p 00000000 00:00 0
206e00000-207200000 rw-s 00000000 00:06 456 /dev/nvidiactl
207200000-207400000 ---p 00000000 00:00 0
207400000-207600000 rw-s 00000000 00:06 456 /dev/nvidiactl
207600000-207800000 rw-s 00000000 00:06 456 /dev/nvidiactl
207800000-207a00000 ---p 00000000 00:00 0
207a00000-207a04000 rw-s 00000000 00:06 456 /dev/nvidiactl
207a04000-207c00000 ---p 00000000 00:00 0
207c00000-208000000 rw-s 00000000 00:06 456 /dev/nvidiactl
208000000-208e00000 ---p 00000000 00:00 0
208e00000-208e04000 rw-s 00000000 00:06 456 /dev/nvidiactl
208e04000-209000000 ---p 00000000 00:00 0
209000000-209400000 rw-s 00000000 00:06 456 /dev/nvidiactl
209400000-209404000 rw-s 00000000 00:06 456 /dev/nvidiactl
209404000-209600000 ---p 00000000 00:00 0
209600000-209a00000 rw-s 00000000 00:06 456 /dev/nvidiactl
209a00000-209a04000 rw-s 00000000 00:06 456 /dev/nvidiactl
209a04000-209c00000 ---p 00000000 00:00 0
209c00000-20a000000 rw-s 00000000 00:06 456 /dev/nvidiactl
20a000000-20a004000 rw-s 00000000 00:06 456 /dev/nvidiactl

@zhreshold
Copy link
Member

@Angzz Would disable these lines help? /~https://github.com/apache/incubator-mxnet/blob/29ac19124555ca838f5f3a01da638eda221b07b2/python/mxnet/gluon/data/dataloader.py#L181-L183

Are you using RecordFiles? If not, it has nothing to do with JPEG images.

@Angzz
Copy link

Angzz commented Sep 26, 2018

@zhreshold Sorry, I don't understand why delete these lines, if delete, the recursive mechanism will not work? I do not use the RecordFiles, just the images download by script gluoncv/datasets/mscoco.py. By the way, I find trouble always occur with coco but not voc, I doubt when image files up to a certain amount(just like coco), the multiprocess in dataloader will not work well(just like pytorch), it will become more aggressive. At last, thanks your reply and awesome job ^_^.

@Angzz
Copy link

Angzz commented Sep 26, 2018

when train to 13 epoch for coco, another error occurs:

[13:44:22] src/resource.cc:262: Ignore CUDA Error [13:44:22] src/storage/storage.cc:65: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading CUDA: initialization error

Stack trace returned 10 entries:
[bt] (0) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x379e1a) [0x7fadcc375e1a]
[bt] (1) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x37a451) [0x7fadcc376451]
[bt] (2) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x3024ddd) [0x7fadcf020ddd]
[bt] (3) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x302cab8) [0x7fadcf028ab8]
[bt] (4) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x302ff2c) [0x7fadcf02bf2c]
[bt] (5) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x297291d) [0x7fadce96e91d]
[bt] (6) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x297c414) [0x7fadce978414]
[bt] (7) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2987585) [0x7fadce983585]
[bt] (8) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x29731f8) [0x7fadce96f1f8]
[bt] (9) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2973d24) [0x7fadce96fd24]

[13:44:22] src/engine/threaded_engine_perdevice.cc:99: Ignore CUDA Error [13:44:22] /home/travis/build/dmlc/mxnet-distro/mxnet-build/3rdparty/mshadow/mshadow/./tensor_gpu-inl.h:35: Check failed: e == cudaSuccess CUDA: initialization error

Stack trace returned 10 entries:
[bt] (0) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x379e1a) [0x7fadcc375e1a]
[bt] (1) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x37a451) [0x7fadcc376451]
[bt] (2) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x297aea8) [0x7fadce976ea8]
[bt] (3) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2987572) [0x7fadce983572]
[bt] (4) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x29731f8) [0x7fadce96f1f8]
[bt] (5) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2973d24) [0x7fadce96fd24]
[bt] (6) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x3030221) [0x7fadcf02c221]
[bt] (7) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x30302e2) [0x7fadcf02c2e2]
[bt] (8) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x37d45a) [0x7fadcc37945a]
[bt] (9) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x30345c9) [0x7fadcf0305c9]

[13:44:22] src/resource.cc:262: Ignore CUDA Error [13:44:22] src/storage/storage.cc:65: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading CUDA: initialization error

Stack trace returned 10 entries:
[bt] (0) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x379e1a) [0x7fadcc375e1a]
[bt] (1) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x37a451) [0x7fadcc376451]
[bt] (2) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x3024ddd) [0x7fadcf020ddd]
[bt] (3) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x302cab8) [0x7fadcf028ab8]
[bt] (4) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x302ff2c) [0x7fadcf02bf2c]
[bt] (5) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x297291d) [0x7fadce96e91d]
[bt] (6) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x297c414) [0x7fadce978414]
[bt] (7) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2987585) [0x7fadce983585]
[bt] (8) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x29731f8) [0x7fadce96f1f8]
[bt] (9) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2973d24) [0x7fadce96fd24]

terminate called after throwing an instance of 'std::system_error'
what(): Invalid argument
Segmentation fault (core dumped)

@Angzz
Copy link

Angzz commented Sep 26, 2018

finally I solve this problem by this link:
r9y9/gantts#14
but I don't know why?

@zhreshold
Copy link
Member

@Angzz Not sure why, maybe python related. However, it is not relevant to this thread. I am going to close this issue. Let me know if anyone is still getting the same original recursion error.

@RuRo
Copy link
Contributor

RuRo commented Dec 25, 2018

Hi, I am getting a very similar error:

Process Process-1:
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.6/dist-packages/mxnet/gluon/data/dataloader.py", line 178, in worker_loop
    _recursive_fork_recordio(dataset, 0, 1000)
  File "/usr/local/lib/python3.6/dist-packages/mxnet/gluon/data/dataloader.py", line 173, in _recursive_fork_recordio
    _recursive_fork_recordio(v, depth + 1, max_depth)
  File "/usr/local/lib/python3.6/dist-packages/mxnet/gluon/data/dataloader.py", line 173, in _recursive_fork_recordio
    _recursive_fork_recordio(v, depth + 1, max_depth)
  File "/usr/local/lib/python3.6/dist-packages/mxnet/gluon/data/dataloader.py", line 173, in _recursive_fork_recordio
    _recursive_fork_recordio(v, depth + 1, max_depth)
  [Previous line repeated 970 more times]
  File "/usr/local/lib/python3.6/dist-packages/mxnet/gluon/data/dataloader.py", line 166, in _recursive_fork_recordio
    if depth >= max_depth:
RecursionError: maximum recursion depth exceeded in comparison

I am using the latest cu90mkl docker (mxnet version 1.3.1). Unfortunately, I can't provide you with the exact code, because of legal reasons.

I have a custom class, that inherits from mxnet.gluon.data.Dataset. During the call to __getitiem__ a bunch of transforms are called. To speed this up, I've tried wrapping the transforms in a mxnet.gluon.data.vision.transforms.Compose, which broke the DataLoader.

Just applying the transforms sequentially works fine, but Composing them results in a RecursionError.

@aaronmarkham
Copy link
Contributor

Reopening this issue since it looks like we have a public example now in the lipnet code that can be used to figure out what's going on...

@aaronmarkham aaronmarkham reopened this Jan 4, 2019
@Demohai
Copy link

Demohai commented Feb 11, 2019

@RuRo has your problem solved?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants