Too large max depth value in _recursive_fork_recordio #12619

caiqi · 2018-09-20T18:13:31Z

It seems that 1000 is too large for _recursive_fork_recordio in
/~https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/gluon/data/dataloader.py#L178
Even if len(obj.dict.items()) > 2, this function will be called by more than 2 ** 1000 times.

The following code in /~https://github.com/dmlc/gluon-cv/blob/master/scripts/detection/ssd/train_ssd.py#L96 in gluon-cv will cause RecursionError: maximum recursion depth exceeded in comparison error on windows 10 with the latest build. I found that the reason is that there will be a HybridSequential object in the dataset object and the HybridSequential contains many children. This function is brought in commit #12554 . Is it ok to jump out of this function when obj is not an instance of mx.gluon.data.dataset.Dataset?

The text was updated successfully, but these errors were encountered:

stu1130 · 2018-09-20T21:47:53Z

Thanks for submitting the issue @caiqi
@mxnet-label-bot [data-loading]

eric-haibin-lin · 2018-09-23T06:18:57Z

@zhreshold

zhreshold · 2018-09-23T06:28:29Z

see #12622

Angzz · 2018-09-24T02:51:41Z

@zhreshold I change the code you commit, but the error still exits

zhreshold · 2018-09-24T18:19:18Z

@Angzz What os? Can you print this for me to debug?

import sys
print(sys.getrecursionlimit())

Angzz · 2018-09-25T01:35:22Z

@zhreshold
ubuntu 16.04, I print the info you mention above with python2, and the output is 1000

zhreshold · 2018-09-25T01:40:09Z

Okay, I modified the search depth to be less aggressive.

Angzz · 2018-09-25T01:47:56Z

@zhreshold OK, I will update mxnet pre version to do a experiment, thanks

Angzz · 2018-09-25T12:08:10Z

when update to 1.3.1b20180925, error occurs when train ssd with coco, but voc is normal:

---------------- train log and error log ------------------

INFO:root:Start training from [Epoch 0]
[19:54:19] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:109: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
[19:54:28] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:109: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
python: malloc.c:3722: _int_malloc: Assertion (unsigned long) (size) >= (unsigned long) (nb)' failed. *** Error in python': malloc(): memory corruption: 0x00007fe3d29b3690 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7fe5b37c87e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x8213e)[0x7fe5b37d313e]
/lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x54)[0x7fe5b37d5184]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(_Znwm+0x18)[0x7fe5af411e78]
/home/liang/.local/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x407eb0)[0x7fe52630beb0]
/home/liang/.local/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x40d7c9)[0x7fe5263117c9]
/home/liang/.local/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x2b88458)[0x7fe528a8c458]
/home/liang/.local/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x2adcb29)[0x7fe5289e0b29]
/home/liang/.local/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x2ae6544)[0x7fe5289ea544]
/home/liang/.local/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x2aea6c2)[0x7fe5289ee6c2]
/home/liang/.local/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x2ae6c64)[0x7fe5289eac64]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80)[0x7fe5af43cc80]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7fe5b3b226ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fe5b385841d]
======= Memory map: ========
00400000-006de000 r-xp 00000000 103:02 16254177 /usr/bin/python2.7
008dd000-008de000 r--p 002dd000 103:02 16254177 /usr/bin/python2.7
008de000-00955000 rw-p 002de000 103:02 16254177 /usr/bin/python2.7
00955000-00978000 rw-p 00000000 00:00 0
00c8d000-a94d5000 rw-p 00000000 00:00 0 [heap]
a94d5000-a9806000 rw-p 00000000 00:00 0 [heap]
200000000-200200000 rw-s 00000000 00:06 456 /dev/nvidiactl
200200000-200400000 ---p 00000000 00:00 0
200400000-200404000 rw-s 00000000 00:06 456 /dev/nvidiactl
200404000-200600000 ---p 00000000 00:00 0
200600000-200a00000 rw-s 00000000 00:06 456 /dev/nvidiactl
200a00000-201800000 ---p 00000000 00:00 0
201800000-201804000 rw-s 00000000 00:06 456 /dev/nvidiactl
201804000-201a00000 ---p 00000000 00:00 0
201a00000-201e00000 rw-s 00000000 00:06 456 /dev/nvidiactl
201e00000-201e04000 rw-s 00000000 00:06 456 /dev/nvidiactl
201e04000-202000000 ---p 00000000 00:00 0
202000000-202400000 rw-s 00000000 00:06 456 /dev/nvidiactl
202400000-202404000 rw-s 00000000 00:06 456 /dev/nvidiactl
202404000-202600000 ---p 00000000 00:00 0
202600000-202a00000 rw-s 00000000 00:06 456 /dev/nvidiactl
202a00000-202a04000 rw-s 00000000 00:06 456 /dev/nvidiactl
202a04000-202c00000 ---p 00000000 00:00 0
202c00000-203000000 rw-s 00000000 00:06 456 /dev/nvidiactl
203000000-203004000 rw-s 00000000 00:06 456 /dev/nvidiactl
203004000-203200000 ---p 00000000 00:00 0
203200000-203600000 rw-s 00000000 00:06 456 /dev/nvidiactl
203600000-203604000 rw-s 00000000 00:06 456 /dev/nvidiactl
203604000-203800000 ---p 00000000 00:00 0
203800000-203c00000 rw-s 00000000 00:06 456 /dev/nvidiactl
203c00000-203c04000 rw-s 00000000 00:06 456 /dev/nvidiactl
203c04000-203e00000 ---p 00000000 00:00 0
203e00000-204200000 rw-s 00000000 00:06 456 /dev/nvidiactl
204200000-204204000 rw-s 00000000 00:06 456 /dev/nvidiactl
204204000-204400000 ---p 00000000 00:00 0
204400000-204800000 rw-s 00000000 00:06 456 /dev/nvidiactl
204800000-204804000 rw-s 00000000 00:06 456 /dev/nvidiactl
204804000-204a00000 ---p 00000000 00:00 0
204a00000-204e00000 rw-s 00000000 00:06 456 /dev/nvidiactl
204e00000-204e04000 rw-s 00000000 00:06 456 /dev/nvidiactl
204e04000-205000000 ---p 00000000 00:00 0
205000000-205400000 rw-s 00000000 00:06 456 /dev/nvidiactl
205400000-205404000 rw-s 00000000 00:06 456 /dev/nvidiactl
205404000-205600000 ---p 00000000 00:00 0
205600000-205a00000 rw-s 00000000 00:06 456 /dev/nvidiactl
205a00000-205a04000 rw-s 00000000 00:06 456 /dev/nvidiactl
205a04000-205c00000 ---p 00000000 00:00 0
205c00000-206000000 rw-s 00000000 00:06 456 /dev/nvidiactl
206000000-206004000 rw-s 00000000 00:06 456 /dev/nvidiactl
206004000-206200000 ---p 00000000 00:00 0
206200000-206600000 rw-s 00000000 00:06 456 /dev/nvidiactl
206600000-206604000 rw-s 00000000 00:06 456 /dev/nvidiactl
206604000-206800000 ---p 00000000 00:00 0
206800000-206c00000 rw-s 00000000 00:06 456 /dev/nvidiactl
206c00000-206c04000 rw-s 00000000 00:06 456 /dev/nvidiactl
206c04000-206e00000 ---p 00000000 00:00 0
206e00000-207200000 rw-s 00000000 00:06 456 /dev/nvidiactl
207200000-207400000 ---p 00000000 00:00 0
207400000-207600000 rw-s 00000000 00:06 456 /dev/nvidiactl
207600000-207800000 rw-s 00000000 00:06 456 /dev/nvidiactl
207800000-207a00000 ---p 00000000 00:00 0
207a00000-207a04000 rw-s 00000000 00:06 456 /dev/nvidiactl
207a04000-207c00000 ---p 00000000 00:00 0
207c00000-208000000 rw-s 00000000 00:06 456 /dev/nvidiactl
208000000-208e00000 ---p 00000000 00:00 0
208e00000-208e04000 rw-s 00000000 00:06 456 /dev/nvidiactl
208e04000-209000000 ---p 00000000 00:00 0
209000000-209400000 rw-s 00000000 00:06 456 /dev/nvidiactl
209400000-209404000 rw-s 00000000 00:06 456 /dev/nvidiactl
209404000-209600000 ---p 00000000 00:00 0
209600000-209a00000 rw-s 00000000 00:06 456 /dev/nvidiactl
209a00000-209a04000 rw-s 00000000 00:06 456 /dev/nvidiactl
209a04000-209c00000 ---p 00000000 00:00 0
209c00000-20a000000 rw-s 00000000 00:06 456 /dev/nvidiactl
20a000000-20a004000 rw-s 00000000 00:06 456 /dev/nvidiactl

zhreshold · 2018-09-25T17:57:23Z

@Angzz Would disable these lines help? /~https://github.com/apache/incubator-mxnet/blob/29ac19124555ca838f5f3a01da638eda221b07b2/python/mxnet/gluon/data/dataloader.py#L181-L183

Are you using RecordFiles? If not, it has nothing to do with JPEG images.

Angzz · 2018-09-26T02:06:33Z

@zhreshold Sorry, I don't understand why delete these lines, if delete, the recursive mechanism will not work? I do not use the RecordFiles, just the images download by script gluoncv/datasets/mscoco.py. By the way, I find trouble always occur with coco but not voc, I doubt when image files up to a certain amount(just like coco), the multiprocess in dataloader will not work well(just like pytorch), it will become more aggressive. At last, thanks your reply and awesome job ^_^.

Angzz · 2018-09-26T06:03:18Z

when train to 13 epoch for coco, another error occurs:

[13:44:22] src/resource.cc:262: Ignore CUDA Error [13:44:22] src/storage/storage.cc:65: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading CUDA: initialization error

Stack trace returned 10 entries:
[bt] (0) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x379e1a) [0x7fadcc375e1a]
[bt] (1) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x37a451) [0x7fadcc376451]
[bt] (2) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x3024ddd) [0x7fadcf020ddd]
[bt] (3) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x302cab8) [0x7fadcf028ab8]
[bt] (4) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x302ff2c) [0x7fadcf02bf2c]
[bt] (5) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x297291d) [0x7fadce96e91d]
[bt] (6) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x297c414) [0x7fadce978414]
[bt] (7) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2987585) [0x7fadce983585]
[bt] (8) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x29731f8) [0x7fadce96f1f8]
[bt] (9) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2973d24) [0x7fadce96fd24]

[13:44:22] src/engine/threaded_engine_perdevice.cc:99: Ignore CUDA Error [13:44:22] /home/travis/build/dmlc/mxnet-distro/mxnet-build/3rdparty/mshadow/mshadow/./tensor_gpu-inl.h:35: Check failed: e == cudaSuccess CUDA: initialization error

Stack trace returned 10 entries:
[bt] (0) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x379e1a) [0x7fadcc375e1a]
[bt] (1) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x37a451) [0x7fadcc376451]
[bt] (2) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x297aea8) [0x7fadce976ea8]
[bt] (3) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2987572) [0x7fadce983572]
[bt] (4) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x29731f8) [0x7fadce96f1f8]
[bt] (5) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2973d24) [0x7fadce96fd24]
[bt] (6) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x3030221) [0x7fadcf02c221]
[bt] (7) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x30302e2) [0x7fadcf02c2e2]
[bt] (8) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x37d45a) [0x7fadcc37945a]
[bt] (9) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x30345c9) [0x7fadcf0305c9]

[13:44:22] src/resource.cc:262: Ignore CUDA Error [13:44:22] src/storage/storage.cc:65: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading CUDA: initialization error

Stack trace returned 10 entries:
[bt] (0) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x379e1a) [0x7fadcc375e1a]
[bt] (1) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x37a451) [0x7fadcc376451]
[bt] (2) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x3024ddd) [0x7fadcf020ddd]
[bt] (3) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x302cab8) [0x7fadcf028ab8]
[bt] (4) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x302ff2c) [0x7fadcf02bf2c]
[bt] (5) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x297291d) [0x7fadce96e91d]
[bt] (6) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x297c414) [0x7fadce978414]
[bt] (7) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2987585) [0x7fadce983585]
[bt] (8) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x29731f8) [0x7fadce96f1f8]
[bt] (9) /home/liang/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2973d24) [0x7fadce96fd24]

terminate called after throwing an instance of 'std::system_error'
what(): Invalid argument
Segmentation fault (core dumped)

Angzz · 2018-09-26T14:45:37Z

finally I solve this problem by this link:
r9y9/gantts#14
but I don't know why?

zhreshold · 2018-09-26T17:50:16Z

@Angzz Not sure why, maybe python related. However, it is not relevant to this thread. I am going to close this issue. Let me know if anyone is still getting the same original recursion error.

RuRo · 2018-12-25T09:50:45Z

Hi, I am getting a very similar error:

Process Process-1:
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.6/dist-packages/mxnet/gluon/data/dataloader.py", line 178, in worker_loop
    _recursive_fork_recordio(dataset, 0, 1000)
  File "/usr/local/lib/python3.6/dist-packages/mxnet/gluon/data/dataloader.py", line 173, in _recursive_fork_recordio
    _recursive_fork_recordio(v, depth + 1, max_depth)
  File "/usr/local/lib/python3.6/dist-packages/mxnet/gluon/data/dataloader.py", line 173, in _recursive_fork_recordio
    _recursive_fork_recordio(v, depth + 1, max_depth)
  File "/usr/local/lib/python3.6/dist-packages/mxnet/gluon/data/dataloader.py", line 173, in _recursive_fork_recordio
    _recursive_fork_recordio(v, depth + 1, max_depth)
  [Previous line repeated 970 more times]
  File "/usr/local/lib/python3.6/dist-packages/mxnet/gluon/data/dataloader.py", line 166, in _recursive_fork_recordio
    if depth >= max_depth:
RecursionError: maximum recursion depth exceeded in comparison

I am using the latest cu90mkl docker (mxnet version 1.3.1). Unfortunately, I can't provide you with the exact code, because of legal reasons.

I have a custom class, that inherits from mxnet.gluon.data.Dataset. During the call to __getitiem__ a bunch of transforms are called. To speed this up, I've tried wrapping the transforms in a mxnet.gluon.data.vision.transforms.Compose, which broke the DataLoader.

Just applying the transforms sequentially works fine, but Composing them results in a RecursionError.

aaronmarkham · 2019-01-04T16:19:10Z

Reopening this issue since it looks like we have a public example now in the lipnet code that can be used to figure out what's going on...

Demohai · 2019-02-11T07:33:43Z

@RuRo has your problem solved?

marcoabreu added the Data-loading label Sep 20, 2018

zhreshold closed this as completed Sep 26, 2018

soeque1 mentioned this issue Jan 4, 2019

Update lip reading example #13647

Merged

7 tasks

aaronmarkham reopened this Jan 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Too large max depth value in _recursive_fork_recordio #12619

Too large max depth value in _recursive_fork_recordio #12619

caiqi commented Sep 20, 2018 •

edited

Loading

stu1130 commented Sep 20, 2018

eric-haibin-lin commented Sep 23, 2018

zhreshold commented Sep 23, 2018

Angzz commented Sep 24, 2018

zhreshold commented Sep 24, 2018

Angzz commented Sep 25, 2018

zhreshold commented Sep 25, 2018

Angzz commented Sep 25, 2018

Angzz commented Sep 25, 2018 •

edited

Loading

zhreshold commented Sep 25, 2018

Angzz commented Sep 26, 2018 •

edited

Loading

Angzz commented Sep 26, 2018

Angzz commented Sep 26, 2018

zhreshold commented Sep 26, 2018

RuRo commented Dec 25, 2018

aaronmarkham commented Jan 4, 2019

Demohai commented Feb 11, 2019

Too large max depth value in _recursive_fork_recordio #12619

Too large max depth value in _recursive_fork_recordio #12619

Comments

caiqi commented Sep 20, 2018 • edited Loading

stu1130 commented Sep 20, 2018

eric-haibin-lin commented Sep 23, 2018

zhreshold commented Sep 23, 2018

Angzz commented Sep 24, 2018

zhreshold commented Sep 24, 2018

Angzz commented Sep 25, 2018

zhreshold commented Sep 25, 2018

Angzz commented Sep 25, 2018

Angzz commented Sep 25, 2018 • edited Loading

zhreshold commented Sep 25, 2018

Angzz commented Sep 26, 2018 • edited Loading

Angzz commented Sep 26, 2018

Angzz commented Sep 26, 2018

zhreshold commented Sep 26, 2018

RuRo commented Dec 25, 2018

aaronmarkham commented Jan 4, 2019

Demohai commented Feb 11, 2019

caiqi commented Sep 20, 2018 •

edited

Loading

Angzz commented Sep 25, 2018 •

edited

Loading

Angzz commented Sep 26, 2018 •

edited

Loading