Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Reading/ignoring corrupt images with Gluon data loader (imdecode error cannot be captured) #12280

Closed
mqtlam opened this issue Aug 21, 2018 · 10 comments · Fixed by dmlc/dmlc-core#517

Comments

@mqtlam
Copy link
Contributor

mqtlam commented Aug 21, 2018

Description

Short Version

mxnet.image.imdecode crashes and hangs when loading certain corrupt images using Gluon data loader. One possible workaround is to wrap a try/except block around imdecode, but Python try/except cannot capture MXNetError.

Long Version

I am working with a very large dataset that it is impractical to clean all images beforehand. Currently, when using Gluon data loader, loading a corrupt image crashes in imdecode with an MXNetError exception (see Error Message below) and then hangs. Ultimately, I would like the Gluon data loader to ignore corrupt images instead of crashing.

My idea to work around this issue is as follows: wrap the imdecode with a try/catch block and whenever an exception occurs, simply return a dummy image (and label). Given the dummy image/label during training, I can ignore backpropagating that sample. I've tried that (see What have you tried to solve it? below) but it does not work because Python try/catch cannot capture MXNetError.

I think there should be a mechanism to capture an error from imdecode or imread (both from mxnet.image) rather than crashing, unless I am missing something.

Environment info (Required)

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    16
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping:              1
CPU MHz:               2699.804
CPU max MHz:           3000.0000
CPU min MHz:           1200.0000
BogoMIPS:              4600.11
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              46080K
NUMA node0 CPU(s):     0-31
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx xsaveopt
----------Python Info----------
Version      : 3.7.0
Compiler     : GCC 7.2.0
Build        : ('default', 'Jun 28 2018 13:15:42')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 18.0
Directory    : /home/ubuntu/anaconda3/envs/mxnet_latest/lib/python3.7/site-packages/pip
----------MXNet Info-----------
Version      : 1.3.0
Directory    : /home/ubuntu/incubator-mxnet/python/mxnet
Hashtag not found. Not installed from pre-built package.
----------System Info----------
Platform     : Linux-4.4.0-1062-aws-x86_64-with-debian-stretch-sid
system       : Linux
node         : ip-172-31-35-198
release      : 4.4.0-1062-aws
version      : #71-Ubuntu SMP Fri Jun 15 10:07:39 UTC 2018
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
----------Network Test----------
Setting timeout: 10
Timing for MXNet: /~https://github.com/apache/incubator-mxnet, DNS: 0.0022 sec, LOAD: 0.4228 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0008 sec, LOAD: 0.3465 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0006 sec, LOAD: 0.3446 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0005 sec, LOAD: 0.1155 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0005 sec, LOAD: 0.0623 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0004 sec, LOAD: 0.0202 sec.

Package used (Python/R/Scala/Julia): Python

Build info (Required if built from source)

Compiler (gcc/clang/mingw/visual studio): gcc

MXNet commit hash: a6ecb59

Build config:

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied.  See the License for the
# specific language governing permissions and limitations
# under the License.

#-------------------------------------------------------------------------------
#  Template configuration for compiling mxnet
#
#  If you want to change the configuration, please use the following
#  steps. Assume you are on the root directory of mxnet. First copy the this
#  file so that any local changes will be ignored by git
#
#  $ cp make/config.mk .
#
#  Next modify the according entries, and then compile by
#
#  $ make
#
#  or build in parallel with 8 threads
#
#  $ make -j8
#-------------------------------------------------------------------------------

#---------------------
# choice of compiler
#--------------------

ifndef CC
export CC = gcc
endif
ifndef CXX
export CXX = g++
endif
ifndef NVCC
export NVCC = nvcc
endif

# whether compile with options for MXNet developer
DEV = 0

# whether compile with debug
DEBUG = 0

# whether to turn on segfault signal handler to log the stack trace
USE_SIGNAL_HANDLER =

# the additional link flags you want to add
ADD_LDFLAGS =

# the additional compile flags you want to add
ADD_CFLAGS =

#---------------------------------------------
# matrix computation libraries for CPU/GPU
#---------------------------------------------

# whether use CUDA during compile
USE_CUDA = 0

# add the path to CUDA library to link and compile flag
# if you have already add them to environment variable, leave it as NONE
# USE_CUDA_PATH = /usr/local/cuda
USE_CUDA_PATH = NONE

# whether to enable CUDA runtime compilation
ENABLE_CUDA_RTC = 1

# whether use CuDNN R3 library
USE_CUDNN = 0

#whether to use NCCL library
USE_NCCL = 0
#add the path to NCCL library
USE_NCCL_PATH = NONE

# whether use opencv during compilation
# you can disable it, however, you will not able to use
# imbin iterator
USE_OPENCV = 1

#whether use libjpeg-turbo for image decode without OpenCV wrapper
USE_LIBJPEG_TURBO = 0
#add the path to libjpeg-turbo library
USE_LIBJPEG_TURBO_PATH = NONE

# use openmp for parallelization
USE_OPENMP = 1

# whether use MKL-DNN library
USE_MKLDNN = 0

# whether use NNPACK library
USE_NNPACK = 0

# choose the version of blas you want to use
# can be: mkl, blas, atlas, openblas
# in default use atlas for linux while apple for osx
UNAME_S := $(shell uname -s)
ifeq ($(UNAME_S), Darwin)
USE_BLAS = apple
else
USE_BLAS = atlas
endif

# whether use lapack during compilation
# only effective when compiled with blas versions openblas/apple/atlas/mkl
USE_LAPACK = 1

# path to lapack library in case of a non-standard installation
USE_LAPACK_PATH =

# add path to intel library, you may need it for MKL, if you did not add the path
# to environment variable
USE_INTEL_PATH = NONE

# If use MKL only for BLAS, choose static link automatically to allow python wrapper
ifeq ($(USE_BLAS), mkl)
USE_STATIC_MKL = 1
else
USE_STATIC_MKL = NONE
endif

#----------------------------
# Settings for power and arm arch
#----------------------------
ARCH := $(shell uname -a)
ifneq (,$(filter $(ARCH), armv6l armv7l powerpc64le ppc64le aarch64))
	USE_SSE=0
	USE_F16C=0
else
	USE_SSE=1
endif

#----------------------------
# F16C instruction support for faster arithmetic of fp16 on CPU
#----------------------------
# For distributed training with fp16, this helps even if training on GPUs
# If left empty, checks CPU support and turns it on.
# For cross compilation, please check support for F16C on target device and turn off if necessary.
USE_F16C =

#----------------------------
# distributed computing
#----------------------------

# whether or not to enable multi-machine supporting
USE_DIST_KVSTORE = 0

# whether or not allow to read and write HDFS directly. If yes, then hadoop is
# required
USE_HDFS = 0

# path to libjvm.so. required if USE_HDFS=1
LIBJVM=$(JAVA_HOME)/jre/lib/amd64/server

# whether or not allow to read and write AWS S3 directly. If yes, then
# libcurl4-openssl-dev is required, it can be installed on Ubuntu by
# sudo apt-get install -y libcurl4-openssl-dev
USE_S3 = 0

#----------------------------
# performance settings
#----------------------------
# Use operator tuning
USE_OPERATOR_TUNING = 1

# Use gperftools if found
USE_GPERFTOOLS = 1

# Use JEMalloc if found, and not using gperftools
USE_JEMALLOC = 1

#----------------------------
# additional operators
#----------------------------

# path to folders containing projects specific operators that you don't want to put in src/operators
EXTRA_OPERATORS =

#----------------------------
# other features
#----------------------------

# Create C++ interface package
USE_CPP_PACKAGE = 0

#----------------------------
# plugins
#----------------------------

# whether to use caffe integration. This requires installing caffe.
# You also need to add CAFFE_PATH/build/lib to your LD_LIBRARY_PATH
# CAFFE_PATH = $(HOME)/caffe
# MXNET_PLUGINS += plugin/caffe/caffe.mk

# WARPCTC_PATH = $(HOME)/warp-ctc
# MXNET_PLUGINS += plugin/warpctc/warpctc.mk

# whether to use sframe integration. This requires build sframe
# git@github.com:dato-code/SFrame.git
# SFRAME_PATH = $(HOME)/SFrame
# MXNET_PLUGINS += plugin/sframe/plugin.mk

Error Message:

Process Process-2:                                                                                                                                    [1/1921]
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/mxnet_latest/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/ubuntu/anaconda3/envs/mxnet_latest/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/incubator-mxnet/python/mxnet/gluon/data/dataloader.py", line 170, in worker_loop
    data_queue.put((idx, batch))
  File "/home/ubuntu/anaconda3/envs/mxnet_latest/lib/python3.7/multiprocessing/queues.py", line 358, in put
    obj = _ForkingPickler.dumps(obj)
  File "/home/ubuntu/anaconda3/envs/mxnet_latest/lib/python3.7/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "/home/ubuntu/incubator-mxnet/python/mxnet/gluon/data/dataloader.py", line 63, in reduce_ndarray
    pid, fd, shape, dtype = data._to_shared_mem()
  File "/home/ubuntu/incubator-mxnet/python/mxnet/ndarray/ndarray.py", line 200, in _to_shared_mem
    self.handle, ctypes.byref(shared_pid), ctypes.byref(shared_id)))
  File "/home/ubuntu/incubator-mxnet/python/mxnet/base.py", line 255, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [19:31:59] src/io/image_io.cc:162: Check failed: !dst.empty() Decoding failed. Invalid image file.
 
Stack trace returned 10 entries:
[bt] (0) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5b) [0x7ff3b5cb5a0b]
[bt] (1) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7ff3b5cb6578]
[bt] (2) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::io::ImdecodeImpl(int, bool, void*, unsigned long, mxnet::NDArray*)+0x4c6) [0x7
ff3b83e6fd6]
[bt] (3) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(+0x3a279db) [0x7ff3b8a4b9db]
[bt] (4) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprB
lock*)+0x8e5) [0x7ff3b8a45e35]
[bt] (5) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::Thre
adedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#1}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invo
ke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&)+0xe2) [0x7ff3b8a5c642]
[bt] (6) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptr<dmlc::Manual
Event>)> (std::shared_ptr<dmlc::ManualEvent>)> >::_M_run()+0x4a) [0x7ff3b8a4543a]
[bt] (7) /home/ubuntu/anaconda3/envs/mxnet_latest/bin/../lib/libstdc++.so.6(+0xafc5c) [0x7ff40d4cfc5c]
[bt] (8) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7ff41b5f66ba]
[bt] (9) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7ff41b32c41d]

Minimum reproducible example

Run train_imagenet.py from GluonCV (/~https://github.com/dmlc/gluon-cv/blob/master/scripts/classification/imagenet/train_imagenet.py with commit hash 863f19bc86cda0f785b97c39a360fbd8cbd1b0e1) on a training dataset with corrupted images (e.g., an image with 0 bytes).

What have you tried to solve it?

  1. I modified ImageFolderDataset below so that it could handle corrupt images in theory. The try/catch does not capture MXNetError.
DEFAULT_IMAGE_SIZE = 224
DEFAULT_MISSING_LABELS_SENTINEL = -1234

class ImageFolderDataset(gluon.data.Dataset):
    """A dataset for loading image files stored in a folder structure like::
        root/car/0001.jpg
        root/car/xxxa.jpg
        root/car/yyyb.jpg
        root/bus/123.jpg
        root/bus/023.jpg
        root/bus/wwww.jpg
    Parameters
    ----------
    root : str
        Path to root directory.
    flag : {0, 1}, default 1
        If 0, always convert loaded images to greyscale (1 channel).
        If 1, always convert loaded images to colored (3 channels).
    transform : callable, default None
        A function that takes data and label and transforms them:
    ::
        transform = lambda data, label: (data.astype(np.float32)/255, label)
    Attributes
    ----------
    synsets : list
        List of class names. `synsets[i]` is the name for the integer label `i`
    items : list of tuples
        List of all images in (filename, label) pairs.
    """
    def __init__(self, root, flag=1, transform=None, missing_sentinel=DEFAULT_MISSING_LABELS_SENTINEL):
        self._root = os.path.expanduser(root)
        self._flag = flag
        self._transform = transform
        self._missing_sentinel = missing_sentinel
        self._exts = tuple(['.jpg', '.jpeg', '.png'])
        self._list_images(self._root)
 
    def _list_images(self, root):
        self.synsets = []
        self.items = []
 
        for folder in sorted(os.listdir(root)):
            path = os.path.join(root, folder)
            if not os.path.isdir(path):
                warnings.warn('Ignoring %s, which is not a directory.'%path, stacklevel=3)
                continue
            label = len(self.synsets)
            self.synsets.append(folder)
            for filename in sorted(os.listdir(path)):
                filename = os.path.join(path, filename)
                ext = os.path.splitext(filename)[1]
                if ext.lower() not in self._exts:
                    warnings.warn('Ignoring %s of type %s. Only support %s'%(
                        filename, ext, ', '.join(self._exts)))
                    continue
                self.items.append((filename, label))
 
    def __getitem__(self, idx):
        file_name = self.items[idx][0]
        if os.path.exists(file_name) and file_name.endswith(self._exts):
            try:
                img = image.imread(file_name, self._flag)
                label = self.items[idx][1]
            except:
                img = mx.nd.zeros((3, DEFAULT_IMAGE_SIZE, DEFAULT_IMAGE_SIZE))
                label = self._missing_sentinel
        else:
            img = mx.nd.zeros((3, DEFAULT_IMAGE_SIZE, DEFAULT_IMAGE_SIZE))
            label = self._missing_sentinel
 
        if self._transform is not None:
            return self._transform(img, label)
        return img, label
 
    def __len__(self):
        return len(self.items)
  1. I replaced image.imread with cv2.imread (directly using OpenCV) in the above code. It seemed to work on some images but still crashes eventually, which may mean mxnet.image.imdecode is running somewhere else too? I have not explored this yet.
@zhreshold
Copy link
Member

@anirudhacharya Can you have a look at the error catching problem?

@ankkhedia
Copy link
Contributor

@mxnet-label-bot : [Gluon, Bug]

@piyushghai
Copy link
Contributor

@mqtlam Seems like the PR for the fix got merged recently.
Can you verify if your problem is resolved now ?

@vrakesh
Copy link
Contributor

vrakesh commented Nov 26, 2018

@mqtlam A PR was merged fixing this issue, requesting a verification of resolution.

@vrakesh
Copy link
Contributor

vrakesh commented Nov 26, 2018

@mxnet-label-bot add [Pending Requester Info]

@vrakesh
Copy link
Contributor

vrakesh commented Dec 26, 2018

@mqtlam Hi since the issue, has been resolved in a PR, I am requesting committers to close this issue, Feel free to reopen it if the error persists on your side

@sandeep-krishnamurthy Requesting to close this issue since it has been resolved in a PR

@Tigerwander
Copy link

Is this issue in open mode?

@leezu
Copy link
Contributor

leezu commented Dec 20, 2019

@Tigerwander this issue should be fixed. Please open a new issue

@Tigerwander
Copy link

Description

Short Version

mxnet.image.imdecode crashes and hangs when loading certain corrupt images using Gluon data loader. One possible workaround is to wrap a try/except block around imdecode, but Python try/except cannot capture MXNetError.

Long Version

I am working with a very large dataset that it is impractical to clean all images beforehand. Currently, when using Gluon data loader, loading a corrupt image crashes in imdecode with an MXNetError exception (see Error Message below) and then hangs. Ultimately, I would like the Gluon data loader to ignore corrupt images instead of crashing.

My idea to work around this issue is as follows: wrap the imdecode with a try/catch block and whenever an exception occurs, simply return a dummy image (and label). Given the dummy image/label during training, I can ignore backpropagating that sample. I've tried that (see What have you tried to solve it? below) but it does not work because Python try/catch cannot capture MXNetError.

I think there should be a mechanism to capture an error from imdecode or imread (both from mxnet.image) rather than crashing, unless I am missing something.

Environment info (Required)

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    16
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping:              1
CPU MHz:               2699.804
CPU max MHz:           3000.0000
CPU min MHz:           1200.0000
BogoMIPS:              4600.11
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              46080K
NUMA node0 CPU(s):     0-31
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx xsaveopt
----------Python Info----------
Version      : 3.7.0
Compiler     : GCC 7.2.0
Build        : ('default', 'Jun 28 2018 13:15:42')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 18.0
Directory    : /home/ubuntu/anaconda3/envs/mxnet_latest/lib/python3.7/site-packages/pip
----------MXNet Info-----------
Version      : 1.3.0
Directory    : /home/ubuntu/incubator-mxnet/python/mxnet
Hashtag not found. Not installed from pre-built package.
----------System Info----------
Platform     : Linux-4.4.0-1062-aws-x86_64-with-debian-stretch-sid
system       : Linux
node         : ip-172-31-35-198
release      : 4.4.0-1062-aws
version      : #71-Ubuntu SMP Fri Jun 15 10:07:39 UTC 2018
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
----------Network Test----------
Setting timeout: 10
Timing for MXNet: /~https://github.com/apache/incubator-mxnet, DNS: 0.0022 sec, LOAD: 0.4228 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0008 sec, LOAD: 0.3465 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0006 sec, LOAD: 0.3446 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0005 sec, LOAD: 0.1155 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0005 sec, LOAD: 0.0623 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0004 sec, LOAD: 0.0202 sec.

Package used (Python/R/Scala/Julia): Python

Build info (Required if built from source)

Compiler (gcc/clang/mingw/visual studio): gcc

MXNet commit hash: a6ecb59

Build config:

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied.  See the License for the
# specific language governing permissions and limitations
# under the License.

#-------------------------------------------------------------------------------
#  Template configuration for compiling mxnet
#
#  If you want to change the configuration, please use the following
#  steps. Assume you are on the root directory of mxnet. First copy the this
#  file so that any local changes will be ignored by git
#
#  $ cp make/config.mk .
#
#  Next modify the according entries, and then compile by
#
#  $ make
#
#  or build in parallel with 8 threads
#
#  $ make -j8
#-------------------------------------------------------------------------------

#---------------------
# choice of compiler
#--------------------

ifndef CC
export CC = gcc
endif
ifndef CXX
export CXX = g++
endif
ifndef NVCC
export NVCC = nvcc
endif

# whether compile with options for MXNet developer
DEV = 0

# whether compile with debug
DEBUG = 0

# whether to turn on segfault signal handler to log the stack trace
USE_SIGNAL_HANDLER =

# the additional link flags you want to add
ADD_LDFLAGS =

# the additional compile flags you want to add
ADD_CFLAGS =

#---------------------------------------------
# matrix computation libraries for CPU/GPU
#---------------------------------------------

# whether use CUDA during compile
USE_CUDA = 0

# add the path to CUDA library to link and compile flag
# if you have already add them to environment variable, leave it as NONE
# USE_CUDA_PATH = /usr/local/cuda
USE_CUDA_PATH = NONE

# whether to enable CUDA runtime compilation
ENABLE_CUDA_RTC = 1

# whether use CuDNN R3 library
USE_CUDNN = 0

#whether to use NCCL library
USE_NCCL = 0
#add the path to NCCL library
USE_NCCL_PATH = NONE

# whether use opencv during compilation
# you can disable it, however, you will not able to use
# imbin iterator
USE_OPENCV = 1

#whether use libjpeg-turbo for image decode without OpenCV wrapper
USE_LIBJPEG_TURBO = 0
#add the path to libjpeg-turbo library
USE_LIBJPEG_TURBO_PATH = NONE

# use openmp for parallelization
USE_OPENMP = 1

# whether use MKL-DNN library
USE_MKLDNN = 0

# whether use NNPACK library
USE_NNPACK = 0

# choose the version of blas you want to use
# can be: mkl, blas, atlas, openblas
# in default use atlas for linux while apple for osx
UNAME_S := $(shell uname -s)
ifeq ($(UNAME_S), Darwin)
USE_BLAS = apple
else
USE_BLAS = atlas
endif

# whether use lapack during compilation
# only effective when compiled with blas versions openblas/apple/atlas/mkl
USE_LAPACK = 1

# path to lapack library in case of a non-standard installation
USE_LAPACK_PATH =

# add path to intel library, you may need it for MKL, if you did not add the path
# to environment variable
USE_INTEL_PATH = NONE

# If use MKL only for BLAS, choose static link automatically to allow python wrapper
ifeq ($(USE_BLAS), mkl)
USE_STATIC_MKL = 1
else
USE_STATIC_MKL = NONE
endif

#----------------------------
# Settings for power and arm arch
#----------------------------
ARCH := $(shell uname -a)
ifneq (,$(filter $(ARCH), armv6l armv7l powerpc64le ppc64le aarch64))
	USE_SSE=0
	USE_F16C=0
else
	USE_SSE=1
endif

#----------------------------
# F16C instruction support for faster arithmetic of fp16 on CPU
#----------------------------
# For distributed training with fp16, this helps even if training on GPUs
# If left empty, checks CPU support and turns it on.
# For cross compilation, please check support for F16C on target device and turn off if necessary.
USE_F16C =

#----------------------------
# distributed computing
#----------------------------

# whether or not to enable multi-machine supporting
USE_DIST_KVSTORE = 0

# whether or not allow to read and write HDFS directly. If yes, then hadoop is
# required
USE_HDFS = 0

# path to libjvm.so. required if USE_HDFS=1
LIBJVM=$(JAVA_HOME)/jre/lib/amd64/server

# whether or not allow to read and write AWS S3 directly. If yes, then
# libcurl4-openssl-dev is required, it can be installed on Ubuntu by
# sudo apt-get install -y libcurl4-openssl-dev
USE_S3 = 0

#----------------------------
# performance settings
#----------------------------
# Use operator tuning
USE_OPERATOR_TUNING = 1

# Use gperftools if found
USE_GPERFTOOLS = 1

# Use JEMalloc if found, and not using gperftools
USE_JEMALLOC = 1

#----------------------------
# additional operators
#----------------------------

# path to folders containing projects specific operators that you don't want to put in src/operators
EXTRA_OPERATORS =

#----------------------------
# other features
#----------------------------

# Create C++ interface package
USE_CPP_PACKAGE = 0

#----------------------------
# plugins
#----------------------------

# whether to use caffe integration. This requires installing caffe.
# You also need to add CAFFE_PATH/build/lib to your LD_LIBRARY_PATH
# CAFFE_PATH = $(HOME)/caffe
# MXNET_PLUGINS += plugin/caffe/caffe.mk

# WARPCTC_PATH = $(HOME)/warp-ctc
# MXNET_PLUGINS += plugin/warpctc/warpctc.mk

# whether to use sframe integration. This requires build sframe
# git@github.com:dato-code/SFrame.git
# SFRAME_PATH = $(HOME)/SFrame
# MXNET_PLUGINS += plugin/sframe/plugin.mk

Error Message:

Process Process-2:                                                                                                                                    [1/1921]
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/mxnet_latest/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/ubuntu/anaconda3/envs/mxnet_latest/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/incubator-mxnet/python/mxnet/gluon/data/dataloader.py", line 170, in worker_loop
    data_queue.put((idx, batch))
  File "/home/ubuntu/anaconda3/envs/mxnet_latest/lib/python3.7/multiprocessing/queues.py", line 358, in put
    obj = _ForkingPickler.dumps(obj)
  File "/home/ubuntu/anaconda3/envs/mxnet_latest/lib/python3.7/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "/home/ubuntu/incubator-mxnet/python/mxnet/gluon/data/dataloader.py", line 63, in reduce_ndarray
    pid, fd, shape, dtype = data._to_shared_mem()
  File "/home/ubuntu/incubator-mxnet/python/mxnet/ndarray/ndarray.py", line 200, in _to_shared_mem
    self.handle, ctypes.byref(shared_pid), ctypes.byref(shared_id)))
  File "/home/ubuntu/incubator-mxnet/python/mxnet/base.py", line 255, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [19:31:59] src/io/image_io.cc:162: Check failed: !dst.empty() Decoding failed. Invalid image file.
 
Stack trace returned 10 entries:
[bt] (0) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5b) [0x7ff3b5cb5a0b]
[bt] (1) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7ff3b5cb6578]
[bt] (2) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::io::ImdecodeImpl(int, bool, void*, unsigned long, mxnet::NDArray*)+0x4c6) [0x7
ff3b83e6fd6]
[bt] (3) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(+0x3a279db) [0x7ff3b8a4b9db]
[bt] (4) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprB
lock*)+0x8e5) [0x7ff3b8a45e35]
[bt] (5) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::Thre
adedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#1}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invo
ke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&)+0xe2) [0x7ff3b8a5c642]
[bt] (6) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptr<dmlc::Manual
Event>)> (std::shared_ptr<dmlc::ManualEvent>)> >::_M_run()+0x4a) [0x7ff3b8a4543a]
[bt] (7) /home/ubuntu/anaconda3/envs/mxnet_latest/bin/../lib/libstdc++.so.6(+0xafc5c) [0x7ff40d4cfc5c]
[bt] (8) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7ff41b5f66ba]
[bt] (9) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7ff41b32c41d]

Minimum reproducible example

Run train_imagenet.py from GluonCV (/~https://github.com/dmlc/gluon-cv/blob/master/scripts/classification/imagenet/train_imagenet.py with commit hash 863f19bc86cda0f785b97c39a360fbd8cbd1b0e1) on a training dataset with corrupted images (e.g., an image with 0 bytes).

What have you tried to solve it?

  1. I modified ImageFolderDataset below so that it could handle corrupt images in theory. The try/catch does not capture MXNetError.
DEFAULT_IMAGE_SIZE = 224
DEFAULT_MISSING_LABELS_SENTINEL = -1234

class ImageFolderDataset(gluon.data.Dataset):
    """A dataset for loading image files stored in a folder structure like::
        root/car/0001.jpg
        root/car/xxxa.jpg
        root/car/yyyb.jpg
        root/bus/123.jpg
        root/bus/023.jpg
        root/bus/wwww.jpg
    Parameters
    ----------
    root : str
        Path to root directory.
    flag : {0, 1}, default 1
        If 0, always convert loaded images to greyscale (1 channel).
        If 1, always convert loaded images to colored (3 channels).
    transform : callable, default None
        A function that takes data and label and transforms them:
    ::
        transform = lambda data, label: (data.astype(np.float32)/255, label)
    Attributes
    ----------
    synsets : list
        List of class names. `synsets[i]` is the name for the integer label `i`
    items : list of tuples
        List of all images in (filename, label) pairs.
    """
    def __init__(self, root, flag=1, transform=None, missing_sentinel=DEFAULT_MISSING_LABELS_SENTINEL):
        self._root = os.path.expanduser(root)
        self._flag = flag
        self._transform = transform
        self._missing_sentinel = missing_sentinel
        self._exts = tuple(['.jpg', '.jpeg', '.png'])
        self._list_images(self._root)
 
    def _list_images(self, root):
        self.synsets = []
        self.items = []
 
        for folder in sorted(os.listdir(root)):
            path = os.path.join(root, folder)
            if not os.path.isdir(path):
                warnings.warn('Ignoring %s, which is not a directory.'%path, stacklevel=3)
                continue
            label = len(self.synsets)
            self.synsets.append(folder)
            for filename in sorted(os.listdir(path)):
                filename = os.path.join(path, filename)
                ext = os.path.splitext(filename)[1]
                if ext.lower() not in self._exts:
                    warnings.warn('Ignoring %s of type %s. Only support %s'%(
                        filename, ext, ', '.join(self._exts)))
                    continue
                self.items.append((filename, label))
 
    def __getitem__(self, idx):
        file_name = self.items[idx][0]
        if os.path.exists(file_name) and file_name.endswith(self._exts):
            try:
                img = image.imread(file_name, self._flag)
                label = self.items[idx][1]
            except:
                img = mx.nd.zeros((3, DEFAULT_IMAGE_SIZE, DEFAULT_IMAGE_SIZE))
                label = self._missing_sentinel
        else:
            img = mx.nd.zeros((3, DEFAULT_IMAGE_SIZE, DEFAULT_IMAGE_SIZE))
            label = self._missing_sentinel
 
        if self._transform is not None:
            return self._transform(img, label)
        return img, label
 
    def __len__(self):
        return len(self.items)
  1. I replaced image.imread with cv2.imread (directly using OpenCV) in the above code. It seemed to work on some images but still crashes eventually, which may mean mxnet.image.imdecode is running somewhere else too? I have not explored this yet.

Hey, have you solved this problem, I am now in same problem, and does not find satisfactory answer

@fucker007
Copy link

是的这个问题让我觉得mxnet有点繁琐,就像一个套套一样。无法直达目标。

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants