Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

example/image-classification/common/util.py get_gpus() exception #14199

Closed
tahouse opened this issue Feb 19, 2019 · 2 comments · Fixed by #14212
Closed

example/image-classification/common/util.py get_gpus() exception #14199

tahouse opened this issue Feb 19, 2019 · 2 comments · Fixed by #14212

Comments

@tahouse
Copy link

tahouse commented Feb 19, 2019

Note: Providing complete information in the most concise form is the best way to get help. This issue template serves as the checklist for essential information to most of the technical issues and bug reports. For non-technical issues and feature requests, feel free to present the information in what you believe is the best form.

For Q & A and discussion, please start a discussion thread at https://discuss.mxnet.io

Description

(Brief description of the problem in no more than 2 sentences.)
The try-except logic surrounding the call to nvidia-smi is not working for me. The exception type is not an OSError so it is not being caught. Instead, my exception type from missing nvidia-smi is "subprocess.CalledProcessError". Problem is fixed by just catching all exceptions, but you'll want to specify the mentioned exception.

Environment info (Required)

Running AWS Deep Learning AMI v21 (Ubuntu 16.04) on a c5.18xlarge. Conda pre-installed MXNet 1.3.1. No NVIDIA drivers present.

This issue has been raised before, but the try-except block does not catch this particular exception. See #8969

What to do:
1. Download the diagnosis script from https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/diagnose.py
2. Run the script using `python diagnose.py` and paste its output here.

ubuntu@ip-10-0-243-26:~/github/incubator-mxnet/example/image-classification$ python diagnose.py
----------Python Info----------
Version      : 3.6.5
Compiler     : GCC 7.2.0
Build        : ('default', 'Apr 29 2018 16:14:56')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 10.0.1
Directory    : /home/ubuntu/anaconda3/lib/python3.6/site-packages/pip
----------MXNet Info-----------
No MXNet installed.
----------System Info----------
Platform     : Linux-4.4.0-1075-aws-x86_64-with-debian-stretch-sid
system       : Linux
node         : ip-10-0-243-26
release      : 4.4.0-1075-aws
version      : #85-Ubuntu SMP Thu Jan 17 17:15:12 UTC 2019
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                72
On-line CPU(s) list:   0-71
Thread(s) per core:    2
Core(s) per socket:    18
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
Stepping:              3
CPU MHz:               3000.000
BogoMIPS:              6000.00
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                72
On-line CPU(s) list:   0-71
Thread(s) per core:    2
Core(s) per socket:    18
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
Stepping:              3
CPU MHz:               3000.000
BogoMIPS:              6000.00
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              25344K
NUMA node0 CPU(s):     0-17,36-53
NUMA node1 CPU(s):     18-35,54-71
Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f rdseed adx smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat pku
----------Network Test----------
Setting timeout: 10
Timing for MXNet: /~https://github.com/apache/incubator-mxnet, DNS: 0.0029 sec, LOAD: 0.4007 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.3205 sec, LOAD: 0.4716 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.5678 sec, LOAD: 0.3928 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0683 sec, LOAD: 0.1077 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0011 sec, LOAD: 0.3291 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0205 sec, LOAD: 0.1399 sec.

Package used (Python/R/Scala/Julia):
(I'm using ...)
Python

For Scala user, please provide:

  1. Java version: (java -version)
  2. Maven version: (mvn -version)
  3. Scala runtime if applicable: (scala -version)

For R user, please provide R sessionInfo():

Build info (Required if built from source)

Compiler (gcc/clang/mingw/visual studio):

MXNet commit hash:
(Paste the output of git rev-parse HEAD here.)

3a6fe22

Build config:
(Paste the content of config.mk, or the build command.)

Error Message:

(Paste the complete error message, including stack trace.)

ubuntu@ip-10-0-243-26:~/github/incubator-mxnet/example/image-classification$ python benchmark_score.py --network resnet-50                                                        
Traceback (most recent call last):
  File "benchmark_score.py", line 104, in <module>
    devs = [mx.gpu(0)] if len(get_gpus()) > 0 else []
  File "/home/ubuntu/github/incubator-mxnet/example/image-classification/common/util.py", line 53, in get_gpus
    re = subprocess.check_output(["nvidia-smi", "-L"], universal_newlines=True)
  File "/home/ubuntu/anaconda3/lib/python3.6/subprocess.py", line 336, in check_output
    **kwargs).stdout
  File "/home/ubuntu/anaconda3/lib/python3.6/subprocess.py", line 418, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['nvidia-smi', '-L']' returned non-zero exit status 9.

Minimum reproducible example

(If you are using your own code, please provide a short script that reproduces the error. Otherwise, please provide link to the existing example.)

see above using benchmark_score.py

Steps to reproduce

(Paste the commands you ran that produced the error.)

What have you tried to solve it?

  1. Modified line 54 of util.py to except:
@mxnet-label-bot
Copy link
Contributor

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Example

@tahouse tahouse changed the title example/image-classification/common/util.py get_gpu exception example/image-classification/common/util.py get_gpus() exception Feb 19, 2019
@frankfliu
Copy link
Contributor

@mxnet-label-bot add [Bug, python]

frankfliu added a commit to frankfliu/incubator-mxnet that referenced this issue Feb 20, 2019
frankfliu added a commit to frankfliu/incubator-mxnet that referenced this issue Feb 21, 2019
frankfliu added a commit to frankfliu/incubator-mxnet that referenced this issue Feb 21, 2019
1. Added get_gpus() and get_gpu_memory() API to python binding.
2. Update example script to use proper API for getting gpu numbers.
frankfliu added a commit to frankfliu/incubator-mxnet that referenced this issue Feb 21, 2019
1. Added get_gpus() and get_gpu_memory() API to python binding.
2. Update example script to use proper API for getting gpu numbers.
frankfliu added a commit to frankfliu/incubator-mxnet that referenced this issue Feb 21, 2019
1. Added get_gpus() and get_gpu_memory() API to python binding.
2. Update example script to use proper API for getting gpu numbers.
frankfliu added a commit to frankfliu/incubator-mxnet that referenced this issue Feb 23, 2019
1. Added get_gpus() and get_gpu_memory() API to python binding.
2. Update example script to use proper API for getting gpu numbers.
frankfliu added a commit to frankfliu/incubator-mxnet that referenced this issue Feb 25, 2019
1. Added get_gpus() and get_gpu_memory() API to python binding.
2. Update example script to use proper API for getting gpu numbers.
frankfliu added a commit to frankfliu/incubator-mxnet that referenced this issue Mar 4, 2019
1. Added get_gpus() and get_gpu_memory() API to python binding.
2. Update example script to use proper API for getting gpu numbers.
wkcn pushed a commit that referenced this issue Mar 7, 2019
* Fixes #14199: use proper API get number of gpus.

1. Added get_gpus() and get_gpu_memory() API to python binding.
2. Update example script to use proper API for getting gpu numbers.

* retrigger CI
vdantu pushed a commit to vdantu/incubator-mxnet that referenced this issue Mar 31, 2019
…he#14212)

* Fixes apache#14199: use proper API get number of gpus.

1. Added get_gpus() and get_gpu_memory() API to python binding.
2. Update example script to use proper API for getting gpu numbers.

* retrigger CI
nswamy pushed a commit that referenced this issue Apr 5, 2019
* Fixes #14199: use proper API get number of gpus.

1. Added get_gpus() and get_gpu_memory() API to python binding.
2. Update example script to use proper API for getting gpu numbers.

* retrigger CI
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this issue Jun 23, 2019
…he#14212)

* Fixes apache#14199: use proper API get number of gpus.

1. Added get_gpus() and get_gpu_memory() API to python binding.
2. Update example script to use proper API for getting gpu numbers.

* retrigger CI
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants