cuda8.0 can't train YOLOv3 Loss : nan #366

ShoufaChen · 2018-10-07T08:26:43Z

When I use cuda8.0, I run the yolov3 script using 2 GPUs, just changing the batch-size to 32, I got the loss nan:

INFO:root:[Epoch 0][Batch 99], LR: 5.99E-05, Speed: 31.597 samples/sec, ObjLoss=nan, BoxCenterLoss=nan, BoxScaleLoss=nan, ClassLoss=nan
INFO:root:[Epoch 0][Batch 199], LR: 1.20E-04, Speed: 32.253 samples/sec, ObjLoss=nan, BoxCenterLoss=nan, BoxScaleLoss=nan, ClassLoss=nan
INFO:root:[Epoch 0][Batch 299], LR: 1.81E-04, Speed: 31.947 samples/sec, ObjLoss=nan, BoxCenterLoss=nan, BoxScaleLoss=nan, ClassLoss=nan

When I comment the net.hybridize() in train() and validate() as mentioned here , I can run it with proper loss, but sacrificing the training speed.

Besides, If I use batch-szie=4, the loss won't become nan with net.hybridize() so I guess that it is not the smaller batch size resulting in nan.

cuda9.0 with bath-size=32 is also OK.

The text was updated successfully, but these errors were encountered:

zhreshold · 2018-10-09T00:13:37Z

I don't think this is related to cuda or hybridize. If you are getting random nan, especially the beginning iterations, it's probably related to the warm up setting. Warm up is a must have for YOLO3 models, you can increase this number /~https://github.com/dmlc/gluon-cv/blob/master/scripts/detection/yolo/train_yolo3.py#L156 to make it more stable.

In a new PR I have made it a command line argument, so it will be more convenient to tweak

zhreshold · 2018-10-09T00:14:06Z

Let me know if the warm up stuff is useless on CUDA8

ShoufaChen · 2018-10-09T00:34:01Z

Thank you very much for your reply. I'll try to tweak the warm up argument on CUDA8 later as the gpu is running now.

And I am quite curious about the totally same code and settings can run properly on CUDA9.0.

ShoufaChen · 2018-10-09T00:57:43Z

I just run the same code and settings on another Ubuntu16.04, it will work properly without changing the warm up argument, on cuda8.0 😕 😕
It seems the the nan error raises randomly..

Another small flaw is that the link yolo3_voc train script and for416 is for coco dataset rather than Pascal Voc. 😃

zhreshold · 2018-10-09T01:06:37Z

script fixed, thanks for spotting.

Every training process is randomized, so you will get random behavior. So I suggest you to increase the warm up epoch to reduce the chance of Nan, otherwise it's not predictable.

ShoufaChen · 2018-10-09T01:11:27Z

OK, thank you very much.

nicklhy · 2018-10-09T06:28:48Z

@zhreshold , still got nan loss from the first log message after changing warmup_epochs from 2 to larger numbers like 5 or 10.

➜  mx-yolov3 git:(master) ✗ python3 train_yolo3.py --data-root /mnt/workspace/shared_datasets/COCO --dataset coco --gpus 0,1,2,3 --num-workers 10 --syncbn
loading annotations into memory...
Done (t=16.28s)
creating index...
index created!
loading annotations into memory...
Done (t=0.47s)
creating index...
index created!
INFO:root:Namespace(batch_size=64, data_root='/mnt/workspace/shared_datasets/COCO', data_shape=416, dataset='coco', epochs=200, gpus='0,1,2,3', log_interval=100, lr=0.001, lr_decay=0.1, lr_decay_epoch='160,180', momentum=0.9, network='darknet53', num_samples=117266, num_workers=10, resume='', save_interval=10, save_prefix='yolo3_darknet53_coco', seed=233, start_epoch=0, syncbn=True, val_interval=1, wd=0.0005)
INFO:root:Start training from [Epoch 0]
[14:10:42] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:109: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
INFO:root:[Epoch 0][Batch 99], LR: 2.70E-06, Speed: 84.074 samples/sec, ObjLoss=nan, BoxCenterLoss=nan, BoxScaleLoss=nan, ClassLoss=nan
INFO:root:[Epoch 0][Batch 199], LR: 5.43E-06, Speed: 58.834 samples/sec, ObjLoss=nan, BoxCenterLoss=nan, BoxScaleLoss=nan, ClassLoss=nan
INFO:root:[Epoch 0][Batch 299], LR: 8.16E-06, Speed: 87.941 samples/sec, ObjLoss=nan, BoxCenterLoss=nan, BoxScaleLoss=nan, ClassLoss=nan

(I add the --data-root arg to specify dataset's root directory.)

zhreshold · 2018-10-09T22:20:59Z

@nicklhy Was that on cuda 9 or later?

nicklhy · 2018-10-10T00:52:32Z

@zhreshold cuda 8.0, cudnn 7.0.5, Titan XP.

I am wondering if there is a specific requirement of the CuDNN's version?

ShoufaChen · 2018-10-10T03:36:26Z

@zhreshold @nicklhy
I tried cuda8.0, Titan XP
both cudnn 7.1.3 and without cudnn raise the nan problem.

zhreshold · 2018-10-11T03:10:15Z

Due to the mixed envs and versions, I am not able to locate the problem. Also we have several fixes on the YOLO network and training script which has been merged to master recently.

Can you guys try the latest master and report if any of the combinations got nan even with --warmup-epochs 10 or something.

I'd appreciate it very much.

@ShoufaChen @nicklhy

nicklhy · 2018-10-11T07:24:01Z

@zhreshold , Just tried the newest gluoncv with mxnet_cu80-1.3.0.post0. The nan loss still exists with --warmup-epochs 10. The training script is called like below:

python3 train_yolo3_new.py --data-root /mnt/workspace/shared_datasets/VOC --dataset voc --gpus 0,1,2,3 --num-workers 10 --syncbn --batch-size 32 --warmup-epochs 10

BTW, the gpu memory seems to be much larger than the old version. I can not use the default batch size(64) with 4 Titan XP gpus now.

ShoufaChen · 2018-10-11T11:46:12Z

I am sorry that I removed my mxnet-cu80 env because there is little memory left on my computer.

kuonangzhe · 2018-10-12T09:04:12Z

I tested yolov3 on my own dataset, and there was also nan problem. I checked that the default initial lr was 0.001. When I set it to half, which is 0.0005, the training becomes nomal with no nan problem. This might cause the problem.

weiaicunzai · 2018-11-09T13:38:40Z

same problem here, run demo from tutorial train yolov3 on pascal_voc dataset, also raise nan loss, using P40 with cuda8.0

weiaicunzai · 2018-11-09T14:34:50Z

never mind, I changed the warm-up args, then everything works fine.

wshuail · 2018-11-16T07:33:40Z

hey guys make sure your driver for GPU is compatible with your cuda version.
This happened to me before I updated the driver to the latest.

ymm4739 · 2018-12-26T09:38:30Z

i trained my own dataset, and there was also nan problem. lr was 0.001 or bigger and cuda9.0. When i changed lr to 0.0005, it worked. Maybe lr was too big?

BackT0TheFuture · 2019-02-18T02:48:56Z

Driver Version: 410.48
CUDA 10
CUDNN 7.4.2.24
MXNET mxnet-cu100mkl

@zhreshold same problem .

zhreshold · 2019-03-07T08:00:47Z

Just an update, the root cause is found and fix has been merged to master: apache/mxnet#14209

By using master/nightly built pip package hopefully you won't meet same problem any more

YunYang1994 mentioned this issue Jan 23, 2019

"Nan" always appears during training! YunYang1994/tensorflow-yolov3#38

Open

This was referenced Jan 23, 2019

训练到6个epoch出现nan wizyoung/YOLOv3_TensorFlow#3

Closed

How to restore from lastly saved checkpoint? wizyoung/YOLOv3_TensorFlow#6

Closed

zhreshold closed this as completed Mar 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda8.0 can't train YOLOv3 Loss : nan #366

cuda8.0 can't train YOLOv3 Loss : nan #366

ShoufaChen commented Oct 7, 2018

zhreshold commented Oct 9, 2018

zhreshold commented Oct 9, 2018

ShoufaChen commented Oct 9, 2018

ShoufaChen commented Oct 9, 2018 •

edited

Loading

zhreshold commented Oct 9, 2018

ShoufaChen commented Oct 9, 2018

nicklhy commented Oct 9, 2018

zhreshold commented Oct 9, 2018

nicklhy commented Oct 10, 2018 •

edited

Loading

ShoufaChen commented Oct 10, 2018 •

edited

Loading

zhreshold commented Oct 11, 2018

nicklhy commented Oct 11, 2018 •

edited

Loading

ShoufaChen commented Oct 11, 2018

kuonangzhe commented Oct 12, 2018

weiaicunzai commented Nov 9, 2018

weiaicunzai commented Nov 9, 2018

wshuail commented Nov 16, 2018

ymm4739 commented Dec 26, 2018

BackT0TheFuture commented Feb 18, 2019

zhreshold commented Mar 7, 2019

cuda8.0 can't train YOLOv3 Loss : nan #366

cuda8.0 can't train YOLOv3 Loss : nan #366

Comments

ShoufaChen commented Oct 7, 2018

zhreshold commented Oct 9, 2018

zhreshold commented Oct 9, 2018

ShoufaChen commented Oct 9, 2018

ShoufaChen commented Oct 9, 2018 • edited Loading

zhreshold commented Oct 9, 2018

ShoufaChen commented Oct 9, 2018

nicklhy commented Oct 9, 2018

zhreshold commented Oct 9, 2018

nicklhy commented Oct 10, 2018 • edited Loading

ShoufaChen commented Oct 10, 2018 • edited Loading

zhreshold commented Oct 11, 2018

nicklhy commented Oct 11, 2018 • edited Loading

ShoufaChen commented Oct 11, 2018

kuonangzhe commented Oct 12, 2018

weiaicunzai commented Nov 9, 2018

weiaicunzai commented Nov 9, 2018

wshuail commented Nov 16, 2018

ymm4739 commented Dec 26, 2018

BackT0TheFuture commented Feb 18, 2019

zhreshold commented Mar 7, 2019

ShoufaChen commented Oct 9, 2018 •

edited

Loading

nicklhy commented Oct 10, 2018 •

edited

Loading

ShoufaChen commented Oct 10, 2018 •

edited

Loading

nicklhy commented Oct 11, 2018 •

edited

Loading