-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cuda8.0 can't train YOLOv3 Loss : nan #366
Comments
I don't think this is related to cuda or hybridize. If you are getting random nan, especially the beginning iterations, it's probably related to the warm up setting. Warm up is a must have for YOLO3 models, you can increase this number /~https://github.com/dmlc/gluon-cv/blob/master/scripts/detection/yolo/train_yolo3.py#L156 to make it more stable. In a new PR I have made it a command line argument, so it will be more convenient to tweak |
Let me know if the warm up stuff is useless on CUDA8 |
Thank you very much for your reply. I'll try to tweak the warm up argument on CUDA8 later as the gpu is running now. And I am quite curious about the totally same code and settings can run properly on CUDA9.0. |
I just run the same code and settings on another Ubuntu16.04, it will work properly without changing the warm up argument, on cuda8.0 😕 😕 Another small flaw is that the link yolo3_voc train script and for416 is for coco dataset rather than Pascal Voc. 😃 |
script fixed, thanks for spotting. Every training process is randomized, so you will get random behavior. So I suggest you to increase the warm up epoch to reduce the chance of Nan, otherwise it's not predictable. |
OK, thank you very much. |
@zhreshold , still got nan loss from the first log message after changing
(I add the |
@nicklhy Was that on cuda 9 or later? |
@zhreshold cuda 8.0, cudnn 7.0.5, Titan XP. I am wondering if there is a specific requirement of the CuDNN's version? |
@zhreshold @nicklhy |
Due to the mixed envs and versions, I am not able to locate the problem. Also we have several fixes on the YOLO network and training script which has been merged to master recently. Can you guys try the latest master and report if any of the combinations got nan even with I'd appreciate it very much. |
@zhreshold , Just tried the newest gluoncv with mxnet_cu80-1.3.0.post0. The nan loss still exists with
BTW, the gpu memory seems to be much larger than the old version. I can not use the default batch size(64) with 4 Titan XP gpus now. |
I am sorry that I removed my mxnet-cu80 env because there is little memory left on my computer. |
I tested yolov3 on my own dataset, and there was also nan problem. I checked that the default initial lr was 0.001. When I set it to half, which is 0.0005, the training becomes nomal with no nan problem. This might cause the problem. |
same problem here, run demo from tutorial train yolov3 on pascal_voc dataset, also raise nan loss, using P40 with cuda8.0 |
never mind, I changed the warm-up args, then everything works fine. |
hey guys make sure your driver for GPU is compatible with your cuda version. |
i trained my own dataset, and there was also nan problem. lr was 0.001 or bigger and cuda9.0. When i changed lr to 0.0005, it worked. Maybe lr was too big? |
Driver Version: 410.48 @zhreshold same problem . |
Just an update, the root cause is found and fix has been merged to master: apache/mxnet#14209 By using master/nightly built pip package hopefully you won't meet same problem any more |
When I use cuda8.0, I run the yolov3 script using 2 GPUs, just changing the batch-size to 32, I got the loss nan:
When I comment the
net.hybridize()
intrain()
andvalidate()
as mentioned here , I can run it with proper loss, but sacrificing the training speed.Besides, If I use
batch-szie=4
, the loss won't becomenan
withnet.hybridize()
so I guess that it is not the smaller batch size resulting innan
.cuda9.0 with
bath-size=32
is also OK.The text was updated successfully, but these errors were encountered: