the results of AMASS are very different trained on torch1.6 and torch1.7 #3

lingtengqiu · 2022-01-08T06:17:45Z

I am very confused about the AMASS results on subject 50002 trained on different versions of pytorch.

According to your environments, it suggests to install pytorch==1.7.0. However the results trained on pytorch1.7 is less than the results I train on pytorch==1.6.0 significantly. I am very confused about it.

the training loss, and evaluation IoU is shown as(blue is trained on torch==1.6.0, and red is trained on 1.7.0):

Looking forward to your reply.

lingtengqiu · 2022-01-08T18:22:32Z

I find the main issue is the broyden method can not converge to cvg_thresh(1e-5). Many (xd_opt -xd_tgt) coverage to 1e-4 trained on RTX3090, pytorch1.7.0. I do not know how to solve it, could you give me some suggestion?

xuchen-ethz · 2022-01-14T13:59:07Z

Hi @lingtengqiu, I've actually encountered the same problem on RTX3080. Later I switched to RTX2080 and haven't gotten the problem anymore. It seems to be a low-level problem and even hardware dependent, I am sorry that I don't really know how to solve this for RTX30xx.

The training curve for PyTorch1.6 seems good - is it also trained with RTX3090? and how does the result look qualitatively? if the qualitative result looks good then I would suggest just using PyTorch1.6. If there are incompatibilities between the code and PyTorch1.6 I am happy to resolve them.

lingtengqiu · 2022-01-15T07:38:26Z

Thanks for your detailed answer.

The training curve for pytorch1.6 is the result of the model trained on RTX2080 and A-100. I find the main issue is the broyden method on RTX-30 series could not converge to 1e-5 (which leads to the number of valid verts is small, many pts are unable to train). If I modify the cvg_thresh to 1e-3, the training curve is good on RTX-30. I guess it is the implementation of graphic computing is different on 3090 ,compared with other devices.

xuchen-ethz · 2022-01-15T10:43:10Z

Thank you for the information. Then it's probably easiest to change GPU ( Given that A100 also works fine for you, I suspect that any model earlier than 8.6 should be okay https://en.wikipedia.org/wiki/CUDA#:~:text=A100%2040GB%2C%20A30-,8.6,-GA102%2C%20GA104%2C%20GA106)

Setting a larger cvg_thresh (e.g. 1e-3) on RTX30xx might cause problems based on my experience - you might get artifacts (noisy spikes) on the output mesh because the loose cvg_thresh could introduce false correspondences.

xuchen-ethz closed this as completed Jan 19, 2022

zhengyuf mentioned this issue Jun 17, 2022

Unstable Training? White imges as result. zhengyuf/IMavatar#3

Closed

taconite mentioned this issue Sep 4, 2024

Error results using monocular config on zju-mocap taconite/arah-release#37

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

the results of AMASS are very different trained on torch1.6 and torch1.7 #3

the results of AMASS are very different trained on torch1.6 and torch1.7 #3

lingtengqiu commented Jan 8, 2022

lingtengqiu commented Jan 8, 2022

xuchen-ethz commented Jan 14, 2022

lingtengqiu commented Jan 15, 2022

xuchen-ethz commented Jan 15, 2022

the results of AMASS are very different trained on torch1.6 and torch1.7 #3

the results of AMASS are very different trained on torch1.6 and torch1.7 #3

Comments

lingtengqiu commented Jan 8, 2022

lingtengqiu commented Jan 8, 2022

xuchen-ethz commented Jan 14, 2022

lingtengqiu commented Jan 15, 2022

xuchen-ethz commented Jan 15, 2022