-
Notifications
You must be signed in to change notification settings - Fork 6.8k
MXNet TensorRT generates wrong results #13113
Comments
@mxnet-label-bot [Gluon, model zoo] |
Interesting. Thanks for the report, I'll have to dig in to the issue. |
this is due to batch normalization layer. |
Can you give some more details why this is due to batch normalization? I understand that TensorRT makes optimization that can impact floating point precision. But it should not change the final output result. |
Still working on this. Here's my output when I run it with the latest version of TensorRT and ONNXtoTensorRT https://gist.github.com/KellenSunderland/7e486514fa02388619bcf4ee1614c18b |
@NRauschmayr So looks like I've got a version that seems to be working as expected. I'll try and work my way backwards to reproduce the error you're seeing. Would you be able to let me know what GPU you're using when you have this error? |
Just thinking a bit more about batch normalization. After optimization I wonder if we even use batch normalization? It should be fixed values, so I would assume TRT would scale the relevant params at engine build time and then not even run batch normalization at runtime. |
I am closing this issue because of /~https://github.com/apache/incubator-mxnet/pull/13310/files . Thanks Kellen for your help and for fixing this so quickly! Just to give a brief summary of some offline discussions between me and Kellen: Initially I reported that I got values in the order of ^30 or NaN. It turns out that I was running on a GPU model that was not supported. When switching to a p3 instance, I got more reasonably looking results, but feature vectors would still differ quite a bit. In a very few cases this difference would lead to a mispredicted class. I tested the update on Resnet and Densenet and I am getting the correct results now. |
Following up from this thread: https://discuss.mxnet.io/t/mxnet-tensorrt-result-different/2139
I tested it myself and it seems that MXNet TensorRT seems to work fine for VGG16 and other models but not for any of the Resnet models from the Gluon model zoo.
I used the mxnet/tensorrt Docker image.
Here an example to reproduce the problem:
The TensorRT version either delivers NaN or values in the order or ^+30.
The text was updated successfully, but these errors were encountered: