Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

MXNet TensorRT generates wrong results #13113

Closed
NRauschmayr opened this issue Nov 5, 2018 · 9 comments · Fixed by #13310
Closed

MXNet TensorRT generates wrong results #13113

NRauschmayr opened this issue Nov 5, 2018 · 9 comments · Fixed by #13310

Comments

@NRauschmayr
Copy link
Contributor

Following up from this thread: https://discuss.mxnet.io/t/mxnet-tensorrt-result-different/2139

I tested it myself and it seems that MXNet TensorRT seems to work fine for VGG16 and other models but not for any of the Resnet models from the Gluon model zoo.

I used the mxnet/tensorrt Docker image.

Here an example to reproduce the problem:

from mxnet.gluon.model_zoo import vision
import time
import os
import mxnet as mx
import numpy as np
from collections import namedtuple

batch_shape = (1, 3, 224, 224)
def get_image(url, show=False):
    fname = mx.test_utils.download(url, fname=url.split('/')[-1].split('?')[0])
    img = mx.image.imread(fname)
    img = mx.image.imresize(img, 224, 224) # resize
    img = img.transpose((2, 0, 1)) # Channel first
    img = img.expand_dims(axis=0) # batchify
    return img/255.0
url= '/~https://github.com/dmlc/web-data/blob/master/mxnet/doc/tutorials/python/predict_image/cat.jpg?raw=true'

input_data = get_image(url, show=True)
#resnet18 = vision.vgg16(pretrained=True)
resnet18 = vision.resnet18_v2(pretrained=True)
resnet18.hybridize()
resnet18.forward(mx.nd.zeros(batch_shape))
resnet18.export('resnet18_v2')
sym, arg_params, aux_params = mx.model.load_checkpoint('resnet18_v2', 0)

# Execute with MXNet
os.environ['MXNET_USE_TENSORRT'] = '0'
executor = sym.simple_bind(ctx=mx.gpu(0), data=batch_shape, grad_req='null', force_rebind=True)
executor.copy_params_from(arg_params, aux_params)
input_data = input_data.as_in_context(mx.gpu())
y= executor.forward(is_train=False, data=input_data)
print (y[0].asnumpy())

# Execute with TensorRT
os.environ['MXNET_USE_TENSORRT'] = '1'
arg_params.update(aux_params)
all_params = dict([(k, v.as_in_context(mx.gpu(0))) for k, v in arg_params.items()])
executor = mx.contrib.tensorrt.tensorrt_bind(sym, ctx=mx.gpu(0), all_params=all_params,
                                             data=batch_shape, grad_req='null', force_rebind=True)

y = executor.forward(is_train=False, data=input_data)
print (y[0].asnumpy())

The TensorRT version either delivers NaN or values in the order or ^+30.

@frankfliu
Copy link
Contributor

@mxnet-label-bot [Gluon, model zoo]

@marcoabreu
Copy link
Contributor

@KellenSunderland

@KellenSunderland
Copy link
Contributor

Interesting. Thanks for the report, I'll have to dig in to the issue.

@hariag
Copy link
Contributor

hariag commented Nov 7, 2018

this is due to batch normalization layer.

@NRauschmayr
Copy link
Contributor Author

Can you give some more details why this is due to batch normalization? I understand that TensorRT makes optimization that can impact floating point precision. But it should not change the final output result.

@KellenSunderland
Copy link
Contributor

Still working on this. Here's my output when I run it with the latest version of TensorRT and ONNXtoTensorRT https://gist.github.com/KellenSunderland/7e486514fa02388619bcf4ee1614c18b

@KellenSunderland
Copy link
Contributor

@NRauschmayr So looks like I've got a version that seems to be working as expected. I'll try and work my way backwards to reproduce the error you're seeing. Would you be able to let me know what GPU you're using when you have this error?

@KellenSunderland
Copy link
Contributor

Just thinking a bit more about batch normalization. After optimization I wonder if we even use batch normalization? It should be fixed values, so I would assume TRT would scale the relevant params at engine build time and then not even run batch normalization at runtime.

@NRauschmayr
Copy link
Contributor Author

I am closing this issue because of /~https://github.com/apache/incubator-mxnet/pull/13310/files . Thanks Kellen for your help and for fixing this so quickly!

Just to give a brief summary of some offline discussions between me and Kellen: Initially I reported that I got values in the order of ^30 or NaN. It turns out that I was running on a GPU model that was not supported. When switching to a p3 instance, I got more reasonably looking results, but feature vectors would still differ quite a bit. In a very few cases this difference would lead to a mispredicted class. I tested the update on Resnet and Densenet and I am getting the correct results now.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants