MXNet TensorRT generates wrong results #13113

NRauschmayr · 2018-11-05T17:00:36Z

Following up from this thread: https://discuss.mxnet.io/t/mxnet-tensorrt-result-different/2139

I tested it myself and it seems that MXNet TensorRT seems to work fine for VGG16 and other models but not for any of the Resnet models from the Gluon model zoo.

I used the mxnet/tensorrt Docker image.

Here an example to reproduce the problem:

from mxnet.gluon.model_zoo import vision
import time
import os
import mxnet as mx
import numpy as np
from collections import namedtuple

batch_shape = (1, 3, 224, 224)
def get_image(url, show=False):
    fname = mx.test_utils.download(url, fname=url.split('/')[-1].split('?')[0])
    img = mx.image.imread(fname)
    img = mx.image.imresize(img, 224, 224) # resize
    img = img.transpose((2, 0, 1)) # Channel first
    img = img.expand_dims(axis=0) # batchify
    return img/255.0
url= '/~https://github.com/dmlc/web-data/blob/master/mxnet/doc/tutorials/python/predict_image/cat.jpg?raw=true'

input_data = get_image(url, show=True)
#resnet18 = vision.vgg16(pretrained=True)
resnet18 = vision.resnet18_v2(pretrained=True)
resnet18.hybridize()
resnet18.forward(mx.nd.zeros(batch_shape))
resnet18.export('resnet18_v2')
sym, arg_params, aux_params = mx.model.load_checkpoint('resnet18_v2', 0)

# Execute with MXNet
os.environ['MXNET_USE_TENSORRT'] = '0'
executor = sym.simple_bind(ctx=mx.gpu(0), data=batch_shape, grad_req='null', force_rebind=True)
executor.copy_params_from(arg_params, aux_params)
input_data = input_data.as_in_context(mx.gpu())
y= executor.forward(is_train=False, data=input_data)
print (y[0].asnumpy())

# Execute with TensorRT
os.environ['MXNET_USE_TENSORRT'] = '1'
arg_params.update(aux_params)
all_params = dict([(k, v.as_in_context(mx.gpu(0))) for k, v in arg_params.items()])
executor = mx.contrib.tensorrt.tensorrt_bind(sym, ctx=mx.gpu(0), all_params=all_params,
                                             data=batch_shape, grad_req='null', force_rebind=True)

y = executor.forward(is_train=False, data=input_data)
print (y[0].asnumpy())

The TensorRT version either delivers NaN or values in the order or ^+30.

The text was updated successfully, but these errors were encountered:

frankfliu · 2018-11-05T18:12:35Z

@mxnet-label-bot [Gluon, model zoo]

marcoabreu · 2018-11-05T23:18:19Z

@KellenSunderland

KellenSunderland · 2018-11-06T15:50:20Z

Interesting. Thanks for the report, I'll have to dig in to the issue.

hariag · 2018-11-07T06:42:53Z

this is due to batch normalization layer.

NRauschmayr · 2018-11-07T16:51:56Z

Can you give some more details why this is due to batch normalization? I understand that TensorRT makes optimization that can impact floating point precision. But it should not change the final output result.

KellenSunderland · 2018-11-07T23:21:08Z

Still working on this. Here's my output when I run it with the latest version of TensorRT and ONNXtoTensorRT https://gist.github.com/KellenSunderland/7e486514fa02388619bcf4ee1614c18b

KellenSunderland · 2018-11-07T23:27:02Z

@NRauschmayr So looks like I've got a version that seems to be working as expected. I'll try and work my way backwards to reproduce the error you're seeing. Would you be able to let me know what GPU you're using when you have this error?

KellenSunderland · 2018-11-14T20:26:30Z

Just thinking a bit more about batch normalization. After optimization I wonder if we even use batch normalization? It should be fixed values, so I would assume TRT would scale the relevant params at engine build time and then not even run batch normalization at runtime.

NRauschmayr · 2018-11-17T17:15:26Z

I am closing this issue because of /~https://github.com/apache/incubator-mxnet/pull/13310/files . Thanks Kellen for your help and for fixing this so quickly!

Just to give a brief summary of some offline discussions between me and Kellen: Initially I reported that I got values in the order of ^30 or NaN. It turns out that I was running on a GPU model that was not supported. When switching to a p3 instance, I got more reasonably looking results, but feature vectors would still differ quite a bit. In a very few cases this difference would lead to a mispredicted class. I tested the update on Resnet and Densenet and I am getting the correct results now.

marcoabreu added Gluon Model Zoo labels Nov 5, 2018

NRauschmayr closed this as completed Nov 17, 2018

KellenSunderland mentioned this issue Nov 17, 2018

[MXNET-703] Update to TensorRT 5, ONNX IR 3. Fix inference bugs. #13310

Merged

5 tasks

KellenSunderland mentioned this issue Jan 16, 2019

[v1.4.x] [MXNET-703] Update to TensorRT 5, ONNX IR 3. Fix inference bugs. #13897

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MXNet TensorRT generates wrong results #13113

MXNet TensorRT generates wrong results #13113

NRauschmayr commented Nov 5, 2018

frankfliu commented Nov 5, 2018

marcoabreu commented Nov 5, 2018

KellenSunderland commented Nov 6, 2018

hariag commented Nov 7, 2018

NRauschmayr commented Nov 7, 2018

KellenSunderland commented Nov 7, 2018

KellenSunderland commented Nov 7, 2018

KellenSunderland commented Nov 14, 2018

NRauschmayr commented Nov 17, 2018

MXNet TensorRT generates wrong results #13113

MXNet TensorRT generates wrong results #13113

Comments

NRauschmayr commented Nov 5, 2018

frankfliu commented Nov 5, 2018

marcoabreu commented Nov 5, 2018

KellenSunderland commented Nov 6, 2018

hariag commented Nov 7, 2018

NRauschmayr commented Nov 7, 2018

KellenSunderland commented Nov 7, 2018

KellenSunderland commented Nov 7, 2018

KellenSunderland commented Nov 14, 2018

NRauschmayr commented Nov 17, 2018