Why FP16 training speed is too slow on Tesla T4 in Gluon? #13709

PistonY · 2018-12-21T02:01:48Z

Hi, I tried to train with FP16 on Tesla T4, but it's speed is slower than GTX 1070 with FP32.
Could you please give me some suggests to solve that?
T4 is on Mxnet-cu100mkl and GTX1070 is on mxnet-cu90mkl
Here are my script and logs:
code: https://gist.github.com/PistonY/8dfcefdc46b747afd4d18b37f9a18665
logs:
T4 log:

INFO:root:Iter 390. Loss: 2.14372, Train RMSE 0.23653.Time 00:05:47.lr 0.019948717948717953
INFO:root:Test Loss: 1.935017, Test acc 0.327200.
INFO:root:Iter 780. Loss: 1.89404, Train RMSE 0.22111.Time 00:05:52.lr 0.03994871794871795
INFO:root:Test Loss: 1.460350, Test acc 0.473100.
INFO:root:Iter 1170. Loss: 1.72982, Train RMSE 0.20837.Time 00:05:49.lr 0.05994871794871795
INFO:root:Test Loss: 1.288763, Test acc 0.559500.
INFO:root:Iter 1560. Loss: 1.57620, Train RMSE 0.19388.Time 00:05:48.lr 0.07994871794871795
INFO:root:Test Loss: 1.856537, Test acc 0.530100.

GTX 1070 log:

INFO:root:Epoch 0, Iter 390. Loss: 2.12699, Train RMSE 0.23722.Time 00:03:00.lr 0.019948717948717953
INFO:root:Test Loss: 1.746372, Test acc 0.361800.

The text was updated successfully, but these errors were encountered:

eric-haibin-lin · 2018-12-27T19:32:25Z

@PistonY could you try mxnet profiler /~https://github.com/apache/incubator-mxnet/blob/master/docs/faq/perf.md#profiler to see what operation is costly with fp16? https://mxnet.incubator.apache.org/tutorials/python/profiler.html

PistonY · 2019-01-02T07:37:25Z

@eric-haibin-lin Hi~I test it with mxnet profiler here are my script and result.It looks good.
But when I actually use it,it's still slow.
Here is script.
I print time between

st_t = time()
with autograd.record():
       output = train_net(trans.astype(dtype, copy=False))
       loss = Loss(output, labels.astype(dtype, copy=False))
loss.backward()
trainer.step(batch_size)
end_t = time()
print(end_t - st_t)

when fp16:

float16
Start training with mixup.
[15:23:12] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
2.0626039505004883
0.26385951042175293
0.2520616054534912
0.2604227066040039
0.25570082664489746
0.26578330993652344
0.25952720642089844
0.2606792449951172
0.2637202739715576
0.3433563709259033
0.2613410949707031

fp32:

float32
Start training with mixup.
[15:36:23] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
0.8311986923217773
0.20481181144714355
0.278350830078125
0.18038034439086914
0.21913409233093262
0.2587764263153076
0.17470550537109375
0.21522021293640137
0.2749063968658447
0.2962362766265869
0.2280411720275879
0.37300872802734375
0.18066024780273438
0.28769636154174805
0.2858397960662842
0.28676462173461914
0.24347591400146484
0.23549628257751465
0.29531288146972656

PistonY · 2019-01-02T07:50:05Z

And if you need you can just run the script.
It only need gluon-cv --pre.

PistonY · 2019-01-14T03:53:58Z

@eric-haibin-lin Hello?
Did you test FP16 with any Nvidia Turing™ architecture's device?
And if you do this could you please give me a test script?

eric-haibin-lin · 2019-01-26T07:57:24Z

sorry I’ve been busy with a submission deadline. Did you test with fixed input to see if it’s not bottlenecked by data loading ?

PistonY · 2019-01-29T15:19:30Z

Ok,I’ll test it later.

PistonY · 2019-01-30T02:55:16Z

I tried to use fixed input,FP32 work well but FP16 out of memory.
This is my script.

from mxnet import nd, autograd
from mxnet import gluon
from mxnet.gluon import loss as gloss
from gluoncv.model_zoo import *
import mxnet as mx
import time

ctx = mx.gpu(0)

data = nd.random.normal(shape=(64, 3, 224, 224), ctx=ctx)
lable = nd.random.randint(low=0, high=1, shape=(64, 1), ctx=ctx)

net = resnet101_v2()
net.hybridize()
net.initialize(ctx=ctx)

net(data)

test_num = 500
dtype = 'float16'    # float32 or float16
if dtype != 'float32':
    net.cast(dtype)
Loss = gloss.SoftmaxCrossEntropyLoss()
trainer = gluon.Trainer(net.collect_params(),
                        'nag', {'learning_rate': 0.1, 'momentum': 0.9,
                                'multi_precision': True  # when fp16 is enabled
                                })
sta = time.time()
for _ in range(test_num):
    with autograd.record():
        output = net(data.astype(dtype, copy=False))
        loss = Loss(output, lable.astype(dtype, copy=False))
    loss.backward()
    trainer.step(128)
end = time.time()
print(end - sta)

mxnet version is 1.5.0 (--pre)
When training with FP32,it cost 9921Mb memory and 75s.
But I tested with FP16 memory usage from 7000Mb continue to grow until out of memory.
I don't know why, it's looks like memory doesn't free.

PistonY · 2019-01-30T07:59:30Z

And I tried to only run forward

sta = time.time()
for _ in range(test_num):
    with autograd.record():
        output = net(data.astype(dtype, copy=False))
        # loss = Loss(output, lable.astype(dtype, copy=False))
    # loss.backward()
    # trainer.step(128)
end = time.time()

FP32 costs 7.83
FP16 cost 18.9

eric-haibin-lin · 2019-02-17T01:21:22Z

Were you using self-attention blocks with batch_dot operator? There was an improvement for fp16 in #13716

PistonY · 2019-02-18T02:32:37Z

Thx,it worked.

PistonY changed the title ~~Why FP16 training speed is too slow on Tesla T4?~~ Why FP16 training speed is too slow on Tesla T4 in Gluon? Dec 21, 2018

marcoabreu added the Performance label Dec 21, 2018

PistonY closed this as completed Jan 16, 2019

PistonY reopened this Jan 29, 2019

PistonY closed this as completed Feb 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why FP16 training speed is too slow on Tesla T4 in Gluon? #13709

Why FP16 training speed is too slow on Tesla T4 in Gluon? #13709

PistonY commented Dec 21, 2018 •

edited

Loading

eric-haibin-lin commented Dec 27, 2018

PistonY commented Jan 2, 2019

PistonY commented Jan 2, 2019

PistonY commented Jan 14, 2019

eric-haibin-lin commented Jan 26, 2019

PistonY commented Jan 29, 2019

PistonY commented Jan 30, 2019 •

edited

Loading

PistonY commented Jan 30, 2019

eric-haibin-lin commented Feb 17, 2019

PistonY commented Feb 18, 2019

Why FP16 training speed is too slow on Tesla T4 in Gluon? #13709

Why FP16 training speed is too slow on Tesla T4 in Gluon? #13709

Comments

PistonY commented Dec 21, 2018 • edited Loading

eric-haibin-lin commented Dec 27, 2018

PistonY commented Jan 2, 2019

PistonY commented Jan 2, 2019

PistonY commented Jan 14, 2019

eric-haibin-lin commented Jan 26, 2019

PistonY commented Jan 29, 2019

PistonY commented Jan 30, 2019 • edited Loading

PistonY commented Jan 30, 2019

eric-haibin-lin commented Feb 17, 2019

PistonY commented Feb 18, 2019

PistonY commented Dec 21, 2018 •

edited

Loading

PistonY commented Jan 30, 2019 •

edited

Loading