Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Why FP16 training speed is too slow on Tesla T4 in Gluon? #13709

Closed
PistonY opened this issue Dec 21, 2018 · 10 comments
Closed

Why FP16 training speed is too slow on Tesla T4 in Gluon? #13709

PistonY opened this issue Dec 21, 2018 · 10 comments

Comments

@PistonY
Copy link

PistonY commented Dec 21, 2018

Hi, I tried to train with FP16 on Tesla T4, but it's speed is slower than GTX 1070 with FP32.
Could you please give me some suggests to solve that?
T4 is on Mxnet-cu100mkl and GTX1070 is on mxnet-cu90mkl
Here are my script and logs:
code: https://gist.github.com/PistonY/8dfcefdc46b747afd4d18b37f9a18665
logs:
T4 log:

INFO:root:Iter 390. Loss: 2.14372, Train RMSE 0.23653.Time 00:05:47.lr 0.019948717948717953
INFO:root:Test Loss: 1.935017, Test acc 0.327200.
INFO:root:Iter 780. Loss: 1.89404, Train RMSE 0.22111.Time 00:05:52.lr 0.03994871794871795
INFO:root:Test Loss: 1.460350, Test acc 0.473100.
INFO:root:Iter 1170. Loss: 1.72982, Train RMSE 0.20837.Time 00:05:49.lr 0.05994871794871795
INFO:root:Test Loss: 1.288763, Test acc 0.559500.
INFO:root:Iter 1560. Loss: 1.57620, Train RMSE 0.19388.Time 00:05:48.lr 0.07994871794871795
INFO:root:Test Loss: 1.856537, Test acc 0.530100.

GTX 1070 log:

INFO:root:Epoch 0, Iter 390. Loss: 2.12699, Train RMSE 0.23722.Time 00:03:00.lr 0.019948717948717953
INFO:root:Test Loss: 1.746372, Test acc 0.361800.
@PistonY PistonY changed the title Why FP16 training speed is too slow on Tesla T4? Why FP16 training speed is too slow on Tesla T4 in Gluon? Dec 21, 2018
@eric-haibin-lin
Copy link
Member

@PistonY
Copy link
Author

PistonY commented Jan 2, 2019

@eric-haibin-lin Hi~I test it with mxnet profiler here are my script and result.It looks good.
But when I actually use it,it's still slow.
Here is script.
I print time between

st_t = time()
with autograd.record():
       output = train_net(trans.astype(dtype, copy=False))
       loss = Loss(output, labels.astype(dtype, copy=False))
loss.backward()
trainer.step(batch_size)
end_t = time()
print(end_t - st_t)

when fp16:

float16
Start training with mixup.
[15:23:12] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
2.0626039505004883
0.26385951042175293
0.2520616054534912
0.2604227066040039
0.25570082664489746
0.26578330993652344
0.25952720642089844
0.2606792449951172
0.2637202739715576
0.3433563709259033
0.2613410949707031

fp32:

float32
Start training with mixup.
[15:36:23] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
0.8311986923217773
0.20481181144714355
0.278350830078125
0.18038034439086914
0.21913409233093262
0.2587764263153076
0.17470550537109375
0.21522021293640137
0.2749063968658447
0.2962362766265869
0.2280411720275879
0.37300872802734375
0.18066024780273438
0.28769636154174805
0.2858397960662842
0.28676462173461914
0.24347591400146484
0.23549628257751465
0.29531288146972656

@PistonY
Copy link
Author

PistonY commented Jan 2, 2019

And if you need you can just run the script.
It only need gluon-cv --pre.

@PistonY
Copy link
Author

PistonY commented Jan 14, 2019

@eric-haibin-lin Hello?
Did you test FP16 with any Nvidia Turing™ architecture's device?
And if you do this could you please give me a test script?

@PistonY PistonY closed this as completed Jan 16, 2019
@eric-haibin-lin
Copy link
Member

sorry I’ve been busy with a submission deadline. Did you test with fixed input to see if it’s not bottlenecked by data loading ?

@PistonY
Copy link
Author

PistonY commented Jan 29, 2019

Ok,I’ll test it later.

@PistonY PistonY reopened this Jan 29, 2019
@PistonY
Copy link
Author

PistonY commented Jan 30, 2019

I tried to use fixed input,FP32 work well but FP16 out of memory.
This is my script.

from mxnet import nd, autograd
from mxnet import gluon
from mxnet.gluon import loss as gloss
from gluoncv.model_zoo import *
import mxnet as mx
import time

ctx = mx.gpu(0)

data = nd.random.normal(shape=(64, 3, 224, 224), ctx=ctx)
lable = nd.random.randint(low=0, high=1, shape=(64, 1), ctx=ctx)

net = resnet101_v2()
net.hybridize()
net.initialize(ctx=ctx)

net(data)

test_num = 500
dtype = 'float16'    # float32 or float16
if dtype != 'float32':
    net.cast(dtype)
Loss = gloss.SoftmaxCrossEntropyLoss()
trainer = gluon.Trainer(net.collect_params(),
                        'nag', {'learning_rate': 0.1, 'momentum': 0.9,
                                'multi_precision': True  # when fp16 is enabled
                                })
sta = time.time()
for _ in range(test_num):
    with autograd.record():
        output = net(data.astype(dtype, copy=False))
        loss = Loss(output, lable.astype(dtype, copy=False))
    loss.backward()
    trainer.step(128)
end = time.time()
print(end - sta)

mxnet version is 1.5.0 (--pre)
When training with FP32,it cost 9921Mb memory and 75s.
But I tested with FP16 memory usage from 7000Mb continue to grow until out of memory.
I don't know why, it's looks like memory doesn't free.

@PistonY
Copy link
Author

PistonY commented Jan 30, 2019

And I tried to only run forward

sta = time.time()
for _ in range(test_num):
    with autograd.record():
        output = net(data.astype(dtype, copy=False))
        # loss = Loss(output, lable.astype(dtype, copy=False))
    # loss.backward()
    # trainer.step(128)
end = time.time()

FP32 costs 7.83
FP16 cost 18.9

@eric-haibin-lin
Copy link
Member

Were you using self-attention blocks with batch_dot operator? There was an improvement for fp16 in #13716

@PistonY PistonY closed this as completed Feb 18, 2019
@PistonY
Copy link
Author

PistonY commented Feb 18, 2019

Thx,it worked.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants