Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low GPU utilization on python/paddle/v2/fluid/tests/book/test_label_semantic_roles.py #7652

Closed
guru4elephant opened this issue Jan 18, 2018 · 5 comments
Assignees

Comments

@guru4elephant
Copy link
Member

guru4elephant commented Jan 18, 2018

  • Compile Version:0.11.0

  • Device:GPU,a P40 card

  • Scripts:python/paddle/v2/fluid/tests/book/test_label_semantic_roles.py

  • I changed the scripts on line 178 from
    place = fluid.CPUPlace() to
    place = fluid.CUDAPlace(0)

  • This example can run normally, but the GPU utilization is only about 20% on a single card. I guess there should be some ops not running on GPU. Is there anyway I can check whether all ops are running on GPU devices?

@jacquesqiao
Copy link
Member

jacquesqiao commented Jan 18, 2018

yes, now the linear_chain_crf and crf_decoding operators can only run on CPU, the framework will automatically copy memory from GPU to CPU if it find some operators can only run on CPU but the input tensors are on GPU

@lcy-seso lcy-seso self-assigned this Jan 18, 2018
@jacquesqiao
Copy link
Member

I have tested the performance, when running on CUDAPlace, the speed will be twice of running on CPUPlace. You can have a check.

@lcy-seso
Copy link
Contributor

lcy-seso commented Jan 18, 2018

@jacquesqiao I will remove the copy operation in the linear_chain_crf op to avoid this repeated copy.

related to #7654

@guru4elephant guru4elephant assigned wangkuiyi and unassigned lcy-seso Jan 19, 2018
@lcy-seso lcy-seso assigned lcy-seso and unassigned wangkuiyi Jan 19, 2018
@lcy-seso
Copy link
Contributor

remove the memory copy inside the linear_chain_crf_op inn this PR #7675 . This will fix the problem that both the framework and the operator itself copys inputs from GPU memory and copy memory back to GPU memory when this operator is running on GPU.

@jshower
Copy link
Contributor

jshower commented Mar 21, 2018

我们使用profile的工具看了一下不同op执行的时间,结果如下。

Event                             Calls       Total       Min.        Max.        Ave.        
thread0::linear_chain_crf_grad    12          12908.2     128.992     1273.86     1075.68     
thread0::crf_decoding             12          3697        36.998      370.332     308.084     
thread0::linear_chain_crf         12          3407.43     33.9113     346.214     283.953     
thread0::elementwise_add_grad     216         2215.51     0.483488    13.0267     10.257      
thread0::mul_grad                 288         971.093     0.516384    8.40723     3.37185     
thread0::lstm_grad                72          924.668     10.1028     13.9453     12.8426     
thread0::lstm                     96          598.993     2.96118     11.5723     6.23952     
thread0::mul                      384         456.164     0.08368     3.00365     1.18793     
thread0::sum                      321         406.066     0.066144    5.77043     1.265       
thread0::scale                    360         168.86      0.045376    4.14739     0.469056    
thread0::tanh_grad                216         112.893     0.067488    0.622016    0.522653    
thread0::tanh                     288         110.25      0.016864    0.50464     0.382813    
thread0::elementwise_add          288         109.787     0.02304     0.568384    0.381203    
thread0::sgd                      678         30.7195     0.008288    0.347968    0.045309    
thread0::lookup_table_grad        72          23.3073     0.162976    0.73408     0.323712    
thread0::lookup_table             96          21.4618     0.06256     0.349184    0.223561    
thread0::chunk_eval               12          19.0456     0.26656     1.97776     1.58713     
thread0::mean                     9           2.16704     0.224384    0.268192    0.240782    
thread0::elementwise_mul          30          1.55699     0.009568    0.103264    0.0518997   
thread0::mean_grad                12          1.38957     0.100224    0.169664    0.115797

可以看到crf有关的op执行的时间是占整个过程的大部分时间的。这是因为crf是在cpu上运行的,后续可以通过提供crf的gpu实现来解决这个问题。

@jshower jshower closed this as completed Mar 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants