Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Support multi-threading for Custom Operator #14363

Merged
merged 3 commits into from
Mar 15, 2019

Conversation

wkcn
Copy link
Member

@wkcn wkcn commented Mar 8, 2019

Description

Hi! I found that there was single-thread to execute custom operator. It drops the performance.
Is it a better way to support custom operator in multi-thread?

Test Case:

import mxnet as mx
import time

class TestOP(mx.operator.CustomOp):
    def __init__(self):
        super(TestOP, self).__init__()
    def forward(self, is_train, req, in_data, out_data, aux):
        print("sleep 5s")
        time.sleep(5)
        print("sleep finished")
        out_data[0][:] = in_data[0]
    def backward(self, req, out_grad, in_data, out_data, in_grad, aux):
        pass

@mx.operator.register("test_op")
class TestOPProp(mx.operator.CustomOpProp):
    def __init__(self):
        super(TestOPProp, self).__init__()
    def list_arguments(self):
        return ['x']
    def list_outputs(self):
        return ['y']
    def infer_shape(self, in_shape):
        return in_shape, in_shape
    def create_operator(self, ctx, shapes, dtypes):
        return TestOP()

a = mx.nd.array([1,2,3])
tic = time.time()
b = mx.nd.Custom(a, op_type='test_op')
c = mx.nd.Custom(a, op_type='test_op')
d = mx.nd.Custom(a, op_type='test_op')
mx.nd.waitall()
print(time.time() - tic)

Cost Time:
Before: 15s
After: 5s

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Support multi-threading for Custom Operator
  • Add environmental variable MXNET_CUSTOM_OP_NUM_THREADS

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@wkcn wkcn requested a review from szha as a code owner March 8, 2019 03:42
@wkcn
Copy link
Member Author

wkcn commented Mar 8, 2019

I will check whehter it is correct to use the lock.
it seems that there is no problem.

Copy link
Member

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice enhancement. Two questions:

src/operator/custom/custom-inl.h Show resolved Hide resolved
src/operator/custom/custom-inl.h Show resolved Hide resolved
@vandanavk
Copy link
Contributor

@mxnet-label-bot add [Operator, pr-awaiting-review]

Copy link
Member

@anirudh2290 anirudh2290 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Nice work.

@@ -139,38 +143,51 @@ class CustomOperator {
destructing_ = true;
cv_.notify_all();
}
worker_.join();
for (auto &worker : workers_)
worker.join();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you evaluated this when the custom op is part of a bigger graph and if there is any performance impact ? Since the CustomOperator is static its lifetime is till the end of the program and destructor of customoperator gets called at the end. This means there is one thread that is waiting on the condition variable while other 11 threads tryign to obtain the lock and in a blocked state. Since these are idle threads, I am not sure if the impact will be significant but good to verify. Will also help us come up with a good default for MXNET_CUSTOM_OP_NUM_THREADS

Copy link
Member Author

@wkcn wkcn Mar 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Sorry that I do not have any server with multiple GPUs to evaluate a big graph.

In this PR, the number of threads will increase when threads are not enough, and the maximum number is MXNET_CUSTOM_OP_NUM_THREADS.

There are always threads which get the lock and execute the operator function, so I think the idle threads do not drop the performance.

I think the maximum MXNET_CUSTOM_OP_NUM_THREADS is the number of CPU cores.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's fine during op execution but even after custom op execution there is one thread waiting on CV and other 11 trying to acquire lock.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.
I ever thought that decrease the number of threads when threads are idle, however it is difficult to estimate the number.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you run the example/reinforcement-learning/dqn which includes a custom op on CPU to check for performance.

Copy link
Member Author

@wkcn wkcn Mar 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be not available to check for performance on only CPU,since there Is only a computational stream usually. I will find a server with multiple GPUs and check on it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this change is mainly for performance improvements on a GPU machine, but we should be careful not to impact cpu performance for inference. Otherwise we should keep the default small.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry that I met the problem when running the RL demo:

(gdb) bt                                                                                                                                    
#0  0x00007ffff7e1ff73 in free () at /usr/lib/libc.so.6                                                                                     
#1  0x00007fffbf2c9ad1 in ALEInterface::welcomeMessage[abi:cxx11]() () at /usr/lib/python2.7/site-packages/ale_python_interface/libale_c.so 
#2  0x00007fffbf2c9d0a in ALEInterface::ALEInterface() () at /usr/lib/python2.7/site-packages/ale_python_interface/libale_c.so              
#3  0x00007fffbf2c7fbb in ALE_new () at /usr/lib/python2.7/site-packages/ale_python_interface/libale_c.so                                   
#4  0x00007ffff6d236d0 in ffi_call_unix64 () at /usr/lib/libffi.so.6                                                                        
#5  0x00007ffff6d230a0 in ffi_call () at /usr/lib/libffi.so.6                                                                               
#6  0x00007ffff6cc9d77 in _ctypes_callproc () at /usr/lib/python2.7/lib-dynload/_ctypes.so                                                 
#7  0x00007ffff6cc37d0 in  () at /usr/lib/python2.7/lib-dynload/_ctypes.so                                                                 
#8  0x00007ffff7c8cb43 in PyObject_Call () at /usr/lib/libpython2.7.so.1.0                                                                  
#9  0x00007ffff7c1c10e in PyEval_EvalFrameEx () at /usr/lib/libpython2.7.so.1.0                                                            
#10 0x00007ffff7c156ba in PyEval_EvalCodeEx () at /usr/lib/libpython2.7.so.1.0             

Copy link
Member Author

@wkcn wkcn Mar 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm checking it. The problem may be related to GPERFTOOLS and JEMALLOC. I close them and plan to rebuild MXNet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my experiment, the performance does not drop in DQN example. The FPS keeps 1500 on my laptop with only CPU i7-7500U(2c4t).

while (!q_.empty()) {
--num_free_threads;
auto fn = q_.front();
q_.pop();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice ! Popping early will prevent the race condition.

Copy link
Member

@anirudh2290 anirudh2290 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@wkcn
Copy link
Member Author

wkcn commented Mar 15, 2019

@eric-haibin-lin @szha Hi! Is it available to merge this PR?

@szha szha merged commit 74c2274 into apache:master Mar 15, 2019
vdantu pushed a commit to vdantu/incubator-mxnet that referenced this pull request Mar 31, 2019
* Support multi-threading for Custom Operator

* update

* Update custom-inl.h
@RogerChern
Copy link

Great work!

nswamy pushed a commit that referenced this pull request Apr 5, 2019
* Support multi-threading for Custom Operator

* update

* Update custom-inl.h
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
* Support multi-threading for Custom Operator

* update

* Update custom-inl.h
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Operator pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants