-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Support multi-threading for Custom Operator #14363
Conversation
I will check whehter it is correct to use the lock. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice enhancement. Two questions:
@mxnet-label-bot add [Operator, pr-awaiting-review] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Nice work.
@@ -139,38 +143,51 @@ class CustomOperator { | |||
destructing_ = true; | |||
cv_.notify_all(); | |||
} | |||
worker_.join(); | |||
for (auto &worker : workers_) | |||
worker.join(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you evaluated this when the custom op is part of a bigger graph and if there is any performance impact ? Since the CustomOperator is static its lifetime is till the end of the program and destructor of customoperator gets called at the end. This means there is one thread that is waiting on the condition variable while other 11 threads tryign to obtain the lock and in a blocked state. Since these are idle threads, I am not sure if the impact will be significant but good to verify. Will also help us come up with a good default for MXNET_CUSTOM_OP_NUM_THREADS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
Sorry that I do not have any server with multiple GPUs to evaluate a big graph.
In this PR, the number of threads will increase when threads are not enough, and the maximum number is MXNET_CUSTOM_OP_NUM_THREADS.
There are always threads which get the lock and execute the operator function, so I think the idle threads do not drop the performance.
I think the maximum MXNET_CUSTOM_OP_NUM_THREADS is the number of CPU cores.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it's fine during op execution but even after custom op execution there is one thread waiting on CV and other 11 trying to acquire lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.
I ever thought that decrease the number of threads when threads are idle, however it is difficult to estimate the number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you run the example/reinforcement-learning/dqn which includes a custom op on CPU to check for performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be not available to check for performance on only CPU,since there Is only a computational stream usually. I will find a server with multiple GPUs and check on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this change is mainly for performance improvements on a GPU machine, but we should be careful not to impact cpu performance for inference. Otherwise we should keep the default small.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry that I met the problem when running the RL demo:
(gdb) bt
#0 0x00007ffff7e1ff73 in free () at /usr/lib/libc.so.6
#1 0x00007fffbf2c9ad1 in ALEInterface::welcomeMessage[abi:cxx11]() () at /usr/lib/python2.7/site-packages/ale_python_interface/libale_c.so
#2 0x00007fffbf2c9d0a in ALEInterface::ALEInterface() () at /usr/lib/python2.7/site-packages/ale_python_interface/libale_c.so
#3 0x00007fffbf2c7fbb in ALE_new () at /usr/lib/python2.7/site-packages/ale_python_interface/libale_c.so
#4 0x00007ffff6d236d0 in ffi_call_unix64 () at /usr/lib/libffi.so.6
#5 0x00007ffff6d230a0 in ffi_call () at /usr/lib/libffi.so.6
#6 0x00007ffff6cc9d77 in _ctypes_callproc () at /usr/lib/python2.7/lib-dynload/_ctypes.so
#7 0x00007ffff6cc37d0 in () at /usr/lib/python2.7/lib-dynload/_ctypes.so
#8 0x00007ffff7c8cb43 in PyObject_Call () at /usr/lib/libpython2.7.so.1.0
#9 0x00007ffff7c1c10e in PyEval_EvalFrameEx () at /usr/lib/libpython2.7.so.1.0
#10 0x00007ffff7c156ba in PyEval_EvalCodeEx () at /usr/lib/libpython2.7.so.1.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm checking it. The problem may be related to GPERFTOOLS
and JEMALLOC
. I close them and plan to rebuild MXNet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my experiment, the performance does not drop in DQN example. The FPS keeps 1500 on my laptop with only CPU i7-7500U(2c4t).
while (!q_.empty()) { | ||
--num_free_threads; | ||
auto fn = q_.front(); | ||
q_.pop(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice ! Popping early will prevent the race condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@eric-haibin-lin @szha Hi! Is it available to merge this PR? |
* Support multi-threading for Custom Operator * update * Update custom-inl.h
Great work! |
* Support multi-threading for Custom Operator * update * Update custom-inl.h
* Support multi-threading for Custom Operator * update * Update custom-inl.h
Description
Hi! I found that there was single-thread to execute custom operator. It drops the performance.
Is it a better way to support custom operator in multi-thread?
Test Case:
Cost Time:
Before: 15s
After: 5s
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
MXNET_CUSTOM_OP_NUM_THREADS
Comments