-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Fix for import mxnet taking long time if multiple process launched #13602
Conversation
doing import mxnet in multiple processes take very long time. Details : apache#12255 One of the reason we have OMP tuning code which iterates to find OMP tune overhead. We are reducing this iteration count to reduce the overehead of tuning code. Also, We added an environment variable where users can set the number of cores that should be used to determine tuning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for this improvement. Great catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still not very sure about the mechanism of operator tuning and how it contributes to other operator computation. Do we have any performance comparison between w/ and w/o operator tuning? Previously, I set USE_OPERATOR_TUNING=0
when built MXNet to avoid import hang. Followed the tips here: #10560 (comment)
Good point. The operator tuning code was introduced by this PR: #8686 , I don't see any numbers for performance comparison with/without tuning code. I think you should start a discussion in dev thread and may be we would able to find some relevant historical info and then decide whether operator tuning should be enabled as default or not. Regarding this PR, it is optimizing for the cases where OPERATOR_TUNING is enabled by default and unblocking 4.0 release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
- Values: String representation of MXNET_ENABLE_OPERATOR_TUNING environment variable | ||
- 0=disable all | ||
- 1=enable all | ||
- float32, float16, float32=list of types to enable, and disable those not listed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we list the valid types here: "float32", "float16", "float64", "int8", "uint8", "int32", "int64"
@@ -56,7 +56,7 @@ namespace op { | |||
#endif | |||
#endif // MXNET_NO_INLINE | |||
|
|||
#define OUTSIDE_COUNT_SHIFT 9 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does changing this impact the IsOMPFaster selection in operator_tune.h. Do we need to tweak WORKLOAD_COUNT_SHIFT too ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Workload_count_shift is currently 11, which means workload count will be 2048.
this means that operation is done for 2048 times. This number can be made smaller, but IsOMPFaster doesn't look like bottleneck for the related issue. It is the function which is calculating OMP overhead which is causing the problem.
@mxnet-label-bot add[Environment Variables, Operator] |
Based on 10560's comment, "It sometimes block executing in Ubuntu and always block executing in Windows" and several related issues, including import hang, are reported. |
I am wondering if this change is needed. For example, we are disabling openmp for child processes, then why cant we also skip tuning for the child processes ? |
if the option to enable OMP remains there, I think this is needed. If someone enables the option, this makes the loading faster as OMP overhead is calculated faster. If we remove OMP tuning at all, then this will automatically be removed. But removing it or disabling it will require bigger discussion about historical reasons why it was made default and if there is any benchmarking. |
for child processes we disable openmp : /~https://github.com/apache/incubator-mxnet/blob/master/src/initialize.cc#L66 . This means that number of omp threads will be 1 for all operators executed in the child process. Tuning doesnt help much then. |
@anirudh2290 |
To summarize an offline discussion with @Vikas89 . The auto tuning feature has performance benefits which has been documented in #8686 in the attached txt. As we can see for various operator and different inputs the Auto tune selects whether to use OMP or not. For close to 90% of the tests it makes the right selection. Also @Vikas89 tried to remove tuning for child process since we are disabling openmp for child process, but since the tuning gets triggered during static variable initialization as part of process startup, changes to the fork handlers are not reflected in the tuning code. So we decided to stick to the existing implementation to reduce the number of iterations for omp overhead calculation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we run sanity performance tests again with : ./mxnet_unit_tests --gtest_filter=OMP_TUNING.EvaluateTuneTestFloat --perf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for this PR.
It's nice to see the microbenchmark shows the benefit from auto-tuning; however, I am considering how much benefit we can gain in the real workload.
We have been asking multiple times for this issue from different customers and it doesn't sound like a good experience.
I will spend some time to look into the code and run some benchmarks later.
tuning-perf-results.txt - For test results |
Thanks results look good , 93 out of 96 tests do correct selection |
@pengzhao-intel Can you please point to the performance issues related to OMP tuning. |
…pache#13602) * apache#12255 doing import mxnet in multiple processes take very long time. Details : apache#12255 One of the reason we have OMP tuning code which iterates to find OMP tune overhead. We are reducing this iteration count to reduce the overehead of tuning code. Also, We added an environment variable where users can set the number of cores that should be used to determine tuning. * cpplint fix * Adding new environment variable: MXNET_USE_NUM_CORES_OPERATOR_TUNING to doc * fixing formatting in doc
…13602) * #12255 doing import mxnet in multiple processes take very long time. Details : #12255 One of the reason we have OMP tuning code which iterates to find OMP tune overhead. We are reducing this iteration count to reduce the overehead of tuning code. Also, We added an environment variable where users can set the number of cores that should be used to determine tuning. * cpplint fix * Adding new environment variable: MXNET_USE_NUM_CORES_OPERATOR_TUNING to doc * fixing formatting in doc
@anirudh2290 this is a good summary of the issue, especially in Intel Xeon Phi system where 72 physical cores and 72X4 logic cores are available. |
@pengzhao-intel yes i saw that issue. that issue checks for omp overhead serially so it launches different parallel section (with 2,..18 threads) serially. So I think for 36 cores omp runtime would try to reuse the already launched threads and not launch 170 threads. This can still be a problem when we fork the process into many subprocesses and we tried to disable operator tuning subprocesses but it wasn't trivial. Therefore the solution implemented here would be a good intermediate solution for a reasonable number of processes forked. We should revisit the long term solution to disable tuning in forked processes though. |
- 0=disable all | ||
- 1=enable all | ||
- float32, float16, float32=list of types to enable, and disable those not listed | ||
- refer : /~https://github.com/apache/incubator-mxnet/blob/master/src/operator/operator_tune-inl.h#L444 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure it's a good choice to put code link here. Once operator_tune-inl.h
is changed, probably we need revise the line number here to avoid confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah , I forgot to add that diff where I listed all the data type. I will create a separate PR to correct this.
In case there are many cores(72 cores as in c5.18xl), doing import mxnet in multiple processes take very long time. Details here: #12255
One of the reason we have OMP tuning code which iterates to find OMP tune overhead. We are reducing this iteration count to reduce the overehead of tuning code.
Also, We added an environment variable where users can set the number of cores that should be used to determine tuning.
Reducing number of cores for tuning also helps in reduction of number of iterations.
Successfully ran operator tuning unit test.(OMP_TUNING*)
Ran below test code on c5.18xl with 72 cores to see if the import mxnet is faster
Loading 30 mxnet processes takes 41 seconds with no change in num_cores for tuning.
Ideally num_cores should be 3
When I run above code, after setting the environ variable export
MXNET_USE_NUM_CORES_OPERATOR_TUNING=3
the 30 mxnet processes finishs loading in 2.2 seconds.
Description
(Brief description on what this PR is about)
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
Comments