Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Fix for import mxnet taking long time if multiple process launched #13602

Merged
merged 4 commits into from
Dec 13, 2018

Conversation

Vikas-kum
Copy link
Contributor

@Vikas-kum Vikas-kum commented Dec 10, 2018

In case there are many cores(72 cores as in c5.18xl), doing import mxnet in multiple processes take very long time. Details here: #12255

One of the reason we have OMP tuning code which iterates to find OMP tune overhead. We are reducing this iteration count to reduce the overehead of tuning code.
Also, We added an environment variable where users can set the number of cores that should be used to determine tuning.
Reducing number of cores for tuning also helps in reduction of number of iterations.

Successfully ran operator tuning unit test.(OMP_TUNING*)
Ran below test code on c5.18xl with 72 cores to see if the import mxnet is faster

import os
import time
import multiprocessing
from os import getpid

def mxnet_worker():
    print("before import: pid:{}".format(getpid()))
    st_time = time.time()
    import mxnet
    end_time = time.time()
    print("after import: pid:{} time:{}".format(getpid(), end_time - st_time))

read_process = [multiprocessing.Process(target=mxnet_worker) for i in range(30)]
from threading import Thread
i=0
#while i<32:
#    t1 = Thread(target=mxnet_worker)
#    t1.start()
#    i=i+1
for p in read_process:
    p.daemon = True
#    time.sleep(3)
    p.start()
time.sleep(100000)

Loading 30 mxnet processes takes 41 seconds with no change in num_cores for tuning.

Ideally num_cores should be 3
When I run above code, after setting the environ variable export MXNET_USE_NUM_CORES_OPERATOR_TUNING=3
the 30 mxnet processes finishs loading in 2.2 seconds.

Description

(Brief description on what this PR is about)

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

doing import mxnet in multiple processes take very long time.
Details : apache#12255
One of the reason we have OMP tuning code which iterates to find OMP
tune overhead. We are reducing this iteration count to reduce the
overehead of tuning code.
Also, We added an environment variable where users can set the number
of cores that should be used to determine tuning.
Copy link
Contributor

@marcoabreu marcoabreu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this improvement. Great catch!

src/operator/operator_tune-inl.h Show resolved Hide resolved
src/operator/operator_tune-inl.h Show resolved Hide resolved
Copy link
Member

@yuxihu yuxihu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

src/operator/operator_tune-inl.h Show resolved Hide resolved
Copy link
Member

@TaoLv TaoLv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not very sure about the mechanism of operator tuning and how it contributes to other operator computation. Do we have any performance comparison between w/ and w/o operator tuning? Previously, I set USE_OPERATOR_TUNING=0 when built MXNet to avoid import hang. Followed the tips here: #10560 (comment)

@Vikas-kum Vikas-kum requested a review from szha as a code owner December 11, 2018 19:24
@Vikas-kum
Copy link
Contributor Author

Vikas-kum commented Dec 11, 2018

Still not very sure about the mechanism of operator tuning and how it contributes to other operator computation. Do we have any performance comparison between w/ and w/o operator tuning? Previously, I set USE_OPERATOR_TUNING=0 when built MXNet to avoid import hang. Followed the tips here: #10560 (comment)

Good point. The operator tuning code was introduced by this PR: #8686 , I don't see any numbers for performance comparison with/without tuning code. I think you should start a discussion in dev thread and may be we would able to find some relevant historical info and then decide whether operator tuning should be enabled as default or not.

Regarding this PR, it is optimizing for the cases where OPERATOR_TUNING is enabled by default and unblocking 4.0 release.

Copy link
Member

@yuxihu yuxihu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

- Values: String representation of MXNET_ENABLE_OPERATOR_TUNING environment variable
- 0=disable all
- 1=enable all
- float32, float16, float32=list of types to enable, and disable those not listed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we list the valid types here: "float32", "float16", "float64", "int8", "uint8", "int32", "int64"

@@ -56,7 +56,7 @@ namespace op {
#endif
#endif // MXNET_NO_INLINE

#define OUTSIDE_COUNT_SHIFT 9
Copy link
Member

@anirudh2290 anirudh2290 Dec 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does changing this impact the IsOMPFaster selection in operator_tune.h. Do we need to tweak WORKLOAD_COUNT_SHIFT too ?

Copy link
Contributor Author

@Vikas-kum Vikas-kum Dec 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Workload_count_shift is currently 11, which means workload count will be 2048.
this means that operation is done for 2048 times. This number can be made smaller, but IsOMPFaster doesn't look like bottleneck for the related issue. It is the function which is calculating OMP overhead which is causing the problem.

@roywei
Copy link
Member

roywei commented Dec 12, 2018

@mxnet-label-bot add[Environment Variables, Operator]

@marcoabreu marcoabreu added Environment Variables Issues related to setting env vars Operator labels Dec 12, 2018
@pengzhao-intel
Copy link
Contributor

pengzhao-intel commented Dec 12, 2018

Based on 10560's comment, "It sometimes block executing in Ubuntu and always block executing in Windows" and several related issues, including import hang, are reported.
Could anyone help verify the functionality of this feature (auto tuning)? @mseth10 @azai91 @lupesko
Maybe set it off by default. Any idea?
@cjolivier01 could you provide more backgrounds and how auto-tuning works?

@anirudh2290
Copy link
Member

I am wondering if this change is needed. For example, we are disabling openmp for child processes, then why cant we also skip tuning for the child processes ?

@Vikas-kum
Copy link
Contributor Author

Vikas-kum commented Dec 12, 2018

@anirudh2290

I am wondering if this change is needed. For example, we are disabling openmp for child processes, then why can't we also skip tuning for the child processes ?

if the option to enable OMP remains there, I think this is needed. If someone enables the option, this makes the loading faster as OMP overhead is calculated faster.

If we remove OMP tuning at all, then this will automatically be removed. But removing it or disabling it will require bigger discussion about historical reasons why it was made default and if there is any benchmarking.

@anirudh2290
Copy link
Member

for child processes we disable openmp : /~https://github.com/apache/incubator-mxnet/blob/master/src/initialize.cc#L66 . This means that number of omp threads will be 1 for all operators executed in the child process. Tuning doesnt help much then.

@Vikas-kum
Copy link
Contributor Author

@anirudh2290
I started a discussion in dev list for disabling openmp tuning by default, just to make sure we are not missing any proven benefit. We would go either way, disable it completely or continue with this PR, depending on input from community.

@anirudh2290
Copy link
Member

To summarize an offline discussion with @Vikas89 . The auto tuning feature has performance benefits which has been documented in #8686 in the attached txt.
tune_all.txt

As we can see for various operator and different inputs the Auto tune selects whether to use OMP or not. For close to 90% of the tests it makes the right selection.

Also @Vikas89 tried to remove tuning for child process since we are disabling openmp for child process, but since the tuning gets triggered during static variable initialization as part of process startup, changes to the fork handlers are not reflected in the tuning code. So we decided to stick to the existing implementation to reduce the number of iterations for omp overhead calculation.

Copy link
Member

@anirudh2290 anirudh2290 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we run sanity performance tests again with : ./mxnet_unit_tests --gtest_filter=OMP_TUNING.EvaluateTuneTestFloat --perf

Copy link
Contributor

@pengzhao-intel pengzhao-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for this PR.

It's nice to see the microbenchmark shows the benefit from auto-tuning; however, I am considering how much benefit we can gain in the real workload.
We have been asking multiple times for this issue from different customers and it doesn't sound like a good experience.

I will spend some time to look into the code and run some benchmarks later.

@Vikas-kum
Copy link
Contributor Author

tuning-perf-results.txt - For test results
@anirudh2290 ran test ./build/tests/cpp/mxnet_unit_tests --gtest_filter=OMP_TUNING.EvaluateTuneTestFloat --perf

@anirudh2290 anirudh2290 merged commit 090f222 into apache:master Dec 13, 2018
@anirudh2290
Copy link
Member

Thanks results look good , 93 out of 96 tests do correct selection

@anirudh2290
Copy link
Member

@pengzhao-intel Can you please point to the performance issues related to OMP tuning.

Vikas-kum added a commit to Vikas-kum/incubator-mxnet that referenced this pull request Dec 13, 2018
…pache#13602)

* apache#12255
doing import mxnet in multiple processes take very long time.
Details : apache#12255
One of the reason we have OMP tuning code which iterates to find OMP
tune overhead. We are reducing this iteration count to reduce the
overehead of tuning code.
Also, We added an environment variable where users can set the number
of cores that should be used to determine tuning.

* cpplint fix

* Adding new environment variable: MXNET_USE_NUM_CORES_OPERATOR_TUNING to doc

* fixing formatting in doc
anirudh2290 pushed a commit that referenced this pull request Dec 13, 2018
…13602)

* #12255
doing import mxnet in multiple processes take very long time.
Details : #12255
One of the reason we have OMP tuning code which iterates to find OMP
tune overhead. We are reducing this iteration count to reduce the
overehead of tuning code.
Also, We added an environment variable where users can set the number
of cores that should be used to determine tuning.

* cpplint fix

* Adding new environment variable: MXNET_USE_NUM_CORES_OPERATOR_TUNING to doc

* fixing formatting in doc
@pengzhao-intel
Copy link
Contributor

#12255 (comment)

@anirudh2290 this is a good summary of the issue, especially in Intel Xeon Phi system where 72 physical cores and 72X4 logic cores are available.

@anirudh2290
Copy link
Member

@pengzhao-intel yes i saw that issue. that issue checks for omp overhead serially so it launches different parallel section (with 2,..18 threads) serially. So I think for 36 cores omp runtime would try to reuse the already launched threads and not launch 170 threads. This can still be a problem when we fork the process into many subprocesses and we tried to disable operator tuning subprocesses but it wasn't trivial. Therefore the solution implemented here would be a good intermediate solution for a reasonable number of processes forked. We should revisit the long term solution to disable tuning in forked processes though.

- 0=disable all
- 1=enable all
- float32, float16, float32=list of types to enable, and disable those not listed
- refer : /~https://github.com/apache/incubator-mxnet/blob/master/src/operator/operator_tune-inl.h#L444
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure it's a good choice to put code link here. Once operator_tune-inl.h is changed, probably we need revise the line number here to avoid confusion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah , I forgot to add that diff where I listed all the data type. I will create a separate PR to correct this.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Environment Variables Issues related to setting env vars Operator
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants