Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Fix for import mxnet taking long time if multiple process launched #13602

Merged
merged 4 commits into from
Dec 13, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions docs/faq/env_var.md
Original file line number Diff line number Diff line change
Expand Up @@ -222,3 +222,17 @@ Settings for More GPU Parallelism
- Set ```MXNET_GPU_WORKER_NTHREADS``` to a larger number (e.g., 2)
- To reduce memory usage, consider setting ```MXNET_EXEC_NUM_TEMP```.
- This might not speed things up, especially for image applications, because GPU is usually fully utilized even with serialized jobs.

Settings for controlling OMP tuning
---------------------------------
- Set ```MXNET_USE_OPERATOR_TUNING=0``` to disable Operator tuning code which decides whether to use OMP or not for operator
- Values: String representation of MXNET_ENABLE_OPERATOR_TUNING environment variable
- 0=disable all
- 1=enable all
- float32, float16, float32=list of types to enable, and disable those not listed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we list the valid types here: "float32", "float16", "float64", "int8", "uint8", "int32", "int64"

- refer : /~https://github.com/apache/incubator-mxnet/blob/master/src/operator/operator_tune-inl.h#L444
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure it's a good choice to put code link here. Once operator_tune-inl.h is changed, probably we need revise the line number here to avoid confusion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah , I forgot to add that diff where I listed all the data type. I will create a separate PR to correct this.


- Set ```MXNET_USE_NUM_CORES_OPERATOR_TUNING``` to define num_cores to be used by operator tuning code.
- This reduces operator tuning overhead when there are multiple instances of mxnet running in the system and we know that
each mxnet will take only partial num_cores available with system.
- refer: /~https://github.com/apache/incubator-mxnet/pull/13602
5 changes: 3 additions & 2 deletions src/operator/operator_tune-inl.h
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ namespace op {
#endif
#endif // MXNET_NO_INLINE

#define OUTSIDE_COUNT_SHIFT 9
Copy link
Member

@anirudh2290 anirudh2290 Dec 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does changing this impact the IsOMPFaster selection in operator_tune.h. Do we need to tweak WORKLOAD_COUNT_SHIFT too ?

Copy link
Contributor Author

@Vikas-kum Vikas-kum Dec 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Workload_count_shift is currently 11, which means workload count will be 2048.
this means that operation is done for 2048 times. This number can be made smaller, but IsOMPFaster doesn't look like bottleneck for the related issue. It is the function which is calculating OMP overhead which is causing the problem.

#define OUTSIDE_COUNT_SHIFT 3
Vikas-kum marked this conversation as resolved.
Show resolved Hide resolved

namespace tune {

Expand Down Expand Up @@ -356,7 +356,8 @@ class OperatorTune : public OperatorTuneByType<DType> {
static duration_t GetOMPLoopOverhead() {
// It was found empirically that OMP times was not heavily tied to number of cores,
// so take an average across all core counts
const auto max_cores = static_cast<size_t>(omp_get_num_procs()) >> 1;
const auto max_cores_default = static_cast<size_t>(omp_get_num_procs()) >> 1;
Vikas-kum marked this conversation as resolved.
Show resolved Hide resolved
const auto max_cores = dmlc::GetEnv("MXNET_USE_NUM_CORES_OPERATOR_TUNING", max_cores_default);
Vikas-kum marked this conversation as resolved.
Show resolved Hide resolved
if (max_cores >= 2) {
std::vector<duration_t> core_times;
// Take care of any OMP lazy-init with a throwaway call
Expand Down