Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

USE_MKLDNN=1 is default in make build (mkldnn must be explicitly turned off) #12591

Closed
wants to merge 19 commits into from

Conversation

azai91
Copy link
Contributor

@azai91 azai91 commented Sep 18, 2018

Description

we are migrating to have mkldnn included in the default mxnet build. the USE_MKLDNN will be set to 1 by default (and therefore must explicitly be turned off if unsupported).

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • set USE_MKLDNN to 1 unless explicitly set to 0
  • set USE_MKLDNN=0 for non-mkldnn builds in jenkins

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@azai91 azai91 requested a review from szha as a code owner September 18, 2018 18:56
@azai91 azai91 changed the title update docs to start requiring cmake for building mxnet from source USE_MKLDNN=1 is default in make build (mkldnn must be explicitly turned off) Sep 18, 2018
Copy link
Member

@szha szha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pip package distribution explicitly sets mkldnn=0 so it won't be a problem. On the other hand, I'd like to see enough evidence that it's the right thing to do first:

  1. the performance difference between regular build and mkldnn build on various hardware.
  2. numeric stability and modeling convergence evaluation.
  3. portability.

@azai91
Copy link
Contributor Author

azai91 commented Sep 18, 2018

will look into portability

@mseth10 can you publish accuracy and performance with a small dataset like CIFAR.

@szha
Copy link
Member

szha commented Sep 18, 2018

@azai91 @mseth10 if possible please get results on more use cases. CIFAR for an image classification model alone won't be representative enough.

@stu1130
Copy link
Contributor

stu1130 commented Sep 18, 2018

@mxnet-label-bot[pr-awaiting-response]

@marcoabreu marcoabreu added the pr-awaiting-response PR is reviewed and waiting for contributor to respond label Sep 18, 2018
@pengzhao-intel
Copy link
Contributor

I think @juliusshufan can help to provide the more accuracy and performance data.

@juliusshufan
Copy link
Contributor

juliusshufan commented Sep 20, 2018

Update the ImageNet-1k inference accuracy, based on Gluon-model zoo (pre-trained model), comparison target is NVidia-GPU.
(Model including: AlexNet, VGG16, Resnet50-v1/v2, Inception-v3, DenseNet, SqueezeNet, MobileNetv1.0)

On Python2

toplogy CPU-top1 CPU-top5 GPU-top1 GPU-top5
alexnet 0.556455 0.785575 0.556455 0.785523
resnet50_v1 0.753367 0.926907 0.753367 0.926907
resnet50_v2 0.761327 0.929354 0.761327 0.929354
vgg16 0.720138 0.90662 0.720138 0.90662
densenet121 0.736587 0.917328 0.736587 0.917328
squeezenet1.1 0.561469 0.792099 0.561481 0.792099
mobilenet1.0 0.693531 0.889003 0.693531 0.889003
inceptionv3 0.762979 0.928074 0.762979 0.92814

On Python3

toplogy CPU-top1 CPU-top5 GPU-top1 GPU-top5
alexnet 0.556455 0.785575 0.556455 0.785523
resnet50_v1 0.753367 0.926907 0.753367 0.926907
resnet50_v2 0.761327 0.929354 0.761327 0.929354
vgg16 0.720138 0.90662 0.720138 0.90662
densenet121 0.736587 0.917328 0.736587 0.917328
squeezenet1.1 0.561469 0.792099 0.561481 0.792099
mobilenet1.0 0.693531 0.889003 0.693531 0.889003
inceptionv3 0.762979 0.928074 0.762979 0.92814

@juliusshufan
Copy link
Contributor

ImageNet-1k training, we use the script under "example\image-classification", training executed on both GPU and CPU, with same hyper-parameters, without further tunings, training curses on GPU and CPU pretty aligned. The followings are corresponding to ResNet50 v1 and v2.
image
image

@juliusshufan
Copy link
Contributor

juliusshufan commented Sep 24, 2018

Apart from ImageNet-1k traning test, training test also been executed on small dataset, that includes:

  Training set Validation set Classes Source
CiFAR-10 50,000 10,000 10 Released by MxNET official http://data.mxnet.io/data/cifar10/
CiFAR-100 50,000 10,000 100 Released by MxNET official http://data.mxnet.io/data/cifar100.zip
sampled ImageNet 100,200 10,000 200 Sampled from ImageNet-1k and converted following the structure and classes retrieved from tinyImageNet (https://www.kaggle.com/c/tiny-imagenet/)

Due to the lackness of the SOTA accuracy data on these small dataset, the comparisons between MXNET-MKLDNN and MXNET-GPU on convergence trends and inference accuracy will be "indirectly" used as the correctness check of MXNET with MKLDNN backend.
Below tables lists the validation accuracy on CIFAR10, CIFAR100 and the sampled-Imagenet and comparisons achieved on GPU, models including ResNet-50, VGG16 and Inception-v3.

On Resnet-50:

  HW Platform Dataset Validation Accuracy
CPU SKX-8180 sampled ImageNet top1=0.629879 top5=0.842132
GPU GTX-1080T sampled ImageNet top1=0.630609 top5=0.840345
CPU SKX-8180 CiFAR-10 top1=0.917067 top5=0.997796
GPU GTX-1080T CiFAR-10 top1=0.921474 top5=0.998397
CPU SKX-8180 CiFAR-100 top1=0.734475 top5=0.915865
GPU GTX-1080T CiFAR-100 top1= 0.723257 top5= 0.913161

On Inception-v3, (as inception-v3 only accepts input size 299, CIFAR is not applicable)

  HW Platform Dataset Validation Accuracy
CPU SKX-8180 sampled ImageNet top1= 0.684964 top5=0.866470
GPU GTX-1080T sampled ImageNet top1= 0.684095 top5= 0.868890

On VGG-16:

  HW Platform Dataset Validation Accuracy
CPU SKX-8180 sampled ImageNet top1=0.528029 top5=0.759809
GPU GTX-1080T sampled ImageNet top1=0.526834 top5= 0.761318
CPU SKX-8180 CiFAR-10 top1=0.884615 top5=0.994391
GPU GTX-1080T CiFAR-10 top1=0.888622 top5=0.995092
CPU SKX-8180 CiFAR-100 top1=0.634415 top5=0.855569
GPU GTX-1080T CiFAR-100 top1= 0.634916 top5= 0.855669

The below two figures are the top-5 validation accuracy trends collected on CPU and GPU respectively,
On CPU:
image
On GPU:
image

@juliusshufan
Copy link
Contributor

juliusshufan commented Sep 24, 2018

Benchmark data
The benchmark data contains data collected on Linux and Mac, and compared between build with and w.o. MKLDNN, as the computation on a build w.o. MKLDNN is too slow, only the performance data selected CNN models are listed, benchmarking script based on example\image-classfication\benchmark_score.py.

On CentOS 7.4, pip is used for MXNET installation, that is pip install mxnet==1.3.0 v.s. pip install mxnet-mkl==1.3.0.
(Benchmarking is executed on a 1-socket Xeon SKX-8180, 28-core and 192G DDR4-2666 memery)

VGG16

batch size MKLDNN-enabled w.o. MKLDNN boost-up
1 63.972961 2.776588 2304.01%
16 90.132777 3.27203 2754.64%
32 90.533301 3.271969 2766.94%
64 90.547993 3.332716 2716.94%
128 90.130061 3.303833 2728.05%
256 89.474756 3.333387 2684.20%

Inception-v3

batch size MKLDNN-enabled w.o. MKLDNN boost-up
1 58.965411 6.244512 944.28%
16 168.280915 6.566202 2562.83%
32 167.823787 6.421525 2613.46%
64 168.746333 6.585618 2562.35%
128 166.841938 6.453535 2585.28%
256 162.761511 6.484705 2509.93%

Inception-v4

batch size MKLDNN-enabled w.o. MKLDNN boost-up
1 32.362458 3.310546 977.56%
16 84.847819 3.393066 2500.62%
32 85.549374 3.379569 2531.37%
64 86.123905 3.335553 2582.00%
128 85.134901 3.334666 2553.03%
256 83.655486 3.330463 2511.83%

ResNet-50

batch size MKLDNN-enabled w.o. MKLDNN boost-up
1 83.434864 11.020557 757.08%
16 194.102224 11.092527 1749.85%
32 197.600266 10.904773 1812.05%
64 199.251137 10.746266 1854.14%
128 198.108861 10.732905 1845.81%
256 196.444539 10.638787 1846.49%

MobileNet

batch size MKLDNN-enabled w.o. MKLDNN boost-up
1 263.504341 27.284977 965.75%
16 607.443174 27.705262 2192.52%
32 614.830145 26.904616 2285.22%
64 644.903928 26.844882 2402.33%
128 621.659484 26.381861 2356.39%
256 605.399741 26.354961 2297.10%

On MacOS, the default compilation configurations disabling the OPENMP, below tables listing the perf datas on build with MKLDNN(OPENMP enabled), and the build without MKLDNN.
(HW is iMac Pro with one Socket 8-core Xeon-W and 32G DDR4 memory)

VGG16

batch size MKLDNN-enabled w.o. MKLDNN boost-up
1 20.913986 7.821254 267.40%
16 24.273071 8.438211 287.66%
32 24.704907 8.480799 291.30%
64 24.94608 8.524874 292.63%
128 25.074148 8.53283 293.86%
256 25.2629 8.535707 295.97%

Inception-v3

batch size MKLDNN-enabled w.o. MKLDNN boost-up
1 41.431404 10.323434 401.33%
16 54.312317 10.665803 509.22%
32 54.604119 10.621378 514.10%
64 54.39568 10.605843 512.88%
128 54.410785 10.62466 512.12%
256 54.614424 10.616772 514.42%

Inception-V4

batch size MKLDNN-enabled w.o. MKLDNN boost-up
1 20.715221 5.655873 366.26%
16 26.249734 5.779357 454.20%
32 26.197659 5.761883 454.67%
64 26.16153 5.771389 453.30%
128 26.247461 5.778834 454.20%
256 26.313875 5.77839 455.38%

ResNet-50

batch size MKLDNN-enabled w.o. MKLDNN boost-up
1 41.70109 19.246681 216.67%
16 43.132788 20.854712 206.83%
32 41.613291 20.570733 202.29%
64 38.13329 20.652445 184.64%
128 38.839577 20.685878 187.76%
256 38.853521 20.68953 187.79%

MobileNet

batch size MKLDNN-enabled w.o. MKLDNN boost-up
1 200.91608 36.047475 557.37%
16 287.614019 37.224849 772.64%
32 277.838051 36.914548 752.65%
64 274.474078 36.939298 743.04%
128 273.622323 37.04172 738.69%
256 273.445636 36.947783 740.09%

@azai91
Copy link
Contributor Author

azai91 commented Sep 24, 2018

Just verifying. The above table is for Mac?

@azai91
Copy link
Contributor Author

azai91 commented Sep 24, 2018

Can we try metrics with MacOS on AVX2 ISA? We are seeing performance drop enabling MKLDNN.

@juliusshufan
Copy link
Contributor

Just verifying. The above table is for Mac?

@azai91

  1. Yes, for benchmarking data, today I also update the datas collected on CentOS 7.4, the data you were reviewing is collected on an iMAC Pro.
  2. I'll collect the Mac perf data on a AVX2 processor later on, as the issue you mentioning "performance drop enabling MKLDNN.", I suspected this is caused by the OPENMP is NOT enabled by default, @xinyu-intel is working on this and can provide more inputs.

@xinyu-intel
Copy link
Contributor

Hi @azai91 , you can try the below building method on mac:

brew install llvm
# .bash_profile
export LIBRARY_PATH=/usr/local/Cellar/llvm/6.0.1/lib/
# config.mk
CC=/usr/local/Cellar/llvm/6.0.1/bin/clang CXX=/usr/local/Cellar/llvm/6.0.1/bin/clang++
# mkldnn.mk L40 before cmake
CC=/usr/local/Cellar/llvm/6.0.1/bin/clang CXX=/usr/local/Cellar/llvm/6.0.1/bin/clang++
# makefile
ifeq ($(USE_OPENMP), 1)
#             ifneq ($(UNAME_S), Darwin)
                              CFLAGS += -fopenmp
#             endif
Endif

@juliusshufan
Copy link
Contributor

juliusshufan commented Sep 27, 2018

RNN related data, including both accuracy, and performance/benchmarking.
Accuracy

  1. A GNMT model implemented by gluon-nlp (scripts\nmt\train_gnmt.py), IWMT2015 dataset, en-vi translation. The decoder-encoder is a 2-layer LSTM, per the model implemenation, as gluon.rnncell used, the MKLDNN FC can be covered as it is gluon.rnncell is an unfused kernel, below figure is the ppl trends collected on both GPU and CPU, with same hyper-parameters, the two curves aligned very well.
    image
  2. A simple RNN model, provided by official MXNET repo (/example/rnn/bucketing), implemented by RNN symbol API. Training tests are using a 3-layer LSTM and GRU RNN model with fused-RNN kernel on CPU and GPU, and comparses the training curves, see below figures for the training perplexity trends.
    image
    image

Benchmarking
Thanks to the new features released by MXNET 1.3.0 on Gluon RNN API, dummy-data based benchmarking are executed, using fused and unfused Gluon RNN-API repectively, with MXNET with MKLDNN as the backend.
The benchmarking uses a series predefined input shape, on a 1S-SKX8180 CPU, 28 core and 192G DDR4 memory. (The input size is the embedding size, and same as hidden size by default)
Metric is Sentence Per Second (SPS).

1-layer LSTM fused v.s. unfused

Input Shape (N, T, C, Input Size) Fused Unfused Boost
[64, 15, 500, 500] 2917.237852 1667.527 174.94%
[64, 20, 500, 500] 3661.45311 1196.497 306.01%
[64, 25, 500, 500] 3288.546223 855.2861 384.50%
[64, 30, 500, 500] 2913.375177 660.5786 441.03%
[64, 35, 500, 500] 2581.44028 519.6848 496.73%
[64, 40, 500, 500] 2479.42023 714.7851 346.88%
[64, 45, 500, 500] 2300.442591 625.1124 368.00%
[64, 50, 500, 500] 2160.407494 549.2164 393.36%
[16, 25, 512, 512] 1067.593284 332.028 321.54%
[32, 25, 512, 512] 1830.461068 649.8168 281.69%
[64, 25, 512, 512] 2827.429465 1187.243 238.15%
[128, 25, 512, 512] 3938.397784 1547.932 254.43%
[16, 25, 1024, 1024] 231.900727 154.7335 149.87%
[32, 25, 1024, 1024] 429.570455 298.2182 144.05%
[64, 25, 1024, 1024] 744.384772 480.4162 154.95%
[128, 25, 1024, 1024] 1204.706856 696.3014 173.02%
[16, 25, 2048, 2048] 52.323166 40.81776 128.19%
[32, 25, 2048, 2048] 101.108405 78.72398 128.43%
[64, 25, 2048, 2048] 181.117374 131.4923 137.74%
[128, 25, 2048, 2048] 315.360515 223.4272 141.15%
[16, 25, 4096, 4096] 12.326611 9.575337 128.73%
[32, 25, 4096, 4096] 24.255487 18.75816 129.31%
[64, 25, 4096, 4096] 44.229753 34.00344 130.07%
[128, 25, 4096, 4096] 78.146907 64.36427 121.41%

1-layer GRU fused v.s. unfused

Input Shape (N, T, C, Input Size) Fused Unfused Boost
[64, 15, 500, 500] 3981.266 1714.903 232.16%
[64, 20, 500, 500] 3686.065 1316.712 279.94%
[64, 25, 500, 500] 3430.645 930.4283 368.72%
[64, 30, 500, 500] 3130.724 722.1599 433.52%
[64, 35, 500, 500] 2982.695 692.9842 430.41%
[64, 40, 500, 500] 2857.4 621.988 459.40%
[64, 45, 500, 500] 2598.724 533.6256 486.99%
[64, 50, 500, 500] 2364.662 498.7772 474.09%
[16, 25, 512, 512] 1066.644 278.212 383.39%
[32, 25, 512, 512] 1861.235 540.8459 344.13%
[64, 25, 512, 512] 3089.303 1020.799 302.64%
[128, 25, 512, 512] 4679.54 1636.657 285.92%
[16, 25, 1024, 1024] 317.5073 163.0825 194.69%
[32, 25, 1024, 1024] 584.9791 318.4931 183.67%
[64, 25, 1024, 1024] 1051.927 552.1558 190.51%
[128, 25, 1024, 1024] 1568.747 814.037 192.71%
[16, 25, 2048, 2048] 64.3481 50.81243 126.64%
[32, 25, 2048, 2048] 124.1267 99.61789 124.60%
[64, 25, 2048, 2048] 227.109 170.9884 132.82%
[128, 25, 2048, 2048] 376.7918 279.1985 134.95%
[16, 25, 4096, 4096] 14.59219 12.47552 116.97%
[32, 25, 4096, 4096] 28.75226 24.61517 116.81%
[64, 25, 4096, 4096] 52.63095 44.60013 118.01%
[128, 25, 4096, 4096] 95.56435 83.10091 115.00%

@juliusshufan
Copy link
Contributor

@azai91 @szha That's all the data I currently uploaded, may I know your comments anything else helping to support the set of USE_MKLDNN as default. Thanks.

@szha
Copy link
Member

szha commented Oct 5, 2018

While the speed-up looks solid, I noticed the following:

  1. A difference in top-1 inference accuracy in this comment for squeezenet USE_MKLDNN=1 is default in make build (mkldnn must be explicitly turned off) #12591 (comment)
  2. Higher variance in the training accuracy compared to GPU, and the lack of validation accuracy in this comment USE_MKLDNN=1 is default in make build (mkldnn must be explicitly turned off) #12591 (comment)
  3. A clear difference in accuracy in USE_MKLDNN=1 is default in make build (mkldnn must be explicitly turned off) #12591 (comment)
  4. Lack of comparison between regular builds and mkl builds which is what we should establish instead.

I also have the following questions regarding the results:

  1. What does "multi-node" mean in the second diagram in this comment? USE_MKLDNN=1 is default in make build (mkldnn must be explicitly turned off) #12591 (comment)
  2. What would be the results for more common CPUs?

Overall, I think these evaluation doesn't yet cover the most important question for this PR: can we say with confidence that by switching to USE_MKLDNN by default, our library can achieve speed-up without losing accuracy, for different CPUs?

@szha
Copy link
Member

szha commented Oct 5, 2018

Note that for larger datasets it's unlikely that people would use it for training, so inference results with pre-trained models would suffice for the purpose of comparing mkl builds with regular builds.

@pengzhao-intel
Copy link
Contributor

Thanks for looking into our data and I agree that the inference results are more important.
@juliusshufan will follow up your question.

@xinyu-intel
Copy link
Contributor

@pengzhao-intel @juliusshufan also add performance on iMac Pro based on building method referred in #12724

@lupesko
Copy link
Contributor

lupesko commented Oct 8, 2018

I'd love to see this one merged, and MXNet users benefitting from improved performance on CPU, but I agree with comments made earlier by @szha that we need clear comparison for speed and accuracy between non-MKLDNN and MKLDNN.

I also suggest we document these benchmarks and results on MXNet CWiki instead of in this issue - will be easier to see a full and up-to-date status there. @xinyu-intel if it makes sense to you, can you please document it there?

@pengzhao-intel
Copy link
Contributor

@lupesko It's a good idea to document the benchmark results in the website rather than github.
How about creating a separate page under doc in https://mxnet.incubator.apache.org/ ?
I think this is the major interface for MXNet users.

@azai91
Copy link
Contributor Author

azai91 commented Oct 11, 2018

@juliusshufan can you provide benchmarks comparing mkldnn vs non?

@juliusshufan
Copy link
Contributor

juliusshufan commented Oct 12, 2018

@azai91 sure, some CNN perf/benchmark data already updated the previous comments (my fourth comment) of this PR, do you mean more model coverage? And I'll also update the same content to the on CWiki page.
Thanks.

@pengzhao-intel
Copy link
Contributor

@azai91 I will sync with @juliusshufan in local. Will launch the benchmark during weekend :)

@azai91
Copy link
Contributor Author

azai91 commented Oct 15, 2018

@pengzhao-intel thanks for the update. can you list platforms and build flags in benchmarks as well. let me know when you're done. planning on taking vote Tuesday or Wednesday.

@pengzhao-intel
Copy link
Contributor

Latest data updated on this wiki page:
https://cwiki.apache.org/confluence/display/MXNET/MXNet+with+Intel+MKL-DNN+-+Performance+Benchmarking

@azai91 could you rebase the code?

@azai91
Copy link
Contributor Author

azai91 commented Nov 7, 2018

Results with mobilenet

ubuntu@ip-172-31-5-67:~/incubator-mxnet$ MXNET_MKLDNN_ENABLED=1 python example/image-classification/benchmark_score.py
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
INFO:root:It may take some time to run all models, set --network to run a specific one
INFO:root:run batchsize [1, 32, 64, 128, 256] by default, set --batch-size to run a specific one
INFO:root:network: mobilenet
INFO:root:device: cpu(0)
/home/ubuntu/incubator-mxnet/python/mxnet/module/base_module.py:68: UserWarning: Data provided by label_shapes don't match names specified by label_names ([] vs. ['softmax_label'])
  warnings.warn(msg)
[23:42:08] ../src/operator/nn/mkldnn/mkldnn_base.cc:74: Allocate 8192 bytes with malloc directly
[23:42:08] ../src/operator/nn/mkldnn/mkldnn_base.cc:74: Allocate 32768 bytes with malloc directly
INFO:root:batch size  1, dtype float32, images/sec: 31.501611
INFO:root:batch size 32, dtype float32, images/sec: 194.704657
INFO:root:batch size 64, dtype float32, images/sec: 247.321861
INFO:root:batch size 128, dtype float32, images/sec: 276.045449
INFO:root:batch size 256, dtype float32, images/sec: 257.687046

@xinyu-intel
Copy link
Contributor

@azai91 which compiler are you using to build mxnet with mkldnn on m5a.24xlarge?

@azai91
Copy link
Contributor Author

azai91 commented Nov 8, 2018

ubuntu@ip-172-31-5-67:~/incubator-mxnet/build$ /usr/bin/c++ --version
c++ (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Copy link
Contributor

@lebeg lebeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benchmark effort is really impressive. Could we add some more information how it was performed? I mean what scripts were called, which models downloaded?

I could reuse this information to perform another comparison: testing performance when compiled with different compilers (with different OpenMP libraries).

@@ -170,6 +171,7 @@ build_armv7() {
-DCMAKE_BUILD_TYPE=Release \
-DUSE_MKL_IF_AVAILABLE=OFF \
-DUSE_LAPACK=OFF \
-DUSE_MKLDNN=0FF \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have changed the default behaviour for make builds, but as far as I know for cmake it was ON by default (is available). Why do we want to switch it explicitly OFF?

@vandanavk
Copy link
Contributor

@mxnet-label-bot add [MKLDNN]

@roywei
Copy link
Member

roywei commented Dec 11, 2018

@azai91 Thanks for the contribution, could you trigger CI again?

@apeforest
Copy link
Contributor

Do you also need to update osx.mk? Please make sure it's working the same on Mac OS

@sandeep-krishnamurthy
Copy link
Contributor

@azai91 - Thanks a lot for this PR.
What are the next steps here?

@mseth10
Copy link
Contributor

mseth10 commented Jan 4, 2019

@azai91 we can close this PR now?

@pengzhao-intel
Copy link
Contributor

pengzhao-intel commented Jan 4, 2019

@mseth10 Yes, I think so.
Next step, we will co-work to

  • update the document and install page
  • make MKLDNN with the static link
  • make MKLDNN as default in the nightly build

@lupesko @sandeep-krishnamurthy @mseth10 @azai91 @TaoLv @xinyu-intel @ZhennanQin
What's your opinion?

@@ -669,7 +662,6 @@ build_ubuntu_gpu_cmake() {
-DUSE_CUDA=1 \
-DUSE_CUDNN=1 \
-DUSE_MKLML_MKL=0 \
-DUSE_MKLDNN=0 \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is supposed to be removed.

@azai91
Copy link
Contributor Author

azai91 commented Jan 4, 2019

closing this PR as this is a duplicate of #13681

@azai91 azai91 closed this Jan 4, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
MKLDNN pr-awaiting-response PR is reviewed and waiting for contributor to respond
Projects
None yet
Development

Successfully merging this pull request may close these issues.