USE_MKLDNN=1 is default in make build (mkldnn must be explicitly turned off) #12591

azai91 · 2018-09-18T18:56:30Z

Description

we are migrating to have mkldnn included in the default mxnet build. the USE_MKLDNN will be set to 1 by default (and therefore must explicitly be turned off if unsupported).

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

set USE_MKLDNN to 1 unless explicitly set to 0
set USE_MKLDNN=0 for non-mkldnn builds in jenkins

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

szha

pip package distribution explicitly sets mkldnn=0 so it won't be a problem. On the other hand, I'd like to see enough evidence that it's the right thing to do first:

the performance difference between regular build and mkldnn build on various hardware.
numeric stability and modeling convergence evaluation.
portability.

azai91 · 2018-09-18T19:26:10Z

will look into portability

@mseth10 can you publish accuracy and performance with a small dataset like CIFAR.

szha · 2018-09-18T19:34:25Z

@azai91 @mseth10 if possible please get results on more use cases. CIFAR for an image classification model alone won't be representative enough.

stu1130 · 2018-09-18T22:52:50Z

@mxnet-label-bot[pr-awaiting-response]

pengzhao-intel · 2018-09-19T02:44:40Z

I think @juliusshufan can help to provide the more accuracy and performance data.

juliusshufan · 2018-09-20T09:12:26Z

Update the ImageNet-1k inference accuracy, based on Gluon-model zoo (pre-trained model), comparison target is NVidia-GPU.
(Model including: AlexNet, VGG16, Resnet50-v1/v2, Inception-v3, DenseNet, SqueezeNet, MobileNetv1.0)

On Python2

toplogy	CPU-top1	CPU-top5	GPU-top1	GPU-top5
alexnet	0.556455	0.785575	0.556455	0.785523
resnet50_v1	0.753367	0.926907	0.753367	0.926907
resnet50_v2	0.761327	0.929354	0.761327	0.929354
vgg16	0.720138	0.90662	0.720138	0.90662
densenet121	0.736587	0.917328	0.736587	0.917328
squeezenet1.1	0.561469	0.792099	0.561481	0.792099
mobilenet1.0	0.693531	0.889003	0.693531	0.889003
inceptionv3	0.762979	0.928074	0.762979	0.92814

On Python3

toplogy	CPU-top1	CPU-top5	GPU-top1	GPU-top5
alexnet	0.556455	0.785575	0.556455	0.785523
resnet50_v1	0.753367	0.926907	0.753367	0.926907
resnet50_v2	0.761327	0.929354	0.761327	0.929354
vgg16	0.720138	0.90662	0.720138	0.90662
densenet121	0.736587	0.917328	0.736587	0.917328
squeezenet1.1	0.561469	0.792099	0.561481	0.792099
mobilenet1.0	0.693531	0.889003	0.693531	0.889003
inceptionv3	0.762979	0.928074	0.762979	0.92814

juliusshufan · 2018-09-24T06:57:55Z

ImageNet-1k training, we use the script under "example\image-classification", training executed on both GPU and CPU, with same hyper-parameters, without further tunings, training curses on GPU and CPU pretty aligned. The followings are corresponding to ResNet50 v1 and v2.

juliusshufan · 2018-09-24T07:05:32Z

Apart from ImageNet-1k traning test, training test also been executed on small dataset, that includes:

	Training set	Validation set	Classes	Source
CiFAR-10	50,000	10,000	10	Released by MxNET official http://data.mxnet.io/data/cifar10/
CiFAR-100	50,000	10,000	100	Released by MxNET official http://data.mxnet.io/data/cifar100.zip
sampled ImageNet	100,200	10,000	200	Sampled from ImageNet-1k and converted following the structure and classes retrieved from tinyImageNet (https://www.kaggle.com/c/tiny-imagenet/)

Due to the lackness of the SOTA accuracy data on these small dataset, the comparisons between MXNET-MKLDNN and MXNET-GPU on convergence trends and inference accuracy will be "indirectly" used as the correctness check of MXNET with MKLDNN backend.
Below tables lists the validation accuracy on CIFAR10, CIFAR100 and the sampled-Imagenet and comparisons achieved on GPU, models including ResNet-50, VGG16 and Inception-v3.

On Resnet-50:

	HW Platform	Dataset	Validation Accuracy
CPU	SKX-8180	sampled ImageNet	top1=0.629879 top5=0.842132
GPU	GTX-1080T	sampled ImageNet	top1=0.630609 top5=0.840345
CPU	SKX-8180	CiFAR-10	top1=0.917067 top5=0.997796
GPU	GTX-1080T	CiFAR-10	top1=0.921474 top5=0.998397
CPU	SKX-8180	CiFAR-100	top1=0.734475 top5=0.915865
GPU	GTX-1080T	CiFAR-100	top1= 0.723257 top5= 0.913161

On Inception-v3, (as inception-v3 only accepts input size 299, CIFAR is not applicable)

	HW Platform	Dataset	Validation Accuracy
CPU	SKX-8180	sampled ImageNet	top1= 0.684964 top5=0.866470
GPU	GTX-1080T	sampled ImageNet	top1= 0.684095 top5= 0.868890

On VGG-16:

	HW Platform	Dataset	Validation Accuracy
CPU	SKX-8180	sampled ImageNet	top1=0.528029 top5=0.759809
GPU	GTX-1080T	sampled ImageNet	top1=0.526834 top5= 0.761318
CPU	SKX-8180	CiFAR-10	top1=0.884615 top5=0.994391
GPU	GTX-1080T	CiFAR-10	top1=0.888622 top5=0.995092
CPU	SKX-8180	CiFAR-100	top1=0.634415 top5=0.855569
GPU	GTX-1080T	CiFAR-100	top1= 0.634916 top5= 0.855669

The below two figures are the top-5 validation accuracy trends collected on CPU and GPU respectively,
On CPU:

On GPU:

juliusshufan · 2018-09-24T09:58:38Z

Benchmark data
The benchmark data contains data collected on Linux and Mac, and compared between build with and w.o. MKLDNN, as the computation on a build w.o. MKLDNN is too slow, only the performance data selected CNN models are listed, benchmarking script based on example\image-classfication\benchmark_score.py.

On CentOS 7.4, pip is used for MXNET installation, that is pip install mxnet==1.3.0 v.s. pip install mxnet-mkl==1.3.0.
(Benchmarking is executed on a 1-socket Xeon SKX-8180, 28-core and 192G DDR4-2666 memery)

VGG16

batch size	MKLDNN-enabled	w.o. MKLDNN	boost-up
1	63.972961	2.776588	2304.01%
16	90.132777	3.27203	2754.64%
32	90.533301	3.271969	2766.94%
64	90.547993	3.332716	2716.94%
128	90.130061	3.303833	2728.05%
256	89.474756	3.333387	2684.20%

Inception-v3

batch size	MKLDNN-enabled	w.o. MKLDNN	boost-up
1	58.965411	6.244512	944.28%
16	168.280915	6.566202	2562.83%
32	167.823787	6.421525	2613.46%
64	168.746333	6.585618	2562.35%
128	166.841938	6.453535	2585.28%
256	162.761511	6.484705	2509.93%

Inception-v4

batch size	MKLDNN-enabled	w.o. MKLDNN	boost-up
1	32.362458	3.310546	977.56%
16	84.847819	3.393066	2500.62%
32	85.549374	3.379569	2531.37%
64	86.123905	3.335553	2582.00%
128	85.134901	3.334666	2553.03%
256	83.655486	3.330463	2511.83%

ResNet-50

batch size	MKLDNN-enabled	w.o. MKLDNN	boost-up
1	83.434864	11.020557	757.08%
16	194.102224	11.092527	1749.85%
32	197.600266	10.904773	1812.05%
64	199.251137	10.746266	1854.14%
128	198.108861	10.732905	1845.81%
256	196.444539	10.638787	1846.49%

MobileNet

batch size	MKLDNN-enabled	w.o. MKLDNN	boost-up
1	263.504341	27.284977	965.75%
16	607.443174	27.705262	2192.52%
32	614.830145	26.904616	2285.22%
64	644.903928	26.844882	2402.33%
128	621.659484	26.381861	2356.39%
256	605.399741	26.354961	2297.10%

On MacOS, the default compilation configurations disabling the OPENMP, below tables listing the perf datas on build with MKLDNN(OPENMP enabled), and the build without MKLDNN.
(HW is iMac Pro with one Socket 8-core Xeon-W and 32G DDR4 memory)

VGG16

batch size	MKLDNN-enabled	w.o. MKLDNN	boost-up
1	20.913986	7.821254	267.40%
16	24.273071	8.438211	287.66%
32	24.704907	8.480799	291.30%
64	24.94608	8.524874	292.63%
128	25.074148	8.53283	293.86%
256	25.2629	8.535707	295.97%

Inception-v3

batch size	MKLDNN-enabled	w.o. MKLDNN	boost-up
1	41.431404	10.323434	401.33%
16	54.312317	10.665803	509.22%
32	54.604119	10.621378	514.10%
64	54.39568	10.605843	512.88%
128	54.410785	10.62466	512.12%
256	54.614424	10.616772	514.42%

Inception-V4

batch size	MKLDNN-enabled	w.o. MKLDNN	boost-up
1	20.715221	5.655873	366.26%
16	26.249734	5.779357	454.20%
32	26.197659	5.761883	454.67%
64	26.16153	5.771389	453.30%
128	26.247461	5.778834	454.20%
256	26.313875	5.77839	455.38%

ResNet-50

batch size	MKLDNN-enabled	w.o. MKLDNN	boost-up
1	41.70109	19.246681	216.67%
16	43.132788	20.854712	206.83%
32	41.613291	20.570733	202.29%
64	38.13329	20.652445	184.64%
128	38.839577	20.685878	187.76%
256	38.853521	20.68953	187.79%

MobileNet

batch size	MKLDNN-enabled	w.o. MKLDNN	boost-up
1	200.91608	36.047475	557.37%
16	287.614019	37.224849	772.64%
32	277.838051	36.914548	752.65%
64	274.474078	36.939298	743.04%
128	273.622323	37.04172	738.69%
256	273.445636	36.947783	740.09%

azai91 · 2018-09-24T14:20:22Z

Just verifying. The above table is for Mac?

azai91 · 2018-09-24T19:00:02Z

Can we try metrics with MacOS on AVX2 ISA? We are seeing performance drop enabling MKLDNN.

juliusshufan · 2018-09-25T09:04:01Z

Just verifying. The above table is for Mac?

@azai91

Yes, for benchmarking data, today I also update the datas collected on CentOS 7.4, the data you were reviewing is collected on an iMAC Pro.
I'll collect the Mac perf data on a AVX2 processor later on, as the issue you mentioning "performance drop enabling MKLDNN.", I suspected this is caused by the OPENMP is NOT enabled by default, @xinyu-intel is working on this and can provide more inputs.

xinyu-intel · 2018-09-25T09:11:00Z

Hi @azai91 , you can try the below building method on mac:

brew install llvm
# .bash_profile
export LIBRARY_PATH=/usr/local/Cellar/llvm/6.0.1/lib/
# config.mk
CC=/usr/local/Cellar/llvm/6.0.1/bin/clang CXX=/usr/local/Cellar/llvm/6.0.1/bin/clang++
# mkldnn.mk L40 before cmake
CC=/usr/local/Cellar/llvm/6.0.1/bin/clang CXX=/usr/local/Cellar/llvm/6.0.1/bin/clang++
# makefile
ifeq ($(USE_OPENMP), 1)
#             ifneq ($(UNAME_S), Darwin)
                              CFLAGS += -fopenmp
#             endif
Endif

juliusshufan · 2018-09-27T05:57:23Z

RNN related data, including both accuracy, and performance/benchmarking.
Accuracy

A GNMT model implemented by gluon-nlp (scripts\nmt\train_gnmt.py), IWMT2015 dataset, en-vi translation. The decoder-encoder is a 2-layer LSTM, per the model implemenation, as gluon.rnncell used, the MKLDNN FC can be covered as it is gluon.rnncell is an unfused kernel, below figure is the ppl trends collected on both GPU and CPU, with same hyper-parameters, the two curves aligned very well.
A simple RNN model, provided by official MXNET repo (/example/rnn/bucketing), implemented by RNN symbol API. Training tests are using a 3-layer LSTM and GRU RNN model with fused-RNN kernel on CPU and GPU, and comparses the training curves, see below figures for the training perplexity trends.

Benchmarking
Thanks to the new features released by MXNET 1.3.0 on Gluon RNN API, dummy-data based benchmarking are executed, using fused and unfused Gluon RNN-API repectively, with MXNET with MKLDNN as the backend.
The benchmarking uses a series predefined input shape, on a 1S-SKX8180 CPU, 28 core and 192G DDR4 memory. (The input size is the embedding size, and same as hidden size by default)
Metric is Sentence Per Second (SPS).

1-layer LSTM fused v.s. unfused

Input Shape (N, T, C, Input Size)	Fused	Unfused	Boost
[64, 15, 500, 500]	2917.237852	1667.527	174.94%
[64, 20, 500, 500]	3661.45311	1196.497	306.01%
[64, 25, 500, 500]	3288.546223	855.2861	384.50%
[64, 30, 500, 500]	2913.375177	660.5786	441.03%
[64, 35, 500, 500]	2581.44028	519.6848	496.73%
[64, 40, 500, 500]	2479.42023	714.7851	346.88%
[64, 45, 500, 500]	2300.442591	625.1124	368.00%
[64, 50, 500, 500]	2160.407494	549.2164	393.36%
[16, 25, 512, 512]	1067.593284	332.028	321.54%
[32, 25, 512, 512]	1830.461068	649.8168	281.69%
[64, 25, 512, 512]	2827.429465	1187.243	238.15%
[128, 25, 512, 512]	3938.397784	1547.932	254.43%
[16, 25, 1024, 1024]	231.900727	154.7335	149.87%
[32, 25, 1024, 1024]	429.570455	298.2182	144.05%
[64, 25, 1024, 1024]	744.384772	480.4162	154.95%
[128, 25, 1024, 1024]	1204.706856	696.3014	173.02%
[16, 25, 2048, 2048]	52.323166	40.81776	128.19%
[32, 25, 2048, 2048]	101.108405	78.72398	128.43%
[64, 25, 2048, 2048]	181.117374	131.4923	137.74%
[128, 25, 2048, 2048]	315.360515	223.4272	141.15%
[16, 25, 4096, 4096]	12.326611	9.575337	128.73%
[32, 25, 4096, 4096]	24.255487	18.75816	129.31%
[64, 25, 4096, 4096]	44.229753	34.00344	130.07%
[128, 25, 4096, 4096]	78.146907	64.36427	121.41%

1-layer GRU fused v.s. unfused

Input Shape (N, T, C, Input Size)	Fused	Unfused	Boost
[64, 15, 500, 500]	3981.266	1714.903	232.16%
[64, 20, 500, 500]	3686.065	1316.712	279.94%
[64, 25, 500, 500]	3430.645	930.4283	368.72%
[64, 30, 500, 500]	3130.724	722.1599	433.52%
[64, 35, 500, 500]	2982.695	692.9842	430.41%
[64, 40, 500, 500]	2857.4	621.988	459.40%
[64, 45, 500, 500]	2598.724	533.6256	486.99%
[64, 50, 500, 500]	2364.662	498.7772	474.09%
[16, 25, 512, 512]	1066.644	278.212	383.39%
[32, 25, 512, 512]	1861.235	540.8459	344.13%
[64, 25, 512, 512]	3089.303	1020.799	302.64%
[128, 25, 512, 512]	4679.54	1636.657	285.92%
[16, 25, 1024, 1024]	317.5073	163.0825	194.69%
[32, 25, 1024, 1024]	584.9791	318.4931	183.67%
[64, 25, 1024, 1024]	1051.927	552.1558	190.51%
[128, 25, 1024, 1024]	1568.747	814.037	192.71%
[16, 25, 2048, 2048]	64.3481	50.81243	126.64%
[32, 25, 2048, 2048]	124.1267	99.61789	124.60%
[64, 25, 2048, 2048]	227.109	170.9884	132.82%
[128, 25, 2048, 2048]	376.7918	279.1985	134.95%
[16, 25, 4096, 4096]	14.59219	12.47552	116.97%
[32, 25, 4096, 4096]	28.75226	24.61517	116.81%
[64, 25, 4096, 4096]	52.63095	44.60013	118.01%
[128, 25, 4096, 4096]	95.56435	83.10091	115.00%

juliusshufan · 2018-10-05T14:46:25Z

@azai91 @szha That's all the data I currently uploaded, may I know your comments anything else helping to support the set of USE_MKLDNN as default. Thanks.

szha · 2018-10-05T17:41:49Z

While the speed-up looks solid, I noticed the following:

A difference in top-1 inference accuracy in this comment for squeezenet USE_MKLDNN=1 is default in make build (mkldnn must be explicitly turned off) #12591 (comment)
Higher variance in the training accuracy compared to GPU, and the lack of validation accuracy in this comment USE_MKLDNN=1 is default in make build (mkldnn must be explicitly turned off) #12591 (comment)
A clear difference in accuracy in USE_MKLDNN=1 is default in make build (mkldnn must be explicitly turned off) #12591 (comment)
Lack of comparison between regular builds and mkl builds which is what we should establish instead.

I also have the following questions regarding the results:

What does "multi-node" mean in the second diagram in this comment? USE_MKLDNN=1 is default in make build (mkldnn must be explicitly turned off) #12591 (comment)
What would be the results for more common CPUs?

Overall, I think these evaluation doesn't yet cover the most important question for this PR: can we say with confidence that by switching to USE_MKLDNN by default, our library can achieve speed-up without losing accuracy, for different CPUs?

szha · 2018-10-05T17:44:57Z

Note that for larger datasets it's unlikely that people would use it for training, so inference results with pre-trained models would suffice for the purpose of comparing mkl builds with regular builds.

pengzhao-intel · 2018-10-06T14:03:49Z

Thanks for looking into our data and I agree that the inference results are more important.
@juliusshufan will follow up your question.

xinyu-intel · 2018-10-06T15:24:01Z

@pengzhao-intel @juliusshufan also add performance on iMac Pro based on building method referred in #12724

lupesko · 2018-10-08T20:15:31Z

I'd love to see this one merged, and MXNet users benefitting from improved performance on CPU, but I agree with comments made earlier by @szha that we need clear comparison for speed and accuracy between non-MKLDNN and MKLDNN.

I also suggest we document these benchmarks and results on MXNet CWiki instead of in this issue - will be easier to see a full and up-to-date status there. @xinyu-intel if it makes sense to you, can you please document it there?

pengzhao-intel · 2018-10-09T05:13:16Z

@lupesko It's a good idea to document the benchmark results in the website rather than github.
How about creating a separate page under doc in https://mxnet.incubator.apache.org/ ?
I think this is the major interface for MXNet users.

azai91 · 2018-10-11T21:52:36Z

@juliusshufan can you provide benchmarks comparing mkldnn vs non?

juliusshufan · 2018-10-12T06:51:59Z

@azai91 sure, some CNN perf/benchmark data already updated the previous comments (my fourth comment) of this PR, do you mean more model coverage? And I'll also update the same content to the on CWiki page.
Thanks.

pengzhao-intel · 2018-10-12T14:21:00Z

@azai91 I will sync with @juliusshufan in local. Will launch the benchmark during weekend :)

azai91 · 2018-10-15T20:50:45Z

@pengzhao-intel thanks for the update. can you list platforms and build flags in benchmarks as well. let me know when you're done. planning on taking vote Tuesday or Wednesday.

pengzhao-intel · 2018-10-26T00:43:53Z

Latest data updated on this wiki page:
https://cwiki.apache.org/confluence/display/MXNET/MXNet+with+Intel+MKL-DNN+-+Performance+Benchmarking

@azai91 could you rebase the code?

azai91 · 2018-11-07T23:43:00Z

Results with mobilenet

ubuntu@ip-172-31-5-67:~/incubator-mxnet$ MXNET_MKLDNN_ENABLED=1 python example/image-classification/benchmark_score.py
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
INFO:root:It may take some time to run all models, set --network to run a specific one
INFO:root:run batchsize [1, 32, 64, 128, 256] by default, set --batch-size to run a specific one
INFO:root:network: mobilenet
INFO:root:device: cpu(0)
/home/ubuntu/incubator-mxnet/python/mxnet/module/base_module.py:68: UserWarning: Data provided by label_shapes don't match names specified by label_names ([] vs. ['softmax_label'])
  warnings.warn(msg)
[23:42:08] ../src/operator/nn/mkldnn/mkldnn_base.cc:74: Allocate 8192 bytes with malloc directly
[23:42:08] ../src/operator/nn/mkldnn/mkldnn_base.cc:74: Allocate 32768 bytes with malloc directly
INFO:root:batch size  1, dtype float32, images/sec: 31.501611
INFO:root:batch size 32, dtype float32, images/sec: 194.704657
INFO:root:batch size 64, dtype float32, images/sec: 247.321861
INFO:root:batch size 128, dtype float32, images/sec: 276.045449
INFO:root:batch size 256, dtype float32, images/sec: 257.687046

xinyu-intel · 2018-11-07T23:49:33Z

@azai91 which compiler are you using to build mxnet with mkldnn on m5a.24xlarge?

azai91 · 2018-11-08T00:01:31Z

ubuntu@ip-172-31-5-67:~/incubator-mxnet/build$ /usr/bin/c++ --version
c++ (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

lebeg

The benchmark effort is really impressive. Could we add some more information how it was performed? I mean what scripts were called, which models downloaded?

I could reuse this information to perform another comparison: testing performance when compiled with different compilers (with different OpenMP libraries).

lebeg · 2018-11-23T09:23:03Z

ci/docker/runtime_functions.sh

@@ -170,6 +171,7 @@ build_armv7() {
        -DCMAKE_BUILD_TYPE=Release \
        -DUSE_MKL_IF_AVAILABLE=OFF \
        -DUSE_LAPACK=OFF \
+        -DUSE_MKLDNN=0FF \


You have changed the default behaviour for make builds, but as far as I know for cmake it was ON by default (is available). Why do we want to switch it explicitly OFF?

vandanavk · 2018-11-28T21:36:15Z

@mxnet-label-bot add [MKLDNN]

…tor-mxnet into feature/mklnn-default-make

roywei · 2018-12-11T01:00:13Z

@azai91 Thanks for the contribution, could you trigger CI again?

apeforest · 2018-12-12T19:02:38Z

Do you also need to update osx.mk? Please make sure it's working the same on Mac OS

…tor-mxnet into feature/mklnn-default-make

sandeep-krishnamurthy · 2018-12-26T21:34:33Z

@azai91 - Thanks a lot for this PR.
What are the next steps here?

mseth10 · 2019-01-04T11:42:05Z

@azai91 we can close this PR now?

pengzhao-intel · 2019-01-04T13:04:29Z

@mseth10 Yes, I think so.
Next step, we will co-work to

update the document and install page
make MKLDNN with the static link
make MKLDNN as default in the nightly build

@lupesko @sandeep-krishnamurthy @mseth10 @azai91 @TaoLv @xinyu-intel @ZhennanQin
What's your opinion?

szha · 2019-01-04T18:13:42Z

ci/docker/runtime_functions.sh

@@ -669,7 +662,6 @@ build_ubuntu_gpu_cmake() {
        -DUSE_CUDA=1                            \
        -DUSE_CUDNN=1                           \
        -DUSE_MKLML_MKL=0                       \
-        -DUSE_MKLDNN=0                          \


I don't think this is supposed to be removed.

azai91 · 2019-01-04T22:06:14Z

closing this PR as this is a duplicate of #13681

azai91 requested a review from szha as a code owner September 18, 2018 18:56

azai91 changed the title ~~update docs to start requiring cmake for building mxnet from source~~ USE_MKLDNN=1 is default in make build (mkldnn must be explicitly turned off) Sep 18, 2018

szha reviewed Sep 18, 2018

View reviewed changes

marcoabreu added the pr-awaiting-response PR is reviewed and waiting for contributor to respond label Sep 18, 2018

mkldnn is default makefile and explicitly turn off for buidls

1e17a51

azai91 and others added 6 commits November 8, 2018 18:21

Merge branch 'master' into feature/mklnn-default-make

d12d2cd

Merge branch 'master' into feature/mklnn-default-make

da8f62c

Merge branch 'master' into feature/mklnn-default-make

78c6093

retrigger

cb095c6

retrigger

c08f6fa

turn of mkldnn on arm builds

bf78666

lebeg reviewed Nov 23, 2018

View reviewed changes

marcoabreu added the MKLDNN label Nov 28, 2018

azai91 mentioned this pull request Nov 28, 2018

AMD CPU performance benchmarks #12910

Closed

azai91 added 2 commits November 29, 2018 12:02

Merge branch 'master' into feature/mklnn-default-make

46da874

Merge branch 'feature/mklnn-default-make' of github.com:azai91/incuba…

9b60119

…tor-mxnet into feature/mklnn-default-make

azai91 and others added 6 commits December 13, 2018 00:54

Merge branch 'master' into feature/mklnn-default-make

a484fec

Merge branch 'feature/mklnn-default-make' of github.com:azai91/incuba…

aac30b4

…tor-mxnet into feature/mklnn-default-make

enable mkldnn on all builds

e5a40f6

merge from master

c483179

Merge branch 'feature/mklnn-default-make' of github.com:azai91/incuba…

f33f736

…tor-mxnet into feature/mklnn-default-make

set make flag correctly

7f1f81f

pengzhao-intel mentioned this pull request Jan 4, 2019

Making MKL-DNN default on MXNet master #13681

Merged

6 tasks

szha suggested changes Jan 4, 2019

View reviewed changes

azai91 closed this Jan 4, 2019

USE_MKLDNN=1 is default in make build (mkldnn must be explicitly turned off) #12591

USE_MKLDNN=1 is default in make build (mkldnn must be explicitly turned off) #12591

Conversation

azai91 commented Sep 18, 2018 • edited Loading

Description

Checklist

Essentials

Changes

Comments

szha left a comment

Choose a reason for hiding this comment

azai91 commented Sep 18, 2018

szha commented Sep 18, 2018 • edited Loading

stu1130 commented Sep 18, 2018

pengzhao-intel commented Sep 19, 2018

juliusshufan commented Sep 20, 2018 • edited Loading

juliusshufan commented Sep 24, 2018

juliusshufan commented Sep 24, 2018 • edited Loading

juliusshufan commented Sep 24, 2018 • edited Loading

azai91 commented Sep 24, 2018

azai91 commented Sep 24, 2018

juliusshufan commented Sep 25, 2018

xinyu-intel commented Sep 25, 2018

juliusshufan commented Sep 27, 2018 • edited Loading

juliusshufan commented Oct 5, 2018

szha commented Oct 5, 2018

szha commented Oct 5, 2018

pengzhao-intel commented Oct 6, 2018

xinyu-intel commented Oct 6, 2018

lupesko commented Oct 8, 2018

pengzhao-intel commented Oct 9, 2018

azai91 commented Oct 11, 2018

juliusshufan commented Oct 12, 2018 • edited Loading

pengzhao-intel commented Oct 12, 2018

azai91 commented Oct 15, 2018 • edited Loading

pengzhao-intel commented Oct 26, 2018

azai91 commented Nov 7, 2018

xinyu-intel commented Nov 7, 2018

azai91 commented Nov 8, 2018

lebeg left a comment

Choose a reason for hiding this comment

lebeg Nov 23, 2018

Choose a reason for hiding this comment

vandanavk commented Nov 28, 2018

roywei commented Dec 11, 2018

apeforest commented Dec 12, 2018

sandeep-krishnamurthy commented Dec 26, 2018

mseth10 commented Jan 4, 2019

pengzhao-intel commented Jan 4, 2019 • edited Loading

szha Jan 4, 2019

Choose a reason for hiding this comment

azai91 commented Jan 4, 2019

azai91 commented Sep 18, 2018 •

edited

Loading

szha commented Sep 18, 2018 •

edited

Loading

juliusshufan commented Sep 20, 2018 •

edited

Loading

juliusshufan commented Sep 24, 2018 •

edited

Loading

juliusshufan commented Sep 24, 2018 •

edited

Loading

juliusshufan commented Sep 27, 2018 •

edited

Loading

juliusshufan commented Oct 12, 2018 •

edited

Loading

azai91 commented Oct 15, 2018 •

edited

Loading

pengzhao-intel commented Jan 4, 2019 •

edited

Loading