-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Conversation
gpu_max_val = np.max(np.abs(gpu_out.asnumpy())) | ||
eprint(model_name + ": CPU " + str(max_val) + ", GPU " + str(gpu_max_val)) | ||
assert_almost_equal(out / max_val, gpu_out.asnumpy() / max_val, rtol=1e-3, atol=1e-3) | ||
for i in range(5): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is one way of reproducing the bug in #10580.
@marcoabreu can you review this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.
ci/docker/runtime_functions.sh
Outdated
cmake \ | ||
-DUSE_CUDA=1 \ | ||
-DUSE_CUDNN=1 \ | ||
-DUSE_MKLML_MKL=1 \ | ||
-DUSE_MKLDNN=1 \ | ||
-DCMAKE_BUILD_TYPE=Release \ | ||
-DARCH_OPT_FLAGS="-mtune=generic" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was never a problem before and is counter intuitive to the user. It should be possible to compile on one instance type and execute on another instance type without having to specify options like this one. I'd appreciate it if you could track this down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this problem is caused by how your CI system is set up. It's compiled on C5, which supports AVX512, and runs on g3, which doesn't support AVX512. MKLDNN by default tries to optimize specifically for an architecture where it is compiled. There is similar setup for Makefile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, so MKLDNN auto-assumes the runtime architecture based on the runtime-architecture. I think this will cause quite some problems for our users in production environments, considering everybody will build their binaries in a build fleet and deploy it in a production fleet with different configuration and hardware.
Could you please elaborate the flow how users are being made aware of this problem and how we instruct them to solve this problem? Also, where is ARCH_OPT_FLAGS being consumed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just edit CMakeLists.txt to turn on "-mtune=generic" for MKLDNN by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good if @piiswrong @szha @cjolivier01 could look into the possible impact of that change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the same thing has been used in prepare_mkldnn.sh, which is used by Makefile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference here is that the linked script is only for the compilation of MKLDNN - maybe @pengzhao-intel can elaborate. Here, you're changing the compilation of the entire MXNet library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is what you worry about, I have changed CMakeLists.txt to only compile MKLDNN with the flag.
/~https://github.com/apache/incubator-mxnet/pull/10731/files#diff-af3b638bc2a3e6c650974192a53c7291R162
-G Ninja \ | ||
/work/mxnet | ||
|
||
ninja -v | ||
# libmkldnn.so.0 is a link file. We need an actual binary file named libmkldnn.so.0. | ||
cp 3rdparty/mkldnn/src/libmkldnn.so.0 3rdparty/mkldnn/src/libmkldnn.so.0.tmp | ||
mv 3rdparty/mkldnn/src/libmkldnn.so.0.tmp 3rdparty/mkldnn/src/libmkldnn.so.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a no-op. Please elaborate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't no-op. I think my comment explain this. libmkldnn.so.0 is a link file. We need an actual binary file named libmkldnn.so.0. Jenkins can't pack a link file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, so you're making use of the fact that cp automatically resolves the symlink? I think a cleaner way would be to resolve the symlink explicitely and then override the symlink file with the original file rather than relying on implicit actions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you know a cleaner way, I'm happy to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What we need here is to turn a link file to a regular file, instead of finding the path to the regular file. It's a different thing. Jenkins can't wrapper a link file and mxnet wants a library file named libmkldnn.so.0, instead of libmkldnn.so.0.13 (which is what libmkldnn.so.0 points to). If what you find can turn libmkldnn.so.0 (a link file) to a regular file with the same name with a single command, can you please provide the command?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so the code will be something like this.
cp --remove-destination 3rdparty/mkldnn/src/`readlink 3rdparty/mkldnn/src/libmkldnn.so.0` 3rdparty/mkldnn/src/libmkldnn.so.0
do you think this is a preferred way or less confusing way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I actually prefer the existing solution over the one-liner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, no strong feelings from my side
for (int i = 0; i < desc2.data.ndims; i++) | ||
required_shape[i] = desc2.data.dims[i]; | ||
NDArray reshaped = MKLDNNDataReshape(required_shape); | ||
const mkldnn::memory *ret = reshaped.GetMKLDNNData(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is ret being destroyed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The memory is managed by MKLDNNStream. You can take a look at the implementation of GetMKLDNNData
@@ -287,10 +286,16 @@ class MKLDNNStream { | |||
return !net.empty(); | |||
} | |||
|
|||
void Submit() { | |||
if (!net.empty()) | |||
void Submit(bool cleanup = true) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please describe this argument
@marcoabreu do you have other comments? |
ci/docker/runtime_functions.sh
Outdated
@@ -313,6 +313,8 @@ build_ubuntu_amalgamation_min() { | |||
build_ubuntu_gpu_cmake_mkldnn() { | |||
set -ex | |||
cd /work/build | |||
# We need to use generic archtecture. Otherwise, MKLDNN compiled in one | |||
# CPU architecture (e.g., C5) can't run on another architecture (e.g., g3). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do these comments apply to here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to remove this. Originally, I set ARCH_OPT_FLAGS here. but I moved it to CMakeLists.txt.
/~https://github.com/apache/incubator-mxnet/pull/10731/files/01b37cecf0a57ea9d943aae0c079305d4de8452d#diff-af3b638bc2a3e6c650974192a53c7291R162
I need to move the comments as well.
@marcoabreu do you have more comments? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I'm only concerned about this optimization part and its impact. It would be good if you could make some kind of benchmark that ensures we're not losing any performance with that method.
intel people saids generic is ok. I'm going to assume they know what they are talking about |
This reverts commit 4ba436b.
* test inference multiple times. * Fix a bug in GetMKLDNNData(). * Update comments. * Handle all cases for GetMKLDNNDataReorder * avoid unnecessary message. * Add C++ unit test for NDArray. * Fix a minor bug. * Unit tests on GetMKLDNNDataReorder. * Fix lint error. * Add more test cases. * add comments for the test code. * Reorganize test code. * Fix cpp tests. * test. * Add a new Jenkins compile task. * Update jenkins. * update jenkins. * Fix a Jenkins. * Fix jenkins. * Fix jenkins. * Fix CMake for MKLDNN. * Fix jenkins. * update jenkins. * update CMake. * Fix cmake. * update CI. * add comment. * add comments. * cmake builds mkldnn with -mtune=generic by default. * adjust comments.
* test inference multiple times. * Fix a bug in GetMKLDNNData(). * Update comments. * Handle all cases for GetMKLDNNDataReorder * avoid unnecessary message. * Add C++ unit test for NDArray. * Fix a minor bug. * Unit tests on GetMKLDNNDataReorder. * Fix lint error. * Add more test cases. * add comments for the test code. * Reorganize test code. * Fix cpp tests. * test. * Add a new Jenkins compile task. * Update jenkins. * update jenkins. * Fix a Jenkins. * Fix jenkins. * Fix jenkins. * Fix CMake for MKLDNN. * Fix jenkins. * update jenkins. * update CMake. * Fix cmake. * update CI. * add comment. * add comments. * cmake builds mkldnn with -mtune=generic by default. * adjust comments.
* test inference multiple times. * Fix a bug in GetMKLDNNData(). * Update comments. * Handle all cases for GetMKLDNNDataReorder * avoid unnecessary message. * Add C++ unit test for NDArray. * Fix a minor bug. * Unit tests on GetMKLDNNDataReorder. * Fix lint error. * Add more test cases. * add comments for the test code. * Reorganize test code. * Fix cpp tests. * test. * Add a new Jenkins compile task. * Update jenkins. * update jenkins. * Fix a Jenkins. * Fix jenkins. * Fix jenkins. * Fix CMake for MKLDNN. * Fix jenkins. * update jenkins. * update CMake. * Fix cmake. * update CI. * add comment. * add comments. * cmake builds mkldnn with -mtune=generic by default. * adjust comments.
* test inference multiple times. * Fix a bug in GetMKLDNNData(). * Update comments. * Handle all cases for GetMKLDNNDataReorder * avoid unnecessary message. * Add C++ unit test for NDArray. * Fix a minor bug. * Unit tests on GetMKLDNNDataReorder. * Fix lint error. * Add more test cases. * add comments for the test code. * Reorganize test code. * Fix cpp tests. * test. * Add a new Jenkins compile task. * Update jenkins. * update jenkins. * Fix a Jenkins. * Fix jenkins. * Fix jenkins. * Fix CMake for MKLDNN. * Fix jenkins. * update jenkins. * update CMake. * Fix cmake. * update CI. * add comment. * add comments. * cmake builds mkldnn with -mtune=generic by default. * adjust comments.
* test inference multiple times. * Fix a bug in GetMKLDNNData(). * Update comments. * Handle all cases for GetMKLDNNDataReorder * avoid unnecessary message. * Add C++ unit test for NDArray. * Fix a minor bug. * Unit tests on GetMKLDNNDataReorder. * Fix lint error. * Add more test cases. * add comments for the test code. * Reorganize test code. * Fix cpp tests. * test. * Add a new Jenkins compile task. * Update jenkins. * update jenkins. * Fix a Jenkins. * Fix jenkins. * Fix jenkins. * Fix CMake for MKLDNN. * Fix jenkins. * update jenkins. * update CMake. * Fix cmake. * update CI. * add comment. * add comments. * cmake builds mkldnn with -mtune=generic by default. * adjust comments.
* test inference multiple times. * Fix a bug in GetMKLDNNData(). * Update comments. * Handle all cases for GetMKLDNNDataReorder * avoid unnecessary message. * Add C++ unit test for NDArray. * Fix a minor bug. * Unit tests on GetMKLDNNDataReorder. * Fix lint error. * Add more test cases. * add comments for the test code. * Reorganize test code. * Fix cpp tests. * test. * Add a new Jenkins compile task. * Update jenkins. * update jenkins. * Fix a Jenkins. * Fix jenkins. * Fix jenkins. * Fix CMake for MKLDNN. * Fix jenkins. * update jenkins. * update CMake. * Fix cmake. * update CI. * add comment. * add comments. * cmake builds mkldnn with -mtune=generic by default. * adjust comments. remove unnecessary tests.
* test inference multiple times. * Fix a bug in GetMKLDNNData(). * Update comments. * Handle all cases for GetMKLDNNDataReorder * avoid unnecessary message. * Add C++ unit test for NDArray. * Fix a minor bug. * Unit tests on GetMKLDNNDataReorder. * Fix lint error. * Add more test cases. * add comments for the test code. * Reorganize test code. * Fix cpp tests. * test. * Add a new Jenkins compile task. * Update jenkins. * update jenkins. * Fix a Jenkins. * Fix jenkins. * Fix jenkins. * Fix CMake for MKLDNN. * Fix jenkins. * update jenkins. * update CMake. * Fix cmake. * update CI. * add comment. * add comments. * cmake builds mkldnn with -mtune=generic by default. * adjust comments. remove unnecessary tests.
* test inference multiple times. * Fix a bug in GetMKLDNNData(). * Update comments. * Handle all cases for GetMKLDNNDataReorder * avoid unnecessary message. * Add C++ unit test for NDArray. * Fix a minor bug. * Unit tests on GetMKLDNNDataReorder. * Fix lint error. * Add more test cases. * add comments for the test code. * Reorganize test code. * Fix cpp tests. * test. * Add a new Jenkins compile task. * Update jenkins. * update jenkins. * Fix a Jenkins. * Fix jenkins. * Fix jenkins. * Fix CMake for MKLDNN. * Fix jenkins. * update jenkins. * update CMake. * Fix cmake. * update CI. * add comment. * add comments. * cmake builds mkldnn with -mtune=generic by default. * adjust comments.
Description
This is to fix a bug in #10580
The bug happens when an NDArray already has a MKLDNN memory with a special MKLDNN format. In some cases (group convolution), the MKLDNN memory uses 5 dimensions, while the NDArray still uses 4 dimensions. In this case, when GetMKLDNNData is called, SetMKLMem will discard the original MKLDNN memory due to incompatible dimensions between MKLDNN memory and NDArray. This messes up the data in the NDArray.
The test code in the PR can actually reproduce the bug in #10580
This PR also tries to add C++ unit tests to cover all possible combinations with getting MKLDNN memory from an NDArray.
Thank @TaoLv @pengzhao-intel for finding the root cause of this bug.
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
Comments