Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Revert "Improve FC perf when no_bias=False" #15099

Merged
merged 1 commit into from
May 30, 2019

Conversation

roywei
Copy link
Member

@roywei roywei commented May 29, 2019

Reverts #15033
@stu1130 and I are running a binary search on the commits that may cause #15084

This seems to be the one. The reason causing CI build failure is still unknown, as this only happens with CI enviroments, builidng locally with GPU + MKLDNN is fine on master.

I m creating this PR to test if CI check can pass

@szha szha requested a review from eric-haibin-lin May 29, 2019 23:27
@szha
Copy link
Member

szha commented May 29, 2019

cc @anirudh2290

@anirudh2290
Copy link
Member

Really weird! CI check passed even on my PR. So if CI check doesn't fail, does that tell much ?

@szha
Copy link
Member

szha commented May 30, 2019

We certainly need to find the root cause on why CI wasn't reliable in this case. (e.g. could it have been just a cache issue?)

@larroy
Copy link
Contributor

larroy commented May 30, 2019

The root cause seems to be not enough shared memory in the docker container used in CI.

Using time ci/build.py --docker-registry mxnetci --platform ubuntu_build_cuda --docker-build-retries 3 --shm-size 8G /work/runtime_functions.sh build_ubuntu_gpu_mkldnn

fixes the build.

We have some files which are very slow to compile and produce big binaries, I think is excess of inline.

@larroy
Copy link
Contributor

larroy commented May 30, 2019

#15100

@szha
Copy link
Member

szha commented May 30, 2019

Hmm, if this was required, how did the original PR pass CI check?

@anirudh2290
Copy link
Member

I was able to reproduce this with make with on my local machine with command:

make DEV=1 ENABLE_TESTCOVERAGE=1 USE_CPP_PACKAGE=1 USE_BLAS=openblas USE_CUDA=1 USE_CUDA_PATH=/usr/local/cuda USE_CUDNN=1 'CUDA_ARCH=-gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_70,code=sm_70' USE_SIGNAL_HANDLER=1 -j$(nproc)

Note: I was able to reproduce only with a clean build after make clean. for example, after building on top of previous commit i wasn't able to reproduce this.

@roywei
Copy link
Member Author

roywei commented May 30, 2019

@stu1130 has tried with ~2G and failed. It also seems 8G in #15100 failed.
I just tried ~10G passed

ci/build.py --docker-registry mxnetci --platform ubuntu_build_cuda --docker-build-retries 3 --shm-size 10000m /work/runtime_functions.sh build_ubuntu_gpu_mkldnn

We have tried with 500m before #15033, everything works fine. Everything after this commit failed with 500m.
For local tests, also able to reproduce now with @anirudh2290 's command.
So this can't be only shared memory related.

  1. why Improve FC perf when no_bias=False #15033 code change caused CI requiring more space? (it changed requiring additional temp space to not limit to MKLDNN only, but should not affect build?)
  2. why CI check passed for initial PR?
  3. how to fix local build?

I would prefer to merge this and keep #15084 open until we find root cause.

@anirudh2290
Copy link
Member

@roywei sounds good to me.

@anirudh2290 anirudh2290 merged commit 6543488 into apache:master May 30, 2019
@aaronmarkham
Copy link
Contributor

Is there are next step for those of use with PRs that failed CI due to this issue?

@anirudh2290
Copy link
Member

@aaronmarkham Rebase and retrigger CI and it should be fine.

@larroy
Copy link
Contributor

larroy commented May 30, 2019

We need to investigate what's going on, I thought it passed for me at 8G but then my PR validation failed. Could be a linker bug or an offset linking bigger than 2^32, it's still not 100% clear to me.

@larroy
Copy link
Contributor

larroy commented May 30, 2019

I'd suggest strace dump of linker

@larroy
Copy link
Contributor

larroy commented May 30, 2019

Just failed with 16G for me.

haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants