Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[WIP] Improve Ccache #11520

Closed
wants to merge 1 commit into from
Closed

Conversation

marcoabreu
Copy link
Contributor

@marcoabreu marcoabreu commented Jul 2, 2018

Description

#11516

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Comments

  • So far, GCC and G++ are properly handled by CCache and we have proper cache hits. The problem here is that NVCC still produces cache misses.

@marcoabreu marcoabreu requested a review from szha as a code owner July 2, 2018 01:33
@@ -34,4 +34,5 @@ then
mkdir /work/mxnet
mkdir /work/build
chown -R jenkins_slave /work/
chown -R jenkins_slave /usr/local/bin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is hacky.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that the entire Docker container is owned by root by default. Since we run in low-privilege mode inside our container (opposed to running everything as root), we have to whitelist certain directories (/work and /usr/local/bin) that the process is allowed to write. This is basically a necessary evil of the least-priviledge-policy.

An alternative would be to set these softlinks directly during the ccache installation. The problem there would then be that CCache would always be activated and we would not be able to turn it on/off during runtime. I wouldn't mind about that, but I didn't want to change the behaviour. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What requires permissions to write in /usr/local/bin ?

Copy link
Contributor Author

@marcoabreu marcoabreu Nov 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The symbolic link I'm creating. It's super hacky, but some buildsystems seem to directly access /usr/local/bin in order to access the compiler binary. They ignore any custom setting, so I'm using the super brutal way to just rewrite that entry by overriding it with a symlink.

This is especially the case for all the submodule dependencies we have. We can't control their compiler choices since they just hardcoded them. ps-lite and ZeroMQ do that, for example.


chmod +x cc
chmod +x cxx
# Add to the beginning of path to ensure this redirection is picked up instead
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this better than the previous solution? It looks less explicit and hacky. Can you provide some rationale?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the recommended way by CCache: https://ccache.samba.org/manual.html#_run_modes
This way ensures that ccache is always used if the symlinks are set. Otherwise, we'd have to make a possibility in all build-paths (including third party) that they use ccache properly. This caused problems in compiling ps-lite, because that project had another third party dependency (ZeroMQ) which did not pick up the propagated compiler.

Copy link
Contributor

@larroy larroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks hacky, what is the rationale of these changes?

@marcoabreu
Copy link
Contributor Author

The problem is that the current method with the ccache wrappers does not work for third party dependencies. There are some scripts (libtool which is part of ZeroMQ which is part of 3rdparty/ps-lite) which identify the compiler by its name. If we redirect the compiler using the wrapper method, the script gets confused and falls back into a backup mode, which isn't able to compile the sub-projects properly. Also, this is the recommended way by CCache: https://ccache.samba.org/manual.html#_run_modes

@larroy
Copy link
Contributor

larroy commented Jul 2, 2018

I didn't know it's recommended by the authors. Thanks for the clarification.

@@ -489,9 +464,8 @@ build_ubuntu_amalgamation_min() {
build_ubuntu_gpu_cmake_mkldnn() {
set -ex
cd /work/build
install_ccache_wrappers
Copy link
Contributor

@lebeg lebeg Jul 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should not be any need for wrappers for cmake based builds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here is that we rather intercept the backend calls to the compilers instead of defining them through the frontend (aka the Cmake parameters). This allows us to be flexible with whatever compilers we define without having to think about ccache because it's handled automatically. Does that make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As much as I know cmake will not like this and different problems can appear suddenly. Using cmake parameters is the recommended approach.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know and I only chose this approach after playing around a lot. The problem is that the underlying submodules we have don't respect that choice.

.gitignore Outdated
cxx
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you moved out the wrappers then this few lines are not needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

# of the original ones. Especially CUDA/NVCC appends itself to the beginning of the
# path and thus this redirect is ignored. This change fixes this problem
export PATH=/usr/local/bin:$PATH
ln -s ccache /usr/local/bin/gcc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this work if already set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you can define and configure the compilers ($CC, $CXX, $NVCC) as you want, we only intercept the backend calls to the actual compilers. This allows us to be independent of the configuration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant if ccache soft link is already there.

Copy link
Contributor Author

@marcoabreu marcoabreu Nov 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, the softlink (or whatever was there previously) would just get replaced.

@sandeep-krishnamurthy sandeep-krishnamurthy added Build pr-awaiting-response PR is reviewed and waiting for contributor to respond pr-work-in-progress PR is still work in progress labels Aug 8, 2018
@nswamy
Copy link
Member

nswamy commented Aug 21, 2018

@marcoabreu are you going address the comments, if its handled in a different PR, could you please close this?

@marcoabreu
Copy link
Contributor Author

This PR is still WIP and paused due to other priorities. It is still valid.

@larroy
Copy link
Contributor

larroy commented Aug 22, 2018

Is this problem also happening with CMake builds or only in Make builds?

@szha szha removed their request for review August 22, 2018 18:57
@vandanavk
Copy link
Contributor

@sandeep-krishnamurthy Please change the label from [pr-awaiting-response, pr-work-in-progress] to [pr-awaiting-testing, pr-work-in-progress]

@sandeep-krishnamurthy sandeep-krishnamurthy added pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-awaiting-response PR is reviewed and waiting for contributor to respond labels Sep 28, 2018
@Roshrini
Copy link
Member

Hi @marcoabreu, Any update on this PR? Are you still working on it?

@roywei
Copy link
Member

roywei commented Oct 29, 2018

ping @marcoabreu any updates?

@marcoabreu
Copy link
Contributor Author

Yes, I will come back to this as soon as I got a few hours

@lupesko
Copy link
Contributor

lupesko commented Nov 27, 2018

@marcoabreu this PR has been inactive for 3 months now.
If you are unable to make progress on it it might be better to close it.

@marcoabreu marcoabreu force-pushed the ccache-nvcc branch 7 times, most recently from 9977728 to 720caaa Compare November 28, 2018 20:35
@marcoabreu marcoabreu force-pushed the ccache-nvcc branch 2 times, most recently from 0621193 to 5a59f07 Compare November 28, 2018 20:49
@marcoabreu marcoabreu force-pushed the ccache-nvcc branch 4 times, most recently from 98a2aff to 9e1e700 Compare November 28, 2018 21:27
@marcoabreu
Copy link
Contributor Author

Replaced by #13456

@marcoabreu marcoabreu closed this Nov 29, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Build pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants