Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

fix memory-related issues to enable ASAN tests #14223

Merged
merged 10 commits into from
Mar 3, 2019

Conversation

arcadiaphy
Copy link
Member

@arcadiaphy arcadiaphy commented Feb 21, 2019

Description

Continuing the discussion in #14176, this PR fixes memory-related issues detected by ASAN:

  1. DeleteVar operations are still executed in engine shutdown phase to cleanup memory
  2. Mark DeleteOperator as DeleteVar
  3. Uncomment free operation in object pool
  4. Free unreleased pointers of executer and optimizer in cpp examples
  5. Fix heap overflow in classification example

Currently ASAN tests is non blocking, after this PR, the checks are green.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@marcoabreu
Copy link
Contributor

Can you make them blocking as part of this PR?

@roywei
Copy link
Member

roywei commented Feb 23, 2019

@mxnet-label-bot add [Memory, pr-awaiting-review]

@marcoabreu marcoabreu added Memory pr-awaiting-review PR is waiting for code review labels Feb 23, 2019
@szha
Copy link
Member

szha commented Feb 23, 2019

@arcadiaphy arcadiaphy force-pushed the memory_leak branch 3 times, most recently from b5bc2da to 4f2549c Compare February 25, 2019 18:13
@arcadiaphy
Copy link
Member Author

arcadiaphy commented Feb 26, 2019

@szha After digging into the CI crash, two more problems are addressed:

  1. When destructing static variables in program exit, the CUDA driver has already been unloaded, so calling CUDA runtime API such as cudaFree results in cudaErrorCudartUnloading. Some destructors should be added noexcept(false) to properly handle the error. Submodule mshadow and dmlc-core needs to be modified accordingly for compiling, I use my forked repo to test CI process.
    /~https://github.com/dmlc/mshadow/blob/3dc80815d965b56b9a975dc27229361955bf66fe/mshadow/tensor_container.h#L74
    /~https://github.com/dmlc/mshadow/blob/3dc80815d965b56b9a975dc27229361955bf66fe/mshadow/tensor_container.h#L182
  2. Sometimes, storage is destructed before NDArray in exit, leading to program crash when accessing storage. I have added a shared_ptr in the NDArray to avoid it.
    /~https://github.com/apache/incubator-mxnet/blob/master/src/ndarray/ndarray.cc#L124

@marcoabreu ASAN tests are blocking in CI now.

@arcadiaphy arcadiaphy force-pushed the memory_leak branch 10 times, most recently from 14ad85e to 4e724c0 Compare February 26, 2019 17:06
@arcadiaphy arcadiaphy force-pushed the memory_leak branch 2 times, most recently from bfba55e to 9d20133 Compare February 26, 2019 17:17
@arcadiaphy
Copy link
Member Author

Why can't I trigger the full CI process?

@szha
Copy link
Member

szha commented Feb 27, 2019

@arcadiaphy there seems to be some problem with the CI right now.

@arcadiaphy
Copy link
Member Author

@szha Everything seems OK now, the only problem is I have changed the code in the submodule of mshadow and dmlc-core.

@marcoabreu The asan log looks clean too.
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fmiscellaneous/detail/PR-14223/10/pipeline#step-159-log-763

@szha
Copy link
Member

szha commented Feb 28, 2019

@arcadiaphy thanks! Feel free to PR those changes to the respective repos. Once merged, you can change the submodules to point to the new commits there.

@arcadiaphy
Copy link
Member Author

@szha Submodules are merged and pointed to new commits.

Copy link
Member

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix! I noticed that the patch is made to the mxnet-stable branch in dmlc-core. I don't think that is a good sign - we do not want to diverge from dmlc-core master. @szha what do you think

@szha
Copy link
Member

szha commented Mar 3, 2019

I agree that mxnet-stable branch should be merged back to master ASAP. @hcho3 informed me that he's taking a look now. For now, I think that effort can be taken separate from this PR

@eric-haibin-lin eric-haibin-lin merged commit 053ffc7 into apache:master Mar 3, 2019
@arcadiaphy arcadiaphy deleted the memory_leak branch March 3, 2019 05:39
vdantu pushed a commit to vdantu/incubator-mxnet that referenced this pull request Mar 31, 2019
* fix heap overflow

* fix memory leak of optimizer and executer

* uncomment memory pool free

* run cleanup in engine shutdown phase

* make asan tests blocking

* fix abort in mxnet shutdown, use forked submodules temporally for tests

* trigger CI

* change submodule mshadow

* change submodule dmlc-core
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
* fix heap overflow

* fix memory leak of optimizer and executer

* uncomment memory pool free

* run cleanup in engine shutdown phase

* make asan tests blocking

* fix abort in mxnet shutdown, use forked submodules temporally for tests

* trigger CI

* change submodule mshadow

* change submodule dmlc-core
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Memory pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants