Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Fix profiler check #14677

Merged
merged 10 commits into from
Apr 12, 2019
Merged

Fix profiler check #14677

merged 10 commits into from
Apr 12, 2019

Conversation

anirudh2290
Copy link
Member

@anirudh2290 anirudh2290 commented Apr 11, 2019

Description

After adding the waitall PR the nightlies started failing and was reported here : #14397 (comment)
This was because waitall rethrowing exceptions that were hidden earlier.
Seems like operator-= for ProfileCounter is called during frees which checks if the amount to be freed is greater than alloced. Added a check before operator is called.

More permanent fix would be to understand why memory pool counters are incorrect when free is called but this PR should stop the profiler nightly test failures.

@ThomasDelteil

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@anirudh2290 anirudh2290 marked this pull request as ready for review April 11, 2019 22:22
@@ -165,6 +166,46 @@ def test_multiple_waitalls():
assert caught, "No exception thrown"
mx.nd.waitall()

@with_seed()
def test_exc_profiler():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems a bit heavy for a unittest, I think passing some nd.ones() through a simple dense layer would do the same right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have simplified it. Please take a look

Copy link
Member

@yuxihu yuxihu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR LGTM. But did you try to investigate why this happened? Can a handle be double freed such that the counter is incorrect? I noticed that we are passing handle by value when calling Free() which is not ideal as the caller might still hold the handle dptr and the handle size is not zeroed.

@anirudh2290
Copy link
Member Author

@yuxihu Good point I didnt investigate whether handle is getting double freed. I will check the same, though this PR doesn't have to block on it.

@yuxihu
Copy link
Member

yuxihu commented Apr 12, 2019

@yuxihu Good point I didnt investigate whether handle is getting double freed. I will check the same, though this PR doesn't have to block on it.

Yes, it is good to have the logic implemented in this PR anyway. The investigation can be done separately.

@ThomasDelteil ThomasDelteil merged commit 5fc5c27 into apache:master Apr 12, 2019
larroy pushed a commit to larroy/mxnet that referenced this pull request Apr 15, 2019
* Relax constexpr restriction

* Image classifcation mkldnn

* Check mem profiler greater than 0

* Revert "Relax constexpr restriction"

This reverts commit 5016170.

* Revert "Image classifcation mkldnn"

This reverts commit 30bfab2.

* Add test for profiler

* Simplify test
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
* Relax constexpr restriction

* Image classifcation mkldnn

* Check mem profiler greater than 0

* Revert "Relax constexpr restriction"

This reverts commit 5016170.

* Revert "Image classifcation mkldnn"

This reverts commit 30bfab2.

* Add test for profiler

* Simplify test
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants