Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Reference engine from chunk via weak pointer #14591

Merged
merged 1 commit into from
Apr 17, 2019

Conversation

KellenSunderland
Copy link
Contributor

Description

This PR fixes an issue we're seeing in TRT integration which causes a crash on shutdown. I believe the issue should be present under other usages as well. This PR tracks a weak reference on the engine to ensure we don't try to schedule work on the engine after it has been shutdown (for example cleaning up memory from NDChunks that are still in scope).

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

@haohuanw
Copy link
Contributor

haohuanw commented Apr 2, 2019

🎆

@piyushghai
Copy link
Contributor

Thanks for your contribution @KellenSunderland .

@mxnet-label-bot Add [Numpy, pr-awaiting-review]

@marcoabreu marcoabreu added Numpy pr-awaiting-review PR is waiting for code review labels Apr 2, 2019
@KellenSunderland
Copy link
Contributor Author

Thanks @piyushghai. Not immediately clear to me why the numpy label is attached. Does this PR have a Numpy related aspect to it?

@piyushghai
Copy link
Contributor

Aah. My bad. I was adding ndarray, and I ended up typing numpy instead :/

@mxnet-label-bot Update[ndarray, pr-awaiting-review]

@marcoabreu marcoabreu added NDArray and removed Numpy labels Apr 2, 2019
@KellenSunderland
Copy link
Contributor Author

@piyushghai Gotcha! Thanks for the quick responses on the PR.

@haohuanw
Copy link
Contributor

haohuanw commented Apr 3, 2019

LGTM

@KellenSunderland
Copy link
Contributor Author

@anirudhacharya @eric-haibin-lin

Could one of you guys do a quick review (it's a pretty small change).

Copy link
Member

@anirudh2290 anirudh2290 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the change looks good. Thanks for the fix! Can we add a cpp test for this.

@KellenSunderland
Copy link
Contributor Author

KellenSunderland commented Apr 16, 2019

@anirudh2290 Good point. Took me a while to think of how we can test this. Reproducing the issue is only possible when we have static de-allocating. A bit tricky to test without modifying the code to make non-static scoped.

I created a test that at a minimum crashes on my machine without the fix. It'll mark the test as passed but it will crash the test process (which will prevent regressions in CI I hope). Stepping through I can see it's taking the branches you'd expect w and w/o the fix. I.e. with the fix it skips the engine delete.

Output w/o fix:

Note: Google Test filter = Engine.stop_cleanup
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from Engine
[ RUN      ] Engine.stop_cleanup
[       OK ] Engine.stop_cleanup (1 ms)
[----------] 1 test from Engine (1 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (1 ms total)
[  PASSED  ] 1 test.

Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)

engine->Stop();
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KellenSunderland Thanks for the nice work on the tests. I understand this may be difficult to test, and I am okay with the crash for verification. Would it be easy to pinpoint where it crashed when run with other tests ? Otherwise better to move it to a different source file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be a little difficult in fact. Let's move it to it's own test suite to hopefully make it a little more obvious for future contributors.

@KellenSunderland KellenSunderland force-pushed the trt_shutdown_crash branch 2 times, most recently from b760cae to b3c6278 Compare April 17, 2019 01:23
@KellenSunderland
Copy link
Contributor Author

KellenSunderland commented Apr 17, 2019

@anirudh2290 Rebased and moved the test into its own fixture, should be ready for another look.

@KellenSunderland KellenSunderland merged commit 3e2f752 into apache:master Apr 17, 2019
lihaofd pushed a commit to lihaofd/incubator-mxnet that referenced this pull request Apr 17, 2019
Reference engine from chunk via weak pointer (apache#14591)
kedarbellare pushed a commit to kedarbellare/incubator-mxnet that referenced this pull request Apr 20, 2019
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
NDArray pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants