-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the quick fix! LGTM
I have two questions:
Does each process have independent threads?
Does the bug exist on Windows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix. Great work @arcadiaphy ! Would it be possible to add a test similar to the reproducible example you provided in the issue: #14396 ?
@wkcn For the two questions:
|
I find another bug, the custom operator hangs when having unfinished async operations.
I test the script on ubuntu, it hangs when calling python CFUNCTYPE function in c++. If you add It seems that calling python function from c++ when destructing static variable or forking may lead to problems. |
c7871d6
to
5bd800c
Compare
@arcadiaphy waitall silently hides exceptions, it doesnt rethrow the exception yet. This should be fixed by #14397 . what happenswhen you add c.wait_to_read() in foo . does it still hang ? |
@anirudh2290 No hangs after adding c.wait_to_read(). It only happens in calling CFUNCTYPE when destructing CustomOperator or running pre-fork Stop function. |
@mxnet-label-bot add [Bug, Operator] |
pthread_atfork( | ||
[]() { | ||
CustomOperator::Get()->Stop(); | ||
Engine::Get()->Stop(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we call Engine stop before customop stop. This will ensure that all ops pushed to engine have executed before stopping engine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I take that back. Maybe that won't work since Stopping engine before custom op means there may be ops that havent been pushed to engine by custom op but it has stopped. Another option is to add waitall after the worker joined in CustomOp stop to make sure all pushed ops have completed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second option doesn't work either. When it goes into Stop function, the custom worker thread has already frozen on python function, leading to the blocking of worker join. I cannot think of a simple and graceful way to complete or ignore the unfinished custom operation other than manually add waiting command in python code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you know why it is frozen on worker join ? The worker should just push the operator to engine and return.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's very easy to understand the first situation. The destruction of static variable on engine exit happens when python interpreter has been shutdown, so something bad happens when you call CFUNCTYPE function at that time.
I push a new commit to move custom Stop function to MXNotifyShutdown, forcing custom function called before python shutdown, so the first situation is all right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second situation of forking is so mysterious, I don't have a clue now. The custom Stop function is registered in prepare fork handler, everything should be OK at that time, but the worker thread just freezes.
I write a simple example to show the program freeze, which comes from calling CFUNCTYPE in thread when preparing fork.
|
@arcadiaphy Does this also related to https://discuss.mxnet.io/t/custom-operator-cause-stuck-in-multicard-train/1452 |
I think this may be related to https://docs.python.org/3/c-api/init.html#non-python-created-threads .
We don't have any such handling for callbacks in custom ops. |
@anirudh2290 The program blocks because of GIL. According to the following docs in your link, if fork locks GIL, then acquiring GIL in another thread and join them will cause deadlock.
Additional handling should be added to correctly run callbacks:
But adding this will make mxnet link against libpython, do we really want to do this? |
@YutingZhang The problem you mentioned may be related to GIL, correctly handling GIL in c++ extension is a tricky problem, easy to cause deadlock. Now we have no such handling at all. |
@arcadiaphy yes I don't see other option apart from depending on libpython if we have to fix these issues w.r.t custom op. What do others think ? |
@anirudh2290 If we don't want to depend on libpython, then this partial fix will work fine if you don't try to fork when having unfinished custom operation. |
@arcadiaphy I think we can move forward with this PR as this is an improvement from what we have today. Can you document this as a limitation of using custom op with forking here: http://mxnet.incubator.apache.org/versions/master/tutorials/gluon/customop.html , |
@anirudh2290 Add some docs on custom op. |
@anirudh2290 Docs updated, everything OK now. |
* fix custom operation in fork * add test * fix custom stop * swap order * add docs * update doc
* fix custom operation in fork * add test * fix custom stop * swap order * add docs * update doc
* fix custom operation in fork * add test * fix custom stop * swap order * add docs * update doc
* fix custom operation in fork * add test * fix custom stop * swap order * add docs * update doc
Description
Fix #14396. Update pthread_atfork function to properly fork on custom operator.
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
Comments