-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Fix crash in random.shuffle operator #15041
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I think you should fix ShuffleND as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@zheng-da ShuffleND does not use |
thanks for clarifying. |
@mxnet-label-bot add [Operator, pr-awaiting-merge] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, could you edit the description of the PR and explain why the segfault? is not clear from the description, even though the fix is clear. Why would it segfault? in the linked issue the value doesn't seem to overflow int32... is it an internal implementation problem in the stl?
@larroy yes, it's the internal implementation of random_shuffle that caused overflow. See the line: (/~https://github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-v3/include/parallel/random_shuffle.h#L384 I already addeded this in the PR description |
* fix crash in random_shuffle caused by int overflow * add unit test * add comment * remove small random test to avoid CI failure
Thanks, this was not entirely clear, was a range error? |
* fix crash in random_shuffle caused by int overflow * add unit test * add comment * remove small random test to avoid CI failure
Description
This PR fix #15029
The rootcause of the problem is when NDArray is 1-d and the platform is GNU Linux, the backend implementation uses
__gnu_parallel:random_shuffle()
See: /~https://github.com/apache/incubator-mxnet/blob/master/src/operator/random/shuffle_op.cc#L53. This explains why crash did not happen in MacOS.The random_shuffle template defined in gcc (/~https://github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-v3/include/parallel/random_shuffle.h#L384) is passing in
std::numeric_limits<uint32_t>::max()
to the __rng() function to generate random seed. Therefore, in our rng() function (/~https://github.com/apache/incubator-mxnet/blob/master/src/operator/random/shuffle_op.cc#L49) we cannot use our self defined data typeindex_t
and have to useuint32_t
to avoid integer overflow.Why this bug was not detected in our unit test? Because our unit test of shuffle only tests 2-D shape, which did not use parallel_shuffle in backend.
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.