Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Fix crash in random.shuffle operator #15041

Merged
merged 7 commits into from
May 23, 2019

Conversation

apeforest
Copy link
Contributor

Description

This PR fix #15029
The rootcause of the problem is when NDArray is 1-d and the platform is GNU Linux, the backend implementation uses __gnu_parallel:random_shuffle() See: /~https://github.com/apache/incubator-mxnet/blob/master/src/operator/random/shuffle_op.cc#L53. This explains why crash did not happen in MacOS.

The random_shuffle template defined in gcc (/~https://github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-v3/include/parallel/random_shuffle.h#L384) is passing in std::numeric_limits<uint32_t>::max() to the __rng() function to generate random seed. Therefore, in our rng() function (/~https://github.com/apache/incubator-mxnet/blob/master/src/operator/random/shuffle_op.cc#L49) we cannot use our self defined data type index_t and have to use uint32_t to avoid integer overflow.

Why this bug was not detected in our unit test? Because our unit test of shuffle only tests 2-D shape, which did not use parallel_shuffle in backend.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Copy link
Contributor

@stu1130 stu1130 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zheng-da
Copy link
Contributor

I think you should fix ShuffleND as well.

Copy link
Contributor

@samskalicky samskalicky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@apeforest
Copy link
Contributor Author

@zheng-da ShuffleND does not use __gnu_parallel:random_shuffle() so there is no issue.

@zheng-da
Copy link
Contributor

thanks for clarifying.

@karan6181
Copy link
Contributor

@mxnet-label-bot add [Operator, pr-awaiting-merge]

@marcoabreu marcoabreu added Operator pr-awaiting-merge Review and CI is complete. Ready to Merge labels May 23, 2019
Copy link
Contributor

@larroy larroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, could you edit the description of the PR and explain why the segfault? is not clear from the description, even though the fix is clear. Why would it segfault? in the linked issue the value doesn't seem to overflow int32... is it an internal implementation problem in the stl?

@zheng-da zheng-da merged commit 66aa983 into apache:master May 23, 2019
@apeforest
Copy link
Contributor Author

@larroy yes, it's the internal implementation of random_shuffle that caused overflow. See the line: (/~https://github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-v3/include/parallel/random_shuffle.h#L384

I already addeded this in the PR description

@apeforest apeforest deleted the bugfix/shuffle_crash branch May 23, 2019 20:09
apeforest added a commit that referenced this pull request May 23, 2019
* fix crash in random_shuffle caused by int overflow

* add unit test

* add comment

* remove small random test to avoid CI failure
@larroy
Copy link
Contributor

larroy commented May 24, 2019

Thanks, this was not entirely clear, was a range error?

haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
* fix crash in random_shuffle caused by int overflow

* add unit test

* add comment

* remove small random test to avoid CI failure
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Operator pr-awaiting-merge Review and CI is complete. Ready to Merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

mx.nd.random.shuffle crashes in the master branch
8 participants