Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Replaced rand_r with std:: random generation #13574

Closed
wants to merge 1 commit into from
Closed

Conversation

lebeg
Copy link
Contributor

@lebeg lebeg commented Dec 7, 2018

Description

rand_r is not thread-safe and is not available on Android. Part of bigger #11148

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@lebeg lebeg requested a review from anirudh2290 as a code owner December 7, 2018 10:38
Copy link
Contributor

@larroy larroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good change, some comments.

@@ -176,13 +176,17 @@ void LstmForwardTraining(DType* ws,
if (dropout > 0.0f) {
#pragma omp parallel for num_threads(omp_threads)
for (int j = 0; j < T * N * H * D; j++) {
int rand_data = rand_r(&seed_);
static thread_local std::random_device device;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it correct that we are seeding inside the loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting link, thanks. I would still prefer the explicit way for clarity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it correct that we are seeding inside the loop?

Since it's static we doing it only once.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For every thread.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

@@ -1890,7 +1896,11 @@ void VanillaRNNForwardTraining(DType* ws,
if (dropout > 0.0f && l > 0) {
#pragma omp parallel for num_threads(omp_threads)
for (int i = 0; i < T * N * I; i++) {
int rand_data = rand_r(&seed_);
static thread_local std::random_device device;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we wrap this up in a function? Or would interfere with thread_local? Seems we are duplicating code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure where to put it, do you have any suggestions?

Copy link
Contributor

@larroy larroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nswamy nswamy added pr-awaiting-review PR is waiting for code review and removed pr-awaiting-review PR is waiting for code review labels Dec 8, 2018
static thread_local std::default_random_engine generator(device());
static thread_local std::uniform_real_distribution<float> distribution;
static thread_local auto dice = std::bind(distribution, generator);
return dice();
}

Copy link
Contributor

@perdasilva perdasilva Dec 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a RandInt method and use this in rnn_impl?
Would be interested to get understand what happens memory-wise with these thread_locals, and also the openmp parallelization when I see you next =)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is file is in tests/.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no facepalm emoticon, so I had thumbs-up

@Roshrini
Copy link
Member

@perdasilva @larroy Is your concern answered? Is this PR good to go?

@Roshrini
Copy link
Member

Roshrini commented Jan 2, 2019

@mxnet-label-bot Update [pr-awaiting-response]

@marcoabreu marcoabreu added pr-awaiting-response PR is reviewed and waiting for contributor to respond and removed pr-awaiting-review PR is waiting for code review labels Jan 2, 2019
@anirudhacharya
Copy link
Member

@lebeg @larroy @perdasilva ping for update

@lebeg
Copy link
Contributor Author

lebeg commented Jan 15, 2019

I think this PR is good to merge, can somebody do this?

@lebeg
Copy link
Contributor Author

lebeg commented Jan 15, 2019

@mxnet-label-bot Update [pr-awaiting-merge]

@marcoabreu marcoabreu added pr-awaiting-merge Review and CI is complete. Ready to Merge and removed pr-awaiting-response PR is reviewed and waiting for contributor to respond labels Jan 15, 2019
@sandeep-krishnamurthy
Copy link
Contributor

@apeforest - Can you please take a look at this PR?
Thanks @lebeg for this PR.

@perdasilva
Copy link
Contributor

@anirudhacharya @lebeg all good from my side

@lebeg
Copy link
Contributor Author

lebeg commented Feb 11, 2019

Can anybody merge this?

@lanking520
Copy link
Member

@apeforest Could you please take a look?

@szha szha requested review from TaoLv and yzhliu February 12, 2019 23:26
Copy link
Member

@szha szha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we have random number generator as operator resources, the thread local random number generator inside CPU RNN operator looks like a hack to me.

@@ -176,13 +176,17 @@ void LstmForwardTraining(DType* ws,
if (dropout > 0.0f) {
#pragma omp parallel for num_threads(omp_threads)
for (int j = 0; j < T * N * H * D; j++) {
int rand_data = rand_r(&seed_);
static thread_local std::random_device device;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably addressed at the framework level by providing APIs to get rand numbers. We should not expect developers who implement operators to handle thread local variables

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eric-haibin-lin In general, I don't see why a thread local variable is an issue - there is a separate generator for every thread and nothing needs to be done additionally to ensure thread safety. Otherwise locking needs to be in place.

I could add a generate random number function to the framework API, but wouldn't change the implementation. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And in case of 1 function for all there will be complications with setting the seed if needed as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest to refer to RandGenerator<xpu> as I mentioned below. It handles thread-safe problem without locking. The implement is somewhat tricky though.

@lebeg
Copy link
Contributor Author

lebeg commented Feb 13, 2019

we have random number generator as operator resources

@szha can you add a bit more information on this? Do you mean there is a random number generator operator?

@szha
Copy link
Member

szha commented Feb 13, 2019

@lebeg I recommend looking at the dropout implementation (in src/operator/nn/dropout-inl.h)

@TaoLv
Copy link
Member

TaoLv commented Feb 14, 2019

Looks like the change is extracted from #11148 . Reviewed the code before and related comment and response are here: /~https://github.com/apache/incubator-mxnet/pull/11148/files#r210174235

@lebeg
Copy link
Contributor Author

lebeg commented Feb 14, 2019

@szha @TaoLv thanks, quoting here again for reference:

Well, the problem is not about the seed itself, but with usage of rand_r and the fact that it's multithreaded. As far as I can tell in dropout-inl.h#L85 the seed is set outside of OpenMP threads and no random number gets generated with it afterwards, so it should be fine.

@yzhliu
Copy link
Member

yzhliu commented Feb 15, 2019

@yzhliu
Copy link
Member

yzhliu commented Feb 15, 2019

Usually using thread_local rand generator is not a good idea, not only in the sense of system design, but it could be invalid.

Say if you use 4 threads to fill an array A[4], each Gaussian random generator fills one entry with its head element - A[0] = gen1[0], A[1] = gen2[0], A[2] = gen3[0], A[3] = gen4[0] - this array in fact does not necessarily follow Gaussian distribution (sometimes with specific implement it does, but no guarantee). Pseudorandom sequence has to be long enough in order to satisfy the distribution property. Such "bug" is rare but will be extremely hard to debug once occurs (which I met before).

Also it will be great to let users set their global seed, as some (maybe not-well-designed) algorithms are sensitive to seed number. This means all random generators within the system should take same input seed as their constructor argument. (I know currently this rule is not well followed in mxnet though.)

This is why RandGenerator is implemented so complicated.

@lebeg
Copy link
Contributor Author

lebeg commented Feb 15, 2019

Thank you @yzhliu for your input! Now it's indeed much clearer to me. I will take a look at the RandGenerator implementation and modify the change accordingly.

@vandanavk
Copy link
Contributor

@mxnet-label-bot update [pr-awaiting-response, Operator]

@marcoabreu marcoabreu added Operator pr-awaiting-response PR is reviewed and waiting for contributor to respond and removed pr-awaiting-merge Review and CI is complete. Ready to Merge labels Mar 7, 2019
@karan6181
Copy link
Contributor

@lebeg Did you get a chance to look at the RandGenerator implementation ? Can you please make the relevant code changes for that? Thank you!

@lebeg
Copy link
Contributor Author

lebeg commented Mar 26, 2019

Closing this for now, no capacity to work on this right now.

@lebeg lebeg closed this Mar 26, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Operator pr-awaiting-response PR is reviewed and waiting for contributor to respond
Projects
None yet
Development

Successfully merging this pull request may close these issues.