-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Use single-bit for mask in dropout operator #16735
base: master
Are you sure you want to change the base?
Conversation
14a3791
to
3192717
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TaoLv @PatricZhao can someone review the CPU changes?
src/operator/nn/dropout-inl.h
Outdated
}); | ||
// mask_out is set per bit position | ||
// therefore bitwise shift need to be performed here | ||
auto maskIdx = i / 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maskIdx -> mask_idx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment for offset, val
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do
src/operator/nn/dropout-inl.h
Outdated
bool maskVal = mshadow_op::threshold_eq::Map<real_t>(rand_num, pkeep); | ||
if (maskVal) { | ||
// set bit | ||
mask_out[maskIdx] |= 1U << maskOffset; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this lead to race condition if the same maskIdx is being set by multiple threads? Shall each thread handle at least 8 bits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I was thinking of setting the step to 8 but forgot to update it in the macro.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After checking into it more, I found ideally this should not happen because RandGenerator<xpu>::kMinNumRandomPerThread
is 64 and therefore by design the step size inside LaunchRNG
should be a multiple of 8. But then I looked into that piece of code again and found it looks like a bug in calculating the step
. Please review my latest change in src/operator/random/sampler.h and let me know if it makes sense. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this for loop parallelized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I do not recommend writing code this way. There is not documenntation nor guarantee that kMinNumRandomPerThread
will always be greater than 8 in the future. Nor does the dropout operator document any assumption about the value of kMinNumRandomPerThread
. The code is delicate and will be broken if some contributor changes kMinNumRandomPerThread
to values like 4. If there's any assumption, we should add an explicit check so that it won't be broken in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point. I will refactor this piece of code.
src/operator/nn/dropout-inl.h
Outdated
}); | ||
// mask_out is set per bit position | ||
// therefore bitwise shift need to be performed here | ||
auto maskIdx = i / 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this lead to race condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
potential race condition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will help to save memory. But curious to know the performance impact.
4457579
to
78a40d5
Compare
@apeforest Thank you for the nice work! Do you have any numbers to share?
|
For GPT-2, the memory usage goes from 30GB to 26GB. For BERT, it goes from 26GB to 23GB. I didn't notice much difference in training throughput. |
@TaoLv Thanks for your review. I ran operator profiling using benchmark.opperf.utils.benchmark_utils.run_performance_test. The result shows speed up in forward but some degradation in backward pass. w/ this change:
w/o this change:
|
38c021a
to
3874110
Compare
@apeforest Thank you for testing it out. Given memory is not always a concern, can we make bit mask an option for dropout? |
@TaoLv I don't think adding an option is necessary. can we improve the backward kernel? |
@apeforest Could you please also test the operator performance with USE_BLAS=mkl? |
It will be a concern for the performance drop because we are working on model training recently. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cudnn part LGTM, one concern is the speed reported from profiler is quite different than measured from python side here
Let's make sure we know the performance impact on end to end python.
@roywei Using the test script in #13896
Using python timer to measure CPU performance with MKL: This PR:
Master:
|
Does the |
auto mask_idx = i >> 3; // div 8; | ||
uint8_t mask_offset = i & 7; // mod 8 | ||
bool mask_val = maskptr[mask_idx] & (1U << mask_offset); | ||
ingradptr[i] = outgradptr[i] * mask_val * pk_1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also use blocking in the backward path:
const int blk_size = 64;
const int nblk = count / blk_size;
#pragma omp parallel for num_threads(nthr) schedule(static, 8)
for (index_t b = 0; b < nblk; ++b) {
for (index_t k = 0; k < blk_size; ++k) {
index_t i = b * blk_size + k;
auto mask_idx = i >> 3; // div 8;
uint8_t mask_offset = i & 7; // mod 8
bool mask_val = maskptr[mask_idx] & (1U << mask_offset);
ingradptr[i] = outgradptr[i] * mask_val * pk_1;
}
}
// tail
if (nblk * blk_size < count) {
for (index_t i = nblk * blk_size; i < count; ++i) {
auto mask_idx = i >> 3; // div 8;
uint8_t mask_offset = i & 7; // mod 8
bool mask_val = maskptr[mask_idx] & (1U << mask_offset);
ingradptr[i] = outgradptr[i] * mask_val * pk_1;
}
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After more thoughts, I think we actually don't need to do blocking in the backward pass as there is no write to maskptr and hence no cache eviction nor race condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're writing to ingradptr
. We also hope the elements in one cache line will be handled by one openmp thread. With the original parallelization, one cache line is loaded and only one element in it is handled by the current thread. For the next thread, it need load the same cache line, and handle the next element.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However there is no read from ingradptr
, therefore this is not a case of the false sharing, right? I tried this block and didn't noticed any noticeable performance gain.
Yes, it includes backward time as my |
|
The new implementation increases both memory load and additional bit-wise operations. So performance slow down is expected. |
What algorithm is used in TF and pytorch? |
@pengzhao-intel I don't think TF has a fused dropout operator. It's implement with several small operators. See /~https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/nn_ops.py#L4456. So the backward path should go through the backward of these small operators. Hence no bit-mask there. For PyTorch, I see there is a fused one: /~https://github.com/pytorch/pytorch/blob/master/tools/autograd/templates/Functions.cpp#L634. The mask tensor should either be Boolean or has compatible type as grad. So no bit-mask either.
@apeforest , so far there is no dropout functionality in MKL or MKL-DNN. Here we just use VSL to generate random values. So even we can generate bit-mask, it will increase additional computation for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new implementation increases both memory load and additional bit-wise operations. So performance slow down is expected.
Why does it increase memory load?
Is there any plan for MKLDNN to support fast dropout with bit-mask like CuDNN? I think reducing memory consumption is quite important. CPU does not have memory capacity issue but it will be one for most GPUs and ASICs. I'd push for efficient implementation from MKLDNN in the long term.
The memory load is actually reduced even in the case of MKL, right? Please refer to the tests results in the PR description. |
If there are N elements, per the Bernoulli distribution generation in VSL, we still need to allocate memory and write |
The memory for bit-mask is not extra memory. So for the MKL dropout case, |
@PatricZhao @TaoLv what do you suggest as the resolution? If CPU performance is a concern, shall we add env_var to control the behavior? Do you agree in the long term we want to push for dropout API in MKLDNN with 1-bit mask? |
Given your concern about the performance degradation in the case of MKL dropout, I have disabled this feature when MKL dropout is used. Please review the PR again and let me know if you think this is good to go. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the turning around, @apeforest. It looks good to me in general but I notice that there are cases failing on dropout. I can approve once they get fixed. Thanks!
Hi @TaoLv and @PatricZhao I reverted my last commit of "Do not use bit-mask when MKL dropout is used." It makes the code too bristle and also involves very complicate logic to check memory allocation at runtime. Here are the main reasons: (1) MKL dropout support is currently not complete. It does not work if the input data type is smaller than int32 and it does not support broadcast option (when the option axes is specified). This limitation enforces a check at runtime which is not possible in the InferShape function e.g. In this function, I will need to check if the dtype is greater than int32 in order to use a different shape for MKL Dropout. (2) Having different Dropout engine at runtime (based on data type and ) may cause inconsistency in the mixed precision case. Introducing another difference in mask memory allocation complicates this even further. I think we should focus on enhancing MKL Dropout so that it (1) supports all the different cases as non MKL dropout (2) supports bit-mask. Please let me know what you think. Thanks! Lin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a RFC for 1-bit dropout in MKLDNN which we can leverage: oneapi-src/oneDNN#656 (comment)
Is there anyone that can take a look at this PR ? |
Description
Use single bit in mask for dropout to reduce memory.
This PR fixes #15968
Performance tests are run using the script below:
Results:
Time measured in python: #13896
@eric-haibin-lin @TaoLv @PatricZhao @ptrendx @roywei please help to review