Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why sharing data type in ShareBufferWith fixes unittests for mkldnn ops using non-float datatypes? #37306

Closed
iclementine opened this issue Nov 17, 2021 · 5 comments
Assignees
Labels

Comments

@iclementine
Copy link

iclementine commented Nov 17, 2021

@jczaja Hi, I have a question about inplace mechanism in mkldnn ops.

Description

I was implementing a operator that can view real tensors(dtype: float32 or float64) of shape (d1, d2, ..., 2) as complex tensors(dtype: complex64 or complex128, respectively) of shape (d1, d2, ...). This requires that sharing buffer(or storage) between 2 tensors with different data types. (PR: #37240)

Paddle shares storage with ShareBufferWith.

void ShareBufferWith(const Tensor& tensor) {
  holder_ = tensor.holder_;
  offset_ = tensor.offset_;
  type_ = tensor.type_;
}

I tried by removing type_ = tensor.type_. I could actually share storage bewteen real and complex tensors by removing this. But I notices that there was a related PR where this line was added.

PR #24853 Fix bug in ShareBufferWith causing non-fp32 inplace uts to fail

During call to ShareBufferWith, the type wasn't assigned when holders were assigned. This causes inplace UTs that use non-fp32 tensors (for example int8, uint8) to fail on FetchOp.

Also, removing this line PR #37247 disable copying of datatype when sharing buffer between two tensorscause a failure of test_elementwise_add_bf16_mkldnn_op. The build log(see the attachment) shows that there is a mismatch in memory size, possible caused by error in data type.

But I think there could be a better way to solve this problem. By definition, data type is a metadatum associated with a Tensor instead of a storage object.

Question

So I why sharing data type in ShareBufferWith fixes unittests for mkldnn ops using non-float datatypes when inplace strategy is enabled. If this is figured out, we could fix this while avoid sharing data types in ShareBufferWith, this sharing
storage between Tensor with different data types is possible.

Thank you!

build.log

@jczaja
Copy link
Contributor

jczaja commented Nov 17, 2021

@iclementine I started to analyze situation and will come back to you soon.

@jczaja
Copy link
Contributor

jczaja commented Nov 18, 2021

@iclementine
I'm not sure If I can answer the question. As former colleague of ours was implementing that fix.

So, data type in this situation you are describing is bf16 which 2 bytes per element as oppose to floating point data type which is four bytes per element. Unit test for inplace scenario is using "buffer_share" op that does call "ShareBufferWith(<src bf16_tensor)" . In case of this unit test src bf16 tensor is holding 100 elements of which each one is of size of two bytes.
If ShareButterWith() is not setting type to bf16 in target buffer then default type (float) is being set. And later on when there is fetch_handle_op working e.g. TensorCopy is called there is "check_memory_size()" which compares tensorsize(200) with numel()*sizeof(float) and as 200 is not equal 400 then there is assertion not matched.

The thing is that there was no bigger design analysis behind this solution. My colleague just was debugging and noticed function ShareBufferWith() is not transfering type which cause problems. And by his (and mine) understanding ShareBufferWith() should transfer data type.

If you have an idea how to have this fixed otherwise we would be happy to read about it.

@iclementine
Copy link
Author

iclementine commented Nov 19, 2021

@jczaja
I think I have found the reason why not sharing data type would cause inplace elementwise op for mkldnn with non-float data type to fail.

For an operator to enable inplace strategy, the output resues the memory of the input in some way. But the kernel still has to take care of the data type of the output tensor.

When the kernel allocates the memory of the ouput (via Tesor.mutable_data<T> or platform::BinaryMKLDNNHandler<T>.AcquireDstMemory<T>(Tensor*), etc), the data type is also set to the desired type.

But when the memory is not allocated by some allocator, but shared with other allocation, there should be a separate step to set the data type. (Though I think it should be the responsibility of the operator to infer the correct data type of the return Variable.)

So I add a step to set the data type.(mutable_data<T>(...) only set data type when the memory of a tensor is allocated and do not need to re-allocate.)

See #PR 37247 for details.

@jczaja
Copy link
Contributor

jczaja commented Nov 19, 2021

@iclementine Ok. If there is not Tensor::set_type() then mutable_data needs to be call. We have analogical constructions in some other ops supporting inplace e.g. softmax, activations..

@iclementine
Copy link
Author

@jczaja Thank you. I'm going to do the same to other operators.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants