Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【Hackathon 4th No.27】为 Paddle 新增 paddle.sparse.concat 稀疏 API #53872

Closed
wants to merge 0 commits into from

Conversation

lijingkai2023
Copy link

PR types

New features

PR changes

APIs

Description

完成第四期第24项目开发任务: /~https://github.com/PaddlePaddle/community/blob/master/hackthon_4th/%E3%80%90PaddlePaddle%20Hackathon%204%E3%80%91%20%E6%A0%B8%E5%BF%83%E6%A1%86%E6%9E%B6%E5%BC%80%E6%BA%90%E8%B4%A1%E7%8C%AE%20API%20%E5%BC%80%E5%8F%91%E4%BB%BB%E5%8A%A1%E5%90%88%E9%9B%86.md#task27

1、增加 以稀疏矩阵列表为参数,自动生成动态图代码和注册逻辑(由于concat的第一个参数是稀疏矩阵列表,paddle框架当前不支持)
2、新增 paddle.sparse.concat 稀疏 API

RFC设计文档: PaddlePaddle/community#504
中文api文档:PaddlePaddle/docs#5886

[used AI Studio] 完成: c++算子以稀疏矩阵列表为参数,注册逻辑;GPU编译测试

@paddle-bot
Copy link

paddle-bot bot commented May 16, 2023

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@paddle-bot paddle-bot bot added contributor External developers status: proposed labels May 16, 2023
@paddle-bot
Copy link

paddle-bot bot commented May 16, 2023

❌ The PR is not created using PR's template. You can refer to this Demo.
Please use PR's template, it helps save our maintainers' time so that more developers get helped.

paddle/phi/api/lib/api_gen_utils.cc Outdated Show resolved Hide resolved
paddle/phi/kernels/sparse/cpu/concat_grad_kernel.cc Outdated Show resolved Hide resolved
paddle/phi/kernels/sparse/cpu/concat_kernel.cc Outdated Show resolved Hide resolved
paddle/phi/kernels/sparse/cpu/concat_kernel.cc Outdated Show resolved Hide resolved
paddle/phi/kernels/sparse/gpu/concat_kernel.cu Outdated Show resolved Hide resolved
paddle/phi/kernels/sparse/gpu/concat_kernel.cu Outdated Show resolved Hide resolved
@lijingkai2023
Copy link
Author

sparse_concat在静态图中调用时,报错:
File "C:\Users\desig\AppData\Local\Programs\Python\Python310\lib\site-packages\paddle\fluid\framework.py", line 2793, in init
for frame in traceback.extract_stack():

InvalidArgumentError: Operator sparse_concat's input x should contain only one variable.
  [Hint: Expected it->second.size() <= 1UL, but received it->second.size():2 > 1UL:1.] (at ..\paddle\fluid\framework\operator.cc:1129)
  [operator < sparse_concat > error]

根据日志输出判断,是在paddle\fluid\framework\operator.cc中,函数 GetExpectedPhiKernelArgs 语句 return (*arg_map_fn_)(arg_mapping_ctx); 中出错。
*arg_map_fn_ 是paddle\phi\ops\compat\generated_sparse_sig.cc 中 函数 SparseConcatOpArgumentMapping,但是该函数未调用已经报错返回。

请问报错函数(文件paddle\fluid\framework\operator.cc中函数 InputVar),是在哪里被调用的?
这个问题有什么好的解决思路吗?

@luotao1 luotao1 added the API label May 23, 2023
@zyfncg
Copy link
Contributor

zyfncg commented May 24, 2023

sparse_concat在静态图中调用时,报错: File "C:\Users\desig\AppData\Local\Programs\Python\Python310\lib\site-packages\paddle\fluid\framework.py", line 2793, in init for frame in traceback.extract_stack():

InvalidArgumentError: Operator sparse_concat's input x should contain only one variable.
  [Hint: Expected it->second.size() <= 1UL, but received it->second.size():2 > 1UL:1.] (at ..\paddle\fluid\framework\operator.cc:1129)
  [operator < sparse_concat > error]

根据日志输出判断,是在paddle\fluid\framework\operator.cc中,函数 GetExpectedPhiKernelArgs 语句 return (*arg_map_fn_)(arg_mapping_ctx); 中出错。 *arg_map_fn_ 是paddle\phi\ops\compat\generated_sparse_sig.cc 中 函数 SparseConcatOpArgumentMapping,但是该函数未调用已经报错返回。

请问报错函数(文件paddle\fluid\framework\operator.cc中函数 InputVar),是在哪里被调用的? 这个问题有什么好的解决思路吗?

这个问题需要加一些LOG信息来定位InputVar的具体调用位置,有可能是已经进入到了SparseConcatOpArgumentMapping中并调用了IsSparseCooTensorInput之类的函数。

@lijingkai2023
Copy link
Author

上面说的报错位置,就是通过增加LOG信息找到的。
在函数SparseConcatOpArgumentMapping开头,和IsSparseCooTensorInput开头都增加了LOG日志,但是未输出,所以判断未进入函数SparseConcatOpArgumentMapping,已经报错。
再往下找问题,没有思路了

@paddle-ci-bot
Copy link

paddle-ci-bot bot commented May 29, 2023

Sorry to inform you that 63bbbc7's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

@zhwesky2010
Copy link
Contributor

zhwesky2010 commented Jun 1, 2023

@lijingkai2023 你好,由于CPU kernel已经跑通了,GPU kernel应该是kernel中的问题,具体有什么报错吗

@luotao1
Copy link
Contributor

luotao1 commented Jun 2, 2023

@zhwesky2010 具体报错如下:

gpu报错 如下:
--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   sparse::concat_ad_func(std::vector<paddle::Tensor, std::allocator<paddle::Tensor> > const&, paddle::experimental::ScalarBase<paddle::Tensor>)
1   paddle::experimental::sparse::concat(std::vector<paddle::Tensor, std::allocator<paddle::Tensor> > const&, paddle::experimental::ScalarBase<paddle::Tensor> const&)
2   phi::KernelImpl<void (*)(phi::GPUContext const&, std::vector<phi::SparseCooTensor const*, std::allocator<phi::SparseCooTensor const*> > const&, paddle::experimental::ScalarBase<phi::DenseTensor> const&, phi::SparseCooTensor*), &(void phi::sparse::ConcatCooKernel<float, phi::GPUContext>(phi::GPUContext const&, std::vector<phi::SparseCooTensor const*, std::allocator<phi::SparseCooTensor const*> > const&, paddle::experimental::ScalarBase<phi::DenseTensor> const&, phi::SparseCooTensor*))>::Compute(phi::KernelContext*)
3   void phi::sparse::ConcatCooKernel<float, phi::GPUContext>(phi::GPUContext const&, std::vector<phi::SparseCooTensor const*, std::allocator<phi::SparseCooTensor const*> > const&, paddle::experimental::ScalarBase<phi::DenseTensor> const&, phi::SparseCooTensor*)
4   phi::DDim::CopyFrom(phi::DDim const&)
 
----------------------
Error Message Summary:
----------------------
FatalError: `Segmentation fault` is detected by the operating system.
  [TimeInfo: *** Aborted at 1684316520 (unix time) try "date -d @1684316520" if you are using GNU date ***]
  [SignalInfo: *** SIGSEGV (@0x48) received by PID 7618 (TID 0x7f8f715d8700) from PID 72 ***]
 
 
静态图调用报错 如下:
File "C:\Users\desig\AppData\Local\Programs\Python\Python310\lib\site-packages\paddle\fluid\framework.py", line 2793, in init
for frame in traceback.extract_stack():

InvalidArgumentError: Operator sparse_concat's input x should contain only one variable.
  [Hint: Expected it->second.size() <= 1UL, but received it->second.size():2 > 1UL:1.] (at ..\paddle\fluid\framework\operator.cc:1129)
  [operator < sparse_concat > error]
详细描述可以参考我在pr中提出的问题

@lijingkai2023
Copy link
Author

是的,可以进入,增加的log日志输出了。
以前也整了好几次都没有输出日志,估计是哪里弄错了吧
谢谢!

@zhwesky2010
Copy link
Contributor

zhwesky2010 commented Jun 2, 2023

@lijingkai2023 你好,静态图调用报错是由于目前生成时还有些机制问题。对于sparse算子任务来说当前也可以只做动态图的,因为算子都是动静复用的,静态图的单测目前也可以先不用写了。

动态图yaml生成这里我看你已经弄好了,因为CPU可以跑了,GPU 是纯kernel问题,和静态图无关,具体是ddim的CopyFrom这个函数,应该是触发了访问越界导致segment fault,所以还需要修一下ConcatCooKernel这个函数,属于任务的范围。

@lijingkai2023
Copy link
Author

好的
正在修改Gpu kernel

self.check_result(i, [2, 3, 4, 2, 3, 4, 2, 3, 4], j + 1, 'coo')


# class TestSparseConcatStatic(unittest.TestCase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

静态图单测目前可以先不用管了

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的

}

template <typename T, typename Context>
void ConcatCooKernel(const Context &dev_ctx,
Copy link
Contributor

@zhwesky2010 zhwesky2010 Jun 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

具体是这个函数里有触发访问越界报错的问题

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

@lijingkai2023 lijingkai2023 force-pushed the develop branch 2 times, most recently from 1a99070 to 0b1086b Compare June 5, 2023 03:53
@paddle-bot
Copy link

paddle-bot bot commented Jun 5, 2023

很抱歉,经过我们的反复讨论,你的PR暂未达到合入标准,请阅读飞桨原生算子开发规范,你可以重新提交新的PR,我们先将此PR关闭,感谢你的贡献。
Sorry to inform you that through our discussion, your PR fails to meet the merging standard (Reference: Paddle Custom Operator Design Doc). You can also submit an new one. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants