Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "enable_tensor_checker" and "disable_tensor_checker" to api list #52936

Merged
merged 20 commits into from
Apr 24, 2023

Conversation

AnnaTrainingG
Copy link
Contributor

@AnnaTrainingG AnnaTrainingG commented Apr 14, 2023

PR types

Others

PR changes

APIs

Description

添加API 到 API 列表
生成效果:
image

@paddle-bot
Copy link

paddle-bot bot commented Apr 14, 2023

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@AnnaTrainingG AnnaTrainingG changed the title Add enable_check Add "enable_tensor_checker" and "disable_tensor_checker" to api list Apr 14, 2023
Copy link
Contributor

@Xreki Xreki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#51906 (review) 还有一些跟调用栈功能相关的review建议,是在调用栈的PR里面修改吗?

@@ -2673,6 +2673,14 @@ All parameter, weight, gradient are variables in Paddle.
m.def("set_nan_inf_debug_path",
&paddle::framework::details::SetNanInfDebugPath);

// Add skipped op list
m.def("set_checked_op_list",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

后面看看,要不要新建一个debugging.cc,来集中放这些接口。零零散散地往pybind.cc里面加的函数太多了,别人很难知道哪些函数是用来干什么的。

或者通过一个函数设置所有配置参数。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个PR里面直接改了吧。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经修改

]


class DebugMode(Enum):
"""
DebugMode is used to present the state of TensorCheckerConfig
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

句末要有.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

@@ -220,16 +243,16 @@ def _set_env(self, check_flag):
else:
raise ValueError("stack_height_limit must be int")

def check(self):
def check_step_id(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update_step_id

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

主要是用来判断是否合规,返回true false

.replace(",", " ")
.split(" ")
)
for err_str in err_str_list:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个单测好像永远不会fail。单测需要起到保证正确性的作用,保证别人不会把原来的代码改坏。所有单测里面都需要检测一下op_type、num_nan、num_inf是否符合预期。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经添加assert

num_nan = int(err_str.split("=")[1])
elif "num_inf" in err_str:
num_inf = int(err_str.split("=")[1])
assert 0 == num_inf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

单测里面用self.assertEqual

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经修改

[0.2, -1, 0.5], place=paddle.CPUPlace(), dtype='float32'
)
try:
res = paddle.pow(x, y)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个单测挪出来之后,没有开启check_nan_inf

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个是为了后面assert使用的

)
# check seed
assert checker_config.initial_seed == 102
assert checker_config.seed == 102
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

单测里面同一用self.assertEqual检查

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改


x = paddle.to_tensor(
[0, 0, 0],
place=paddle.CPUPlace(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

单测也需要检查GPU

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经添加

)
paddle.amp.debugging.enable_tensor_checker(checker_config)
try:
res = paddle.divide(y, x)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

单测都构造的太简单了,需要有包括前向、反向、优化器的测试用例,且需要开启level=3检测,确保不要漏算子。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经修改

if "num_nan" in err_str:
num_nan = int(err_str.split("=")[1])
elif "num_inf" in err_str:
num_inf = int(err_str.split("=")[1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

重复的代码建议封装成函数

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经修改

@AnnaTrainingG
Copy link
Contributor Author

调用栈的PR 这个PR修改

jzhang533
jzhang533 previously approved these changes Apr 19, 2023
Copy link
Contributor

@jzhang533 jzhang533 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@Xreki Xreki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

新增API文档的预览,在PR描述中贴出来

"""
DebugMode is used to present the state of TensorCheckerConfig.

The meaning of each DebugMode is as following
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • 句末要有标点符号,下同
  • 英语表述含义不太准确,用ChatGPT润色下?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经修改

@@ -47,14 +85,16 @@ class TensorCheckerConfig:
* enable: Whether to enable Tensor's value detection function. The default value is False, which means that these tools will never be used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

文档的格式不对

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经修改

@@ -67,29 +107,31 @@ class TensorCheckerConfig:
* enable_traceback_filtering: Whether to filter the traceback. The main purpose is to filter out the internal code call stack of the framework and only display the user code call stack
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • stack_height_limit文档要说明当前支持了什么功能,怎么关掉调用栈
  • enable_traceback_filtering这个功能暂不支持,也去掉

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经修改

res = paddle.pow(x, y)
paddle.autograd.backward(res, retain_graph=True)
paddle.amp.debugging.disable_tensor_checker()
#[PRECISION] [ERROR] in [device=cpu, op=elementwise_pow_grad, tensor=, dtype=fp32], numel=3, num_nan=1, num_inf=0, num_zero=0, max=2.886751e-01, min=2.000000e-01, mean=-nan
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

调用栈也以注释的方式给出

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经添加


if self.enable:
self._set_seed(self.enable)

def keep_random(self, seed, flag):
def _keep_random(self, seed, flag):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

关于函数名,我上个PR有一些review建议,你再check下

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经修改,不加_ 在static检查的时候默认是暴露给用户的

)
else:
raise ValueError("stack_height_limit must be int")

def check(self):
def _check_step_id(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • _开头,python中默认为私有函数(private),需要再类型外面调用的函数不要加_
  • 函数名建议:update_and_check_step_id

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经修改

@@ -411,30 +437,34 @@ def enable_tensor_checker(checker_config):
"""
enable_tensor_checker(checker_config) is enables model level accuracy checking, which is used together with disables_tensor_checker() to achieve model level precision checking through the combination of these two APIs, checking the output Tensors of all operators within the specified range.


Attention:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attention -> Note

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经修改

@@ -411,30 +437,34 @@ def enable_tensor_checker(checker_config):
"""
enable_tensor_checker(checker_config) is enables model level accuracy checking, which is used together with disables_tensor_checker() to achieve model level precision checking through the combination of these two APIs, checking the output Tensors of all operators within the specified range.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is enables语法错误

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经修改

@@ -2673,6 +2673,14 @@ All parameter, weight, gradient are variables in Paddle.
m.def("set_nan_inf_debug_path",
&paddle::framework::details::SetNanInfDebugPath);

// Add skipped op list
m.def("set_checked_op_list",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个PR里面直接改了吧。

Copy link
Contributor

@sunzhongkai588 sunzhongkai588 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debugmode再确认下是否要加示例


Attention:
Note:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note:下面不用空行,否则解析不出
image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经修改


Attention:
Note:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note:下面不用空行,否则解析不出

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经修改


Args:
* enable: Whether to enable Tensor's value detection function. The default value is False, which means that these tools will never be used.
enable: A boolean value indicating whether to enable the detection of NaN and Inf values in tensors. The default value is False, which means that these tools will not be used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

参数前面还是得加上Args:,请参照模板(paddle.add)并注意缩进。

Args:
        x (Tensor): The input tensor, it's data type should be float32, float64, int32, int64.
        y (Tensor): The input tensor, it's data type should be float32, float64, int32, int64.
        name (str, optional): For details, please refer to :ref:`api_guide_Name`. Generally, no setting i

如果对某个参数有更进一步的说明,如debug_mode的四种modes,可以在再进一步缩进

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经修改

@AnnaTrainingG
Copy link
Contributor Author


debug_step: A list or tuple used primarily for nan/inf checking during model training. For example, debug_step=[1,5] indicates that nan/inf checking should only be performed on model training iterations 1 to 5.
debug_step: A list or tuple used primarily for nan/inf checking during model training. For example, debug_step=[1,5] indicates that nan/inf checking should only be performed on model training iterations 1 to 5.

stack_height_limit: An integer value specifying the maximum depth of the call stack. This feature supports printing the call stack at the error location. Currently, only enabling or disabling call stack printing is supported. If you want to print the corresponding C++ call stack when NaN is detected in GPU Kernel, set stack_height_limit to 1, otherwise set it to 0.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个参数的缩进和上面的(debug_step)也对齐一下

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经修改


checker_config = paddle.amp.debugging.TensorCheckerConfig(enable=True, debug_mode=DebugMode.CHECK_NAN_INF_AND_ABORT)
paddle.amp.debugging.enable_tensor_checker(checker_config)
.. code-block:: python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.. code-block:: python 下方得空行,否则会解析错误
image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

sunzhongkai588
sunzhongkai588 previously approved these changes Apr 21, 2023
Copy link
Contributor

@sunzhongkai588 sunzhongkai588 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


Args:
* enable: Whether to enable Tensor's value detection function. The default value is False, which means that these tools will never be used.
enable: A boolean value indicating whether to enable the detection of NaN and Inf values in tensors. The default value is False, which means that these tools will not be used.
debug_mode: A parameter that determines the type of debugging to be used. There are 4 available modes:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

此处的格式依然不对。这里可不必展开讲DebugMode每个配置的功能。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经修改


Args:
* enable: Whether to enable Tensor's value detection function. The default value is False, which means that these tools will never be used.
enable: A boolean value indicating whether to enable the detection of NaN and Inf values in tensors. The default value is False, which means that these tools will not be used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

参数介绍的格式是:

  • 没有默认值,如enable (bool)
  • 有默认值,如debug_mode (DebugMode, optional),后面的解释中加一个Default is DebugMode.CHECK_NAN_INF_AND_ABORT.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经修改


def __init__(
self,
enable,
debug_mode=DebugMode.CHECK_NAN_INF_AND_ABORT,
dump_dir=None,
output_dir=None,
checked_op_list=None,
skipped_op_list=None,
debug_step=None,
stack_height_limit=3,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

默认值改成1吧。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经修改

if self.initial_seed != self.seed:
self.seed = self.initial_seed

if self.seed > 4294967295 or self.seed < 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4294967295这个数字怎么来的?最好在这里直接调用相应的函数。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经修改

Copy link
Contributor

@Xreki Xreki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Great work~

Copy link
Contributor

@sunzhongkai588 sunzhongkai588 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@AnnaTrainingG AnnaTrainingG merged commit 4113871 into PaddlePaddle:develop Apr 24, 2023
lanxianghit pushed a commit that referenced this pull request Apr 25, 2023
… api list (#52936) (#53287)

新增enable_tensor_checker, disable_tensor_checker API (#52936)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants