-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add "enable_tensor_checker" and "disable_tensor_checker" to api list #52936
Add "enable_tensor_checker" and "disable_tensor_checker" to api list #52936
Conversation
你的PR提交成功,感谢你对开源项目的贡献! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#51906 (review) 还有一些跟调用栈功能相关的review建议,是在调用栈的PR里面修改吗?
@@ -2673,6 +2673,14 @@ All parameter, weight, gradient are variables in Paddle. | |||
m.def("set_nan_inf_debug_path", | |||
&paddle::framework::details::SetNanInfDebugPath); | |||
|
|||
// Add skipped op list | |||
m.def("set_checked_op_list", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
后面看看,要不要新建一个debugging.cc,来集中放这些接口。零零散散地往pybind.cc里面加的函数太多了,别人很难知道哪些函数是用来干什么的。
或者通过一个函数设置所有配置参数。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个PR里面直接改了吧。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已经修改
python/paddle/amp/debugging.py
Outdated
] | ||
|
||
|
||
class DebugMode(Enum): | ||
""" | ||
DebugMode is used to present the state of TensorCheckerConfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
句末要有.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改
python/paddle/amp/debugging.py
Outdated
@@ -220,16 +243,16 @@ def _set_env(self, check_flag): | |||
else: | |||
raise ValueError("stack_height_limit must be int") | |||
|
|||
def check(self): | |||
def check_step_id(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update_step_id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
主要是用来判断是否合规,返回true false
.replace(",", " ") | ||
.split(" ") | ||
) | ||
for err_str in err_str_list: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个单测好像永远不会fail。单测需要起到保证正确性的作用,保证别人不会把原来的代码改坏。所有单测里面都需要检测一下op_type、num_nan、num_inf是否符合预期。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已经添加assert
num_nan = int(err_str.split("=")[1]) | ||
elif "num_inf" in err_str: | ||
num_inf = int(err_str.split("=")[1]) | ||
assert 0 == num_inf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
单测里面用self.assertEqual
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已经修改
[0.2, -1, 0.5], place=paddle.CPUPlace(), dtype='float32' | ||
) | ||
try: | ||
res = paddle.pow(x, y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个单测挪出来之后,没有开启check_nan_inf
吧
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个是为了后面assert使用的
) | ||
# check seed | ||
assert checker_config.initial_seed == 102 | ||
assert checker_config.seed == 102 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
单测里面同一用self.assertEqual
检查
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改
|
||
x = paddle.to_tensor( | ||
[0, 0, 0], | ||
place=paddle.CPUPlace(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
单测也需要检查GPU
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已经添加
) | ||
paddle.amp.debugging.enable_tensor_checker(checker_config) | ||
try: | ||
res = paddle.divide(y, x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
单测都构造的太简单了,需要有包括前向、反向、优化器的测试用例,且需要开启level=3检测,确保不要漏算子。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已经修改
if "num_nan" in err_str: | ||
num_nan = int(err_str.split("=")[1]) | ||
elif "num_inf" in err_str: | ||
num_inf = int(err_str.split("=")[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
重复的代码建议封装成函数
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已经修改
调用栈的PR 这个PR修改 |
6c714f5
to
f9837ba
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
新增API文档的预览,在PR描述中贴出来
python/paddle/amp/debugging.py
Outdated
""" | ||
DebugMode is used to present the state of TensorCheckerConfig. | ||
|
||
The meaning of each DebugMode is as following |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 句末要有标点符号,下同
- 英语表述含义不太准确,用ChatGPT润色下?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已经修改
python/paddle/amp/debugging.py
Outdated
@@ -47,14 +85,16 @@ class TensorCheckerConfig: | |||
* enable: Whether to enable Tensor's value detection function. The default value is False, which means that these tools will never be used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
文档的格式不对
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已经修改
python/paddle/amp/debugging.py
Outdated
@@ -67,29 +107,31 @@ class TensorCheckerConfig: | |||
* enable_traceback_filtering: Whether to filter the traceback. The main purpose is to filter out the internal code call stack of the framework and only display the user code call stack |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stack_height_limit
文档要说明当前支持了什么功能,怎么关掉调用栈enable_traceback_filtering
这个功能暂不支持,也去掉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已经修改
python/paddle/amp/debugging.py
Outdated
res = paddle.pow(x, y) | ||
paddle.autograd.backward(res, retain_graph=True) | ||
paddle.amp.debugging.disable_tensor_checker() | ||
#[PRECISION] [ERROR] in [device=cpu, op=elementwise_pow_grad, tensor=, dtype=fp32], numel=3, num_nan=1, num_inf=0, num_zero=0, max=2.886751e-01, min=2.000000e-01, mean=-nan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
调用栈也以注释的方式给出
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已经添加
python/paddle/amp/debugging.py
Outdated
|
||
if self.enable: | ||
self._set_seed(self.enable) | ||
|
||
def keep_random(self, seed, flag): | ||
def _keep_random(self, seed, flag): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
关于函数名,我上个PR有一些review建议,你再check下
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已经修改,不加_ 在static检查的时候默认是暴露给用户的
python/paddle/amp/debugging.py
Outdated
) | ||
else: | ||
raise ValueError("stack_height_limit must be int") | ||
|
||
def check(self): | ||
def _check_step_id(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_
开头,python中默认为私有函数(private),需要再类型外面调用的函数不要加_
。- 函数名建议:
update_and_check_step_id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已经修改
python/paddle/amp/debugging.py
Outdated
@@ -411,30 +437,34 @@ def enable_tensor_checker(checker_config): | |||
""" | |||
enable_tensor_checker(checker_config) is enables model level accuracy checking, which is used together with disables_tensor_checker() to achieve model level precision checking through the combination of these two APIs, checking the output Tensors of all operators within the specified range. | |||
|
|||
|
|||
Attention: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Attention -> Note
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已经修改
python/paddle/amp/debugging.py
Outdated
@@ -411,30 +437,34 @@ def enable_tensor_checker(checker_config): | |||
""" | |||
enable_tensor_checker(checker_config) is enables model level accuracy checking, which is used together with disables_tensor_checker() to achieve model level precision checking through the combination of these two APIs, checking the output Tensors of all operators within the specified range. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is enables
语法错误
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已经修改
@@ -2673,6 +2673,14 @@ All parameter, weight, gradient are variables in Paddle. | |||
m.def("set_nan_inf_debug_path", | |||
&paddle::framework::details::SetNanInfDebugPath); | |||
|
|||
// Add skipped op list | |||
m.def("set_checked_op_list", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个PR里面直接改了吧。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
debugmode再确认下是否要加示例
python/paddle/amp/debugging.py
Outdated
|
||
Attention: | ||
Note: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已经修改
|
||
Attention: | ||
Note: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note:下面不用空行,否则解析不出
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已经修改
python/paddle/amp/debugging.py
Outdated
|
||
Args: | ||
* enable: Whether to enable Tensor's value detection function. The default value is False, which means that these tools will never be used. | ||
enable: A boolean value indicating whether to enable the detection of NaN and Inf values in tensors. The default value is False, which means that these tools will not be used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
参数前面还是得加上Args:
,请参照模板(paddle.add)并注意缩进。
Args:
x (Tensor): The input tensor, it's data type should be float32, float64, int32, int64.
y (Tensor): The input tensor, it's data type should be float32, float64, int32, int64.
name (str, optional): For details, please refer to :ref:`api_guide_Name`. Generally, no setting i
如果对某个参数有更进一步的说明,如debug_mode的四种modes,可以在再进一步缩进
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已经修改
python/paddle/amp/debugging.py
Outdated
|
||
debug_step: A list or tuple used primarily for nan/inf checking during model training. For example, debug_step=[1,5] indicates that nan/inf checking should only be performed on model training iterations 1 to 5. | ||
debug_step: A list or tuple used primarily for nan/inf checking during model training. For example, debug_step=[1,5] indicates that nan/inf checking should only be performed on model training iterations 1 to 5. | ||
|
||
stack_height_limit: An integer value specifying the maximum depth of the call stack. This feature supports printing the call stack at the error location. Currently, only enabling or disabling call stack printing is supported. If you want to print the corresponding C++ call stack when NaN is detected in GPU Kernel, set stack_height_limit to 1, otherwise set it to 0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个参数的缩进和上面的(debug_step)也对齐一下
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已经修改
|
||
checker_config = paddle.amp.debugging.TensorCheckerConfig(enable=True, debug_mode=DebugMode.CHECK_NAN_INF_AND_ABORT) | ||
paddle.amp.debugging.enable_tensor_checker(checker_config) | ||
.. code-block:: python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
python/paddle/amp/debugging.py
Outdated
|
||
Args: | ||
* enable: Whether to enable Tensor's value detection function. The default value is False, which means that these tools will never be used. | ||
enable: A boolean value indicating whether to enable the detection of NaN and Inf values in tensors. The default value is False, which means that these tools will not be used. | ||
debug_mode: A parameter that determines the type of debugging to be used. There are 4 available modes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已经修改
python/paddle/amp/debugging.py
Outdated
|
||
Args: | ||
* enable: Whether to enable Tensor's value detection function. The default value is False, which means that these tools will never be used. | ||
enable: A boolean value indicating whether to enable the detection of NaN and Inf values in tensors. The default value is False, which means that these tools will not be used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
参数介绍的格式是:
- 没有默认值,如
enable (bool)
- 有默认值,如
debug_mode (DebugMode, optional)
,后面的解释中加一个Default is DebugMode.CHECK_NAN_INF_AND_ABORT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已经修改
python/paddle/amp/debugging.py
Outdated
|
||
def __init__( | ||
self, | ||
enable, | ||
debug_mode=DebugMode.CHECK_NAN_INF_AND_ABORT, | ||
dump_dir=None, | ||
output_dir=None, | ||
checked_op_list=None, | ||
skipped_op_list=None, | ||
debug_step=None, | ||
stack_height_limit=3, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
默认值改成1
吧。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已经修改
python/paddle/amp/debugging.py
Outdated
if self.initial_seed != self.seed: | ||
self.seed = self.initial_seed | ||
|
||
if self.seed > 4294967295 or self.seed < 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4294967295
这个数字怎么来的?最好在这里直接调用相应的函数。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已经修改
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Great work~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
Others
PR changes
APIs
Description
添加API 到 API 列表
生成效果: