Add "enable_tensor_checker" and "disable_tensor_checker" to api list #52936

AnnaTrainingG · 2023-04-14T10:43:54Z

PR types

Others

PR changes

APIs

Description

添加API 到 API 列表
生成效果：

paddle-bot · 2023-04-14T10:43:58Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

Xreki

#51906 (review) 还有一些跟调用栈功能相关的review建议，是在调用栈的PR里面修改吗？

Xreki · 2023-04-14T11:08:37Z

paddle/fluid/pybind/pybind.cc

@@ -2673,6 +2673,14 @@ All parameter, weight, gradient are variables in Paddle.
  m.def("set_nan_inf_debug_path",
        &paddle::framework::details::SetNanInfDebugPath);

+  // Add skipped op list
+  m.def("set_checked_op_list",


后面看看，要不要新建一个debugging.cc，来集中放这些接口。零零散散地往pybind.cc里面加的函数太多了，别人很难知道哪些函数是用来干什么的。

或者通过一个函数设置所有配置参数。

这个PR里面直接改了吧。

已经修改

Xreki · 2023-04-14T11:10:36Z

python/paddle/amp/debugging.py

 ]


 class DebugMode(Enum):
+    """
+    DebugMode is used to present the state of TensorCheckerConfig


句末要有.

Xreki · 2023-04-14T11:11:59Z

python/paddle/amp/debugging.py

@@ -220,16 +243,16 @@ def _set_env(self, check_flag):
            else:
                raise ValueError("stack_height_limit must be int")

-    def check(self):
+    def check_step_id(self):


update_step_id

主要是用来判断是否合规，返回true false

Xreki · 2023-04-14T11:15:48Z

python/paddle/fluid/tests/unittests/test_tensor_checker.py

+                .replace(",", " ")
+                .split(" ")
+            )
+            for err_str in err_str_list:


这个单测好像永远不会fail。单测需要起到保证正确性的作用，保证别人不会把原来的代码改坏。所有单测里面都需要检测一下op_type、num_nan、num_inf是否符合预期。

已经添加assert

Xreki · 2023-04-14T11:17:29Z

python/paddle/fluid/tests/unittests/test_tensor_checker.py

+                    num_nan = int(err_str.split("=")[1])
+                elif "num_inf" in err_str:
+                    num_inf = int(err_str.split("=")[1])
+            assert 0 == num_inf


单测里面用self.assertEqual

已经修改

Xreki · 2023-04-14T11:20:16Z

python/paddle/fluid/tests/unittests/test_tensor_checker.py

+            [0.2, -1, 0.5], place=paddle.CPUPlace(), dtype='float32'
+        )
+        try:
+            res = paddle.pow(x, y)


这个单测挪出来之后，没有开启check_nan_inf吧

这个是为了后面assert使用的

Xreki · 2023-04-14T11:21:21Z

python/paddle/fluid/tests/unittests/test_tensor_checker.py

+        )
+        # check seed
+        assert checker_config.initial_seed == 102
+        assert checker_config.seed == 102


单测里面同一用self.assertEqual检查

Xreki · 2023-04-14T11:22:40Z

python/paddle/fluid/tests/unittests/test_tensor_checker.py

+
+        x = paddle.to_tensor(
+            [0, 0, 0],
+            place=paddle.CPUPlace(),


单测也需要检查GPU

已经添加

Xreki · 2023-04-14T11:23:42Z

python/paddle/fluid/tests/unittests/test_tensor_checker.py

+        )
+        paddle.amp.debugging.enable_tensor_checker(checker_config)
+        try:
+            res = paddle.divide(y, x)


单测都构造的太简单了，需要有包括前向、反向、优化器的测试用例，且需要开启level=3检测，确保不要漏算子。

已经修改

Xreki · 2023-04-14T11:25:25Z

python/paddle/fluid/tests/unittests/test_tensor_checker.py

+                if "num_nan" in err_str:
+                    num_nan = int(err_str.split("=")[1])
+                elif "num_inf" in err_str:
+                    num_inf = int(err_str.split("=")[1])


重复的代码建议封装成函数

已经修改

AnnaTrainingG · 2023-04-17T02:00:29Z

调用栈的PR 这个PR修改

jzhang533

LGTM

Xreki

新增API文档的预览，在PR描述中贴出来

Xreki · 2023-04-20T02:07:19Z

python/paddle/amp/debugging.py

+    """
+    DebugMode is used to present the state of TensorCheckerConfig.
+
+    The meaning of each DebugMode is as following


句末要有标点符号，下同

英语表述含义不太准确，用ChatGPT润色下？

已经修改

Xreki · 2023-04-20T02:09:40Z

python/paddle/amp/debugging.py

@@ -47,14 +85,16 @@ class TensorCheckerConfig:
    * enable: Whether to enable Tensor's value detection function. The default value is False, which means that these tools will never be used.


文档的格式不对

已经修改

Xreki · 2023-04-20T02:10:15Z

python/paddle/amp/debugging.py

@@ -67,29 +107,31 @@ class TensorCheckerConfig:
    * enable_traceback_filtering: Whether to filter the traceback. The main purpose is to filter out the internal code call stack of the framework and only display the user code call stack


stack_height_limit文档要说明当前支持了什么功能，怎么关掉调用栈

enable_traceback_filtering这个功能暂不支持，也去掉

已经修改

Xreki · 2023-04-20T02:11:52Z

python/paddle/amp/debugging.py

+        res = paddle.pow(x, y)
+        paddle.autograd.backward(res, retain_graph=True)
+        paddle.amp.debugging.disable_tensor_checker()
+        #[PRECISION] [ERROR] in [device=cpu, op=elementwise_pow_grad, tensor=, dtype=fp32], numel=3, num_nan=1, num_inf=0, num_zero=0, max=2.886751e-01, min=2.000000e-01, mean=-nan


调用栈也以注释的方式给出

已经添加

Xreki · 2023-04-20T02:13:20Z

python/paddle/amp/debugging.py


        if self.enable:
            self._set_seed(self.enable)

-    def keep_random(self, seed, flag):
+    def _keep_random(self, seed, flag):


关于函数名，我上个PR有一些review建议，你再check下

已经修改，不加_ 在static检查的时候默认是暴露给用户的

Xreki · 2023-04-20T02:14:59Z

python/paddle/amp/debugging.py

                )
            else:
                raise ValueError("stack_height_limit must be int")

-    def check(self):
+    def _check_step_id(self):


_开头，python中默认为私有函数（private），需要再类型外面调用的函数不要加_。

函数名建议：update_and_check_step_id

已经修改

Xreki · 2023-04-20T02:15:31Z

python/paddle/amp/debugging.py

@@ -411,30 +437,34 @@ def enable_tensor_checker(checker_config):
    """
    enable_tensor_checker(checker_config) is enables model level accuracy checking, which is used together with disables_tensor_checker() to achieve model level precision checking through the combination of these two APIs, checking the output Tensors of all operators within the specified range.

+
    Attention:


Attention -> Note

已经修改

Xreki · 2023-04-20T02:16:02Z

python/paddle/amp/debugging.py

@@ -411,30 +437,34 @@ def enable_tensor_checker(checker_config):
    """
    enable_tensor_checker(checker_config) is enables model level accuracy checking, which is used together with disables_tensor_checker() to achieve model level precision checking through the combination of these two APIs, checking the output Tensors of all operators within the specified range.


is enables语法错误

已经修改

Xreki · 2023-04-20T02:17:22Z

paddle/fluid/pybind/pybind.cc

@@ -2673,6 +2673,14 @@ All parameter, weight, gradient are variables in Paddle.
  m.def("set_nan_inf_debug_path",
        &paddle::framework::details::SetNanInfDebugPath);

+  // Add skipped op list
+  m.def("set_checked_op_list",


这个PR里面直接改了吧。

sunzhongkai588

debugmode再确认下是否要加示例

sunzhongkai588 · 2023-04-21T06:40:43Z

python/paddle/amp/debugging.py


-    Attention:
+    Note:


Note：下面不用空行，否则解析不出

已经修改

sunzhongkai588 · 2023-04-21T06:42:00Z

python/paddle/amp/debugging.py


-    Attention:
+    Note:


Note：下面不用空行，否则解析不出

已经修改

sunzhongkai588 · 2023-04-21T06:51:18Z

python/paddle/amp/debugging.py


-    Args:
-    * enable: Whether to enable Tensor's value detection function. The default value is False, which means that these tools will never be used.
+    enable: A boolean value indicating whether to enable the detection of NaN and Inf values in tensors. The default value is False, which means that these tools will not be used.


参数前面还是得加上Args:，请参照模板（paddle.add）并注意缩进。

Args: x (Tensor): The input tensor, it's data type should be float32, float64, int32, int64. y (Tensor): The input tensor, it's data type should be float32, float64, int32, int64. name (str, optional): For details, please refer to :ref:`api_guide_Name`. Generally, no setting i

如果对某个参数有更进一步的说明，如debug_mode的四种modes，可以在再进一步缩进

已经修改

AnnaTrainingG · 2023-04-21T07:24:11Z

参考之前的：https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/api/paddle/profiler/ProfilerState_cn.html DebugMode不加示例代码

sunzhongkai588 · 2023-04-21T07:33:16Z

python/paddle/amp/debugging.py


-    debug_step: A list or tuple used primarily for nan/inf checking during model training. For example, debug_step=[1,5] indicates that nan/inf checking should only be performed on model training iterations 1 to 5.
+        debug_step: A list or tuple used primarily for nan/inf checking during model training. For example, debug_step=[1,5] indicates that nan/inf checking should only be performed on model training iterations 1 to 5.

    stack_height_limit: An integer value specifying the maximum depth of the call stack. This feature supports printing the call stack at the error location. Currently, only enabling or disabling call stack printing is supported. If you want to print the corresponding C++ call stack when NaN is detected in GPU Kernel, set stack_height_limit to 1, otherwise set it to 0.


这个参数的缩进和上面的（debug_step）也对齐一下

已经修改

sunzhongkai588 · 2023-04-21T09:20:14Z

python/paddle/amp/debugging.py


-           checker_config = paddle.amp.debugging.TensorCheckerConfig(enable=True, debug_mode=DebugMode.CHECK_NAN_INF_AND_ABORT)
-           paddle.amp.debugging.enable_tensor_checker(checker_config)
+        ..  code-block:: python


.. code-block:: python 下方得空行，否则会解析错误

sunzhongkai588

LGTM

Xreki · 2023-04-23T05:59:46Z

python/paddle/amp/debugging.py


    Args:
-    * enable: Whether to enable Tensor's value detection function. The default value is False, which means that these tools will never be used.
+        enable: A boolean value indicating whether to enable the detection of NaN and Inf values in tensors. The default value is False, which means that these tools will not be used.
+        debug_mode: A parameter that determines the type of debugging to be used. There are 4 available modes:


此处的格式依然不对。这里可不必展开讲DebugMode每个配置的功能。

已经修改

Xreki · 2023-04-23T06:01:58Z

python/paddle/amp/debugging.py


    Args:
-    * enable: Whether to enable Tensor's value detection function. The default value is False, which means that these tools will never be used.
+        enable: A boolean value indicating whether to enable the detection of NaN and Inf values in tensors. The default value is False, which means that these tools will not be used.


参数介绍的格式是：

没有默认值，如enable (bool)

有默认值，如debug_mode (DebugMode, optional)，后面的解释中加一个Default is DebugMode.CHECK_NAN_INF_AND_ABORT.

已经修改

Xreki · 2023-04-23T06:02:39Z

python/paddle/amp/debugging.py


    def __init__(
        self,
        enable,
        debug_mode=DebugMode.CHECK_NAN_INF_AND_ABORT,
-        dump_dir=None,
+        output_dir=None,
        checked_op_list=None,
        skipped_op_list=None,
        debug_step=None,
        stack_height_limit=3,


默认值改成1吧。

已经修改

Xreki · 2023-04-23T06:04:06Z

python/paddle/amp/debugging.py

+        if self.initial_seed != self.seed:
+            self.seed = self.initial_seed
+
+        if self.seed > 4294967295 or self.seed < 0:


4294967295这个数字怎么来的？最好在这里直接调用相应的函数。

已经修改

Xreki

LGTM! Great work~

sunzhongkai588

LGTM

…addlePaddle#52936)

… api list (#52936) (#53287) 新增enable_tensor_checker, disable_tensor_checker API (#52936)

AnnaTrainingG changed the title ~~Add enable_check~~ Add "enable_tensor_checker" and "disable_tensor_checker" to api list Apr 14, 2023

Xreki reviewed Apr 14, 2023

View reviewed changes

update

f9837ba

AnnaTrainingG force-pushed the debug_tools_more branch from 6c714f5 to f9837ba Compare April 18, 2023 01:51

AnnaTrainingG added 10 commits April 18, 2023 02:41

update test=docs_preview

f9036a0

update test=docs_preview

5595f7e

update test=docs_preview

856a1c4

update test=docs_preview

98ddd70

update test=docs_preview path

4a7645f

update backward test=docs_preview

dec232f

ctest test=docs_preview

7080c45

ctest test=docs_preview

5dfa9ab

update

8dabb45

update backward test=docs_preview

8827aeb

jzhang533 previously approved these changes Apr 19, 2023

View reviewed changes

update test=docs_preview

df0babc

AnnaTrainingG dismissed jzhang533’s stale review via df0babc April 19, 2023 11:00

Xreki reviewed Apr 20, 2023

View reviewed changes

AnnaTrainingG added 3 commits April 20, 2023 07:11

update

9996eb1

update test=docs_preview

bb63042

update test=docs_preview

bec5f05

sunzhongkai588 reviewed Apr 21, 2023

View reviewed changes

update test=docs_preview

ce7565b

sunzhongkai588 reviewed Apr 21, 2023

View reviewed changes

update test=docs_preview

7d6b46a

sunzhongkai588 reviewed Apr 21, 2023

View reviewed changes

update test=docs_preview

3fe5e3f

sunzhongkai588 previously approved these changes Apr 21, 2023

View reviewed changes

Xreki reviewed Apr 23, 2023

View reviewed changes

Update, test=docs_preview

f9e05cc

AnnaTrainingG dismissed sunzhongkai588’s stale review via f9e05cc April 23, 2023 10:41

Update, test=docs_preview

b6833cf

Xreki approved these changes Apr 24, 2023

View reviewed changes

sunzhongkai588 approved these changes Apr 24, 2023

View reviewed changes

lanxianghit approved these changes Apr 24, 2023

View reviewed changes

AnnaTrainingG merged commit 4113871 into PaddlePaddle:develop Apr 24, 2023

AnnaTrainingG added a commit to AnnaTrainingG/Paddle that referenced this pull request Apr 24, 2023

Add "enable_tensor_checker" and "disable_tensor_checker" to api list (P…

8198ce4

…addlePaddle#52936)

AnnaTrainingG mentioned this pull request Apr 24, 2023

[Cherry-pick] Add enable_tensor_checker and disable_tensor_checker to api list #53287

Merged

lijialin03 pushed a commit to lijialin03/Paddle that referenced this pull request Apr 25, 2023

Add "enable_tensor_checker" and "disable_tensor_checker" to api list (P…

51d0a29

…addlePaddle#52936)

lanxianghit pushed a commit that referenced this pull request Apr 25, 2023

[Cherry-pick] Add enable_tensor_checker and disable_tensor_checker to…

ec77def

… api list (#52936) (#53287) 新增enable_tensor_checker, disable_tensor_checker API (#52936)

		@@ -47,14 +85,16 @@ class TensorCheckerConfig:
		* enable: Whether to enable Tensor's value detection function. The default value is False, which means that these tools will never be used.

		@@ -67,29 +107,31 @@ class TensorCheckerConfig:
		* enable_traceback_filtering: Whether to filter the traceback. The main purpose is to filter out the internal code call stack of the framework and only display the user code call stack

		@@ -411,30 +437,34 @@ def enable_tensor_checker(checker_config):
		"""
		enable_tensor_checker(checker_config) is enables model level accuracy checking, which is used together with disables_tensor_checker() to achieve model level precision checking through the combination of these two APIs, checking the output Tensors of all operators within the specified range.

Add "enable_tensor_checker" and "disable_tensor_checker" to api list #52936

Add "enable_tensor_checker" and "disable_tensor_checker" to api list #52936

Conversation

AnnaTrainingG commented Apr 14, 2023 • edited Loading

PR types

PR changes

Description

paddle-bot bot commented Apr 14, 2023

Xreki left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AnnaTrainingG commented Apr 17, 2023

jzhang533 left a comment

Choose a reason for hiding this comment

Xreki left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sunzhongkai588 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AnnaTrainingG commented Apr 21, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sunzhongkai588 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Xreki left a comment

Choose a reason for hiding this comment

sunzhongkai588 left a comment

AnnaTrainingG commented Apr 14, 2023 •

edited

Loading