cuda: prevent task lockup on timeout error #2547

rst0git · 2024-12-14T21:28:39Z

When creating a checkpoint of large models, the checkpoint action of cuda-checkpoint can exceed the CRIU timeout. This causes CRIU to fail with the following error, leaving the CUDA task in a locked state:

cuda_plugin: Checkpointing CUDA devices on pid 84145 restore_tid 84202
Error (criu/cr-dump.c:1791): Timeout reached. Try to interrupt: 0
Error (cuda_plugin.c:139): cuda_plugin: Unable to read output of cuda-checkpoint: Interrupted system call
Error (cuda_plugin.c:396): cuda_plugin: CHECKPOINT_DEVICES failed with
net: Unlock network
cuda_plugin: finished cuda_plugin stage 0 err -1
cuda_plugin: resuming devices on pid 84145
cuda_plugin: Restore thread pid 84202 found for real pid 84145
Unfreezing tasks into 1
	Unseizing 84145 into 1
Error (criu/cr-dump.c:2111): Dumping FAILED.

To fix this, we set task_info->checkpointed before invoking the checkpoint action to ensure that the CUDA task is resumed even if CRIU times out.

avagin · 2024-12-15T07:31:46Z

LGTM

I think we need to move run_plugins(CHECKPOINT_DEVICES) out of collect_pstree(). collrect_pstree shoould just freezer processes.

jesus-ramos · 2024-12-15T07:57:50Z

Unfortunately this problem also requires a driver fix. Due to the way cuda-checkpoint works at the moment killing it via ctrl-c or alarm timeout in the middle of operations can make it get out of sync with later invocations of cuda-checkpoint. Essentially you will get stale operation responses from previous invocations so this may not always be working as intended. Ex: killing it in the middle of a checkpoint operation that then completes behind the scenes the following call to cuda-checkpoint to restore will return the status of the checkpoint rather than the restore. Fixing it requires restarting the target application at the moment which is not very user friendly.

I'll forward the issue internally along though as it's been on the radar to fix for a while.

Patch itself LGTM, and also agree with Andrei's point to move the checkpoint plugin call out of the pstree walk/freeze.

rst0git · 2024-12-21T15:57:03Z

I think we need to move run_plugins(CHECKPOINT_DEVICES) out of collect_pstree(). collrect_pstree shoould just freezer processes.

@avagin @jesus-ramos I've updated the pull request with this change, and I was able to confirm that CRIU no longer fails with a timeout error when checkpointing large models.

criu/cr-dump.c

When creating a checkpoint of large models, the `checkpoint` action of `cuda-checkpoint` can exceed the CRIU timeout. This causes CRIU to fail with the following error, leaving the CUDA task in a locked state: cuda_plugin: Checkpointing CUDA devices on pid 84145 restore_tid 84202 Error (criu/cr-dump.c:1791): Timeout reached. Try to interrupt: 0 Error (cuda_plugin.c:139): cuda_plugin: Unable to read output of cuda-checkpoint: Interrupted system call Error (cuda_plugin.c:396): cuda_plugin: CHECKPOINT_DEVICES failed with net: Unlock network cuda_plugin: finished cuda_plugin stage 0 err -1 cuda_plugin: resuming devices on pid 84145 cuda_plugin: Restore thread pid 84202 found for real pid 84145 Unfreezing tasks into 1 Unseizing 84145 into 1 Error (criu/cr-dump.c:2111): Dumping FAILED. To fix this, we set `task_info->checkpointed` before invoking the `checkpoint` action to ensure that the CUDA task is resumed even if CRIU times out. Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>

Move `run_plugins(CHECKPOINT_DEVICES)` out of `collect_pstree()` to ensure that the function's sole responsibility is to use the cgroup freezer for the process tree. This allows us to avoid a time-out error when checkpointing applications with large GPU state. v2: This patch calls `checkpoint_devices()` only for `criu dump`. Support for GPU checkpointing with `pre-dump` will be introduced in a separate patch. Suggested-by: Andrei Vagin <avagin@google.com> Suggested-by: Jesus Ramos <jeramos@nvidia.com> Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>

Temporarily disable CUDA plugin for `criu pre-dump`. pre-dump currently fails with the following error: Handling VMA with the following smaps entry: 1822c000-18da5000 rw-p 00000000 00:00 0 [heap] Handling VMA with the following smaps entry: 200000000-200200000 ---p 00000000 00:00 0 Handling VMA with the following smaps entry: 200200000-200400000 rw-s 00000000 00:06 895 /dev/nvidia0 Error (criu/proc_parse.c:116): handle_device_vma plugin failed: No such file or directory Error (criu/proc_parse.c:632): Can't handle non-regular mapping on 705693's map 200200000 Error (criu/cr-dump.c:1486): Collect mappings (pid: 705693) failed with -1 We plan to enable support for pre-dump by skipping nvidia mappings in a separate patch. Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>

rst0git requested review from jesus-ramos and avagin December 14, 2024 21:28

rst0git force-pushed the 2024-12-14-cuda-prevent-task-lockup-after-timeout-error branch from 24c158f to 66cb6de Compare December 21, 2024 14:27

avagin reviewed Dec 31, 2024

View reviewed changes

criu/cr-dump.c Outdated Show resolved Hide resolved

rst0git force-pushed the 2024-12-14-cuda-prevent-task-lockup-after-timeout-error branch 3 times, most recently from 14ae222 to 8597e5f Compare January 15, 2025 20:42

rst0git added 3 commits January 19, 2025 11:42

rst0git force-pushed the 2024-12-14-cuda-prevent-task-lockup-after-timeout-error branch from 01738b0 to 1845d48 Compare January 19, 2025 11:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda: prevent task lockup on timeout error #2547

cuda: prevent task lockup on timeout error #2547

rst0git commented Dec 14, 2024

avagin commented Dec 15, 2024

jesus-ramos commented Dec 15, 2024

rst0git commented Dec 21, 2024 •

edited

Loading

cuda: prevent task lockup on timeout error #2547

Are you sure you want to change the base?

cuda: prevent task lockup on timeout error #2547

Conversation

rst0git commented Dec 14, 2024

avagin commented Dec 15, 2024

jesus-ramos commented Dec 15, 2024

rst0git commented Dec 21, 2024 • edited Loading

rst0git commented Dec 21, 2024 •

edited

Loading