[Checkpointing] Set reasonable default for keep_latest_k checkpoints and avoid overflowing user drives...currently set to 'infinite' checkpoints #910

lessw2020 · 2025-03-01T13:21:54Z

Bug description

The current setting for keep_latest_k = 0, which means 'infinite' checkpoints will be saved.
This is a poor default and endangers the users system for crashing out a run when the drive overflows and worse, on EC2 system using the root drive, it will lock a user out of the system b/c the drive is full and therefore the OS can't reboot. (ask me how I know about this :). This then requires a lot of expertise to make a new EC2 drive, attach and copy out the old one and then you can finally get access to your files again.

By further example - I'm training llama3 70B. Each checkpoint is thus ~768GB. The host drive is 10TB...so that means turning on checkpointing at every 200 iters as part of an 8K run would overflow the drive around iter 2600 of said 8K iter run.

Anyway, we should never put a user at risk of crashing their run or worse crashing out their EC2 instance.

Recommend we default to 4 checkpoints before overwriting as a default. It would also be better imo that this setting is exposed by default in the debug toml so users are aware it exists and are likely to then tune it as needed.

Versions

'2.7.0.dev20250228+cu128'

tianyu-l added the module: checkpoint label Mar 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Checkpointing] Set reasonable default for keep_latest_k checkpoints and avoid overflowing user drives...currently set to 'infinite' checkpoints #910

[Checkpointing] Set reasonable default for keep_latest_k checkpoints and avoid overflowing user drives...currently set to 'infinite' checkpoints #910

lessw2020 commented Mar 1, 2025

[Checkpointing] Set reasonable default for keep_latest_k checkpoints and avoid overflowing user drives...currently set to 'infinite' checkpoints #910

[Checkpointing] Set reasonable default for keep_latest_k checkpoints and avoid overflowing user drives...currently set to 'infinite' checkpoints #910

Comments

lessw2020 commented Mar 1, 2025

Bug description

Versions