You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current setting for keep_latest_k = 0, which means 'infinite' checkpoints will be saved.
This is a poor default and endangers the users system for crashing out a run when the drive overflows and worse, on EC2 system using the root drive, it will lock a user out of the system b/c the drive is full and therefore the OS can't reboot. (ask me how I know about this :). This then requires a lot of expertise to make a new EC2 drive, attach and copy out the old one and then you can finally get access to your files again.
By further example - I'm training llama3 70B. Each checkpoint is thus ~768GB. The host drive is 10TB...so that means turning on checkpointing at every 200 iters as part of an 8K run would overflow the drive around iter 2600 of said 8K iter run.
Anyway, we should never put a user at risk of crashing their run or worse crashing out their EC2 instance.
Recommend we default to 4 checkpoints before overwriting as a default. It would also be better imo that this setting is exposed by default in the debug toml so users are aware it exists and are likely to then tune it as needed.
Versions
'2.7.0.dev20250228+cu128'
The text was updated successfully, but these errors were encountered:
lessw2020
changed the title
[Checkpointing] Set reasonable default for keep_latest_k checkpoints and avoid overllowing drives...currently set to 'infinite' checkpoints
[Checkpointing] Set reasonable default for keep_latest_k checkpoints and avoid overflowing user drives...currently set to 'infinite' checkpoints
Mar 1, 2025
Bug description
The current setting for keep_latest_k = 0, which means 'infinite' checkpoints will be saved.
This is a poor default and endangers the users system for crashing out a run when the drive overflows and worse, on EC2 system using the root drive, it will lock a user out of the system b/c the drive is full and therefore the OS can't reboot. (ask me how I know about this :). This then requires a lot of expertise to make a new EC2 drive, attach and copy out the old one and then you can finally get access to your files again.
By further example - I'm training llama3 70B. Each checkpoint is thus ~768GB. The host drive is 10TB...so that means turning on checkpointing at every 200 iters as part of an 8K run would overflow the drive around iter 2600 of said 8K iter run.
Anyway, we should never put a user at risk of crashing their run or worse crashing out their EC2 instance.
Recommend we default to 4 checkpoints before overwriting as a default. It would also be better imo that this setting is exposed by default in the debug toml so users are aware it exists and are likely to then tune it as needed.
Versions
'2.7.0.dev20250228+cu128'
The text was updated successfully, but these errors were encountered: