Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Checkpointing] Set reasonable default for keep_latest_k checkpoints and avoid overflowing user drives...currently set to 'infinite' checkpoints #910

Open
lessw2020 opened this issue Mar 1, 2025 · 0 comments

Comments

@lessw2020
Copy link
Contributor

Bug description

The current setting for keep_latest_k = 0, which means 'infinite' checkpoints will be saved.
This is a poor default and endangers the users system for crashing out a run when the drive overflows and worse, on EC2 system using the root drive, it will lock a user out of the system b/c the drive is full and therefore the OS can't reboot. (ask me how I know about this :). This then requires a lot of expertise to make a new EC2 drive, attach and copy out the old one and then you can finally get access to your files again.

By further example - I'm training llama3 70B. Each checkpoint is thus ~768GB. The host drive is 10TB...so that means turning on checkpointing at every 200 iters as part of an 8K run would overflow the drive around iter 2600 of said 8K iter run.

Anyway, we should never put a user at risk of crashing their run or worse crashing out their EC2 instance.

Recommend we default to 4 checkpoints before overwriting as a default. It would also be better imo that this setting is exposed by default in the debug toml so users are aware it exists and are likely to then tune it as needed.

Versions

'2.7.0.dev20250228+cu128'

@lessw2020 lessw2020 changed the title [Checkpointing] Set reasonable default for keep_latest_k checkpoints and avoid overllowing drives...currently set to 'infinite' checkpoints [Checkpointing] Set reasonable default for keep_latest_k checkpoints and avoid overflowing user drives...currently set to 'infinite' checkpoints Mar 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants