Refactor Checkpointer #871

fegin · 2025-02-20T19:39:29Z

Stack from ghstack (oldest at bottom):

Several bugs fixes, refactors, and feature improvement for the next PR (integration with TorchFT)

Code refactor for better readability
Remove the time based checkpoint condition, this is not used and can cause deadlocks when integrating with TorchFT. This will also make code simpler.
Fixes a async_with_pinned_memory bug.
The original keep_last_k implementation can cause exceptions in certain case and is also slow. Fixes the bugs and use a thread to purge checkpoints.

[ghstack-poisoned]

torchtitan/checkpoint.py

[ghstack-poisoned]

fduwjj · 2025-02-26T04:42:32Z

tests/unit_tests/test_checkpoint.py

+    @mock.patch(
+        "torchtitan.components.checkpoint.dcp.async_save", side_effect=fake_async_save
+    )
+    def test_async_save_calls_async_wait(self, *_):


Can this test memory leak as well?

No, unfortunately. Memory leakage requires some more thorough test. I'm not sure if there is an easy way to test with unittest.

torchtitan/components/checkpoint.py

fduwjj · 2025-02-26T04:49:25Z

torchtitan/components/checkpoint.py

-            and self.async_mode == AsyncMode.ASYNC_WITH_PINNED_MEM
-            and self.staging
+            self.keep_latest_k > 0
+            and dist.get_rank() == 0


For my learning purpose, why do we only do the purge for dist.get_rank() == 0?

We assume that all ranks can access the same files. That's the assumption of DCP. If we let all ranks to purge, then the file systems will complain.

torchtitan/components/checkpoint.py

tianyu-l

I'm not entitled to review every detail of this PR -- so will leave it to others.
But it looks quite good to me.

tianyu-l · 2025-02-26T04:29:41Z

torchtitan/components/checkpoint.py

@@ -44,6 +50,8 @@ class AsyncMode(str, enum.Enum):
    ASYNC_WITH_PINNED_MEM = "async_with_pinned_mem"


+# TODO: move this out from checkpoint.py and merge it with the trainer.py
+# We probably want to create a Trainer objecta.


Suggested change

# We probably want to create a Trainer objecta.

# We probably want to create a Trainer object.

Hmm I still didn't get why we need a Trainer (yet).

@tianyu-l , a Trainer class is not necessary but cleaner.

class Trainer(Stateful): def __init__(self, job_config)-> None: move_all_init_code in train.py here def train(self) -> None: training_loop self.checkpoint.save(state={"trainer": self}) def state_dict(self) -> Dict[str, Any]: return {"step": self.step, "log_steps": self.log_steps, ....} def load_state_dict(self, sd) -> None: self.step = sd["step"] ...

While we can simplify the original train() by splitting it to two separate functions and keeps TrainerState, this one single trainer class is more natural, as we keep the states and methods in one class/object.

I see, makes sense to me!

fduwjj · 2025-02-26T04:54:26Z

torchtitan/components/checkpoint.py

+            self.purge_thread.join()
+
+    @torch.no_grad()
+    def save(self, curr_step: int, force: bool = False) -> None:


Nit: IIUC, force is only for unit test right? Might also mention this in the comment?

It's not, it is also used if the training is finished.

fduwjj

LGTM

[ghstack-poisoned]

Reland #871 due to ghstack issues.

Reland pytorch#871 due to ghstack issues.

Update

2a57609

[ghstack-poisoned]

This was referenced Feb 20, 2025

Show GC execution time #870

Merged

Integrate TorchFT #834

Merged

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 20, 2025

fegin added 3 commits February 20, 2025 16:52

Update

e6960e4

[ghstack-poisoned]

Update

4b34bfa

[ghstack-poisoned]

Update

c7621f3

[ghstack-poisoned]

fduwjj reviewed Feb 25, 2025

View reviewed changes

torchtitan/checkpoint.py Outdated Show resolved Hide resolved

fegin added 5 commits February 24, 2025 23:18

Update

1cc8d65

[ghstack-poisoned]

Update

59c49f1

[ghstack-poisoned]

Update

a886799

[ghstack-poisoned]

Update

c7d7991

[ghstack-poisoned]

Update

0128b99

[ghstack-poisoned]

fegin requested review from d4l3k, tianyu-l and fduwjj February 25, 2025 07:55

fduwjj reviewed Feb 26, 2025

View reviewed changes

torchtitan/components/checkpoint.py Show resolved Hide resolved

fduwjj reviewed Feb 26, 2025

View reviewed changes

torchtitan/components/checkpoint.py Outdated Show resolved Hide resolved

tianyu-l reviewed Feb 26, 2025

View reviewed changes

fduwjj reviewed Feb 26, 2025

View reviewed changes

fduwjj approved these changes Feb 26, 2025

View reviewed changes

fegin added 4 commits February 26, 2025 14:57

Update

6368448

[ghstack-poisoned]

Update

206df62

[ghstack-poisoned]

Update

272a9e6

[ghstack-poisoned]

Update

29caadc

[ghstack-poisoned]

fegin merged commit 29caadc into gh/fegin/13/base Feb 27, 2025
6 checks passed

fegin mentioned this pull request Feb 27, 2025

[Reland] Refactor Checkpointer #899

Merged

fegin added a commit that referenced this pull request Feb 27, 2025

[Reland] Refactor Checkpointer (#899)

031f73f

Reland #871 due to ghstack issues.

K-H-Ismail pushed a commit to K-H-Ismail/torchtitan that referenced this pull request Feb 28, 2025

[Reland] Refactor Checkpointer (pytorch#899)

c37b7c7

Reland pytorch#871 due to ghstack issues.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor Checkpointer #871

Refactor Checkpointer #871

fegin commented Feb 20, 2025 •

edited

Loading

fduwjj Feb 26, 2025

fegin Feb 26, 2025

fduwjj Feb 26, 2025

fegin Feb 26, 2025

tianyu-l left a comment

tianyu-l Feb 26, 2025

tianyu-l Feb 26, 2025

fegin Feb 26, 2025 •

edited

Loading

tianyu-l Feb 27, 2025

fduwjj Feb 26, 2025

fegin Feb 26, 2025

fduwjj left a comment

	# We probably want to create a Trainer objecta.
	# We probably want to create a Trainer object.

Refactor Checkpointer #871

Refactor Checkpointer #871

Conversation

fegin commented Feb 20, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tianyu-l left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fegin Feb 26, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fduwjj left a comment

Choose a reason for hiding this comment

fegin commented Feb 20, 2025 •

edited

Loading

fegin Feb 26, 2025 •

edited

Loading