Support looped PP schedules in torchtitan #358

wconstab · 2024-05-23T20:01:01Z

Stack from ghstack (oldest at bottom):

refactor some per-model logic into helper functions

[ghstack-poisoned]

ghstack-source-id: 39a1559ba3ecf1c7c8b2704151ca2781bfe0001b Pull Request resolved: #358

train.py

[ghstack-poisoned]

ghstack-source-id: 527a6f22d3c0955e527ac34167a00023deab6981 Pull Request resolved: #358

train.py

[ghstack-poisoned]

ghstack-source-id: db6559fe5a5d2b338bd27553d3d1b66a6c64d3b9 Pull Request resolved: #358

[ghstack-poisoned]

ghstack-source-id: 94567ac8c62948a130e7d062c8d66f3c34f5ff7f Pull Request resolved: #358

[ghstack-poisoned]

- refactor some per-model logic into helper functions ghstack-source-id: 4741d494bdb61cd28f7bf5ad91094f0c174f88c2 Pull Request resolved: #358

[ghstack-poisoned]

- refactor some per-model logic into helper functions ghstack-source-id: 4fcd38adafe9926799366c4c868219d47f7bc03c Pull Request resolved: #358

[ghstack-poisoned]

- refactor some per-model logic into helper functions ghstack-source-id: a7768287ed2d31272b07ac9f3601b6e23e90c710 Pull Request resolved: #358

[ghstack-poisoned]

- refactor some per-model logic into helper functions ghstack-source-id: c40342e4d577a044d4094ef766de16ba496ab835 Pull Request resolved: #358

wconstab · 2024-06-14T00:45:52Z

torchtitan/checkpoint.py

+                "optimizer": OptimizerWrapper(model_parts, optimizers),
+                # TODO(whc) flatten lr_schedulers using a wrapper and somehow handle resharding?
+                # or store one per key and explicitly dont support resharding?
+                # "lr_scheduler": lr_scheduler,


i have to fix this part. i think @fegin had a suggestion for a workaround that would support resharding. i am not sure if i should do this or just let it be saved in a way that would break for resharding and don't care about resharding for now.

torchtitan/checkpoint.py

[ghstack-poisoned]

- refactor some per-model logic into helper functions ghstack-source-id: 1d313526b76b7ba76376d82d39171b75294fd831 Pull Request resolved: #358

[ghstack-poisoned]

- refactor some per-model logic into helper functions ghstack-source-id: 2f0b57f3cbfb2d27f37850d09a92d64e5b7fbc87 Pull Request resolved: #358

[ghstack-poisoned]

- refactor some per-model logic into helper functions ghstack-source-id: 049327e0eb74dd0f1e8a6ccd8f1e7391ed4c339b Pull Request resolved: #358

[ghstack-poisoned]

- refactor some per-model logic into helper functions ghstack-source-id: d9cd4b2de66ff263b68db13f717f3f597cbd6d0d Pull Request resolved: #358

[ghstack-poisoned]

- refactor some per-model logic into helper functions ghstack-source-id: f0f106158e366922573d91e1e11ca278d900f136 Pull Request resolved: #358

[ghstack-poisoned]

- refactor some per-model logic into helper functions ghstack-source-id: 64587ca052b8107ef86112a64891a4bab54b7f27 Pull Request resolved: #358

[ghstack-poisoned]

- refactor some per-model logic into helper functions ghstack-source-id: 8ce18913aff539ec8ca102383663448c69fa6632 Pull Request resolved: #358

[ghstack-poisoned]

wconstab · 2024-06-18T23:50:18Z

torchtitan/checkpoint.py

+            which is gauranteed for the model by correct pipeline splitting and for the optimizer by the flattening
+            support described in (1).
+
+        3. LR schedulers also index model states like optimizers and would need to be flattened properly to support


RE:

I did it this way because i thought if we are not supporting resharding of lr_scheduler, then i may as well save each one. If i save each one, then at load time i have a form of assertion provided for me by dcp- if the runtime loading the checkpoint has the same number of ranks, they will match up and load OK. if not, they would throw an error.

I could switch back to the version where I save only one copy, but then i have to do some validation up front. Do you think this way is better?
@fegin

If we are not supporting resharding, then this implementation is better.

kwen2501

Pipelining part looks good to me. Left two minor comments.

torchtitan/parallelisms/parallelize_llama.py

donglimm · 2024-06-20T21:05:33Z

torchtitan/parallelisms/pipelining_utils.py

@@ -26,7 +33,7 @@ def build_pipeline_schedule(job_config, parallel_dims, stages, loss_fn):
        n_microbatches = job_config.experimental.pipeline_parallel_degree

    return schedule_class(
-        stage,
+        stages if looped_schedule else stages[0],


the init function of ScheduleInterleaved1F1B takes a list of _PipelineStageBase
if we only pass stages[0], will it cause ScheduleInterleaved1F1B fail ?

well, the point of this code is that the schedule class could be a simple schedule or a looped schedule.

If it's a looped schedule, like Interleaved1F1B, then we must have 'stages'.

if its a simple schedule, then we must have just one stage, so 'stages[0]' is appropriate.

But we should never have Interleaved1F1B(stages[0]).

[ghstack-poisoned]

- refactor some per-model logic into helper functions ghstack-source-id: a2376627e2864deeb9e4fbf49cecd0990bc434ea Pull Request resolved: #358

- refactor some per-model logic into helper functions ghstack-source-id: a2376627e2864deeb9e4fbf49cecd0990bc434ea Pull Request resolved: pytorch#358

- refactor some per-model logic into helper functions ghstack-source-id: a2376627e2864deeb9e4fbf49cecd0990bc434ea Pull Request resolved: #358

- refactor some per-model logic into helper functions ghstack-source-id: a2376627e2864deeb9e4fbf49cecd0990bc434ea Pull Request resolved: pytorch#358

Update

a8635e9

[ghstack-poisoned]

wconstab added a commit that referenced this pull request May 23, 2024

add todos mocking changes for looped PP support

de17128

ghstack-source-id: 39a1559ba3ecf1c7c8b2704151ca2781bfe0001b Pull Request resolved: #358

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 23, 2024

fegin reviewed May 23, 2024

View reviewed changes

train.py Outdated Show resolved Hide resolved

Update

390f1e4

[ghstack-poisoned]

wconstab added a commit that referenced this pull request May 23, 2024

add todos mocking changes for looped PP support

78901e7

ghstack-source-id: 527a6f22d3c0955e527ac34167a00023deab6981 Pull Request resolved: #358

awgu reviewed May 23, 2024

View reviewed changes

train.py Outdated Show resolved Hide resolved

Update

84a32e1

[ghstack-poisoned]

wconstab added a commit that referenced this pull request May 23, 2024

add todos mocking changes for looped PP support

e8ea328

ghstack-source-id: db6559fe5a5d2b338bd27553d3d1b66a6c64d3b9 Pull Request resolved: #358

Update

bca7fe8

[ghstack-poisoned]

wconstab added a commit that referenced this pull request Jun 3, 2024

add todos mocking changes for looped PP support

8ef3cb0

ghstack-source-id: 94567ac8c62948a130e7d062c8d66f3c34f5ff7f Pull Request resolved: #358

Update

b5ecbae

[ghstack-poisoned]

wconstab added a commit that referenced this pull request Jun 11, 2024

Add train loop support for looped PP schedules

70fae92

- refactor some per-model logic into helper functions ghstack-source-id: 4741d494bdb61cd28f7bf5ad91094f0c174f88c2 Pull Request resolved: #358

Update

9b66c61

[ghstack-poisoned]

wconstab added a commit that referenced this pull request Jun 12, 2024

Add train loop support for looped PP schedules

fc183db

- refactor some per-model logic into helper functions ghstack-source-id: 4fcd38adafe9926799366c4c868219d47f7bc03c Pull Request resolved: #358

wconstab changed the title ~~add todos mocking changes for looped PP support~~ Support Looped PP schedules Jun 13, 2024

Update

48f5ede

[ghstack-poisoned]

wconstab added a commit that referenced this pull request Jun 13, 2024

Add train loop support for looped PP schedules

f229b99

- refactor some per-model logic into helper functions ghstack-source-id: a7768287ed2d31272b07ac9f3601b6e23e90c710 Pull Request resolved: #358

wconstab changed the title ~~Support Looped PP schedules~~ Add train loop support for looped PP schedules Jun 13, 2024

wconstab added a commit that referenced this pull request Jun 13, 2024

Add train loop support for looped PP schedules

4bb56ec

- refactor some per-model logic into helper functions ghstack-source-id: a7768287ed2d31272b07ac9f3601b6e23e90c710 Pull Request resolved: #358

wconstab changed the title ~~Add train loop support for looped PP schedules~~ Add support for looped PP schedules Jun 14, 2024

Update

ce363a8

[ghstack-poisoned]

wconstab added a commit that referenced this pull request Jun 14, 2024

Add train loop support for looped PP schedules

023ebbc

- refactor some per-model logic into helper functions ghstack-source-id: c40342e4d577a044d4094ef766de16ba496ab835 Pull Request resolved: #358

wconstab commented Jun 14, 2024

View reviewed changes

torchtitan/checkpoint.py Outdated Show resolved Hide resolved

Update

ada5710

[ghstack-poisoned]

wconstab mentioned this pull request Jun 14, 2024

Change debugmodel to have 8 layers #403

Merged

wconstab added a commit that referenced this pull request Jun 14, 2024

Add train loop support for looped PP schedules

33bd6fc

- refactor some per-model logic into helper functions ghstack-source-id: 1d313526b76b7ba76376d82d39171b75294fd831 Pull Request resolved: #358

Update

b6830b2

[ghstack-poisoned]

wconstab mentioned this pull request Jun 14, 2024

Prepare train.py for model chunks for pipelining #406

Merged

wconstab added a commit that referenced this pull request Jun 14, 2024

Add train loop support for looped PP schedules

25116e8

- refactor some per-model logic into helper functions ghstack-source-id: 2f0b57f3cbfb2d27f37850d09a92d64e5b7fbc87 Pull Request resolved: #358

Update

f9fc4a2

[ghstack-poisoned]

wconstab added a commit that referenced this pull request Jun 14, 2024

Add train loop support for looped PP schedules

d42c18d

- refactor some per-model logic into helper functions ghstack-source-id: 049327e0eb74dd0f1e8a6ccd8f1e7391ed4c339b Pull Request resolved: #358

Update

07a3eb3

[ghstack-poisoned]

wconstab changed the title ~~Add support for looped PP schedules~~ Add train loop support for looped PP schedules Jun 15, 2024

wconstab added a commit that referenced this pull request Jun 15, 2024

Add train loop support for looped PP schedules

389550a

- refactor some per-model logic into helper functions ghstack-source-id: d9cd4b2de66ff263b68db13f717f3f597cbd6d0d Pull Request resolved: #358

Update

aa704c1

[ghstack-poisoned]

wconstab added a commit that referenced this pull request Jun 15, 2024

Add train loop support for looped PP schedules

b9a8b7a

- refactor some per-model logic into helper functions ghstack-source-id: f0f106158e366922573d91e1e11ca278d900f136 Pull Request resolved: #358

Update

71bfaf7

[ghstack-poisoned]

wconstab added a commit that referenced this pull request Jun 17, 2024

Add train loop support for looped PP schedules

5011268

- refactor some per-model logic into helper functions ghstack-source-id: 64587ca052b8107ef86112a64891a4bab54b7f27 Pull Request resolved: #358

Update

fd09a3d

[ghstack-poisoned]

wconstab added a commit that referenced this pull request Jun 17, 2024

Add train loop support for looped PP schedules

ae52b63

- refactor some per-model logic into helper functions ghstack-source-id: 8ce18913aff539ec8ca102383663448c69fa6632 Pull Request resolved: #358

wconstab mentioned this pull request Jun 17, 2024

Skip data loading for middle PP ranks #411

Closed

wconstab added 3 commits June 17, 2024 15:07

Update

db0269b

[ghstack-poisoned]

Update

79f10d9

[ghstack-poisoned]

Update

055e688

[ghstack-poisoned]

wconstab commented Jun 18, 2024

View reviewed changes

kwen2501 approved these changes Jun 20, 2024

View reviewed changes

torchtitan/parallelisms/parallelize_llama.py Outdated Show resolved Hide resolved

torchtitan/parallelisms/parallelize_llama.py Show resolved Hide resolved

donglimm reviewed Jun 20, 2024

View reviewed changes

wconstab added 2 commits June 20, 2024 14:27

Update

b230b8c

[ghstack-poisoned]

Update

6fc2045

[ghstack-poisoned]

wconstab merged commit 6fc2045 into gh/wconstab/28/base Jun 21, 2024
5 checks passed

wconstab added a commit that referenced this pull request Jun 21, 2024

Add train loop support for looped PP schedules

ac83f9c

- refactor some per-model logic into helper functions ghstack-source-id: a2376627e2864deeb9e4fbf49cecd0990bc434ea Pull Request resolved: #358

wconstab deleted the gh/wconstab/28/head branch June 21, 2024 16:40

wconstab changed the title ~~Add train loop support for looped PP schedules~~ Support looped PP schedules in torchtitan Jun 25, 2024

tianyu-l pushed a commit that referenced this pull request Aug 16, 2024

Add train loop support for looped PP schedules

78e6d10

- refactor some per-model logic into helper functions ghstack-source-id: a2376627e2864deeb9e4fbf49cecd0990bc434ea Pull Request resolved: #358

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support looped PP schedules in torchtitan #358

Support looped PP schedules in torchtitan #358

wconstab commented May 23, 2024 •

edited

Loading

wconstab Jun 14, 2024

wconstab Jun 18, 2024

fegin Jun 21, 2024

kwen2501 left a comment

donglimm Jun 20, 2024

wconstab Jun 20, 2024

Support looped PP schedules in torchtitan #358

Support looped PP schedules in torchtitan #358

Conversation

wconstab commented May 23, 2024 • edited Loading

wconstab Jun 14, 2024

Choose a reason for hiding this comment

wconstab Jun 18, 2024

Choose a reason for hiding this comment

fegin Jun 21, 2024

Choose a reason for hiding this comment

kwen2501 left a comment

Choose a reason for hiding this comment

donglimm Jun 20, 2024

Choose a reason for hiding this comment

wconstab Jun 20, 2024

Choose a reason for hiding this comment

wconstab commented May 23, 2024 •

edited

Loading