Skip data loading for middle PP ranks #411

wconstab · 2024-06-17T22:06:23Z

Stack from ghstack (oldest at bottom):

First and last PP rank need to perform data loading to fetch matching
input_ids and labels.

as an alternative, PP could pass the 'labels' from stage 0 through
all the stages, but unless the dataloader is overburdened enough to
become the bottleneck this would likely be worse

A downside to skipping dataloading for middle ranks is added complexity
in train.py including handling metrics.

wps and mfu are derived from knowing the number of words in the batch,
which is no longer known if data loading is skipped. This may be OK-
last rank is the most interesting one to look at logs from since it
also includes loss, and would compute MFU/WPS as well. However, if
there are imbalances in WPS between ranks, we'd ideally have a way to
see that via the metrics.

[ghstack-poisoned]

First and last PP rank need to perform data loading to fetch matching input_ids and labels. - as an alternative, PP could pass the 'labels' from stage 0 through all the stages, but unless the dataloader is overburdened enough to become the bottleneck this would likely be worse A downside to skipping dataloading for middle ranks is added complexity in train.py including handling metrics. - wps and mfu are derived from knowing the number of words in the batch, which is no longer known if data loading is skipped. This may be OK- last rank is the most interesting one to look at logs from since it also includes loss, and would compute MFU/WPS as well. However, if there are imbalances in WPS between ranks, we'd ideally have a way to see that via the metrics. ghstack-source-id: b91fe55dcd8bdbbb9a4750c69cf40d2e51eec078 Pull Request resolved: #411

[ghstack-poisoned]

First and last PP rank need to perform data loading to fetch matching input_ids and labels. - as an alternative, PP could pass the 'labels' from stage 0 through all the stages, but unless the dataloader is overburdened enough to become the bottleneck this would likely be worse A downside to skipping dataloading for middle ranks is added complexity in train.py including handling metrics. - wps and mfu are derived from knowing the number of words in the batch, which is no longer known if data loading is skipped. This may be OK- last rank is the most interesting one to look at logs from since it also includes loss, and would compute MFU/WPS as well. However, if there are imbalances in WPS between ranks, we'd ideally have a way to see that via the metrics. ghstack-source-id: 5698ec46a1fc0dc65592df6813fc31a30470e99b Pull Request resolved: #411

[ghstack-poisoned]

First and last PP rank need to perform data loading to fetch matching input_ids and labels. - as an alternative, PP could pass the 'labels' from stage 0 through all the stages, but unless the dataloader is overburdened enough to become the bottleneck this would likely be worse A downside to skipping dataloading for middle ranks is added complexity in train.py including handling metrics. - wps and mfu are derived from knowing the number of words in the batch, which is no longer known if data loading is skipped. This may be OK- last rank is the most interesting one to look at logs from since it also includes loss, and would compute MFU/WPS as well. However, if there are imbalances in WPS between ranks, we'd ideally have a way to see that via the metrics. ghstack-source-id: 722bc565c9d56405285b4a5bbfcce9cda7d801a6 Pull Request resolved: #411

[ghstack-poisoned]

First and last PP rank need to perform data loading to fetch matching input_ids and labels. - as an alternative, PP could pass the 'labels' from stage 0 through all the stages, but unless the dataloader is overburdened enough to become the bottleneck this would likely be worse A downside to skipping dataloading for middle ranks is added complexity in train.py including handling metrics. - wps and mfu are derived from knowing the number of words in the batch, which is no longer known if data loading is skipped. This may be OK- last rank is the most interesting one to look at logs from since it also includes loss, and would compute MFU/WPS as well. However, if there are imbalances in WPS between ranks, we'd ideally have a way to see that via the metrics. ghstack-source-id: 70b41d137c12c858c4605c42faeddcbb6734193e Pull Request resolved: #411

[ghstack-poisoned]

First and last PP rank need to perform data loading to fetch matching input_ids and labels. - as an alternative, PP could pass the 'labels' from stage 0 through all the stages, but unless the dataloader is overburdened enough to become the bottleneck this would likely be worse A downside to skipping dataloading for middle ranks is added complexity in train.py including handling metrics. - wps and mfu are derived from knowing the number of words in the batch, which is no longer known if data loading is skipped. This may be OK- last rank is the most interesting one to look at logs from since it also includes loss, and would compute MFU/WPS as well. However, if there are imbalances in WPS between ranks, we'd ideally have a way to see that via the metrics. ghstack-source-id: 0330f308b151aef5d1afbdbee938adfd8cf742b4 Pull Request resolved: #411

[ghstack-poisoned]

First and last PP rank need to perform data loading to fetch matching input_ids and labels. - as an alternative, PP could pass the 'labels' from stage 0 through all the stages, but unless the dataloader is overburdened enough to become the bottleneck this would likely be worse A downside to skipping dataloading for middle ranks is added complexity in train.py including handling metrics. - wps and mfu are derived from knowing the number of words in the batch, which is no longer known if data loading is skipped. This may be OK- last rank is the most interesting one to look at logs from since it also includes loss, and would compute MFU/WPS as well. However, if there are imbalances in WPS between ranks, we'd ideally have a way to see that via the metrics. ghstack-source-id: 0330f308b151aef5d1afbdbee938adfd8cf742b4 Pull Request resolved: #411

[ghstack-poisoned]

First and last PP rank need to perform data loading to fetch matching input_ids and labels. - as an alternative, PP could pass the 'labels' from stage 0 through all the stages, but unless the dataloader is overburdened enough to become the bottleneck this would likely be worse A downside to skipping dataloading for middle ranks is added complexity in train.py including handling metrics. - wps and mfu are derived from knowing the number of words in the batch, which is no longer known if data loading is skipped. This may be OK- last rank is the most interesting one to look at logs from since it also includes loss, and would compute MFU/WPS as well. However, if there are imbalances in WPS between ranks, we'd ideally have a way to see that via the metrics. ghstack-source-id: 443444902b1549fa4b6f9e2c4521490a72fb4ffc Pull Request resolved: #411

wconstab · 2024-06-21T16:41:44Z

honestly i'm not sure if we want to land this PR or not. It is not urgent in any case, and we could do some experiments to decide whether its more critical to reduce data-loader stress or to keep compute/comms balanced per rank and ensure we avoid timeouts. closing for now.

cc @tianyu-l @wanchaol @awgu

First and last PP rank need to perform data loading to fetch matching input_ids and labels. - as an alternative, PP could pass the 'labels' from stage 0 through all the stages, but unless the dataloader is overburdened enough to become the bottleneck this would likely be worse A downside to skipping dataloading for middle ranks is added complexity in train.py including handling metrics. - wps and mfu are derived from knowing the number of words in the batch, which is no longer known if data loading is skipped. This may be OK- last rank is the most interesting one to look at logs from since it also includes loss, and would compute MFU/WPS as well. However, if there are imbalances in WPS between ranks, we'd ideally have a way to see that via the metrics. ghstack-source-id: 443444902b1549fa4b6f9e2c4521490a72fb4ffc Pull Request resolved: #411

Update

8e140f6

[ghstack-poisoned]

This was referenced Jun 17, 2024

Change debugmodel to have 8 layers #403

Merged

Prepare train.py for model chunks for pipelining #406

Merged

Support looped PP schedules in torchtitan #358

Merged

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 17, 2024

Update

c0bd954

[ghstack-poisoned]

Update

0ac5426

[ghstack-poisoned]

Update

d219326

[ghstack-poisoned]

Update

31956c7

[ghstack-poisoned]

Update

25ebbee

[ghstack-poisoned]

Update

0f93057

[ghstack-poisoned]

wconstab closed this Jun 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip data loading for middle PP ranks #411

Skip data loading for middle PP ranks #411

wconstab commented Jun 17, 2024 •

edited

Loading

wconstab commented Jun 21, 2024

Skip data loading for middle PP ranks #411

Skip data loading for middle PP ranks #411

Conversation

wconstab commented Jun 17, 2024 • edited Loading

wconstab commented Jun 21, 2024

wconstab commented Jun 17, 2024 •

edited

Loading