Make metrics logging work for pipeline parallelism #383

wconstab · 2024-06-03T23:27:26Z

Stack from ghstack (oldest at bottom):

-> Make metrics logging work for pipeline parallelism #383

Avoid complicating the ux and leave the status quo of 2 user-selectable
behaviors:

log from rank 0 (the default)
log from all ranks (not the default)

Modify the meaning of 'log from rank 0' to log from rank 0 in
non-pipeline parallel runs, and log from the local rank 0 within the
last pipeline-parallel stage group if pp is enabled. (note: earlier
pipeline stages still produce some metrics like mfu/memory, but do not
compute loss.)

[ghstack-poisoned]

Avoid complicating the ux and leave the status quo of 2 user-selectable behaviors: - log from rank 0 (the default) - log from all ranks (not the default) Modify the meaning of 'log from rank 0' to log from rank 0 in non-pipeline parallel runs, and log from the local rank 0 within the last pipeline-parallel stage group if pp is enabled. (note: earlier pipeline stages still produce some metrics like mfu/memory, but do not compute loss.) ghstack-source-id: 10c6d13f6820642995a09585b12aa5f7da0b2165 Pull Request resolved: #383

awgu · 2024-06-04T01:37:02Z

train.py

-    metric_logger = build_metric_logger(job_config)
+    if parallel_dims.pp_enabled:
+        pp_size = pp_mesh.size()
+        metrics_log_rank = int((world_mesh.size() // pp_size) * (pp_size - 1))


Does this rank computation assume that PP is outermost? If so, should we assert/check for that?

yea, it does. I wonder if there is a better place to do the assert. I'll add it here for now.

I also don't like to do it this way, i might want to propose adding a device-mesh API to help do a calculation like this more robustly.

i moved this into a util and added an assert.

[ghstack-poisoned]

Avoid complicating the ux and leave the status quo of 2 user-selectable behaviors: - log from rank 0 (the default) - log from all ranks (not the default) Modify the meaning of 'log from rank 0' to log from rank 0 in non-pipeline parallel runs, and log from the local rank 0 within the last pipeline-parallel stage group if pp is enabled. (note: earlier pipeline stages still produce some metrics like mfu/memory, but do not compute loss.) ghstack-source-id: 740e06906e079e366cc6966b549cc3e1fcebcc69 Pull Request resolved: #383

[ghstack-poisoned]

Avoid complicating the ux and leave the status quo of 2 user-selectable behaviors: - log from rank 0 (the default) - log from all ranks (not the default) Modify the meaning of 'log from rank 0' to log from rank 0 in non-pipeline parallel runs, and log from the local rank 0 within the last pipeline-parallel stage group if pp is enabled. (note: earlier pipeline stages still produce some metrics like mfu/memory, but do not compute loss.) ghstack-source-id: 7f60d1045f240327ae41ade3a353aff19d2f289a Pull Request resolved: #383

fegin

We probably need to extend DeviceMesh to make calculating a specific rank easier.

wconstab · 2024-06-04T18:03:01Z

We probably need to extend DeviceMesh to make calculating a specific rank easier.

yes, my thoughts exactly. I would like to discuss this offline. I couldn't quickly think of what the best API proposal for devicemesh would be so i went this route instead.

Avoid complicating the ux and leave the status quo of 2 user-selectable behaviors: - log from rank 0 (the default) - log from all ranks (not the default) Modify the meaning of 'log from rank 0' to log from rank 0 in non-pipeline parallel runs, and log from the local rank 0 within the last pipeline-parallel stage group if pp is enabled. (note: earlier pipeline stages still produce some metrics like mfu/memory, but do not compute loss.) ghstack-source-id: 7f60d1045f240327ae41ade3a353aff19d2f289a Pull Request resolved: #383

Avoid complicating the ux and leave the status quo of 2 user-selectable behaviors: - log from rank 0 (the default) - log from all ranks (not the default) Modify the meaning of 'log from rank 0' to log from rank 0 in non-pipeline parallel runs, and log from the local rank 0 within the last pipeline-parallel stage group if pp is enabled. (note: earlier pipeline stages still produce some metrics like mfu/memory, but do not compute loss.) ghstack-source-id: 7f60d1045f240327ae41ade3a353aff19d2f289a Pull Request resolved: pytorch#383

Update

66ecb2e

[ghstack-poisoned]

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 3, 2024

awgu reviewed Jun 4, 2024

View reviewed changes

Update

8fd3ac8

[ghstack-poisoned]

Update

44a0046

[ghstack-poisoned]

wconstab requested a review from wanchaol June 4, 2024 16:43

fegin approved these changes Jun 4, 2024

View reviewed changes

wconstab merged commit 44a0046 into gh/wconstab/34/base Jun 4, 2024
4 checks passed

wconstab deleted the gh/wconstab/34/head branch June 4, 2024 18:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make metrics logging work for pipeline parallelism #383

Make metrics logging work for pipeline parallelism #383

wconstab commented Jun 3, 2024 •

edited

Loading

awgu Jun 4, 2024

wconstab Jun 4, 2024

wconstab Jun 4, 2024

fegin left a comment

wconstab commented Jun 4, 2024

Make metrics logging work for pipeline parallelism #383

Make metrics logging work for pipeline parallelism #383

Conversation

wconstab commented Jun 3, 2024 • edited Loading

awgu Jun 4, 2024

Choose a reason for hiding this comment

wconstab Jun 4, 2024

Choose a reason for hiding this comment

wconstab Jun 4, 2024

Choose a reason for hiding this comment

fegin left a comment

Choose a reason for hiding this comment

wconstab commented Jun 4, 2024

wconstab commented Jun 3, 2024 •

edited

Loading