[float8] add float8 training benchmarking scripts #1802

danielvegamyhre · 2025-02-28T22:35:17Z

Summary

Add bash script which kicks of a torchtitan llama3 8b training run, with configurable params) then calls the python script to parse the logs and calculate the median tok/sec and peak memory usage
Add README with usage instructions

Usage

Usage: TORCHTITAN_ROOT=<directory> ./float8_training_benchmark.sh
Optional parameters configurable via environment variables:
 * FLOAT8_RECIPE: "rowwise" or "tensorwise". if set, use float8 training with the specified recipe. otherwise, use bf16 mixed precision training.
 * BATCH_SIZE: defaults to 1.
 * STEPS: defaults to 100.

Example

~/ao/torchao/float8/benchmarking (bench)]$ TORCHTITAN_ROOT=${HOME}/torchtitan FLOAT8_RECIPE=rowwise STEPS=50 ./float8_training_benchmark.sh
...
...
...
[rank0]:[titan] 2025-02-28 14:33:20,373 - root - INFO - step:  1  loss: 12.2643  memory: 40.25GiB(42.37%)  tps: 590  tflops: 34.19  mfu: 3.46%
[rank0]:[titan] 2025-02-28 14:33:20,374 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
[rank0]:[titan] 2025-02-28 14:33:31,600 - root - INFO - step: 10  loss:  9.9184  memory: 47.79GiB(50.30%)  tps: 6,567  tflops: 380.35  mfu: 38.46%
[rank0]:[titan] 2025-02-28 14:33:43,781 - root - INFO - step: 20  loss:  8.3637  memory: 47.79GiB(50.30%)  tps: 6,726  tflops: 389.54  mfu: 39.39%
[rank0]:[titan] 2025-02-28 14:33:55,992 - root - INFO - step: 30  loss:  7.6944  memory: 47.79GiB(50.30%)  tps: 6,709  tflops: 388.53  mfu: 39.28%
[rank0]:[titan] 2025-02-28 14:34:08,279 - root - INFO - step: 40  loss:  7.3230  memory: 47.79GiB(50.30%)  tps: 6,668  tflops: 386.16  mfu: 39.05%
[rank0]:[titan] 2025-02-28 14:34:19,041 - root - INFO - [GC] Peforming periodical GC collection. 0.02 seconds.
[rank0]:[titan] 2025-02-28 14:34:20,586 - root - INFO - step: 50  loss:  7.0842  memory: 47.79GiB(50.30%)  tps: 6,657  tflops: 385.52  mfu: 38.98%
[rank0]:[titan] 2025-02-28 14:34:20,586 - root - INFO - Sleeping 2 seconds for other ranks to complete
[rank0]:[titan] 2025-02-28 14:34:22,587 - root - INFO - Training completed
[rank0]:NCCL version 2.25.1+cuda12.2

=====================================================
 Calculating training performance metrics
=====================================================
Median Tokens/Second (excluding step 1): 6668.0
Max Memory Usage: 47.79 GiB

pytorch-bot · 2025-02-28T22:35:21Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1802

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures

As of commit 51a9780 with merge base 890e0ac ():

NEW FAILURES - The following jobs have failed:

Run Regression Tests / test (CUDA 2.4, linux.g5.12xlarge.nvidia.gpu, torch==2.4.0, cuda, 12.1) / linux-job (gh)
RuntimeError: Command docker exec -t f0921f92d88673b92584364f6ea3c39f6f815f24550e8fdeae03b0ab5906e076 /exec failed with exit code 1
Run Regression Tests / test (CUDA 2.5.1, linux.g5.12xlarge.nvidia.gpu, torch==2.5.1 --index-url https://download.pytorch... / linux-job (gh)
RuntimeError: Command docker exec -t cd2b0fc81badb16fd0544873992e8c46ca239543e76b78f8cf4ec1f2a4c29fa8 /exec failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

vkuzo · 2025-03-01T00:00:29Z

torchao/float8/benchmarking/README.md

@@ -0,0 +1,18 @@
+# Float8 training benchmarking


nit: move this to benchmarks/float8/something?

vkuzo · 2025-03-01T00:02:09Z

torchao/float8/benchmarking/float8_training_benchmark.sh

+
+# validate recipe name
+if [ -n "${FLOAT8_RECIPE}" ]; then
+  if [ "$FLOAT8_RECIPE" != "rowwise" ] && [ "$FLOAT8_RECIPE" != "tensorwise" ]; then


nit: this is already checked twice (once in torchtitan, once in Float8LinearConfig), IMO we don't need to check it a third time :)

vkuzo

looks great! Let's just move to benchmarks/float8

danielvegamyhre · 2025-03-01T15:30:41Z

Looks like test failure is due to #1799 and unrelated to this change. Going to merge

danielvegamyhre added float8 topic: improvement Use this tag if this PR is an improvement (doesn't fit into any of the other categories) training labels Feb 28, 2025

danielvegamyhre requested a review from vkuzo February 28, 2025 22:35

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 28, 2025

danielvegamyhre marked this pull request as draft February 28, 2025 23:13

danielvegamyhre marked this pull request as ready for review February 28, 2025 23:35

danielvegamyhre mentioned this pull request Feb 28, 2025

[float8] add perf benchmarks for float8 training with rowwise + tensorwise scaling #1793

Open

add float8 training benchmarking scripts

009b331

danielvegamyhre force-pushed the bench branch from d63c7fa to 009b331 Compare February 28, 2025 23:37

vkuzo reviewed Mar 1, 2025

View reviewed changes

vkuzo approved these changes Mar 1, 2025

View reviewed changes

move to benchmarks/float8/training

51a9780

danielvegamyhre force-pushed the bench branch from 2b6f8bc to 51a9780 Compare March 1, 2025 00:53

danielvegamyhre merged commit 7963f9c into main Mar 1, 2025
15 of 17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[float8] add float8 training benchmarking scripts #1802

[float8] add float8 training benchmarking scripts #1802

danielvegamyhre commented Feb 28, 2025 •

edited

Loading

pytorch-bot bot commented Feb 28, 2025 •

edited

Loading

vkuzo Mar 1, 2025

vkuzo Mar 1, 2025

vkuzo left a comment

danielvegamyhre commented Mar 1, 2025

[float8] add float8 training benchmarking scripts #1802

[float8] add float8 training benchmarking scripts #1802

Conversation

danielvegamyhre commented Feb 28, 2025 • edited Loading

Summary

Usage

Example

pytorch-bot bot commented Feb 28, 2025 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1802

❌ 2 New Failures

vkuzo Mar 1, 2025

Choose a reason for hiding this comment

vkuzo Mar 1, 2025

Choose a reason for hiding this comment

vkuzo left a comment

Choose a reason for hiding this comment

danielvegamyhre commented Mar 1, 2025

danielvegamyhre commented Feb 28, 2025 •

edited

Loading

pytorch-bot bot commented Feb 28, 2025 •

edited

Loading