Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable preshuffled mixed dtype Cutlass Gemm #3722

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jwfromm
Copy link
Contributor

@jwfromm jwfromm commented Feb 21, 2025

Summary: WIP to enable new optimized preshuffled fp8xint4 gemm.

Differential Revision: D69955197

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D69955197

Copy link

netlify bot commented Feb 21, 2025

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit bbca782
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/67b92525c91cb400082f474e
😎 Deploy Preview https://deploy-preview-3722--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request Feb 21, 2025
Summary:

WIP to enable new optimized preshuffled fp8xint4 gemm.

Differential Revision: D69955197
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D69955197

jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request Feb 21, 2025
Summary:

WIP to enable new optimized preshuffled fp8xint4 gemm.

While the example compiles and runs, it runs into a variety of problems. The outputs are either completely incorrect, contain NaNs, or the kernel hits an Illegal Memory Access. I'm not yet sure why.

Differential Revision: D69955197
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D69955197

@jwfromm
Copy link
Contributor Author

jwfromm commented Feb 21, 2025

@IwakuraRein Despite this compiling and running, I'm getting incorrect outputs and very poor performance (even slower than the legacy f8i4 without packing or shuffling). Can you take a look and see if I'm doing something obviously wrong?

Ignore files besides f8i4_shuffled.cu and mixed_dtype_utils.cu as the others just fix cutlass v3.8 compatibility.

jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request Feb 22, 2025
Summary:

WIP to enable new optimized preshuffled fp8xint4 gemm.

While the example compiles and runs, it runs into a variety of problems. The outputs are either completely incorrect, contain NaNs, or the kernel hits an Illegal Memory Access. I'm not yet sure why.

Differential Revision: D69955197
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D69955197

Differential Revision: D69890673
Summary:

WIP to enable new optimized preshuffled fp8xint4 gemm.

While the example compiles and runs, it runs into a variety of problems. The outputs are either completely incorrect, contain NaNs, or the kernel hits an Illegal Memory Access. I'm not yet sure why.

Differential Revision: D69955197
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D69955197

@IwakuraRein
Copy link

@jwfromm Are there negative values in the scale factors? This might be the reason for the accuracy drop after enabling lookup table, and can be easily fixed by applying this change to external/cutlass/include/cutlass/gemm/collective/sm90_mma_tma_gmma_rs_warpspecialized_mixed_input.hpp in your fork.

@jwfromm
Copy link
Contributor Author

jwfromm commented Feb 25, 2025

@IwakuraRein The scales are all positive and I'm running with the latest cutlass head commit (as of yesterday). The link you posted doesnt seem to include any changes to sm90_mma_tma_gmma_rs_warpspecialized_mixed_input.hpp, did you mean to paste a differint one?

@IwakuraRein
Copy link

@jwfromm Sorry I mean the changes in include/cutlass/detail/collective/mixed_input_utils.hpp in that link. But since your scales are all positive and I'm running with the latest cutlass then I guess this is not the issue.

@IwakuraRein
Copy link

IwakuraRein commented Mar 3, 2025

fbgemm_gpu/experimental/gen_ai/bench/quantize_ops.py:1145:

-     scales = scales.view(x.shape[0], -1)
+     scales = scales.view(x.shape[0], -1).t().contiguous()

fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/mixed_dtype_utils.cu:59:

- StrideB stride_B;
+ StrideB stride_B = cutlass::make_cute_packed_stride(StrideB{}, shape_B);

These should fix the bugs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants