-
Notifications
You must be signed in to change notification settings - Fork 545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable preshuffled mixed dtype Cutlass Gemm #3722
base: main
Are you sure you want to change the base?
Conversation
This pull request was exported from Phabricator. Differential Revision: D69955197 |
✅ Deploy Preview for pytorch-fbgemm-docs ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
Summary: WIP to enable new optimized preshuffled fp8xint4 gemm. Differential Revision: D69955197
254c644
to
24a824b
Compare
This pull request was exported from Phabricator. Differential Revision: D69955197 |
24a824b
to
dc741e7
Compare
Summary: WIP to enable new optimized preshuffled fp8xint4 gemm. While the example compiles and runs, it runs into a variety of problems. The outputs are either completely incorrect, contain NaNs, or the kernel hits an Illegal Memory Access. I'm not yet sure why. Differential Revision: D69955197
This pull request was exported from Phabricator. Differential Revision: D69955197 |
@IwakuraRein Despite this compiling and running, I'm getting incorrect outputs and very poor performance (even slower than the legacy f8i4 without packing or shuffling). Can you take a look and see if I'm doing something obviously wrong? Ignore files besides f8i4_shuffled.cu and mixed_dtype_utils.cu as the others just fix cutlass v3.8 compatibility. |
Summary: WIP to enable new optimized preshuffled fp8xint4 gemm. While the example compiles and runs, it runs into a variety of problems. The outputs are either completely incorrect, contain NaNs, or the kernel hits an Illegal Memory Access. I'm not yet sure why. Differential Revision: D69955197
dc741e7
to
70477ce
Compare
This pull request was exported from Phabricator. Differential Revision: D69955197 |
Differential Revision: D69890673
Summary: WIP to enable new optimized preshuffled fp8xint4 gemm. While the example compiles and runs, it runs into a variety of problems. The outputs are either completely incorrect, contain NaNs, or the kernel hits an Illegal Memory Access. I'm not yet sure why. Differential Revision: D69955197
70477ce
to
bbca782
Compare
This pull request was exported from Phabricator. Differential Revision: D69955197 |
@jwfromm Are there negative values in the scale factors? This might be the reason for the accuracy drop after enabling lookup table, and can be easily fixed by applying this change to |
@IwakuraRein The scales are all positive and I'm running with the latest cutlass head commit (as of yesterday). The link you posted doesnt seem to include any changes to sm90_mma_tma_gmma_rs_warpspecialized_mixed_input.hpp, did you mean to paste a differint one? |
@jwfromm Sorry I mean the changes in |
fbgemm_gpu/experimental/gen_ai/bench/quantize_ops.py:1145: - scales = scales.view(x.shape[0], -1)
+ scales = scales.view(x.shape[0], -1).t().contiguous() fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/mixed_dtype_utils.cu:59: - StrideB stride_B;
+ StrideB stride_B = cutlass::make_cute_packed_stride(StrideB{}, shape_B); These should fix the bugs. |
Summary: WIP to enable new optimized preshuffled fp8xint4 gemm.
Differential Revision: D69955197