[CUTLASS] Add NDEBUG option to CUTLASS compile to speed up attention kernel #14798

masahi · 2023-05-07T20:13:53Z

I found that adding this option brings non-trivial perf improvement to the attention kernel (2.4 vs 2.7 msec for the most heavy workload in SD UNet, see below). This results in a few msec speed up for SD UNet e2e.

Before (nvprof output on test_attention_offload((2, (4096, 4096), 8, (40, 40), "float16))

 Time (%)  Total Time (ns)  Instances   Avg (ns)     Med (ns)    Min (ns)   Max (ns)   StdDev (ns)     GridXYZ         BlockXYZ                       
                              Name                                                 
 --------  ---------------  ---------  -----------  -----------  ---------  ---------  -----------  --------------  --------------  ------------------
----------------------------------------------------------------------------------
 100.0        8,325,790          3  2,775,263.3  2,773,023.0  2,768,736  2,784,031      7,889.8    64    8    2    32    4    1  void attention_kernel_batched_impl<AttentionKernel<cutlass::half_t, cutlass::arch::Sm80, (bool)1, (…

After

   Time (%)  Total Time (ns)  Instances   Avg (ns)     Med (ns)    Min (ns)   Max (ns)   StdDev (ns)     GridXYZ         BlockXYZ                                                     Name                                                
 --------  ---------------  ---------  -----------  -----------  ---------  ---------  -----------  --------------  --------------  ----------------------------------------------------------------------------------------------------
 100.0        7,466,320          3  2,488,773.3  2,483,845.0  2,481,606  2,500,869     10,534.8    64    8    2    32    4    1  void attention_kernel_batched_impl<AttentionKernel<cutlass::half_t, cutlass::arch::Sm80, (bool)1, (…

@cyx-6 @vinx13

tvm-bot · 2023-05-07T20:13:56Z

Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.

No users to tag found in teams: cutlass _{See #10317 for details}

_{Generated by tvm-bot}

masahi · 2023-05-07T21:20:30Z

But a weird thing is, if I run the same attention workload via the cutlass example, it shows that the same kernel runs in 1.3 msec, see below (compared to our BYOC result, 2.4 msec).

$ nsys nvprof examples/41_fused_multi_head_attention/41_fused_multi_head_attention_fixed_seqlen --head_number=8 --batch_size=2 --head_size=40 --_head_size_v=40 --seq_length=4096 --seq_length_kv=4096
                                                                                                                                          
CUTLASS Attention:                                                                                                                                    ====================================================                                                                                                  
     {seq length Q, seq length KV, head size, head size V, head number, batch size} = {4096, 4096, 40, 40, 8, 2}.
                                                                           
    Runtime: 1.36964 ms
    GFLOPs: 19897.7                                                                                                                                   
                                                                                                                                                      
Passed                                                                                                                                                

 Time (%)  Total Time (ns)  Instances   Avg (ns)     Med (ns)    Min (ns)   Max (ns)   StdDev (ns)     GridXYZ         BlockXYZ                       
                              Name                                                 
--------  ---------------  ---------  -----------  -----------  ---------  ---------  -----------  --------------  --------------  ------------------
     ...
     13.8       30,160,277         22  1,370,921.7  1,368,225.0  1,346,113  1,414,209     16,718.9    64    8    2    32    4    1  void attention_kernel_batched_impl<AttentionKernel<cutlass::half_t, cutlass::arch::Sm80, (bool)1, (…

I've also checked out Triton and Flash attention perf on the same workload by running /~https://github.com/openai/triton/blob/main/python/tutorials/06-fused-attention.py#L330 and saw perf around 1.2 - 1.3 msec. So I want to believe that ~1.3 msec should be the right result for an attention kernel on this workload.

So maybe there is something off in how we use this kernel from our BYOC? I compared the generated code and the cutlass example code but didn't find any difference. There is no difference in the nvcc options that might affect performance other than this NDEBUG stuff (that's how I found about it). Any thoughts? @vinx13 @cyx-6

junrushao · 2023-05-07T21:33:20Z

@spectrometerHBH has same observation. Good catch!

vinx13 · 2023-05-08T08:57:25Z

I can confirm there's performance difference, the profiler also shows different number of instructions are executed, though the it's indeed the same kernel

masahi · 2023-05-08T09:47:38Z

I tried updating the cutlass submodule revision, but it didn't help.

vinx13 · 2023-05-09T12:28:14Z

it turns out that cutlass profiler has --causal default to true, after adding --causal=false I can get the same numbers

masahi · 2023-05-09T22:39:55Z

wow then Triton and flash attention kernels may indeed be faster than the cutlass one, given that the triton implementation is definitely not doing causal mask optimization.

vinx13 · 2023-05-09T23:06:38Z

triton kernel is also causal attention, /~https://github.com/openai/triton/blob/main/python/tutorials/06-fused-attention.py#LL49C57-L49C57 it's not doing a full Q*K.T, so it has less computation

masahi · 2023-05-09T23:19:01Z

Interesting! Yeah I didn't understand this loop bound for start_n in range(0, (start_m + 1) * BLOCK_M, BLOCK_N) when I was studying this code, now it makes sense.

add NDEBUG option to CUTLASS compile to speed up attention kernel

3d88e4a

masahi force-pushed the cutlass-ndebug branch from 3c8f899 to 3d88e4a Compare May 7, 2023 20:49

vinx13 approved these changes May 8, 2023

View reviewed changes

vinx13 merged commit 6c689ee into apache:main May 8, 2023

ysh329 mentioned this pull request Jul 12, 2023

[Release] v0.13.0 Release Candidate Notes #15295

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUTLASS] Add NDEBUG option to CUTLASS compile to speed up attention kernel #14798

[CUTLASS] Add NDEBUG option to CUTLASS compile to speed up attention kernel #14798

masahi commented May 7, 2023 •

edited

Loading

tvm-bot commented May 7, 2023

masahi commented May 7, 2023 •

edited

Loading

junrushao commented May 7, 2023

vinx13 commented May 8, 2023

masahi commented May 8, 2023

vinx13 commented May 9, 2023

masahi commented May 9, 2023

vinx13 commented May 9, 2023

masahi commented May 9, 2023

[CUTLASS] Add NDEBUG option to CUTLASS compile to speed up attention kernel #14798

[CUTLASS] Add NDEBUG option to CUTLASS compile to speed up attention kernel #14798

Conversation

masahi commented May 7, 2023 • edited Loading

tvm-bot commented May 7, 2023

masahi commented May 7, 2023 • edited Loading

junrushao commented May 7, 2023

vinx13 commented May 8, 2023

masahi commented May 8, 2023

vinx13 commented May 9, 2023

masahi commented May 9, 2023

vinx13 commented May 9, 2023

masahi commented May 9, 2023

masahi commented May 7, 2023 •

edited

Loading

masahi commented May 7, 2023 •

edited

Loading