📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
-
Updated
Feb 24, 2025 - Cuda
📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.
FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme
The simplest but fast implementation of matrix multiplication in CUDA.
code for benchmarking GPU performance based on cublasSgemm and cublasHgemm
Some common CUDA kernel implementations (Not the fastest).
Use tensor core to calculate back-to-back HGEMM (half-precision general matrix multiplication) with MMA PTX instruction.
Fast SGEMM emulation on Tensor Cores
CUDA kernel functions
My attempt of making a GEMM kernel...
Add a description, image, and links to the gemm topic page so that developers can more easily learn about it.
To associate your repository with the gemm topic, visit your repo's landing page and select "manage topics."