Skip to content

Releases: ashvardanian/less_slow.cpp

v0.10: cuBLASLt examples for `fp8_e4m3` GEMM

27 Feb 12:56
Compare
Choose a tag to compare

DeepSeek has just released their mixed-precision FP8 GEMM implementation, and it felt like a good time to introduce some cuBLASLt snippets as a baseline for such work. On Nvidia H200, the results for different input sizes look like this:

--------------------------------------------------------------------------------------------------
Benchmark                                        Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------
cublaslt_tops<fp8_e4m3_t, float>/256         12496 ns        12496 ns        56284 TOP=2.67999T/s
cublaslt_tops<fp8_e4m3_t, float>/512         13089 ns        13089 ns        53100 TOP=20.4883T/s
cublaslt_tops<fp8_e4m3_t, float>/1024        14882 ns        14882 ns        46918 TOP=144.23T/s
cublaslt_tops<fp8_e4m3_t, float>/2048        25802 ns        25802 ns        26869 TOP=665.679T/s
cublaslt_tops<fp8_e4m3_t, float>/4096       109316 ns       109313 ns         6021 TOP=1.25715P/s
cublaslt_tops<fp8_e4m3_t, float>/8192       821080 ns       821050 ns          629 TOP=1.33907P/s
cublaslt_tops<fp8_e4m3_t, float>/16384     7135472 ns      7135461 ns           93 TOP=1.23269P/s
cublaslt_tops<fp8_e4m3_t, float>_BigO         0.00 N^3        0.00 N^3  
cublaslt_tops<fp8_e4m3_t, float>_RMS             2 %             2 % 

The advertised throughput for H100 and H200 in the SXM form factor is 2 Peta-Ops, and cuBLASLt achieves around 67% of that in the shared benchmarks. So, one should definitely be able to squeeze more.

I haven't tried implementing synthetic ALU benchmarks for different FP8-oriented PTX instructions, so if you have time and want to try something new - feel free to submit a PR 🤗

Release v0.9.2

23 Feb 13:36
Compare
Choose a tag to compare

Release: v0.9.2 [skip ci]

Patch

  • Docs: Counting PTX as Assembly lines (cb470dd)

Release v0.9.1

12 Feb 17:10
Compare
Choose a tag to compare

Release: v0.9.1 [skip ci]

Patch

  • Docs: List project structure (5f603c7)

How to count GPU Tensor operations correctly 🤯

11 Feb 12:08
Compare
Choose a tag to compare

Measuring Tensor-Core throughput is tricky! Many families of matrix-multiplications instructions exist. Practically every Nvidia GPU generation brings new tiles, new numeric types, mixed-precision schemes, and "structured sparsity" models. All of those together form some of the longest PTX IR instructions. To make things worse, across generations, Tensor Core scheduling and collective execution scale are different!

  • Before Volta and Tensor Cores, each GPU thread would execute its own scalar Fused-Multiply-Add — easy-peasy, as long as you know how to choose the optimal grid size for your GPU model.
  • On Volta, with new mma.* instructions and wmma:: intrinsics, 8 threads would execute every tiled Mat-Mul together. This scale of collaboration was creatively called by Nvidia engineers a octet a "quadpair", of course 🤦‍♂️
  • On Ampere, with new wmma.mma.* instructions, all of the 32 threads in a single "warp" would work together. This abstraction makes sense to people familiar with CUDA C++ and how scheduling works on the GPU. Great!
    On Hopper, things changed again, of course, with wgmma.mma_async.sync.*, which supports basic asynchronous primitives at the hardware level. It has 128 threads across 4 consecutive "warps" forming a "warp group".
  • On Blackwell, you would be wise to expect a new change, and it came with a broader set of functionality refactored into an all-new tcgen05.* namespace of instructions 🧠 🔫

This new PR addresses this by explicitly marking the collaboration "scale" and counting TOPS differently for each family of instructions.


Almost equally tricky is making sure that the PTXAS assembler doesn't optimize out relevant code blocks. In the past, one approach I'd use is putting an impossible condition at the end of a CUDA C++ kernel, like this:

template <typename input_type_, typename output_type_, int m_, int n_, int k_, int repetitions_ = 128>
__device__ inline void tops_tc_cuda_kernel() {
    using namespace nvcuda;
    wmma::fragment<wmma::matrix_a, m_, n_, k_, input_type_, wmma::row_major> a_frag;
    wmma::fragment<wmma::matrix_b, m_, n_, k_, input_type_, wmma::col_major> b_frag;
    wmma::fragment<wmma::accumulator, m_, n_, k_, output_type_> c_frag;
    for (int i = 0; i != repetitions_; ++i) wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);
    if (threadIdx.x == 2147483647) wmma::store_matrix_sync(nullptr, c_frag, 16, wmma::mem_row_major);
}

This way, the compiler will see that I'm trying to export the accumulated value and will not remove our mma_sync call, even if the target address is a NULL pointer. Another approach I'd often use in PTX is to define dummy global variables and export a few values there:

.visible .global .align 4 .s32 dummy_sink_s32[32];
.visible .global .align 4 .f32 dummy_sink_f32[32];
.visible .entry tops_f16f32_sm90tc_m64n256k16_loop128_ptx_kernel() {
    ...
loop_exit:
    // Zero argument means - wait for all committed WGMMAs to complete.
    wgmma.wait_group.sync.aligned 0;

    // Use volatile stores to force the accumulator values to be written out.
    // This dummy write (to a global variable) makes the work observable and 
    // prevents the multiplication  pipeline from being optimized out.
    st.global.volatile.f32 [dummy_sink_f32],      accum0;
    st.global.volatile.f32 [dummy_sink_f32+4],    accum1;
    ret;
}

But with WGMMA, the PTXAS tool will optimize our multiplications if the shared-memory tile descriptors aren't valid. Even if it's just for a benchmark. So this PR shows how to assemble valid descriptors 🤗


This PR fixes those issues and adds more PTX kernels to highlight the different aspects of GPGPU development 🤗

Minor

  • Add: f16f32 WMMA variant for Ampere (28e639e)
  • Add: f16f32 MMA variant for Volta (1359ca7)
  • Add: Inline-PTX in C++ for WGMMA (6e16165)
  • Add: WGMMA synchronization (0207843)
  • Add: Inlined PTX kernels in CUDA C++ (e2a1bfc)

Patch

  • Docs: New H200 stats (b5d4610)
  • Docs: Naming temporary compilation results (da36475)
  • Improve: Drop small WGMMA for conciseness (7f63ef2)
  • Fix: Invoke f16f32 in WGMMA (4423421)
  • Fix: tf32 perf and waiting on fences (ea4a3e0)
  • Fix: Counting TOPS across TC generations (85f78c3)
  • Make: Split Hopper and Ampere PTX (733cbac)
  • Make: Target SM 9.0a over SM 9.0 (726c1e1)

Release v0.8.2

07 Feb 23:45
Compare
Choose a tag to compare

Release: v0.8.2 [skip ci]

Patch

  • Docs: Recommend CMake from PyPI (3ff4265)

Release v0.8.1

07 Feb 22:39
Compare
Choose a tag to compare

Release: v0.8.1 [skip ci]

Patch

  • Improve: Shrink PTX loops (152e59a)
  • Fix: Reopen ASIO compilation (d1909f9)
  • Improve: Naming scheme for PTX kernels (c1b21c8)
  • Fix: Illegal memory access on Volta (b7eab6d)

v0.8: Mat-Muls on Nvidia Hopper and Blackwell

07 Feb 21:08
Compare
Choose a tag to compare

This release answers a few questions:

  • CUTLASS vs CUBLAS performance: which to choose?
  • How did MMA instructions change with Hopper H100?
  • How did they change again with Blackwell B200?

Minor

  • Add: Warp-Group Binary MMA (d6daf3a)
  • Add: Larger m64n256k8 WGMMA variant (3e3530e)
  • Add: Warp-Group Async kernels (6cc7e34)
  • Add: f64 MMA PTX variant (ae450e5)
  • Add: CuTe draft (fdea727)
  • Add: CUTLASS placeholders (b1ab93d)
  • Add: Hopper sm90a PTX kernels (4bcf74a)

Patch

  • Improve: CUresult error handling (d74d430)
  • Improve: Logging CUDA errors (953a696)
  • Fix: Synchronize TCs (494ba52)
  • Improve: Impossible %tid condition against NVCC (8a9c9c5)
  • Make: Temporarily block CUTLASS (df1b39c)
  • Improve: Cleaner PTX code (71dea0c)
  • Improve: Avoid NVCC-specific features (3d65c7f)
  • Fix: Re-creating a CUDA stream (e831650)
  • Make: Compile in parallel by default (8e671c6)
  • Make: Separate host-only code (f751fbf)
  • Docs: Counter-intuitive PTX facts (822fa2f)
  • Docs: H200 vs MI 300X vs GB200 specs (cc36bcd)
  • Make: CUTLASS dependency (f272c40)
  • Fix: Synchronize cuBLAS for profiling (4077f26)
  • Docs: Blackwell tensor cores (ec35b35)
  • Fix: Missing _Float16 in NVCC, use half (71cadca)
  • Improve: Same size range for GEMM (d914fce)
  • Fix: Different output size for cublasGemmEx (304c880)

v0.7: Networking in POSIX vs. io_uring 💍

07 Feb 10:57
Compare
Choose a tag to compare

To showcase the differences between different IO approaches, this release brings a batch-asynchronous echo server implementation on top of UDP, measuring the packet drop frequency, throughput, and latency for:

  • ASIO
  • POSIX
  • io_uring

The numbers currently look like:

Running build_release/less_slow
Run on (6 X 4000.4 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 2048 KiB (x6)
  L3 Unified 327680 KiB (x1)
Load Average: 0.93, 0.52, 0.47
----------------------------------------------------------------------------------------------------------
Benchmark                                                Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------------------------
rpc_libc/loopback/min_time:2.000/manual_time          5514 us         2298 us          509 bytes_per_second=45.3389Mi/s drop,%=0 items_per_second=46.427k/s max_packet_latency,us=55 mean_batch_latency,us=5.51403k mean_packet_latency,us=21.5392
rpc_uring55/loopback/min_time:2.000/manual_time       1630 us         1591 us         1727 bytes_per_second=153.366Mi/s drop,%=0 items_per_second=157.046k/s max_packet_latency,us=1.822k mean_batch_latency,us=1.63009k mean_packet_latency,us=6.36754
rpc_asio/loopback/min_time:2.000/manual_time         89058 us          878 us           28 bytes_per_second=2.80717Mi/s drop,%=12.9325 items_per_second=2.87454k/s max_packet_latency,us=916 mean_batch_latency,us=89.0576k mean_packet_latency,us=399.553

The current example only uses the most basic io_uring features available with Linux kernel 5.5. In the next iterations (#30), we should extend it with the following functionality:

  • IORING_REGISTER_BUFFERS - since 5.1
  • IORING_RECV_MULTISHOT or io_uring_prep_recvmsg_multishot - since 6.0
  • IORING_OP_SEND_ZC or io_uring_prep_sendmsg_zc - since 6.0
  • IORING_SETUP_SQPOLL - with IORING_FEAT_SQPOLL_NONFIXED after 5.11
  • IORING_SETUP_SUBMIT_ALL - since 5.18
  • IORING_SETUP_COOP_TASKRUN - since 5.19
  • IORING_SETUP_SINGLE_ISSUER - since 6.0

Feel free to join the development 🤗

Minor

  • Add: io_uring variant for kernel 6.0 (ce73aa3)
  • Add: io_uring draft (ec28b57)
  • Add: External route networking (a2a8c9e)
  • Add: POSIX echo implementation (3cce3b9)
  • Add: ASIO "echo" server/client ping-pong (08d3326)

Patch

  • Fix: Depend io_uring compilation on kernel version (70c53f6)
  • Improve: IOSQE_FIXED_FILE for kernel 6.0+ (f7f7693)
  • Improve: ASIO benchmarks (6be216a)
  • Docs: Refactor spell-checks (24706a7)
  • Make: Order spell-checks (1358a69)
  • Docs: Recommend OpenBLAS (035e388)
  • Improve: Avoid std::format in io_uring (1857a82)
  • Fix: ARCH_ENABLE_TAGGED_ADDR needs Linux 6.2+ (3993b0c)
  • Fix: Missing openblas_set_num_threads (3cab87d)
  • Docs: Instal libBLAS (7629609)
  • Improve: SO_ZEROCOPY (fd4c9e2)
  • Improve: Retrofit registering buffers in 5.5 (95de751)
  • Make: RelWithDebInfo flags (2838fd5)
  • Improve: Code styling on Windows (c9238a1)
  • Fix: Avoid in-place increment (2c25b4d)
  • Make: Disable CUDA by default (94879fd)
  • Make: Matching VERSION in CMake (b4dc186)
  • Improve: Detect Linux version (f3e91fa)
  • Improve: physical_cores for Windows refactor (0eb985c)
  • Docs: Future io_uring tasks (53c4ca6)
  • Improve: io_uring optional timeouts (f933582)
  • Make: Revert to default BLAS (b3e13dd)
  • Improve: io_uring server logic (3dfe612)
  • Fix: liburing example (a2a9d6c)
  • Improve: Reuse benchmarking logic (cae4175)
  • Improve: Manual IO timing (fc60bfd)
  • Make: Switch to PkgConfig for liburing (b4e50ad)
  • Fix: Compiling asio example (b041392)
  • Make: Tag dependencies, where possible (8f2e985)
  • Improve: Batching client/server requests (955be1d)
  • Make: liburing & asio deps (7256f98)

Less Slow v0.6: Thrust → CUDA → PTX → SASS 🏋️‍♂️🏋️‍♀️

29 Jan 00:28
Compare
Choose a tag to compare

It's almost impossible to imagine modern High-Performance Computing without GPUs. Yet, there are surprisingly few "full stack" demos out there for folks wanting to build intuition around CUDA C++, PTX Intermediate Representations, SASS Assembly, and higher-level libraries like Thrust, CUB, or the various cuBLAS flavors. This new release of Less Slow covers all of those! 🥳

Tensor Cores

The main highlight is an in-depth look at Tensor Core designs, from their extensive type system to the complexity of tile shapes—notoriously under-documented and confusing areas. These capabilities differ across Volta, Turing, Ampere, Ada, and Hopper GPUs, mapping to different PTX intrinsics (like wmma, binary bmma, or warp-group wgmma) and culminating in yet another shape at the SASS level with instructions such as multiple HMMA.884.F32.F32.STEPx instructions for each wmma.mma.sync.aligned.row.col.m16n16k16.f32.f32 intrinsic on Volta. And if you believe that instruction is long... be warned 😅

__global__ void tops_f16f16_sm70tc_16x16x16_1024unroll_cuda_kernel() {
    using namespace nvcuda;
    wmma::fragment<wmma::matrix_a, 16,16,16, half, wmma::row_major> a_frag;
    wmma::fragment<wmma::matrix_b, 16,16,16, half, wmma::col_major> b_frag;
    wmma::fragment<wmma::accumulator, 16,16,16, half> c_frag;
    for (int i = 0; i < 1024; ++i)
        wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);
}
$ cuobjdump -sass less_slow_from_cu.cubin | grep -i mma
# e.g. HMMA.884.F32.F32.STEP2 ...

This indicates the 8×8×4 shape actually used by the hardware on Volta.

PTX vs SASS

I've also hand-written PTX kernels, that may look like:

.visible .entry tops_f16f16_sm70tc_16x16x16_1024loop_ptx_kernel()
{
  // ...
  loop_start:
    // A single wmma instruction
    wmma.mma.sync.aligned.row.col.m16n16k16.f16.f16
      { %f0, %f1, %f2, %f3 }, // output accumulators
      { %f4, ... },          // A
      { %f12, ... },         // B
      { %f0, %f1, %f2, %f3 }; // input accumulators
    // ...
  bra loop_start;
}

Using the provided scripts, you can see for yourself just how different manually written vs. machine-generated PTX can be and how to invoke kernels directly from C++ in various ways — whether through the CUDA Runtime API or the CUDA Driver API — loading and JIT-compiling bits of PTX on the fly!

cuInit(0);
CUdevice dev; cuDeviceGet(&dev, 0);
CUcontext ctx; cuCtxCreate(&ctx, 0, dev);
CUmodule mod; cuModuleLoad(&mod, "less_slow.ptx");
CUfunction fun; cuModuleGetFunction(&fun, mod, "tops_f16f16_sm70tc_16x16x16_1024loop_ptx_kernel");

void* args[] = { /* kernel parameters here */ };
cuLaunchKernel(fun,
               1, 1, 1,  // gridDim
               256, 1, 1,// blockDim
               0, nullptr, args, nullptr);
cuCtxSynchronize();
cuModuleUnload(mod);
cuCtxDestroy(ctx);

cuBLAS on Practice

I've also included theoretical throughput benchmarks alongside real matrix multiplications via cuBLAS in case you want to compare actual performance to the raw theoretical numbers. One important observation here may be the lack of low-resolution numeric types:

if constexpr (std::is_same_v<scalar_type_, float>) {
    scalar_type_ alpha = 1, beta = 0;
    cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, n, n, n, &alpha, a.begin(), lda, b.begin(), ldb, &beta, c.begin(), ldc);
} else if constexpr (std::is_same_v<scalar_type_, double>) {
    scalar_type_ alpha = 1, beta = 0;
    cublasDgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, n, n, n, &alpha, a.begin(), lda, b.begin(), ldb, &beta, c.begin(), ldc);
} else if constexpr (std::is_same_v<scalar_type_, __half>) {
    scalar_type_ alpha = 1, beta = 0;
    cublasHgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, n, n, n, &alpha, a.begin(), lda, b.begin(), ldb, &beta, c.begin(), ldc);
} else if constexpr (std::is_same_v<scalar_type_, int8_t>) {
    int32_t alpha_int = 1, beta_int = 0;
    cublasGemmEx(handle, CUBLAS_OP_N, CUBLAS_OP_N, n, n, n, &alpha_int, a.begin(), CUDA_R_8I, lda, b.begin(), CUDA_R_8I, ldb, &beta_int, c.begin(), CUDA_R_32I, ldc, CUDA_R_32I, CUBLAS_GEMM_DEFAULT);
}

Even integer kernels have a different signature, requiring $Alpha$ and $Beta$ to match the accumulator type, rather than the inputs. Very few libraries have adaptations for binary matrices and or sub-byte representations.

Beyond Linear Algebra

Since GPUs obviously go beyond linear algebra, Thrust and CUB are perfect for exploring other domains in heterogeneous computing. I’ve added snippets that mostly revolve around sorting algorithms, showcasing the differences in memory management between Thrust and CUB and explaining why CUB calls often come in pairs, like:

size_t temp_size = 0;
void *d_temp = nullptr;
cub::DeviceRadixSort::SortKeys(nullptr, temp_size, d_in_keys, d_out_keys, count);
cudaMalloc(&d_temp, temp_size);
cub::DeviceRadixSort::SortKeys(d_temp, temp_size, d_in_keys, d_out_keys, count);

This was also a good place to show how Thrust and CUB operations can be scheduled together on the same asynchronous streams and profiled with GPU time instead of CPU time to avoid unnecessary blocking ⏲️


Enjoy exploring, and happy GPU hacking! I’ll keep adding to this project (and other related ones) as we go along!

Changelog

  • Add: Binary BMMA kernels for GPU (6a609a0)
  • Add: Tensor Core intrinsic benchmarks (1bdb5df)
  • Add: cuBLAS benchmarks (2f791fe)
  • Add: Precompiled CUDA C++ kernels (c1a6f3e)
  • Add: Using CUDA Driver API to JIT .ptx (82cb684)
  • Add: PTX and .cuh kernels (824e473)
  • Add: Sorting with thrust and cub (df3b2c1)
  • Add: Thrust, CUB, CUDA sorting (551402d)
  • Add: Thrust, CUB, CUDA sorting (8481114)
  • Make: Drop OpenBLAS (3c92c36)
  • Fix: Use f16 MMA (141d285)
  • Fix: Lower PTX version for JIT (eff3854)
  • Fix: Working PTX kernel (514db0f)
  • Docs: Introduce Warp-Group-MMA on Hopper (400f294)
  • Make: Build CUDA for multiple platforms (3283ab0)
  • Fix: Avoid optimizing-out SASS code (986b8bc)
  • Fix: Compiling cuBLAS calls (312409a)
  • Make: Don't compile PTX (53202e6)
  • Make: Silence NVCC warnings (a6cdc74)
  • Fix: NVCC compilation issues (494e705)
  • Make: Upgrade fmt for NVCC builds (88277bf)
  • Fix: Ranges require constexpr on NVCC (c1d7b2f)
  • Make: Switch to CUDA Toolkit for GPU libs (2589a40)
  • Make: Options for CUDA & TBB in CMake (4d03c08)

v0.5.4: Supporting MSVC on Windows 🪟

26 Jan 20:40
Compare
Choose a tag to compare

The less_slow.cpp project now supports Microsoft Visual C++ (MSVC), thanks to the extensive list of patches suggested by @RazielXYZ 👏

Key updates include switching to OpenBLAS via FetchContent for comparable linear algebra performance across platforms, enabling OpenMP on MSVC for parallelism in Eigen-based computations, and revising OpenMP loop indices to use int64_t, as MSVC requires signed types for parallel loops. The detection of physical cores on high-core-count Windows systems has also been improved by implementing GetActiveProcessorCount(ALL_PROCESSOR_GROUPS) and refining physical core detection logic. Furthermore, the integration addresses missing functionality, such as the lack of __builtin_popcountll on MSVC, with a manual fallback for is_power_of_two.

Additional findings include MSVC-specific behaviors, such as how linking AVX-512 code significantly slows down builds and how assembly-based benchmarks require further investigation for proper MSVC integration. Interestingly, heavily-templated libraries, like Ranges-v3 or CRTE (Compile-time RegEx), show much worse performance on MSVC than with GCC and Clang.