Faster sorting with SIMD CUDA intrinsics (2024)

CUDA "SIMD" Intrinsics and Their Limitations

The discussion kicks off with a clarification about what "SIMD" means in the CUDA context. "ashvardanian" points out that the article's topic, warp-level synchronization, is different from typical CUDA SIMD intrinsics: "Most 'CUDA SIMD' intrinsics are designed to process a 32-bit data pack containing 2x 16-bit or 4x 8-bit values." He argues that this significantly limits their application outside of video and string processing. He also mentions his experience with DPX instructions on Hopper GPUs for StringZilla, noting that "the gains aren't huge."

SWAR (SIMD Within A Register) on GPUs

The potential and challenges of SWAR (SIMD Within A Register) on GPUs is discussed. "winwang" uses the term SWAR to refer to the data packing approach "ashvardanian" mentioned. "ashvardanian" is exploring the use of SWAR and is assembling benchmarks. "I've begun assembling a set of synthetic benchmarks to compare DP4A vs. DPX... but it feels incomplete without SWAR. My working hypothesis is that 64-bit SWAR on properly aligned data could be very useful in GPGPU". "winwang" admits he isn't too familiar with the field, but recounts a thought experiment involving prefix sums. A concern about 64-bit integers is raised by "winwang": "I was of the impression that i64 is just simulated."

GPU vs. CPU Sorting Performance at Small Scales

The discussion touches on the performance comparison between GPU and CPU sorting, particularly for small datasets. "DennisL123" expressed interest in comparing a GPU implementation against a CPU-based Radix sort. "winwang" responds, "I don't think a GPU sort would beat a CPU sort at this scale... CPUs are simply too fast for (super-)small data, especially with AVX-512." However, "winwang" acknowledges that GPUs become more competitive with larger datasets, such as in a mergesort.

Use Cases for GPU Sorting

The conversation briefly acknowledges the use case for GPU sorting when data already resides on the GPU. "maeln" provides an example from rendering: "It is also useful if your data already lives on the GPU memory. For example, when you need to z-sort a bunch of particles in a 3d renderer particle system."