io_uring is faster than mmap

Here's a summary of the key themes discussed in the Hacker News thread, presented with markdown headers and direct quotes:

HUGEPAGES and MMAP Optimizations

A significant portion of the discussion revolves around improving mmap performance, particularly concerning the use of huge pages. Users suggest that utilizing huge pages could significantly reduce page table overhead and improve performance for large file mappings.

"Would huge pages help with the mmap case?" (Jap2-0)
The original author, jared_hulbert, expresses uncertainty about implementing huge pages with mmap for file caching and notes that "the arm64 systems with 16K or 64K native pages would have fewer faults."
inetknght provides flags for mmap to enable huge pages (MAP_HUGETLB | MAP_HUGE_1GB) and claims personal experience of "a significant speed-up" with an 800GB file. They also suggest consulting the kernel source code.
inetknght further elaborates on the benefit: "Tens- or hundreds- of gigabytes of 4K page table entries take a while for the OS to navigate."
jared_hulbert questions whether these flags create huge page cache entries and reports that this approach "doesn't work with a file on my ext4 volume."
inetknght clarifies that MAP_HUGETLB might not be the correct flag, and MAP_HUGE_1GB alone might be the way, also emphasizing the need for correct alignment of file size and length to the huge page size.
mastax states that MAP_HUGETLB cannot be used for memory-mapping files on disk, only with MAP_ANONYMOUS, memfd, or files on hugetlbfs.
Sesse__ clarifies that this behavior is filesystem-dependent, working with tmpfs but not with "normal" filesystems like ext4 or xfs.
mananaysiempre points to patches for ext4 that were not merged, indicating a potential historical inability to use huge pages with such filesystems.
jandrewrogers advises that "there are restrictions on using the huge page option with mmap() that mean it won’t do what you might intuit it will in many cases. Getting reliable huge page mappings is a bit fussy on Linux."
bawolff notes, "My understanding is its presicely meant for this circumstance. I don't think its a fair comparison without it."

Visualisation and Data Presentation

The discussion touches on the presentation of benchmark data, with a focus on making trends clearer through appropriate scaling.

nchmy suggests replacing charts with ones on a "log scale" to better visualize exponential growth and requests that "all the lines on the same chart, or at least with the same y axis scale" for better relative comparison.
jared_hulbert acknowledges the difficulty in expressing exponential growth on log scales and notes that putting all lines on the same chart made the "y axis impossible to understand" due to differing units.
nchmy reiterates that "Log axis solves this, and turns meaningless hockey sticks into generally a straightish line that you can actually parse." They also suggest scaling lines relative to their initial values.

Direct I/O, io_uring, and SPDK Comparisons

A central theme is the comparison between different I/O mechanisms, particularly io_uring, mmap, and user-space solutions like SPDK, in terms of performance and overhead.

The original post's title, "Memory is slow, Disk is fast," is debated, with many arguing it's misleading and that the performance difference is due to algorithmic choices rather than a fundamental property of memory vs. disk.
"The real difference is that with io_uring and O_DIRECT you manage the cache yourself... and with mmap this is managed by the OS." (nextaccountic)
josephg is "fascinated to see a comparison with SPDK" as it bypasses the kernel.
jared_hulbert, a long-time SPDK user, believes io_uring has "really closed the gap" and expects similar performance, potentially with a slight edge to io_uring due to Intel's Direct I/O technology (DDIO) support.
benlwalker, an SPDK co-creator, states that "SPDK will be able to fully saturate the PCIe bandwidth from a single CPU core here... it can use a lot less CPU." However, they clarify that with O_DIRECT, the difference might not be as drastic.
WAhern suggests that the performance difference could be attributed to the parallelism in io_uring (using a kernel thread pool) versus the more sequential nature of mmap's page fault handling. "Generic readahead... you effectively get at most one thread running in parallel to fill the page cache."
"The io_uring case may even result in additional data copies, but with less cost than the VM fiddling..." (WAhern)
kentonv clarifies the article’s benchmark setup: "the io_uring program uses 6 threads to pull from disk... Whereas the program using mmap() uses a single thread for everything." This single thread also incurs overhead from page faults.
kat529770 asks about benchmarking io_uring with "real-world workloads (e.g., DB-backed APIs or log ingestion)."
yencabulator calls the title "Bullshit clickbait" and suggests it's more accurately "naive algorithm is slower than prefetching."

Processor Features and Vectorization

The discussion also touches on CPU features, such as direct I/O and vector instructions, and their impact on performance.

john-h-k inquires about CPUs' ability to "read direct to L3, skipping memory."
jared_hulbert links to Intel's Data Direct I/O Technology (DDIO) and mentions AMD has similar capabilities, explaining that NVMe drives use DMA requests that can be directed to cache.
jared_hulbert notes that "gcc and clang don't like to optimize" loop unrolling and vectorization for dynamically sized inputs. Manual unrolling sometimes yields better results for compilers.
mischief6 suggests testing with ispc (Implicit SPMd Program Compiler).
titanomachy questions if manual loop unrolling is necessary for vectorized code in LLVM and asks about MAP_POPULATE's potential to improve the naive in-memory solution.
jared_hulbert confirms that indeed, loop unrolling can be necessary for compiler optimization and reports that MAP_POPULATE actually slowed down the overall test by 2.5 seconds, despite improving the counting loop's speed.
johnisgood suggests deeper optimization with AVX512 and prefetching, noting that the blog post used AVX2. They also propose that DMA to CPU caches via SPDK could be theoretically faster than io_uring + O_DIRECT.

Title and Article Quality Debate

A notable thread concerns the title of the original article and the perceived quality of its content.

MaxikCZ criticizes the "clickbait" title, stating, "we want descriptive tittles to know what the article is about before we read it."
dang, a HN moderator, agrees and changes the title, soliciting suggestions for a more accurate one.
hsn915 initially suggests "io_uring is faster than mmap" but acknowledges the author's intent to "bait people into clicking."
pixelpoet criticizes jared_hulbert's use of "um actually haters" and defends criticism of "clickbaity" titles and "suspect" results.
menaerus declares the information in the article "wrong on so many levels that it isn't worth correcting or commenting," attributing it to AI without fact-checking. They cite a specific example about early x86 processors.
yencabulator reiterates the "Bullshit clickbait title" assessment, calling the article "naive algorithm is slower than prefetching" and asserting the author "spent a lot of time benchmarking a thing completely unrelated to the premise of the article."

PCIe vs. Memory Bandwidth

The discussion briefly explores the relative bandwidths of PCIe and memory, with some users expressing surprise at the potential for PCIe to exceed memory bandwidth.

juancn calculates that while PCIe 5.0 x16 offers 64 GB/s, modern server CPUs with multiple memory channels can achieve over 100 GB/s, and HBM or DDR5 configurations can reach much higher. They conclude that "solid state IO is getting really close, but it's not so fast on non-sequential access patterns."
jared_hulbert claims that "if you add up the DDR5 channel bandwidth and the PCIe lanes most systems the PCIe bandwidth is higher."
modeless expresses surprise, asking, "Wait, PCIe bandwidth is higher than memory bandwidth now? That's bonkers, when did that happen?"
adgjlsfhk1 notes that on server chips, "5th gen Epyc has 128 lanes of PCIEx5 for over 1TB/s of pcie bandwith" compared to ~600GB/s RAM bandwidth.
andersa and pclmulqdq debate the exact calculations for PCIe bandwidth, considering factors like duplexing and protocol overhead.
wmf clarifies that "PCIe is full duplex while DDR5 is half duplex so in theory PCIe is higher. It's rare to max out PCIe in both directions though."
immibis mentions Threadripper's multiple memory channels and PCIe lanes, concluding, "You know what adds up to an even bigger number though? Using both."
kentonv corrects the article's premise: "Obviously, no matter how you read from disk, it has to go through RAM. Disk bandwidth cannot exceed memory bandwidth." They then elaborate on the true technical comparison being made in the article regarding mmap vs. io_uring with different threading models.

Security of io_uring

A small segment of the discussion touches on the security implications and adoption status of io_uring.

worldsavior asks about the "current status of io_uring from a security standpoint," noting its disabled status on Android and some Linux distros.
p_l explains that io_uring's lack of integration with Linux Security Modules (LSMs) like SELinux is a reason for its restricted use in certain environments.
yencabulator describes io_uring as a complex shared-memory protocol and advises caution, especially for state-actor targets. They suggest that for true performance, organizations often bypass kernel stacks, and that security concerns are often mitigated by running code in isolated environments like VMs. They also advocate for moving away from C/C++ and utilizing formal methods for shared-memory protocols.