Hitting Peak File IO Performance with Zig

This Hacker News discussion primarily revolves around optimizing direct I/O operations for NVMe drives, with particular attention to block size, alignment, and the role of the Zig programming language.

Zig Language and API Stability

A significant point of discussion is the stability of the Zig programming language, specifically its I/O API. Users are keenly aware that changes in Zig can impact I/O implementations and emphasize the importance of versioning.

"Zig is currently undergoing lots of breaking changes in the IO API and implementation. Any post about IO in zig should also mention the zig version used." (laserbeam)
"I see it’s 0.15.1 in the zon file, but that should also be part of the post somewhere." (laserbeam)

This highlights a general concern about the maturity and evolving nature of Zig's I/O capabilities, making it crucial for developers to track which version of the language is being used in examples and benchmarks.

NVMe Block Size and Alignment

A core theme is the complexity and optimization surrounding block sizes for NVMe drives. The discussion delves into the nuances of different "hardware block sizes" and their impact on performance. One user points out a hard-coded constant and suggests considering re-formatting drives to larger logical block sizes.

"I see you use a hard-coded constant ALIGN = 512. Many NVMe drives actually allow you to raise the logical block size to 4096 by re-formatting (nvme-format(1)) the drive." (database64128)

Another user clarifies that for direct I/O, the hardware block size is paramount and generally immutable. However, they acknowledge that logical block sizes can differ due to buffering or RAID configurations, and that smaller block sizes are often preferred for optimal outcomes.

"It’s really the hardware block size that matters in this case (direct I/O). That value is a property of the hardware and can’t be changed." (HippoBaro)
"In general, we want it to be as small as possible!" (HippoBaro)

This is then countered with a more detailed explanation of the various "hardware block sizes" present in NVMe drives, including LBA size, NAND flash page size, erase block size, and the SSD controller's Flash Translation Layer granularity. The ability to reconfigure the LBA size is noted, and the influence of vendor-specific choices on preferred granularities is also mentioned.

"NVMe drives have at least three "hardware block sizes". There's the LBA size that determines what size IO transfers the OS must exchange with the drive, and that can be re-configured on some drives, usually 512B and 4kB are the options. There's the underlying page size of the NAND flash, which is more or less the granularity of individual read and write operations, and is usually something like 16kB or more. There's the underlying erase block size of the NAND flash that comes into play when overwriting data or doing wear leveling, and is usually several MB. There's the granularity of the SSD controller's Flash Translation Layer, which determines the smallest size write the SSD can handle without doing a read-modify-write cycle, usually 4kB regardless of the LBA format selected, but on some special-purpose drives can be 32kB or more." (wtallis)
"And then there's an assortment of hints the drive can provide to the OS about preferred granularity and alignment for best performance, or requirements for atomic operations." (wtallis)
"I've run into (specialized) flash hardware with 512 kB for that 3rd size." (loeg)

The debate on the "smallest possible" block size continues, with one user arguing that an intermediate size, balancing overhead and device granularity, is often the sweet spot, especially for larger files. They explain that larger blocks reduce syscall and request overhead, and that the OS can handle splitting requests if they exceed the device's native block size. The optimization difficulty for parallel, asynchronous I/O is also highlighted, where matching the device's native block size is crucial for IOPs.

"Why would you want the block size to be as small as possible? You will only benefit from that for very small files, hence the sweet spot is somewhere between "as small as possible" and "small multiple of the hardware block size"." (imtringued)
"If your native device block size is 4KiB, and you fetch 512 byte blocks, you need storage side RAM to hold smaller blocks and you have to address each block independently. Meanwhile if you are bigger than the device block size you end up with fewer requests and syscalls." (imtringued)
"The most difficult to optimize case is the one where you issue many parallel requests to the storage device using asynchronous file IO for latency hiding. In that case, knowing the device's exact block size is important, because you are IOPs bottlenecked and a block size that is closer to what the device supports natively will mean fewer IOPs per request." (imtringued)

Performance Benchmarking and Optimization Techniques

The discussion also touches upon the performance implications of the chosen block sizes and I/O strategies, particularly in relation to benchmarks and modern I/O APIs like io_uring. One user points out that a given throughput at a specific block size translates to a low IO/s rate, suggesting that better performance could be achieved with fewer in-flight operations.

"7 GB/s at 512 KB block size is only ~14,000 IO/s which is a whopping ~70 us/IO. That is a trivial rate for even synchronous IO. You should only need one inflight operation (prefetch 1) to overlap your memory copy (to avoid serializing the IO with the memory copy) to get the full IO bandwidth." (Veserv)
"Their referenced previous post [1] demonstrates ~240,000 IO/s when using basic settings. Even that seems pretty low, but is still more than enough to completely trivialize this benchmark and saturate the hardware IO with zero tuning." (Veserv)

The comparison with FreeBSD's AIO is also brought up as an alternative implementation to consider.

"Interesting how an implementation using FreeBSDs AIO would compare." (nesarkvechnep)

Finally, the use of io_uring's registered file descriptors is recommended as a significant optimization, though the user noted difficulty in locating where this was set in the example code.

"I'm not very familiar with zig and was kind a struggling to follow the code and maybe that's why I couldn't find where the setting was being set up, but in case it's not, be sure to also use registered file descriptors with io_uring as they make a fairly big difference." (marginalia_nu)

Memory Allocation Strategies

A brief but relevant point is raised regarding memory allocation for I/O operations, suggesting the use of a page allocator for aligned memory instead of overallocation.

"why not use page allocator to get aligned memory instead of overallocating?" (throwawaymaths)