Speeding up PyTorch inference on Apple devices with AI-generated Metal kernels

Here's a summary of the themes from the Hacker News discussion:

AI-Generated Kernels for Performance Gains

A primary theme is the exploration and potential of using AI, specifically large language models (LLMs) like GPT-5, to automatically generate optimized compute kernels for machine learning models. The goal is to improve inference speed compared to unoptimized, off-the-shelf PyTorch implementations.

"This is pretty cool." - pbronez
"This is amazing. I wouldn't have thought that AI is this good in niche topics. Very impressive experiment and write up." - earthnail
"The aim here is to show that you can start from prototype code and automatically produce lower-level kernels (in this case Metal) that are more usable in real deployments, without additional work from the developer." - nserrino
"But that’s the thing, I wouldn’t write a custom kernel before AI. I don't do that level of development or operate at that part of the stack but I’m very experienced in software development. AI significantly augments my skillsets in this area" - yieldcrv
"The hope is that we can automate some of that process [hand-tuned implementations]." - nserrino

Skepticism and Verification Challenges

Despite the impressive claims, there's significant skepticism regarding the accuracy, correctness, and the magnitude of the reported speedups. Several users highlight the difficulty in verifying the results and the potential for subtle errors to lead to drastically incorrect model behavior or inflated performance metrics.

"unless we get a zip file of all the kernels with how they're benchmarked results like this are almost impossible to verify" - formalsystem
"In practice, with slight differences the model will feel almost lobotomized." - arjvik
"The numerics are off as well and I suspect the atols were way too big." - formalsystem
"18x (and even some 100x speedups claimed near the end) are just a smell that some kernel is incorrect, the typical way you can get speedups like this is you don't warmup or you forget to synchronize." - formalsystem
"For a numerical kernel, this seems way too loose, but turns out those bounds come straight from KernelBench, which only tested for correctness on 5 random inputs by default in their harness, not the 100 they used here." - magicalist (referencing the paper's correctness testing)
"So please for everyone's sanity if you find a kernel that's 10-100x faster please share the exact code and benchmarking methodology to your smartest performance friends, you should be extremely skeptical of such results often you can discard some numbers based on a simple speed of light analysis." - formalsystem

Baselines and Benchmarking Methodology

A core point of contention is the choice of baseline and the rigor of the benchmarking methodology. Some argue that comparing against unoptimized PyTorch inference is not a fair reflection of real-world deployment scenarios, where ONNX or torch.compile are typically used. The validity of the correctness tolerances and the handling of small input shapes are also questioned.

"They are comparing unoptimized PyTorch inference, something you would never deploy on a device, to a model with custom kernels." - turbo_wombat
"This is the equivalent of comparing interpreted code to compiled code." - turbo_wombat
"While the original KernelBench focused on CUDA, the addition of Metal kernels has been a significant enhancement, and new LLM-based approaches like GPT-5 are showing promising results in generating these kernels." - nserrino (In response to formalsystem's criticism of not comparing to torch.compile)
"We didn't compare to torch.compile because as of PyTorch 2.7, torch.compile doesn't support the MPS backend, and we ran into some issues on many of the problems when using it." - nserrino
"In that regime you're more often measuring noise as the primary bottleneck is not compute or memory but overhead" - formalsystem (referencing small input shapes)
"KernelBench levels measure vastly different things and if you want to compare to PyTorch operators you want to focus on Level 1, Level 2 is fusions and so the right baseline is torch.compile and more reliable on nightlies." - formalsystem

Clarification of "Kernel"

A recurring clarification is the distinction between an ML/NN "kernel" (like a neural network layer) and a "compute kernel," which refers to a low-level program designed to run on a GPU or other parallel processing hardware.

"The article is referring to GPU compute kernel (https://en.wikipedia.org/wiki/Compute_kernel), not the term kernel used in ML/NN/etc." - ymsodev
"This is gonna be a silly question but what does “kernel” mean in this context. I thought it meant like a Linux kernel module but doesn’t seem to be?" - syntaxing
"A kernel is low level function that is going to run in parallel on your accelerator (hopefully efficiently). You will have various matmuls, convolutions, etc." - tuckerman

Alternative Future Directions (Mojo, JAX, Julia)

Some users suggest alternative or complementary future directions for high-performance computing in the ML space, with Mojo being mentioned as a potential long-term bet by some, while others express strong reservations due to its closed-source components and licensing. JAX and Julia are also mentioned as preferred alternatives.

"Still, I can't help but think we should bet on sth like Mojo instead for the long run." - earthnail
"Mojo is a terrible language and its main feature (GPU acceleration through Mojo max) is closed source and requires a commercial license to be purchased." - ipsum2
"JAX or Julia hopefully." - moelf

Practicalities of Deployment Pipelines

The discussion touches upon the typical ML deployment pipeline, which often involves exporting models to formats like ONNX and then compiling them for specific hardware, contrasting this with direct PyTorch inference.

"When deployed, you should export to ONNX, and then compile the ONNX to the native format of the device." - turbo_wombat
"I'm assuming you are talking about https://github.com/onnx/onnx-mlir? In your experience, how much faster is a 'compiled' onnx model vs. using an onnx runtime?" - airforce1
"Back in the day TensorFlow had tfdeploy which compiled TensorFlow terms into NumPy matrix operations. Our synthetic tests saw speedups of factor 50." - dapperdrake

Comments on the "Swarm" Approach and Testing Rigor

The methodology used, particularly the "swarm" approach (using multiple models and picking the best result) and the correctness testing parameters, are also scrutinized for their robustness and potential for misinterpretation.

"The swarm part seemed a bit of a stretch. They fired off requests to 8 different models to do the translation, and the 'supervisor' benchmarked the returned kernels and picked the fastest one. Technically a swarm, I guess, but feels like we're devaluing the term :)" - magicalist
"The correctness testing used made my eye twitch a bit: [...] For a numerical kernel, this seems way too loose, but turns out those bounds come straight from KernelBench, which only tested for correctness on 5 random inputs by default in their harness, not the 100 they used here." - magicalist
"The Mamba 2 example (which I didn't run) also acknowledges that the primary thing it does is fusions which assuming everything is correct would still be strange to baseline vs eager" - formalsystem (Critiquing specific examples)