Why is Japan still investing in custom floating point accelerators?

This Hacker News discussion revolves around the challenges and opportunities in competing with NVIDIA's dominance in the AI acceleration market, particularly concerning hardware and software ecosystems.

The NVIDIA Software Moat and Its Strengths

A central theme is the acknowledgement of NVIDIA's strong software and infrastructure as a significant barrier to entry for competitors. While some debate the "well-designed" aspect, there's consensus that it's "established" and provides a level of functionality that is difficult to replicate.

"Nvidia's software and infrastructure is so well designed and established that no competitor can threaten them even if they give away the hardware for free." - WithinReason
"I don't know about well designed but it's definitely established." - saagarjha
"For the most part 'it just works', the models are generic enough that you can actually get pretty close to the TDP on your own workloads with custom software and yet specific enough that you'll find stuff that makes your work easier most of the time." - jacquesm
"Despite not having used it much, my impression was that Nvidia's "moat" was that they have good networking libraries, that they are pretty good (relatively) and making sure all their tools work, and they have had consistent investment on this for a decade." - saagarjha
"Nvidia stuck to one stack and wrote all their high level libraries on it, while their competitors switched from old APIs to new ones and never made anything close to CUDA." - nromiun

Criticisms and Potential Improvements for CUDA

Despite the dominant position, some users point out specific areas where NVIDIA's CUDA ecosystem could be improved, often highlighting a desire for more helpful compiler diagnostics and better integrated tooling for performance analysis.

"Memory indexing. It's a pain to avoid banking conflicts, and implement cooperative loading on transposed matrices." - programjames
"Why is there no warning when shared memory is unspecified?" - programjames
"Timing - doesn't exist. Pretty much the gold standard is to run your kernel 10_000 times in a loop and subtract the time from before and after the loop." - programjames
"I actually really hate CUDA's programming model and feel like it's too low-level to actually get any productive work done." - saagarjha
"Nsight (Systems), it is…ok, I guess? It's fine for games and stuff I guess but for HPC or AI it doesn't really surface the information that you would want." - saagarjha
"Nsight Compute is the thing that tells you that but it's kind of a mediocre profiler... and to use it effectively you basically have to read a bunch of blog posts by people instead of official documentation." - saagarjha

The Difficulty of GPU Programming and Optimization

There's a strong undercurrent that achieving peak performance on GPUs, even with CUDA, is far from trivial. Writing efficient kernels requires a deep understanding of the hardware architecture, and simple C++ code does not translate to optimal GPU performance.

"Have you tried to write a kernel for basic matrix multiplication? Because I have and I can assure you it is very hard to get 50% of maximum FLOPs, let alone 90%. It is nothing like CPUs where you write a * b in C and get 99% of the performance by the compiler." - nromiun
"CUDA gives you all that optimization on a plate." - nromiun
"Well, CUDA gives you a whole programming language where you have to figure out the optimization for your particular card's cache size and bus width." - wooooo
"A barrel processor is extremely efficient once you learn to write code for them... However, most people never learn how to write proper code for barrel processors." - jandrewrogers (Note: This user's definition of barrel processor is debated later in the thread.)
"It's nothing like CPUs where you write a * b in C and get 99% of the performance by the compiler." - nromiun

The Role of Government and Niche Accelerators

A significant portion of the discussion focuses on alternative hardware accelerators, particularly those originating from government-funded projects or niche companies. The common thread here is their potential to offer competitive performance, but often under conditions that limit their widespread adoption or market impact.

"It's unfortunate that they don't sell them on open markets. There are few of these accelerators that could threaten NVIDIA monopoly if prices(and manufacturing costs!) were right." - numpad0
"Governments are terrible at picking winners." - eru
"But what governments often can do, is break local optimums clustering around the quarter economy and take moonshot chances and find paths otherwise never taken." - actionfromafar
"If the hardware isn't available at all, we'll never find out if the software moat could be overcome." - rwmj
"The NEC VectorEngine - they had 5 TFLOPS FP32 with 48GB of HBM2 totaling 1.5TB/s bandwidth at $10k in 2020. That was within a digit or two against NVIDIA at basically the same price. But they then didn't capitalize on it, just kept delivering to national institutes in ritualistic manners." - numpad0
"peclmulqdq: They do sell these on the open market. You just have to be in the market for an entire cluster. The minimum order quantity for Pezy is several racks." - pclmulqdq
"These Pezy chips are also made for large clusters. There is a whole system design around the chips that wasn't presented here." - pclmulqdq
"Japan has a pretty long history of marching to their own drummer in computing. They either created their own architectures or adopted others after pretty much everyone had moved on." - ghaff

The AI vs. HPC Specialization Trend

A recurring point is the shift in GPU design and power allocation towards AI-specific workloads (lower precision, tensor operations) at the expense of traditional High-Performance Computing (HPC) tasks that rely on FP64 (double-precision floating-point) operations.

"The power allocation of most GPGPUs are heavily tilted for Tensor usages. This has been the trend well before B300." - rfoo
"The FP64 GFLOPS per watt metric in the post is almost entirely meaningless to compare between these accelerators and NVIDIA GPUs, for example it says Hopper H200 is 47.9 gigaflops per watt at FP64 (33.5 teraflops divided by 700 watts) But then if you consider H100 PCIe instead, it's going to be 26000/350 = 74.29 GFLOPS per watt." - rfoo
"The Blackwell B300 has FP64 severely deprecated at 1.25 teraflops and burns 1,400 watts, which is 0.89 gigaflops per watt. (The B300 is really aimed at low precision AI inference.)" - Aissen
"do cards with intentionally handicapped FP64 actually use anywhere near their TDP when doing FP64?... my point is more that if FP64 performance is poor _on purpose_, then you're probably not using anywhere near the card's TDP to do FP64 calculations, so FLOPS/W(TDP) is misleading." - kbolino
"In general: consumer cards with very bad FP64 performance have it fused off for product segmentation reasons, datacenter GPUs with bad FP64 performance have it removed from the chip layout to specialize for low precision." - wtallis
"Recently, the GB300 removed almost all of them, to the point that a GB300 actually has less FP64 TFLOPS than a 9 year old P100. FP32 is the highest precision used during training so it makes sense." - niklassheth

The "Shelving" of Competitive Efforts

There's speculation that NVIDIA actively works to stifle potential competitors by offering them hardware deals contingent on discontinuing their own development efforts, a concern exemplified by the Tesla Dojo project.

"I suspect that whenever you look like you're making good progress on this front, nvidia gives you a lot of chips for free on condition you shelve the effort though!" - londons_explore
"The latest example being Tesla, who were designing their own hardware and software stack for NN training, then suspiciously got huge numbers of H100's ahead of other clients and cancelled the dojo effort." - londons_explore
"My suspicion is that Musk told them to just buy Nvidia instead of waiting around for years of slow iteration to get something competitive." - AlotOfReading (counterpoint regarding Tesla)

Exploring Alternative Floating-Point Formats

A brief but notable theme is the discussion around alternative floating-point formats, such as posits, as a potential area for future hardware innovation, though adoption challenges are acknowledged.

"I wonder how much progress (if any) is being done on floating point formats other than IEEE floats; on serious adoption in hardware in particular. Stuff like posits [1] for instance look very promising." - andrepd
"the problem with posits is that they aren't enough better to be worth a switch. switching the industry over would cost billions in software rewrites and there are benefits, but they are fairly marginal." - adgjlsfhk1