Parallelizing SHA256 Calculation on FPGA

The Hacker News discussion revolves around the practical implementation and performance of SHA-256 hashing on FPGAs, with a strong emphasis on comparing these implementations to GPUs and ASICs, particularly in the context of Bitcoin mining.

FPGA Performance and Implementation Compared to GPUs

A central theme is the perceived slow performance of the FPGA implementation discussed in the linked article, with users questioning its efficiency and comparing it unfavorably to modern GPUs. The core of this criticism points to architectural choices and clock speeds.

One user, Retr0id, attempts to calculate the hashrate: "So what's the overall hashrate with this approach? I'll try to calculate it from the information given. 12 parallel instances at a clock speed of 62.5MHz, with 68 clock cycles per hash. 62.5MHz * 12 / 68 = ~11MH/s. That seems... slow? Did I do the math right?"
This sentiment is amplified by a comparison to a modern GPU: "For reference, an RTX 4090 can do 21975.5 MH/s according to hashcat benchmarks."
The author of the linked post, identified as "picture," acknowledges the slowness and attributes it to improper FPGA usage: "Quite slow. It's largely due to the author using FPGAs wrong. Clocking down a 7-series Artix to 62.5 MHz means the design is not pipelined correctly/enough."
"picture" further elaborates on how FPGAs could achieve much higher speeds: "My friend got 1 SHA256 hash per cycle at 300 MHz on 7 series, but slightly fewer of the design fit on a chip. Thruput would easily be in the GH/s range."

The FPGA vs. ASIC vs. GPU Landscape

The discussion frequently contrasts the capabilities of FPGAs with ASICs and GPUs, highlighting their respective strengths, weaknesses, and historical development, especially concerning specialized tasks like Bitcoin mining.

The significant architectural and technological differences between FPGAs and GPUs are noted, explaining performance disparities: "picture" states, "Keep in mind RTX4090 is 5 nm process node and has a lot more transistors and memory than XC7A100T, which is 28 nm. That's a huge difference in terms of dynamic performance. Also, the two are also released 10 years apart. If you compare RTX4090 against a similarly modern UltraScale part from Xilinx, I believe the FPGA can be notably faster than RTX4090."
The dominance of ASICs in Bitcoin mining is a recurring point, with users agreeing that hard silicon is ultimately far more performant for dedicated tasks: "benlivengood" suggests, "I'm assuming this space has already been heavily optimized by the Bitcoin miners on their way to ASICs."
"picture" confirms this, adding nuance about value and process nodes: "Yes, hard silicon will be another magnitude more performant than FPGAs and GPUs, but ASICs properly take on negative value when they're no longer profitable to mine with. (Note that efficiency won't be much better at the same process node. You can just pump more power through each ASIC die)"
The closed-source nature of ASIC development for competitive advantage is also mentioned: "Retr0id" posits, "Unfortunately I think most of that innovation happened behind closed doors, because everyone wanted to maintain their competitive advantages." and "sMarsIntruder" agrees, "Yes, ASICS are definitely very closed source for that specific reason."

Technical Details of SHA-256 Implementation on FPGAs vs. ASICs

There's a deep dive into the specific technical characteristics of SHA-256 that influence its implementation on different hardware, particularly focusing on how ASICs and FPGAs approach the algorithm's structure.

The fundamental difference in how ASICs and FPGAs handle logic and routing is a key point: "15155" explains, "Yes, but a designed-for-FPGA SHA256 implementation looks very different than an ASIC SHA256 implementation - the ASIC has far greater routing flexibility and density, and can therefore use far more combinatorial logic between register stages. (ASIC simulation on an FPGA will retain the combinatorial stages but run at dramatically lower fMax)"
The computational requirements for an optimized SHA-256 implementation are quantified: "15155" states, "SHA256 is extremely FF-heavy, you need around 200k for an optimized, unrolled, pipelined implementation."
The potential performance of more appropriate FPGA families is highlighted: "15155" contrasts the OP's chip with superior alternatives: "UltraScale+ chips will run a proper design at 600MHz-800MHz, big chips might be able to fit 24 cores. The Artix chip OP used is extremely slow and too small to fit this style of implementation."

The Role of FPGAs in Cryptography and General Computing

The discussion touches on the broader utility of FPGAs in cryptographic acceleration, particularly in relation to general-purpose CPUs and common software libraries like OpenSSL.

A user suggests integrating FPGAs with existing software for crypto acceleration: "d00mB0t" proposes, "More posts like this please! How about a crypto accelerator on FPGA that's integrated with OpenSSL?"
However, the practical advantages of FPGAs for common OpenSSL tasks are questioned, with CPUs often being more efficient due to overheads: "15155" counters, "Unless you're talking about niche algorithms (and even then), the FPGA will get smoked by a CPU for most common tasks one would use OpenSSL for."
The overhead of data transfer between CPU and FPGA is identified as a performance bottleneck for certain operations: "15155" further explains, "Even without the extensions, by the time you've moved the workload to the FPGA and back, the CPU has already completed whatever operation your FPGA was going to complete with OpenSSL. FPGA cryptographic acceleration is about batch task bandwidth, OpenSSL has few places where this is required."
Despite these limitations, the educational value of such projects is acknowledged: "d00mB0t" clarifies their intent, "Yes--obviously modern CPUs have crypto extensions that would be faster than an FPGA,this would be for educational purposes."

Alternative Approaches and Historical Context

Briefly, some users point towards alternative or related concepts and resources.

The idea of pre-computed hashes is mentioned as a potential optimization, albeit an impractical one at scale: "m3kw9" jokes, "Or try hardcoding a few billion trillions of premade hashes."
This leads to a reference to rainbow tables: "nayuki" asks, "https://en.wikipedia.org/wiki/Rainbow_table ?".
A user also recommends an external resource for alternative design approaches: "qdotme" shares, "For alternative design/writeup, check out http://nsa.unaligned.org".