Research Review: The Role of Field Programmable Gate Arrays in the Acceleration of Modern High Performance Computing Workloads

DSS Engineering
Oct 22
6 min read

Updated: Oct 23

Background: The "Research Review" series is a grouping of articles that review scientific publications on the topics of financial modeling, foundation mathematics, and parallel computing. It is meant to enrich the community with technical knowledge and provide clarity on topics that can create distrust in the markets

Paper being review:

M. d. Castro, D. L. Vilariño, Y. Torres and D. R. Llanos, "The Role of Field-Programmable Gate Arrays in the Acceleration of Modern High-Performance Computing Workloads," in Computer, vol. 57, no. 7, pp. 66-76, July 2024, doi: 10.1109/MC.2024.3378380.

Introduction

In the world of High-Performance Computing (HPC), the landscape is dominated by familiar giants. Central Processing Units (CPUs) serve as the versatile heart of any system, while Graphics Processing Units (GPUs) have risen to become the standard accelerator, powering everything from scientific research to the artificial intelligence revolution. This CPU-GPU partnership has become the default architecture for today's supercomputers.

Yet, a third type of hardware accelerator has long been a subject of significant research interest: the Field-Programmable Gate Array (FPGA). Despite their theoretical promise, FPGAs have seen surprisingly low general adoption in the HPC mainstream. They exist as a powerful but often misunderstood alternative, capable of feats that general-purpose processors cannot match.

This raises a critical question: Given their unique capabilities, what surprising and counter-intuitive realities have kept FPGAs from becoming a mainstream force in HPC? The answers reveal crucial lessons about the trade-offs between raw power, flexibility, and developer productivity.

1. They’re Not New Technology, They’re a Comeback Story Gone Sideways

One of the most common misconceptions is that FPGAs are a new or emerging technology. In reality, they have been around since the mid-1980s, when they were introduced for prototyping small digital circuits. They served as a reconfigurable canvas for hardware engineers to test their designs before committing to expensive, permanent Application-Specific Integrated Circuits (ASICs).

By the 2000s, FPGAs had grown powerful enough to move beyond proof-of-concept prototyping to final production on their own. With more logic cells, memory, and specialized arithmetic units, they became effective accelerators in their own right. They were even adopted in some supercomputing clusters to accelerate specific, demanding tasks in fields like cryptography, genomics, and pattern recognition.

However, a major turning point occurred in the mid-2000s. GPUs, backed by powerful and user-friendly development platforms like NVIDIA's CUDA, surged ahead in popularity and performance for parallel computing. FPGAs were relegated to more niche, embedded applications where their extreme energy efficiency was a critical advantage, but they lost their foothold in mainstream HPC. This history is surprising because it reveals that despite decades of technological advancement, the struggle for FPGAs to reclaim a spot in the HPC limelight was caused by factors beyond just raw power.

THE DECLINE OF FPGAs WAS NOT JUST AN ISSUE OF COMPUTING PERFORMANCE OR EFFICIENCY BUT ALSO A PROBLEM OF PRODUCTIVITY.

2. The "Easy" Way to Program Them is Deceptively Difficult

Traditionally, programming an FPGA requires using a hardware description language (HDL) like VHDL or Verilog. These languages give engineers granular control over the hardware, but for HPC software developers, they are "cumbersome and error prone" and involve steep learning curves and long development times.

To solve this productivity problem, vendors introduced High-Level Synthesis (HLS) tools, such as OpenCL and Vitis. The goal was to allow developers to program FPGAs using familiar, C-based languages, treating the reconfigurable hardware like any other computational resource. This was meant to make FPGA programming as accessible as software development.

The counter-intuitive reality, however, is that these high-level tools do not guarantee performance. Code written for a GPU, even using a supposedly portable language like OpenCL, generally performs poorly on an FPGA without significant, device-specific manual optimization. The most impactful drawback is the compilation time. Translating high-level code into a hardware configuration is a complex, multi-step process. Compiling sophisticated HPC kernels for FPGAs can take several hours, a stark contrast to the minutes or seconds required for CPU or GPU code. This bottleneck drastically increases development time and cost, creating a major barrier to adoption. To combat this, researchers are exploring "Overlay architectures"—higher-level abstractions that could simplify programming and reduce these long compilation times, though this technology is still developing.

THE COMPILATION OF FPGA CODES IS A TIME-CONSUMING PROCESS, ESPECIALLY WHEN HIGH-LEVEL LANGUAGES ARE USED TO DESCRIBE SOPHISTICATED ALGORITHMS.

3. For an "Accelerator," Key Hardware Specs Lag Far Behind GPUs

While FPGAs are classified as hardware accelerators, a surprising look at their technical specifications reveals significant hardware limitations compared to their main competitor, the GPU.

First, FPGAs operate at considerably lower clock frequencies than both CPUs and GPUs. While their custom architecture allows them to perform more operations per clock cycle for certain tasks, their lower clock speed creates a performance ceiling that is hard to overcome for many general HPC workloads.

Second, and more importantly, FPGAs are constrained on multiple memory fronts, likely the "main limiting factor" for their performance. Most data center FPGAs use older Double Data Rate 4 (DDR4) memory technology, resulting in lower peak memory bandwidth than GPUs, which have long since moved to faster standards. This is compounded by the fact that available FPGA boards do not support the same large memory sizes found in GPUs, and the process of getting data in and out of the card is expensive and can easily destroy any potential benefit in the computation. Furthermore, achieving even the theoretical peak memory bandwidth is incredibly difficult. Real-world applications often struggle to get more than 70% of the peak bandwidth due to strict requirements for data access patterns—a challenge not nearly as pronounced on GPUs.

MOST BANDWIDTH LIMITS ON FPGAs COME FROM THE USE OF DOUBLE DATA RATE 4 TECHNOLOGY, WHILE GPUs HAVE BEEN USING FASTER MEMORY TECHNOLOGY FOR SOME YEARS NOW.

4. Their True Superpower Isn’t Raw Speed, It’s Extreme Customization

Despite their limitations in clock speed and memory bandwidth, FPGAs possess a unique and powerful characteristic that sets them apart: reconfigurability. Unlike a GPU or an ASIC, which has a fixed hardware design, an FPGA can be reconfigured to implement entirely different hardware circuits tailored to different tasks. This can even be done "on the fly" using a technique called "dynamic partial reconfiguration," allowing parts of the chip to change behavior without halting the entire device.

This reconfigurability leads to their key advantages in specific scenarios:

Low-latency and predictable performance: By creating custom hardware paths for specific computations, FPGAs eliminate the overhead found in general-purpose processors, such as fetching and decoding instructions. This direct execution path results in extremely low latency and consistent, predictable timing, which is vital for urgent computing, real-time data analysis, or any application where timing is critical.
Excelling at irregular tasks: GPUs are masters of data-parallel problems where the same instruction is applied to massive datasets. FPGAs, on the other hand, are ideal for applications with irregular algorithms, such as those involving custom data widths, combinational logic problems, finite-state machines, and parallel MapReduce problems that don't fit the rigid structure of a GPU.

The true value of an FPGA isn't in trying to beat a GPU at its own game of massively parallel computation. Instead, its superpower is solving problems that general-purpose devices are fundamentally not well-suited for.

FPGAs BECOME COMPETITIVE WHEN WORKING WITH APPLICATIONS WITH SPECIFIC CONSTRUCTS OR REQUIREMENTS FOR WHICH GENERAL-PURPOSE COMPUTING DEVICES ARE NOT SUITED.

5. They Were Perfectly Positioned for the AI Revolution—And Missed It

For a time, FPGAs were considered a highly promising platform for accelerating deep learning. Their ability to efficiently handle the pipelined nature of neural network models and support custom, low-precision data types made them seem like a perfect fit for the coming AI boom. They offered a way to build highly customized and power-efficient AI inference engines.

However, two major breakthroughs in the hardware industry dramatically changed the landscape and sidelined FPGAs.

First, GPU vendors like Nvidia integrated dedicated AI hardware directly into their products. The introduction of Tensor Cores provided specialized processing units for the exact matrix operations that dominate deep learning workloads, giving GPUs a massive, built-in advantage.
Second, Google developed the Tensor Processing Unit (TPU), a highly specialized ASIC built from the ground up for one purpose: accelerating neural networks.

These innovations, which targeted the very tasks where FPGAs could have excelled, effectively captured the deep learning market. The result is that today, despite their theoretical potential, FPGAs are "not the platform of choice for accelerating large-scale deep neural networks." It stands as a significant missed opportunity for a technology that seemed perfectly positioned for the biggest computing revolution of the decade.

Conclusion: The Specialist's Tool in a Generalist's World

The story of the FPGA in high-performance computing is one of a fundamental tension. On one hand, FPGAs possess unique and powerful capabilities for customization, flexibility, low-latency processing, and energy efficiency. On the other, they are held back by significant hardware limitations, most notably memory bandwidth, and an immensely complex development process that inflates costs and timelines.

Their enduring strength lies not in competing with GPUs on mainstream HPC workloads, but in accelerating irregular tasks for which building a custom, single-purpose ASIC would not be profitable. They occupy a crucial middle ground between general-purpose flexibility and special-purpose performance.

This leaves us with a final, forward-looking question: As computing demands become more diverse and specialized, will critical progress in development tools and hardware finally unlock the mainstream potential of FPGAs, or will they remain the powerful but perpetually niche tool for the world's most highly specialized problems?