Design experience across a wide range of applications—from signal processing to network packet processing to cryptography to deep learning inference—has shown that, properly used, FPGAs can provide very substantial performance and power improvements in algorithm execution. Generally, these improvements come from implementing computational kernels—the inner loops of the algorithm—in the FPGA hardware, offloading these kernels from CPU software and applying massive parallelism or deep pipelining to their execution. Given this history, it seems intuitive that as cloud and enterprise data centers take on more compute-intensive workloads—including artificial intelligence, streaming data analysis, network functions virtualization, and traditional supercomputing applications—FPGAs could offer substantial benefits to both the data center owner and the workload owner. In fact, some major public cloud providers have begun offering FPGA acceleration for some workloads. As the most recent generation of FPGAs, with millions of registers and hundreds of megabits of internal memory, moves deeper into the market, this trend will certainly accelerate (to coin a phrase).
But in practice, closer examination reveals what any experienced user of large FPGAs already knows: there is a profound mismatch between traditional FPGA development techniques and the software-dominated, high-level culture of the data center—the application developers, DevOps managers, and increasingly automated processes for managing workloads (Figure 1). This mismatch is most acute at two points: programming, and management and orchestration (MANO).
What is Programming?
The first point of mismatch arises from the fact that the word programming means something very different for a CPU than it means for an FPGA. The two meanings refer to different kinds of tasks, and imply different skills.
Programming a CPU, of course, means defining data structures and creating an ordered list of instructions for the CPU to execute. The CPU hardware is a black box, known only to correctly execute the machine-language instructions generated by the programming tool chain. The actual sequence and timing of instruction execution is invisible to the programmer except to the extent that she imposes constraints on sequencing with semaphores or similar devices.
In contrast, FPGA programming also generally starts with human-readable languages and translation tools. But these high-level tools and languages like C++ are used not to create instruction sequences, but to describe algorithms. The programming styles and best practices for the two purposes can be quite different. For an algorithmic description the tool chain quickly moves into a new realm, that of languages that may look like programming languages but are in fact specifications of hardware structures: so-called hardware description languages (HDLs) such as Verilog. At this level data structures are defined, but in terms of bits and bytes. There are no executable instructions—just HDL code that defines blocks of logic, associates them with registers, and defines the interconnection between the elements. At this point, it may be more appropriate to call the development process configuring rather than programming, because the language tools are specifying how the FPGA logic is to be configured.
The FPGA tool chain then moves deeper, into territory that has no analogy in software development. First, the FPGA developer must specify timing constraints that will govern subsequent steps. Then automated tools map the HDL description of the design into a network of logic and storage nodes, and then onto the actual hardware macro functions, logic elements and interconnect segments of the FPGA chip, in such a way to meet the timing constraints. While FPGA developers may refer to this whole chain as programming in fact it begins with something like programming, but quickly becomes something very different: configuration of hardware.
The other cultural divide separating FPGAs from data centers lies in the complex, arcane, and increasingly automated world of management and orchestration. In a cloud data center with thousands of applications moving between standby and active status, sometimes in a matter of milliseconds, it is necessary to have automation in control. Resources have to be tracked, updated, maintained, and billed out. Workloads have to be authenticated, their requests for resources serviced, and their execution monitored and gracefully terminated. All of this must be done in a way that optimizes data-center infrastructure utilization. To achieve this, MANO tools rely on the uniformity of the data-center fabric: many, many identical racks of similar servers, network connections, and storage nodes. Also vital is the ability to make any resource in the data center virtually present to any server through the network.
It is, to say the least, not obvious how FPGAs fit into this environment. They are not fixed assets: at any given time every FPGA in the data center may in theory be configured differently. Their configurations can be altered, but in milliseconds, not nanoseconds. And they may not be present at every compute node in the fabric. Where MANO desires uniformity, FPGAs offer variety, in function, space, and time.
Bridging the Mismatch
Inserting FPGAs into a data-center fabric to perform a specific function—say as a virtual layer-2 switching device or a cryptographic accelerator—can avoid the mismatch issues by keeping the chips to a fixed function and transparent to users. But making the value of FPGAs accessible to application developers and end users requires a great deal of additional work beyond just putting the chip on a card and the card in a server slot. This effort must start at the hardware level and move through layers of increasing abstraction, all the way to the levels at which software developers and data-center operators do their jobs. It is useful to borrow a concept from networking engineers and model this work as a stack. For our purposes we will choose five layers: physical, configuration, abstraction, environmental, and MANO (Figure 2).
At the level of the FPGA silicon, designers face a number of decisions. It isn’t feasible to simply create an FPGA chip carrier that would plug into a CPU socket on a server board, although this is technically possible for thermal, power, and mechanical reasons. These considerations suggest placing the FPGA on its own card, plugged into the data-center server rack so the chip’s serial ports have access to the rack’s backplane network. To couple the FPGA and its fast in-package memory more closely to a particular CPU, the card can bridge to a server CPU via PCI Express* (PCIe*). This way the FPGA can either work as a slave accelerator to the CPU, or it can stream data directly from the network. This, for instance, is the approach Intel has taken with its Intel® Stratix® 10 Programmable Accelerator Card.
Another, less obvious piece of design effort goes into this level. Unprogrammed, the FPGA can do very little beyond responding to its PCIe interface and authenticating, decrypting, and loading a configuration file. Functions that make the FPGA a useful accelerator—able to talk with its host CPU, maintain security, transfer data, and manage execution—must be configured into the chip. These functions create a hardware gasket, often called a signal bridge, into which user-defined functions can plug. Design of the gasket clearly requires the cooperation of FPGA, CPU, and server board designers.
These physical provisions make it possible to initialize and control acceleration functions in the FPGA. A further layer of support, device drivers integrated with the operating-system kernel, make these control and supervision operations physically available to the OS, hypervisors, and user workloads. But available is not necessarily accessible. There is more to the stack.
Just as there are many levels of programming languages, ranging from highly abstract to highly hardware-dependent, there are several kinds of FPGA programming and configuration tools that provide different entry points for FPGA developers. These range from tools that can almost directly map a C++ program into an FPGA configuration to tools that let developers configure directly in an HDL. Tools below that level, that manipulate individual configuration bits within the chip also exist, but are generally only used by the FPGA vendors themselves.
For algorithm developers without hardware skills there are tools that convert software—usually a C/C++ dialect—into HDL code for use by the vendor’s native tool chain. Roughly these tools are of one of two kinds. The older variety, from the days when FPGAs were primarily used to accelerate small functions in embedded systems, allow developers to describe the behavior of a logic block in a subset of C/C++. The tool then generates HDL code for that block, which can then be linked with functions not easily described in C, I/O functions, and the aforementioned gasket to produce a complete accelerator description. This HDL would then be fed into the vendor tools to be synthesized, verified, timing-checked, and mapped into the FPGA hardware. Finally, the vendor tools would create a configuration file that contains the values to be loaded into individual configuration registers on the chip. This process can speed production of accelerator blocks, but may still require experienced FPGA developers to take the design all the way through the chain. And it presumes experience with the kind of parallel-programming or pipelining structures that allow the FPGA to bring its massive resources to bear on the algorithm.
More recently, there has been strong interest in a slightly higher-level tool that comprehends concepts of parallelism, pipelining, and streaming. Using the industry-standard OpenCL™ language, a developer separates his algorithm into two parts: a main or control block that is written in C and runs on the host CPU, and one or more kernels that are to be accelerated. The main block will move data and control the execution of the kernels via parallel-aware application programming interfaces (APIs).
The kernels are coded in a C subset with pragmas to describe how they are to be parallelized and managed. With compiler directives, the developer then compiles the code either to run entirely on the CPU for debug, or for the main body to run on the CPU and the kernels to be parallelized onto some specified amount of FPGA resources. In the latter case, the kernel code is transformed into HDL, and an automated process synthesizes it, automatically generating appropriate timing constraints, and maps the netlist to the FPGA region, linking it to the gasket and necessary I/O. The automation of the back-end tasks, in exchange for a little bit of the theoretical maximum performance, removes most need for intervention by FPGA experts and significantly reduces time to deployment.
The process of identifying opportunities for acceleration and implementing them in kernels is itself neither trivial nor always intuitive. So tools that assist in parallelization and exploration, such as Intel Parallel Studio can be a great help even though they often were developed for programming the vector processors in CPUs rather than creating parallel hardware on FPGAs.
Despite the several levels of abstraction available in FPGA tools, many application developers will find even a modest facility in parallel programming and FPGA development irrelevant to their purposes. What they need is not more automated tools but to have an outside expert do the FPGA design work and simply hand them a set of libraries to initialize the accelerator and invoke FPGA acceleration of specific functions.
This is already happening, starting with widely-used libraries that cut horizontally across applications. As FPGA deployment in clouds grows, we will see more announcements of more specialized libraries as well. These will target specific fields of endeavor, such as structural or fluid analyses, physical chemistry, or genetics.
The next step beyond libraries is integration of FPGA acceleration into application frameworks, such as those for machine learning, data analytics, or video coding. This is already happening as well. At this level the user is completely removed from questions of FPGA architecture and operation. He may not even be aware of the presence of the FPGA other than as a switch setting that leads to significantly better performance.
As these trends develop we will see more turnkey applications requesting FPGA acceleration from the data center and applying it transparently if it is available. At this point a great deal of work has gone into algorithm adaptation, library development, testing, and integration, but all the application user sees is significantly shorter execution time or higher throughput.
A Development Environment
The question of testing brings us to the next level in our stack, the environmental layer. No matter how the FPGA implementation has been developed, it will require the same debug environment as any other cloud application—an environment that gives DevOps folk control over location, isolation, and execution of the package and the ability to automatically deploy it. But there are critical issues for FPGA-accelerated packages that do not exist for software-only workloads.
An FPGA-aware environment must have access to the chip’s drivers in order to load configuration files into the chip, read and write FPGA memory, initialize FPGA logic, and start and stop execution. Most of these operations require some level of support from the gasket logic in the FPGA as well as driver code on the server CPU. Even so, debugging FPGA-accelerated code cannot be quite like debugging pure software. FPGAs have no intrinsic equivalent of the trace, single step, and breakpoint functions in the CPU, and, as one developer wryly observed, they have no equivalent of PRINTF. Most of these functions require some level of support from the gasket logic and explicit inclusion of debug structures in the user’s logic design.
That is not to say debug is done by intuition and guess. There are tools that provide software simulation of the FPGA configuration at a level useful to application developers. And tools for OpenCL can generate either CPU code or HDL for kernels, allowing developers to conduct high-level debug in a software-only environment first.
At a much deeper level, there are tools such as the SignalTap logic analyzer that allow experienced FPGA developers visibility into individual FPGA registers. But most users, working with verified libraries or frameworks, would have no need for such tools and would simply debug their code at source level as if it were calling pre-compiled software functions.
There are many tasks in the cloud data center that do not directly execute user code. Management tasks keep track of what resources are present where in the fabric, ensure that updates are fully deployed, track resource use for billing, and respond to exceptions. Orchestration tasks assign compute, network, and storage resources to tasks, ideally in ways that optimize both resource utilization and task performance. Hypervisors create virtual machines, protect them from each other, and bind them to physical resources. All of these tasks must deal with the special situation of FPGA accelerators.
The issue is simply that, unlike a CPU, an FPGA has no intrinsic function at the user level. It is more like a receptacle into which computing resources may be installed. For management automation software, this means working through the FPGA’s drivers and, preferably, some open standard interface like Intel’s Open Programmable Accelerator Engine to discover the version of gasket code currently in the FPGA, what user functions are installed, and what resources are currently available for further functions. It also means interpreting exception signals from the driver and handling them appropriately.
The challenge for orchestration automation is somewhat different. It must understand what resources are already programmed into each FPGA and what resources remain uncommitted, as well as how the chip is connected into the data-center fabric: PCI Express* (PCIe*) to a server, Ethernet to the data-center network, private links to other FPGAs, or some combination thereof (Figure 3). It must comprehend allocation of the FPGAs internal memories. And it must make decisions about when to preserve the configuration of an FPGA for later, and when to erase and reconfigure all or part of it, allowing for the not inconsiderable configuration time. The problem is similar to the challenges presented by orchestrating servers with large blocks of rather granular persistent memory—but with more constraints.
Similarly, virtualization hypervisors must know how to discover FPGA resources, and must understand the implications of shifting a virtual machine to a different physical FPGA. Like orchestration tools, hypervisors prefer a uniform fabric. The fine-grained heterogeneity that can accumulate as workloads request, configure, and release portions of FPGAs across the data center can become challenging for them.
It Takes a Stack
We have observed that for many important applications FPGA acceleration offers very substantial performance gains. But just dropping FPGAs into server racks will not make those gains accessible to most application developers. Nor will it fit FPGAs acceptably into the data center’s MANO automation framework.
To achieve those goals requires an adaptation stack. There must be an interface gasket with an open interface and drivers for the OS. There must be a range of programming and configuration tools, libraries, development frameworks, and turnkey solutions. And there must be close integration with MANO automation tools and hypervisors from major vendors. Only then does the promise transcend bragging rights and internal uses to become a reality.