We have entered the age of heterogeneous multiprocessing. In high-performance computing applications, architects are adding hardware accelerators to the multicore CPU clusters in their huge supercomputers. At the other end of the spectrum, designers of embedded and mobile systems are moving critical code loops into hardware to slash energy consumption. Everywhere in between, embedded-system designers are looking at multicore SoCs with on-chip accelerators or programmable graphics processors, or at FPGAs with integrated CPU cores, and wondering if such chips could be the best route to adequate performance at minimal energy for their application.
Such architectures offer a way forward even as the promises of increasing uniprocessor performance and of aggressive multithreading fade. But heterogeneous multiprocessing brings its own challenges. For software developers there are issues such as scheduling, synchronization, and programming models. For hardware developers, one of the key questions is how to attach the accelerator to the rest of the system. That is the question we will take up here.
There are many ways to attach a hardware accelerator to a system. They range from the simple and traditional to the astonishingly complex. At one extreme, you can treat the accelerator as a separate computer and connect to it via Ethernet, or even the Internet. The logical end of that line of reasoning is the cloud as an accelerator.
Drawing the accelerator closer, you can treat it as a peripheral, and connect via PCI Express® (PCIe®), or for point-to-point connection, a parallel port or a specialized serial link such as Interlaken. In some cases, the central SoC may provide a memory-coherent interface—an extension of an on-chip cache bus—such as Intel’s Quick Path Interconnect (QPI).
All of these approaches assume that the accelerator is physically separate from the main CPU SoC. If the accelerator can be on the CPU die, there are more options. An accelerator can be an on-chip peripheral attached to an on-chip bus, such as AMBA® AXI™, or to a network-on-chip (NoC). Even more intimately, an accelerator can share L2 or L3 cache with the CPU cores. And in a few cases, the accelerator may actually move inside the CPU, becoming an execution unit driven by the CPU instruction stream, as has happened with floating-point units and vector processors, both of which were once external boxes.
A closely-related question for on-die accelerators is sharing of resources. Will the accelerator share an AXI switch fabric with other devices? If so, will it share memory coherency? Will it share access to a DRAM controller, or to a cache? The answers to these questions come with implications.
“One of the key issues in thinking about accelerators is granularity,” says Ting Lu, SoC architect at Altera. He measures granularity roughly by the number of lines of code the accelerator would execute before passing control back to the host CPU. This can range, Ting explains, from just a few lines, for very fine-grained accelerations, to perhaps a few hundreds to a few tens of thousands of lines for a medium-granularity task such as Advanced Encryption Standard (AES), to really huge code sets for coarse-grained tasks.
As a rule of thumb, you always want the time consumed in passing control and data to and from the accelerator to be much shorter than the execution time in the accelerator. This rule by itself dictates some decisions.
“The only way to get the control latency short enough for fine-grained acceleration is to put the hardware in a tightly-coupled coprocessor, or in an expanded execution pipeline inside the CPU,” Ting asserts. “Eventually, fine-grained accelerations tend to disappear into the CPU core.”
Medium-grained tasks present a different situation. When the execution time is longer, the accelerator’s control logic doesn’t need to feed directly from a CPU’s instruction dispatch unit. There is time for the CPU to load registers via a local bus and to point to a job control block in shared memory. That observation brings up the next major topic for discussion: how to get the data in and out of the accelerator. Key questions will involve data movement, caching, and coherency.
“You start by looking at the target workloads,” says Dwight Barron, fellow and chief technologist in the Hyperscale Unit of Hewlett Packard. “In modern software, the rule is to keep the data close to the processor that is using it. And don’t move the data without working on it.”
Bruce Mathewson, AMBA architect and ARM® fellow, counsels a tight focus on the data. “Understand the characteristics of the data,” Mathewson says. “Look at the memory footprint, at the accelerator’s access patterns, and at interaction with the CPU.”
Footprint is a key concept here, but there are at least two ways of looking at it. There is the total size of the data set, which can range from quite modest in some embedded applications to absolutely enormous in big-data applications like global climate modeling or consumer behavior analyses.
But there is another sense to footprint—one that is more important to this discussion. That is the set of data the accelerator will pick up, touch, and release without having to return and open it up again during the current activity. For a fast Fourier transform (FFT) processor, the footprint might be 1024 samples. For a video compression engine it might be a couple of UHD video frames. For a neural-network simulation, the active footprint might be many millions, or even billions, of simulated synapses.
The concept of footprint is intertwined with Mathewson’s second concept, the algorithm’s access patterns. Some algorithms slide a relatively small window about within a much larger data set, only touching data that is currently inside the window. Other algorithms may have much poorer locality, making access at random across vast data spaces—like our neural-net example. Some will only touch a few items scattered through a large data set, while others may touch every element once, while still others may hammer away repeatedly on a few hot spots.
Mathewson’s third point is about interaction between the CPUs and the accelerator. Does a CPU task generate the data and hand it off to the accelerator? Does the data come directly from an external source and the result flow into memory? Do multiple CPUs continuously touch the data set while the accelerator is working on it? All of these questions are relevant.
The three points come together to suggest the best way to attach medium-granularity accelerators. For example, if the active data footprint is small and the data is being generated in the CPUs, it can make sense for the accelerator to share L1 or L2 caches with the CPUs. If the data footprint is somewhat larger, or if the data is not owned by any one CPU, it may make sense for the accelerator to share on-chip L3 cache instead.
Mathewson cautions that even when it makes sense in principle for the accelerator to share access to the on-chip caches, further analysis of the sequence of events is important. “If the CPU is generating the data and the accelerator is picking it up a few cycles later, you don’t have to worry about the cache getting flushed,” he says. “But in general, if an accelerator shares L2 and isn’t tightly coupled to what is happening in the CPUs, you can thrash the L2.”
When you are planning to share caches, another question comes up: coherency. One possibility is that the accelerator becomes a peer on a coherent cache bus, as shown in Figure 1. But an accelerator can also benefit from cache coherency as a peripheral. Ting points out that AMBA now offers a range of options, from the AXI Coherency Extensions (ACE) when you want the CPU caches and the memory inside the accelerator fully coherent, to ACE-Lite when the accelerator has no internal memory accessible to the outside, to standard AXI when you don’t need coherency protocol at all.
In principle, if the controlling task on a CPU can insure a deterministic sequence of events, you never need hardware coherency. You can be certain that the accelerator can pick up its block of current data from one location, process it, and return its results without any other task interfering. But few systems achieve this ideal of rationality. Often the accelerator won’t be the only processor working on the data, or asynchronous events can move the data around in the cache hierarchy, so that the accelerator by itself may not know where to find the valid copy of each byte of data.
In these situations hardware-based coherency protocol can be a huge time saving. For example, if the task that creates the data is spread across multiple CPU cores, some of which must service interrupts, without coherency all the CPUs might have to flush their caches and the accelerator pick up the data from DRAM to ensure that the accelerator gets the current values of all the data elements. With hardware coherency, the accelerator can simply issue reads to, say, one shared L2 cache, and the coherency hardware will insure that the system delivers the valid copy of each word, wherever it happens to be at the instant.
We have been quietly assuming here that there is just one L2 cache on the chip. But today’s SoCs may contain many CPU clusters, each cluster with its own L2 cache, as shown in Figure 2. “When you have multiple clusters, you have to associate the accelerator with just one L2 cache,” Mathewson warns. “If some of the accelerator’s data may be on other L2s, let the coherency protocol resolve that. The complexity of attaching to multiple L2s just isn’t worth it.”
If the accelerator must be accessible to several different CPU clusters, or if the active footprint is too large for the L2 caches, or the access pattern is likely to lead to thrashing, then shared access to L3 cache is the next alternative to consider. But of course not every SoC has an L3, or a large enough one. At this point we begin to think of the accelerator less as a peer in a heterogeneous multicore architecture and more as an attachment. So the question becomes: how do we attach it?
Now we are talking about accelerators only loosely coupled to CPU tasks. Under command from a CPU the accelerator can load, process, and store a block of data, bypassing the cache architecture altogether. Such an accelerator may or may not reside on the CPU die.
If the accelerator is still on the SoC, it will probably have command and status registers on the SoC’s bus. But it may also, if the data footprint is large, connect directly to a port on the chip’s DRAM controller. This strategy avoids the issues that arise when an accelerator shares cache with the CPUs, but it introduces new issues.
One such issue is contention for the DRAM channel. If the memory controller is not sophisticated enough to prioritize and schedule access requests from multiple clients, a second or third client can trigger a catastrophic rise in page misses, and a similarly dispiriting drop in DRAM throughput.
Another key point is that designs that bypass the cache will usually lie outside the coherency sphere. That simplifies hardware design, but it puts new responsibilities on the software—responsibilities that complicate coding and can harm performance. Without coherency hardware, the software must ensure that the accelerator and CPUs consistently get valid data when they make an access.
But what about accelerators that are not on the SoC die? If the system design employs a coherent inter-chip link such as QPI or advanced forms of HyperTransport™ Protocol, an external accelerator can still reside within the coherency sphere and share cache with the SoC, albeit at some penalty in latency, transfer speed, and energy consumption. In fact some server designers are using this approach today to create coherent accelerators in FPGAs that sit next to multicore CPU chips on the board. In the next few years, we are likely to see 2.5D IC modules that implement a very wide coherent bus—essentially an extension of the cache bus between dice.
But without QPI or an equivalent, we are in the realm of conventional external busses. “The basic principle stays the same,” Mathewson explains. “The further the accelerator is away from the CPUs, the bigger the offload needs to be, and the greater the speed-up. You always have to ask if the hand-off is really worth it.”
Candidates for the interconnect link now include chip-to-chip connections like Interlaken, multimaster buses like PCIe, and high-speed network connections like 10 Gigabit Ethernet. These links are generally implemented using high-speed serial transceivers, so they can achieve remarkably low transfer energy. But they exhibit higher latency than more intimate connections.
At this point, the astute reader might have observed that the criteria for selecting an accelerator attachment scheme are application- and even algorithm-dependent. This would be no surprise to embedded-system designers, who are quite used to algorithm-specific hardware. But the idea presents a bit of a quandary in the data-center world. How do you provision a cloud if the connections within the servers are application-specific?
Hewlett Packard’s Moonshot architecture offers one answer: a purpose-built system chassis. As HP’s Barron describes it, the Moonshot chassis accepts third-party computing cartridges. These cartridges, which may be defined by either silicon vendors or system developers, contain one or more CPU SoCs and/or accelerator chips.
“The architecture provides two levels of interconnect for accelerators,” Barron explains. “On-die heterogeneous multiprocessing has a lot of prior practice in the ARM world, in particular, so naturally we support it. There can also be multiple processing chips within a cartridge. Normally the boundary of coherency stays inside the SoC.
“At the second level, we support cartridge clustering. The chassis provides a 2D torus of nearest-neighbor interconnect. These are passive traces, not dedicated to a specific fabric, so their use varies from customer to customer. But most, in one way or another, use the torus to tunnel Ethernet packets between cartridges. We separately provision a lot of standard Ethernet bandwidth to each cartridge.”
We have seen that there are many approaches, from the exceedingly intimate to the arm’s-length, for attaching an accelerator to the CPUs it serves. The criteria for choosing one approach over another include control granularity, data footprint, access techniques, and access interactions. All of these criteria are application-determined, and some may even change with algorithm choices within a given application.
It is reasonable to surmise, then, that the job of embedded-system chip developers, FPGA users, and even board-level designers in the near future will involve these analyses of their application and some challenging decisions about how to handle accelerations. For developers of “general-purpose” computing systems, which appear to be growing more and more application-specific, the longer-term challenge may be to develop a virtualized computing fabric in which the accelerators, their characteristics, and the way they connect into the fabric are all alterable at run-time. That is no small challenge to throw down before either group of system developers.