The annual Hot Chips conference in Silicon Valley offers a reliable window into the architectural thinking of both CPU giants and exciting start-ups. This year proved to be no exception, as architects squared off against the limitations of physics and the demands of workloads, with special attention going to the trending task of the year, deep learning. Taken together, the papers could almost be read as a celebration of heterogeneous computing with hardware accelerators (Figure 1).
Naturally, much attention focused on the headline chips: server-class CPUs. AMD, IBM, and Intel each presented their current offering. Interestingly, AMD’s Epyc and Intel® Xeon® Scalable processors were based on cores—Zen and Skylake, respectively—presented at last year’s conference.
On the surface, the two cores are similar. Both offer four integer and two floating-point pipelines, along with address calculation and load-store units. Both load a wide-enough word from instruction cache to take in several of the shorter X86 instructions per cycle. The cores crack these complex, variable-length instructins into micro-operations, which they mark to indicate data dependencies and load into huge buffers. Out-of-order dispatch units then select micro-ops that are ready to execute and stuff them into available execution pipes.
The point of all this trouble and hardware is to extract every last drop of instruction-level parallelism from single-thread code. As we move to new process generations, transistors may get somewhat faster, but interconnect takes back much of the improvement, causing block-level maximum clock frequencies to level off. But each new node does provide a significant increase in transistor density. So processor core designers are lavishing transistors on circuits that can increase the effective number of instructions per clock on benchmarks. AMD, for instance, claims they are getting 50 percent more IPC on some codes with Zen compared to the IPC of their previous-generation core.
But this pursuit doesn’t end at more pipelines and wider dispatch units. Architects must also avoid stalls, or all is for naught. That requires even more transistors—for ever more elaborate branch prediction, pipeline bypasses, predictive prefetches and other ingenious devices to keep the pipelines from draining. And it means ever larger and more sophisticated cache hierarchies to minimize the number of clock cycles lost when, all else having failed, the CPU core must wait for memory.
All of this was clear from last year’s papers describing Zen and Skylake. This year’s papers took up the tale again as architects packed the cores onto dice, and the dice onto substrates, while trying to provision these systems with enough bus, cache, and memory bandwidth to keep the large numbers of cores running simultaneously.
In the Server Socket
AMD’s Epyc mounts four dice, each carrying two four-core CPU clusters, on an organic substrate. Each core has a private L2, and each cluster has a shared L3. Each die has a pair of 8 x 72-bit DRAM channels, and the dice are interconnected across the substrate by a proprietary point-to-point fabric.
The Intel Xeon Scalable processor, in contrast, is a single die carrying 28 Skylake cores and six DRAM channels in a switchbox interconnect mesh (Figure 2). Each core has an L2, and there is a shared L3. In both designs a lot of die area goes into increasing cache hit rates and bandwidth between the caches and DRAM. And increasingly, with high-speed links like AMD’s undedicated SERDES connections and Intel Ultra Path Interconnect, transistors are going into providing bandwidth between sockets.
But why spend transistors on more cores instead of on bigger cores? The main answer is diminishing returns. In a world where, it is estimated, the available instruction-level parallelism in most codes is only a bit more than three, six- or eight-issue cores may be doing all that can be done, even with the help of powerful compiler optimizations. Once you are running the clock as fast as it can go, and launching as many instructions per clock as the software has to offer, you must look for parallelism elsewhere.
The low-hanging fruit in cloud data centers is parallel threads. With a rich mix of workloads, many of them multithreaded, and with huge local memory, you can keep many cores busy by giving them many independent threads to execute.
The obvious next question then is what about individual applications with many independent threads—map-reduce searches, for example, or algorithms that can partition data into little chunks and launch a thread to deal with each chunk? Yes, you can certainly speed up such thread-rich applications by throwing lots of cores at them—assuming you don’t run out of memory bandwidth. But in many cases you can do even better than just mapping these many threads onto independent processor cores.
Multicore Has a Place
As we saw earlier, server-chip architects have gone to great lengths to be sure each socket has access to enough DRAM bandwidth and capacity to support all those cores. But suppose we focused on a thread-rich, compute-intensive workload that could use even more than 28 or 32 cores with the bandwidth available to a single socket. Then we would have an interesting tradeoff. If we simplify—and hence shrink–the individual cores, we will reduce single-thread performance of each core, but we might increase the number of cores per socket substantially. On compute-intensive cores, when the inner loops fit within the reduced L1 instruction caches, this can be a big net win. That is the theory behind Intel’s Knights Mill many-core chip and a number of more radical presentations.
Knights Mill starts out with an Intel Xeon processor-compatible core, then simplifies it to a 2-issue—but still out-of-order executing—engine. It adds some specific capabilities for machine learning. It eliminates on-die L3 cache, replacing it with 16 GB of high-speed DRAM on the device’s multi-die module. These changes allow the architects to pack 72 cores, with their L1 and L2 caches and vector processing units, all onto the main die.
Knights Mill is mid-way between a server CPU and a hardware accelerators. It can execute standard Intel Xeon processor operating software and workloads. But it is optimized to accelerate numerically-intensive workloads particularly friendly to many-core processors. As such, it is a step down the road toward pure accelerators that require rewriting of at least some code and must be used in conjunction with CPUs.
A paper from Baidu took the idea a step deeper into specialization. Instead of streamlined x86 cores, Baidu created a genuinely tiny core—small enough to fit nearly a thousand cores into a large FPGA. At this size, it is hard to make the cores general-purpose so instead Baidu gave them small, domain-specific instruction sets—an alternative attractive in an FPGA but not so viable in an ASIC. While admitting there was still much to be done, the paper claimed that its implementation of 256 cores in an FPGA accelerated some tasks up to 64 x compared to a single eight-core Xeon chip.
Another variation on this theme came from a team of chip designers spread across four universities: Cornell, UCLA, UC San Diego, and the University of Michigan. The team designed a hierarchy of processors using two variants of the open-source RISC-V core: a general-purpose tier of five high-end RISC-V Rocket Cores, and an accelerator array of 496 tiny RISC-V Vanilla-5 cores, plus a tier of specialized neural-network acceleration hardware.
Milking Data Parallelism
If you have achieved many threads by dividing data up among many identical copies of a single instruction sequence, there are further ways to eliminate redundant circuitry and free more die area and power for more execution units. For example, if you know all the cores are going to be executing the same instruction sequence, but not necessarily in lock step, you can have a single instruction cache, decode unit, and reorder buffer, with each execution unit maintaining its own dispatch buffer and instruction counter—an arrangement that some call single-thread, multiple data. This approach is approximated in advanced graphics processing units (GPUs,) two of which were profiled in Hot Chips papers.
But if you can ensure that each execution unit will execute exactly the same sequence of instructions, you can go even further, having a single fetch, decode, and dispatch pipeline shared by all the execution units: single-instruction, multiple data, or SIMD. The most familiar SIMD implementations are traditional GPU shading-engine arrays, which live on in the inner workings of even today’s advanced GPUs. Along with specialized graphics processing hardware, GPUs include massive arrays of little floating-point processors, often organized as groups of SIMD clusters. These little processors were originally designed only to run shading algorithms for the polygons in 3D image renderings, but they have become general-purpose enough to handle compute-intensive codes of other sorts. Accordingly, blessed with massive numbers of little floating-point engines and enormous memory bandwidth, GPUs have become the most widely used accelerators in high-performance computing and in data centers that support compute-intensive workloads—the latter category swelling rapidly with the growing interest in machine learning, which is a distinctly compute-intensive and data-parallel task.
Two Hot Chips papers, one from AMD on its Vega 10 and one by Nvidia on its Tesla V100, showed that GPUs are adapting eagerly to the peculiarities of the machine-learning workloads. One major change is in data-path width. Recent research has shown that the inference side of deep learning—using the network to classify inputs after it has been trained—doesn’t really need the full 32-bit floating-point precision of the GPU execution units. So both AMD and Nvidia have added packed 16-bit data types to their little engines. Both have also adopted in-package high-bandwidth memory (HBM) to get data in faster—a requirement both for their day jobs in 3D rendering and for the insatiable hunger of deep-learning network computations.
Nvidia has gone a step further, perhaps recognizing that its vast arrays of little cores are not ideally organized for the matrix arithmetic that forms the heart of deep-learning calculations. The V100 has added 640 so-called Tensor Cores—perhaps with a nod to Google’s Tensor Processing Unit ASIC—to accelerate matrix multiply-accumulate operations. Each of these cores provides a hard-wired multiply-accumulate data path for four-by-four matrices, using 16-bit floating-point operands and producing 32-bit floating-point sums. The Tensor cores can gang up into warps to do, for example, multiply-accumulates on 16-by-16 matrices. The result, according to Nvidia, is a nine-times speed-up for 16-bit floating-point matrix operations compared to the previous-generation P100 GPU.
By including application-specific logic in its GPU, Nvidia is taking a cue from an entirely different kind of accelerator: Google’s Tensor Processing Unit (TPU), which also was the subject of a paper this year. The TPU is essentially a hardware matrix multiplier/accumulator with an elaborate CPU interface. The serious computing is done in a 256-by-256 systolic array of 8-bit multipliers.
Beyond Matrix Arithmetic
Recognizing that the heart of deep learning is the artificial neuron, which is essentially a vector dot product, makes it very reasonable to think of neural-network acceleration as an instance of linear algebra, to be accelerated by power matrix multipliers. But a paper from ARM* Research, Harvard, and Princeton argued that this view is a misunderstanding of what actually goes on in an inference network. And if your target application is deep-learning inference at the edge of the network, in space- and power-constrained systems, it is a crucial misunderstanding.
In such environments, the paper claimed, one has to exploit the sparsity of all those matrices. When a deep-learning network completes its training, many of the weights—the coefficients that get multiplied by the inputs, called activations, at each neuron—will be zero or near zero. So when you form the dot product that will become the output of the neuron, many of the individual products will be too small to matter. The ARM paper argued for an approach that could prune out these unnecessary operations. In their design, a static scheduler caches the weights that are important in local memory and routes necessary operations through the chip’s many execution unit pipelines, leaving out the unnecessary ones.
But what about training, where you have no knowledge before run-time of the pruned network? One technique that could potentially eliminate many instructions and load/store operations would be to directly map the code’s data flow graph onto hardware. Such data flow graphs are often used as intermediate formats in compilers, and are the bases of many machine-learning frameworks, so they are often readily available. And there are open-source utilities for optimizing and pruning them.
At Hot Chips, two papers presented quite different approaches for mapping data flow graphs directly onto silicon. One paper was from Thinci—pronounced “think-eye”, as the company was quick to explain. Their chip is essentially a pool of small processors governed by a thread scheduler, in much the way a server CPU is a pool of execution units governed by an instruction scheduler (Figure 3). The chip decomposes a data flow graph into a set of threads—each thread representing the computation at one node in the graph—and maps these threads onto its processors in a way that allows the processors to stream data directly to each other, rather than continually having to do loads and stores to intermediate RAM.
A perhaps more radical approach came from Wave Computing. Their paper described an array of up to 16K processing elements, interconnected by a programmable switch matrix. A compiler maps a data flow graph directly onto this array, assigning graph nodes to processing elements and interconnecting the nodes through the switches. Thus in a sense, graphs are the native language of the chip. Wave says their automated tools and IC achieve about the same acceleration and die area as you would get by manually mapping the graph onto an advanced FPGA.
With their ability to reduce or eliminate instruction fetches and intermediate data storage in compute-intensive kernels, to exploit data- and thread-parallelism, and to pack large numbers of computing elements onto a die, hardware accelerators—be they many-core CPUs, GPUs, arithmetic arrays, data flow engines, or FPGAs—are gaining a permanent role in computing. In fact at the recent Linley Processor Conference, analyst Linley Gwennap claimed that “Over time, accelerators will pick up the bulk of workloads.”
But that doesn’t mean CPUs will stand still: not with process technology offering up a wellspring of new transistors every couple of years. CPU architects will find new ways to milk what instruction and thread parallelism remains in benchmarks and important customer codes. But they won’t ignore the promise of accelerators.
In addition to increasingly sophisticated vector-processing units—floating-point SIMD engines that have been available on server-CPU chips for some time—some server CPUs are sprouting crypto engines, and more modest processors are getting integral GPUs with general-purpose and deep-learning capabilities. And some experts believe there are more kinds of integrated hardware acceleration to come in the future. The one thing that seems certain is that wherever they turn, from mobile devices to the cloud, application developers will find richly heterogeneous platforms at their disposal.