The arrival of FinFETs, starting in the 20 nm CMOS logic process node, has been justly credited with saving Moore’s Law. Just as our ability to continue scaling planar MOSFETs began to come apart, the FinFET’s vastly superior channel control came to the rescue, taming leakage currents and opening the way to continued voltage scaling. The timely intervention allowed Vivek Singh, keynoting at the Design Automation Conference this month, to claim that through the 14 nm node Intel has maintained a steady improvement in transistor speed-power curves.
With leakage under control of a gate that drapes over the fin-shaped transistor body on three sides—hence Intel’s preferred term, Tri-Gate—device designers are free to continue reducing operating voltage and critical dimensions. This liberation has continued through 14 nm, and promises to persevere through 10 nm and beyond, albeit supported by increasingly radical device engineering and by new materials. At the recent TSMC Technology Symposium, for example, TSMC vice president of R/D YJ Mii said that beyond 10 nm, his organization is looking still at FinFETs and their near relatives, gate-all-around FETs (Figure 1), but perhaps with fins of germanium or indium gallium arsenide.
While transistor size and performance have resumed scaling, the situation for on-chip interconnect is less promising. At the lowest level, FinFETs have inherently high-parasitic capacitance, taking back some of the theoretical improvement in circuit speed-power. And as contacts, local interconnect, and lower metal layers shrink to keep pace with the incredible shrinking transistors, they are running into problems of their own. For example, as the contact pitch shrinks to match transistor spacing, the contact diameter must go down. But the liner that separates the contact fill material from the surrounding dielectric does not get that much thinner. And the edge roughness of the hole does not magically improve. So the area left for contact metal decreases more rapidly than the hole diameter, and the series resistance increases sharply. The problem is even more serious for low-level copper interconnect lines, which are gradually being pinched off by non-shrinking seed and barrier layers (Figure 2).
In critical designs interconnect impedance had already started to matter by 40 nm, according to SRAM IP vendor Surecore. The company has reduced SRAM measured dynamic power by more than a factor of two in TSMC 40 LP by—among other things—eliminating long metal runs. Below 28 nm, resistance joins parasitic capacitance to form a real RC issue. “Beyond 28, resistance really begins to bite,” warns Surecore CTO Duncan Bremner. “You have to start matching RCs on different paths to avoid serious skew issues in digital circuits.”
Increasing process variations are also an issue. Actual transistor geometry, channel-doping levels, and interconnect geometry will vary not only between lots and between wafers, but across the surface of one die. Add to that potentially large operating-temperature differences and resultant differences in aging rates across the die, and you have, shall we say, an interesting situation. Different portions of the same chip may require quite different voltages to operate at the same clock frequency—or quite different clock frequencies to operate at minimum power.
So on the positive side, the next-generation FinFET processes are offering very real gains in transistor density and speed-power product. On the negative side, interconnect capacitance and resistance are becoming serious issues. And process variations—even on-die—have reached the point where multi-corner analysis and guard-banding just give up too much. These factors have intruded into the system architect’s field of view.
The obvious architectural response to this set of benefits and challenges is to keep the die very small. That being a non-starter in many SoC applications, the next-best alternative is modularization. Instead of visualizing the SoC as one giant synchronous circuit that happens to be composed of many functional blocks, think of it as a community of small, potentially very fast blocks, isolated from each other in frequency and voltage, and each small enough to negate the effects of on-die variations.
At first glance, this idea would appear to be a perfect fit for the trend toward multicore processing. Instead of one sprawlingly complex CPU trying to run at process fMAX, use four, or eight, or maybe ten smaller processor cores sharing a big L2 cache. The individual cores are more easily designed for high clock frequencies because of their shorter interconnect lines and lower internal variations. And signals that leave the cores to travel the longer distance to the cache controller are running at relaxed frequencies compared to the CPU core clocks.
Additionally, by setting frequency and voltage for each core independently you gain new control over the impact of variations. And you gain a wonderful tool for system energy management.
So, problem solved, right? Well, there are a few limitations with the multicore approach. The first is that multiple cores only offer acceleration if the task mix is rich in independent threads, each with a modest cache footprint. If most of the cores are idle or the cache is thrashing, the system is not going to run much faster than with a single core.
This limitation is most apparent in embedded systems, where often there are only one or two heavy computing tasks, and these were coded, in a time long forgotten, as single threads, documented in what now appears to be early Sumerian. But even smart-phone applications SoCs—those notorious consumers of CPU cores—seem to have trouble finding enough threads for eight or ten CPUs. At least that is the drift of recent debate over the introduction of a ten-core SoC from MediaTek. Many architects feel that multicore design above eight cores is most properly used in the data center, where there are many tasks and huge caches.
Another issue with multicore concerns those big caches, at the chip-design level. As the shared caches grow, the aggregate bandwidth required of them goes up. But also, their physical size begins to require the long wires that we are specifically trying to avoid, and begins to tempt problems with process variations. This problem is leading some designers to implement large caches as arrays of small SRAM blocks, independently timed and power-managed. But this is not yet mainstream thinking. And the idea of an asynchronous, message-passing interface between the CPU cores and the L2—let alone between blocks within the L2 cache—is still radical. The only exception is in designs where the coherent cache bus must cross die boundaries, such as the Intel Quick-Path Interconnect (QPI) or IBM’s Coherent Accelerator Processor Interface (CAPI).
Once we have uncoupled the CPU cores—limiting the impact of interconnect issues and process variations—we can go back to the question of how best to use all those transistors. The answer will be application-dependent. But increasingly, architects are turning to heterogeneous multicore processing: including different kinds of processing units in the cluster around the shared cache.
The most obvious example might be ARM’s big.LITTLE concept, in which a mix of both high-performance and low-power cores gather around the cache. The two types of cores—for example, a big Cortex®-A15 and a little Cortex-A7—have the same instruction set and similar state registers, so tasks can move easily between them, as either acceleration or energy saving is the priority.
But the concept works for different kinds of cores as well as for different sizes. At last year’s Hot Chips conference, ARM® CTO Mike Muller made a cryptic reference to work within ARM on a heterogeneous multicore cluster that included both CPUs and graphic processing units (GPUs), sharing an instruction set. More obviously, a number of companies, including AMD, IBM, Intel, and Microsoft have been working on architectures that closely couple GPUs or FPGAs with CPUs.
Such heterogeneous strategies work not by executing more threads in parallel, but by exploiting opportunities within a single thread. A GPU provides a massive single-instruction/multiple-data (SIMD) engine for exploiting highly-parallel data, such as, well, graphics. An FPGA can implement a parallel or pipelined datapath, or simply a state-machine-driven loop that eliminates instruction fetches. A many-core architecture like Intel’s Xeon Phi can exploit either data parallelism or the ability to shatter a task into a very large number of lightweight threads. In any case, the result is significant acceleration, and in some cases energy reduction, for threads that have been recoded for the novel hardware.
Some signals still have to leave the multicore cluster and make their way across the die. The traditional ways to close timing on these paths have been either to load them up with buffers and route them on higher-layer—and hence lower-impedance—traces, or to give up and declare them multicycle paths. Both options have disadvantages.
But with a plethora of gates available at only a modest cost in static power, other choices present themselves. One option is to pipeline the long paths, bringing the signals down to a register—or better, a clock-crossing register to manage metastability—often enough to keep the delay on each segment within a single clock cycle. This may also require level-shifting as the signals cross between voltage domains. Register insertion also allows retiming, which can be useful even if it means duplicating some logic in a new domain.
A quite different approach to longer connections—especially if they are wide paths—is to use a globally-asynchronous/locally-synchronous (GALS) network-on-chip (NoC). Such networks are readily available as packages that include network synthesis tools, libraries, and verification intellectual property (IP). A NoC renders your design all but invulnerable to on-die variations simply be decoupling the timing of all the blocks from each other (Figure 3). But of course you have to be sure that the latencies induced by all that asynchronous message-passing still leave your design able to meet its overall performance requirements. Message-based networks may not be the best way to handle some situations, such as fast streaming data flows.
One more point on deploying transistors to deal with process, thermal, and aging variations is becoming important. With significant variability, an ambitious SoC design decomposes into a large number of independent blocks, asynchronous to their neighbors. Each of these blocks has distinct system-management needs. Each may have a number of independent voltage domains that require periodic adjustment based on temperature or delay measurements and power-management strategies. Similarly, each block may have precision analog circuits or transceivers that require local recalibration. If the block has a large amount of memory it may need periodic scrubbing to control soft errors. And some blocks may have security needs—authentication and encryption—not shared with, or not trusted to, the rest of the chip.
As a result, some architects are implementing local, block-level system-management processors. These embedded microcontroller units (MCUs) can handle the above issues, and special needs such as sequencing of power rails during power-up and shutdown.
Thus three factors—a rich transistor budget, increasingly problematic interconnect, and growing on-die process variations—are all working to divide the large SoC into autonomous functional blocks. These blocks will have their own timing, and may make their own decisions about operating frequency and voltage choices, either initially at power-up or during operation. And increasingly, these blocks will have their own internal system management and perhaps voltage regulation capabilities. This growing autonomy for functional blocks not only solves problems that are arising in today’s designs, but it prepares architects for a future in which the distinction between monolithic circuits and multidie circuits will begin to blur.