Is the System on Chip Coming Apart?

Arguably it is, in some circumstances. This disintegration can take any of several paths—some of which wind deep into the promising yet problematic technology of 2.5D packaging, while others lead back to the seemingly archaic landscape of separate chips on a board, albeit with very unarchaic interconnect technology. The only accurate map for these varied routes will spring from the architect’s skill at system partitioning.

At least that is the story told by a number of the plenary talks and technical paper sessions at this year’s International Solid State Circuits Conference (ISSCC). This conference, which has for years charted the inexorable Moore’s-Law advance of integration, now appears, ironically, to be signaling an inflection point that could lead system design back toward discrete multi-chip implementations.

No Surprises

The driving forces behind disintegration of the SoC are familiar. Many have been discussed for years, either in the endless ASIC versus FPGA debates or in the burst of excitement and ensuing flood of hyperbole over 3D integration.

First, there are the economic arguments. Not only does each new process node get more expensive, but the cost is now growing faster than the density—so each new node is more expensive on a per-transistor basis. Bigger, faster, and cheaper ended at about 28 nm. As Marvell Technology Group chairman and CEO Sehat Sutardja put it in his ISSCC plenary talk, “It is time to stop ‘just because we can’ integration. Just a mask set is costing us $10 million now.” Given such costs, it is very tempting to increase the value of your next-generation system by something other than moving the whole SoC to a new process node and then shoe-horning in more features, driving up the cost while narrowing the market.

There are strong technical arguments as well. In their race to integrate, today’s SoCs have absorbed many structures that are not well-suited to the restrictive design world of 16 or 10 nm digital FinFET processes. Analog and RF functions are easy examples. It is not that you can’t do analog design with tiny, fixed-size, fixed-orientation FinFETs. In fact a fascinating plenary talk by revered emeritus professor Willy Sansen of Katholieke Universiteit Leuven showed that the key circuit techniques for analog design at 10 nm—maybe even at 5 nm—have existed for years. Essentially, Sansen explained, you start with the old familiar circuits, and then add structures to cancel the new problems—increased resistance and capacitance, new poles, growing noise, increasing device mismatches, and so on—brought about by the new process node. The problem isn’t that the old concept doesn’t work: it is that at each new node, every analog circuit becomes a new design.

Memory presents a rather different problem. Except for IBM’s uniquely successful use of embedded SoI DRAM as on-chip cache—even in relatively small quantities—no one seriously contemplates putting large arrays of DRAM or NAND Flash onto a leading-edge SoC. The processes are hopelessly incompatible.

Another issue is rate of change. Some parts of a system create new competitive opportunities at each new process node: CPU clusters, for example. Other portions, such as peripheral and bus controllers, may change little over a period of years. “Look at the PC,” Sutardja urged. “Intel changes the CPU design rapidly, but leaves the south bridge and many peripherals alone in older, low-cost processes.” Such partitioning could lower the cost of the system while giving the most critical blocks access to leading-edge speed-power points.

All of these arguments have been used in favor of multi-die modules—and especially to support stacked-die 2.5D and 3D modules employing through-silicon vias (TSVs). But these module technologies are slow to take off.

3D Promises

In principle, multi-die modules offer the perfect solution to the growing challenges of SoC integration. The module is in a single package, and with stacked-die approaches can have a smaller footprint than an equivalent SoC. TSVs, silicon bridges (as proposed by Intel), or silicon interposers allow interconnect densities that approach the densities of upper metal layers on-die, permitting quite wide paths between dice: for example the 512-bit-wide Wide I/O2 DRAM interface specification for use in stacks of DRAM dice. And the short distances involved promise interconnect delay and energy figures approaching those of on-die interconnect.

At the same time, on-die interconnect is becoming less attractive. Vanishingly small wire cross sections—often taken up as much by barrier and seed material as by copper, and then obstructed by grain boundaries–present discouragingly high resistivity. Equally tiny and constricted vias are at least as serious a problem. In response, timing tools pump buffers into critical paths, driving up area and power. Some experts estimate that moving a wide signal across a chip is now no more energy-efficient than taking it off the chip.

The most resource-rich companies serving the largest markets are pushing ahead with multi-die modules. In his lead-off plenary talk, Samsung Electronics president Kinam Kim said that his organization is in production on a stack of four DDR4 DRAM dice with TSV interconnect, and is evaluating a four-die design using the TSV-based high-bandwidth memory standard. Eschewing TSVs, Intel is qualifying a quite different 2.5D interconnect approach for 14 nm using small silicon bridges embedded face-up in the substrate of an otherwise-conventional flip-chip package.

But if anyone had been expecting widespread production of 3D, or even advanced 2.5D, modules by now, they would have been left gazing in disappointment across a field of smartphones still using package-on-package technology for their SoCs. Well-publicized issues, including the difficulty of getting known-good dice, the challenges of TSV design and fabrication, yield and reliability questions, and just simple cost have all stood in the way.

And then there is something else. While the path delays of 2.5D interconnect are in general significantly shorter than inter-package delays, they are not comparable to the delays in short on-die nets. Nor can you speed them up by sprinkling them with buffers. Consequently, these inter-die nets do not fit well into the framework of existing timing tools.

This in turn means that, rather than treating all the nets in the multi-die module as one big, flat design, designers often must explicitly partition the design among the dice, and give some careful thought to the inter-die interfaces. Increasingly, when designers and architects do this analysis, a surprising additional possibility appears.

Back to Board Level

“In going from the previous generation System z microprocessor to the new z13, we moved from multi-die modules to individual chips linked by high-speed serial lanes,” said IBM distinguished engineer James Warnock in his ISSCC technical paper presentation. The paper claims increased flexibility and modularity for the design using individually-packaged chips, while continuing the System z’s pursuit of higher performance.

The interconnect scheme is actually a hierarchy of connections. A processor node comprises four chips—three multicore processor chips and one L4cache/controller chip—interconnected by a mesh of 5 Gbps lanes called an X-Bus. There are two nodes in a drawer, with a bundle of 5 Gbps lanes—the S-Bus—linking the two cache/controller chips. A third mesh, using 6.4 Gbps differential lanes, interconnects the cache/controller chips in up to four drawers (Figure 1).

Figure 1. IBM’s System z13 uses one set of serial lanes to link the compute nodes in a drawer , and another set to link between drawers. The nodes themselves comprise four packages interconnected with yet a third type of serial link.

ibm_system_z13

The System z designers’ decision might seem an exception, but Marvell’s Sutardja is thinking along almost exactly the same lines. He proposed, and his company is developing, an 8 Gtransfer/s multi-lane serial connection to daisy-chain together the chips in his multi-package SoCs. Sutardja described the link as essentially a serial implementation of the AXI™ bus, using very compact short-range transceiver designs to minimize die overhead.

The theme was echoed in chip-level presentations from two other processor vendors: Oracle and Intel. Neither company is splitting its multicore SoC into separate chips—yet. But both are clearly depending upon fast serial lanes to link multiple CPU chips and other kinds of processors into the on-chip bus structure.

Oracle physical design director Penny Li described the next-generation SPARC server chip, the M7. Mammoth even in 20 nm, the chip packs in 32 processor cores, 64 MB of distributed L3 cache, and eight data-analytics accelerators—the latter grouped with the memory controllers rather than with the CPUs or caches, interestingly (Figure 2).

Figure 2. Oracle’s SPARC M7 die includes three sets of multi-lane serial links: one for memory, one for coherent links to other processors, and one for general I/O.

oracle_sparc_m7

For inter-chip communications, the M7 team provided three types of SERDES-based serial lanes, based on two different physical designs. There are short-range lanes for connection to memory, running at 12.8 Gbps and providing an aggregate 358 GBps bandwidth. And there are long-range lanes, using a more complex 10-tap decision-feedback equalizer receiver. These transceivers operate at up to 18.13 Gbps and are used both for general I/O and for a coherent inter-chip connection, linking the on-chip networks of different M7 chips.

One of the early users of serial inter-chip coherent links was Intel, with the Quick Path Interconnect (QPI) port on Xeon server processors. At ISSCC Intel reaffirmed their commitment to QPI, and to off-chip accelerators—stating that the Xeon E5-2600 v3 CPU will include a sped-up implementation of QPI. The new port will employ 60 lanes at up to 9.6 Gtransfers/s.

Speed and Power

The 9.6 Gtransfers/s transceivers in the new 22 nm Xeon chip are nowhere near the end of the road for Intel. In a separate paper, designers described a 40 Gbps transmitter implemented in Intel’s 14 nm Tri-Gate technology. Anticipating a shift in modulation schemes that is already taking root in Ethernet circles, the transmitter can operate in either NRZ mode or in the increasingly attractive PAM4 mode.

Looking at the energy consumption of these transceiver links reveals another important issue. Typically, the papers are reporting on the order of 10 pJ/bit. This becomes particularly interesting when compared to estimates for on-chip interconnect, which typically run around 0.2 to 0.5 pJ/bit-mm. To get halfway across a 20 mm die, then, would consume up to maybe 5 pJ/bit—for an untimed path. Add in rapidly increasing wire resistivity and the need for lots of buffers, and estimates can get very pessimistic. The bottom line is that on-chip busses and chip-to-chip serial links can have nearly comparable energy efficiency. And the trend in advanced process nodes favors the off-chip connections.

There is an obvious objection. An on-chip bus will dissipate only static buffer power when it is idle. But SERDES links are normally kept active to maintain the phase lock between transmitters and receivers. Except when the links are in continuous use—such as when streaming network traffic or high-definition video—SERDES-based links would appear to have a built-in energy disadvantage.

This has occurred to designers. A paper from the University of Illinois Urbana and Intel described a 7 Gbps serial transceiver designed for burst-mode operation. This requirement dictated a near-zero-power standby mode, plus the ability to start and stabilize the transmitter and to lock in the receiver very quickly and with very little overhead energy.

The designers reported a 20 ns power-on latency, an operating energy consumption of 9 pJ/bit—typical of transceivers in this speed range—and a power-up energy cost about equivalent to the energy required to transfer 100 data bits. This should make the transceiver design quite energy-efficient for message-passing or large cache-line-oriented transactions between chips, even if the duty cycle is quite low.

Partitioning is Key

The idea of implementing a system in multiple chips connected by high-speed serial lanes is establishing itself as an alternative to both single-die integration and multi-die modules. This new range of choices places physical partitioning at the center of architects’ concerns. At least three variables stand out.

First, but perhaps least, is energy. At 5-10 pJ/bit, inter-chip transceiver links consume more energy than short, unbuffered on-chip interconnect. But realistic figures for moving data across a large SoC are already approaching this level. A second key question is duty cycle. Traditionally, serial lanes have been used for streaming connections. But development of fast-locking burst-mode transceivers makes message-based and burst-mode links viable alternatives in terms of energy budgets.

Bandwidth is also a question. Big-system links like those in IBM’s System z13 claim 1 Tbps or more of aggregate bandwidth. But they devote many pins to such connections. In smaller systems connection bandwidths may be more like 10-20 GBps. And then there is latency. An inter-chip serial link will impost at the very least packet-formation and SERDES latencies. Architects must either partition on boundaries not sensitive to latency, or must use such measures as multithreading to reduce the impact of link latency on system performance: the same considerations that apply to a lesser extent even to on-chip cache design.

The system architect then has a wider range of partitioning choices than perhaps ever before, with fewer costs for moving a function onto a separate die, or into a separate package. Once the energy budgets, latency, and bandwidth requirements have been established at the partitioning boundaries, the architect can plan the best blend of on-die, with potentially massive parallelism and the greatest speed for nearby connections; on-module, with potentially huge numbers of relatively fast inter-die connections; and separate-chip, with high-speed-serial lane interconnect for the implementation.

Finally, a speculation is in order. The arguments in favor of inter-chip serial links apply just as well inside multi-die modules, where interconnect may be abundant, but it is far from free. And as shrinking geometries drive up the impedance of on-die interconnect at lower metal layers, forcing more signal paths up onto the wider but less-numerous routing paths of the upper metal layers, the notion of serial transceivers for long intra-chip links may become attractive, despite the need for transceivers to support them. The real estate for the transceivers may be more available than the parallel routing resources. The old ways are indeed changing.


CATEGORIES : All, Bandwidth, System Architecture/ AUTHOR : Ron Wilson

4 comments to “Is the System on Chip Coming Apart?”

You can leave a reply or Trackback this post.
  1. What no one is facing is that part of our problem is bad software design. “C” uses a stack and a half where as there are significant gains to be made from two stack software. At the bleeding edge a gain of 10X is nothing to be sniffed at. Especially given that you get that gain by just changing the way you think (and some very minor hardware changes).

  2. Please elaborate on “stack and a half” and “two stack software”. These terms are not familiar to me. Thank you.

  3. This two stack software “c”??? and starting / stopping burst mode?? sounds like color fourth on GA144 developed by Chuck Moore and friends…

  4. Both HW and SW need to get away from making things out of such small pieces. SW uses RISC ISAs which cause heavy memory traffic and HW builds registers from individual flip-flops with at least one LUT/gate per data bit then runs clocks and wires all over the chip.

    Design needs to focus on use of memory blocks which are much more dense than registers for data storage and stacks.

    Memory blocks can also be used for micro-code to reduce random logic wiring as IBM has done for 50 years

    Another thing is that SW source code can be parsed and microcode generated to load into the memory. Variables and execution stack can share a memory and the call/return stack can be in a separate memory. (see MSimons comment.)

    There is no need to impose C limitations or CPU ISAs on the HW. After all a program is just another way to define an FSM. HW synthesis is very efficient for FSM implementation.

Write a Reply or Comment

Your email address will not be published.