System on chip means putting everything you can on one die. Only lack of technology, major process incompatibility, or physically running out of real estate have seemed valid excuses for taking a multi-die approach to integration. But these ideas are ending.
Today new options, including lower-cost multi-die packaging, novel uses of high-speed serial transceivers, and even non-electrical interconnect are opening new possibilities for partitioning system cores across multiple dice. Architects can contemplate ideas that bandwidth limitations or power budgets would have precluded before. This means new combinations of performance, efficiency, and compactness well beyond today’s state of the market.
Begin with Partitioning
Any discussion of dispersing a system across multiple dice must begin with partitioning. The bandwidth and latency demands of links between subsystems will determine the alternatives at your disposal.
It is important to distinguish between those two parameters. In wide, synchronous busses bandwidth and latency tend to be inversely linked: the faster the bus, the lower the latency and the higher the bandwidth. But streaming interfaces or high-speed serial links may accept more latency to get higher bandwidth. So it is important to understand what is going on in each link between subsystems (Figure 1). Is the connection latency-critical, or bandwidth-constrained, or both? In general, you will want to partition your system so as to have the fewest constraints possible on the connections between blocks. That will give you the most freedom in floorplanning an SoC or FPGA design, or in subdividing the system among multiple dice.
Since some of the most efficient interconnect techniques—both on-chip and between chips–have long initial latencies but high bandwidths, it is very valuable to make blocks as latency-tolerant as possible. Clearly there are applications that simply cannot accept added delays. A little more latency inside a control loop, for example, can shift a system from critically damped to unstable. In such cases you may have no choice but to integrate all the blocks in the loop, or to spend the money and power for wide parallel inter-die connections.
But there are also systems in which latency is less an issue, yet throughput is important. Some such systems process long streams of data: signal processing and some image processing are examples. Often this sort of computing can be implemented in a pipelined architecture that is relatively immune to predictable delays. In most pipelines, latency for interconnect only impacts the delay between input and output, not the bandwidth of the pipeline.
In some other situations the algorithm cannot be easily pipelined, but it can be decomposed into a large number of threads. If you have enough threads ready to execute, you can cover impressively long and even unpredictable latencies by simply picking up another thread as soon as the current one stalls. Hardware support for multi-threading, which is often available to a limited degree on modern CPU cores and is astonishingly deep on GPUs, limits the overhead of thread switching. So while the delay between any given input and its resulting change in output may be longer and even unpredictable, the overall throughput of the system will be high and nearly independent of the internal latencies.
However you can go about it, tolerating added delay opens the door to repartitioning the system to compensate for long interconnect paths, to employ a globally asynchronous, locally synchronous (GALS) network-on-chip, or to move some blocks onto a separate die. For the rest of this article, we will focus on options for linking to a separate die in a multi-die system.
Intuitively, the best way to maximize bandwidth and minimize latency between dice is to keep them close together. Hence the enthusiasm for 2.5D and 3D packaging. Traditionally associated with high costs and increased reliability issues, these approaches have gone through a renaissance, extending their reach from high-end military systems to mainstream and even low-cost uses.
Perhaps the most discussed 2.5D/3D approach, through-silicon vias (TSVs), remains firmly anchored at the high end. TSVs, as the name implies, are connections that pass all the way through the die, taking signals or power from the interconnect stack on the top to microbumps on the backside. You form a TSVs by etching a very deep, narrow hole into the wafer, depositing a liner material onto the walls of your hole, filling the rest of the hole with a conductive via-fill material like Tungsten, and then grinding away the back of the wafer until you have exposed the bottom of the via-fill. Each of these steps, not to mention handling of the resulting ultra-thin wafer, has proved quite challenging in practice.
There are two high-profile uses of TSVs in production today, neither of which is in particularly high volumes. One is TSMC’s chip-on-wafer-on-substrate (CoWoS) process. CoWoS doesn’t actually put TSVs through active IC wafers. Rather, it mounts the active dice face-down on a silicon interposer. The interposer uses TSVs to get connections from its top face, where the dice are, to its bottom face, where the package bumps are.
The other, much more ambitious use is in DRAM stacks. Both the Hybrid Memory Cube (HMC) and the High-Bandwidth Memory (HBM) designs use TSVs in the DRAM dice to pass signals vertically from die to die in the stack. As you can imagine, putting TSVs through a dense, active die, with all of the associated layout, lithography, and strain-engineering issues is far from a trivial undertaking.
But the rewards are great. TSVs allow huge numbers of connections between stacked dice, at much lower inductance than bonding wires would have. For example, HBM claims to support over 100 gigabytes per second (GBps) data rates between the host die and the stacked DRAMs. That is about four times the bandwidth proposed for the conventional single-die GDDR5 DRAM. By greatly increasing the density of connections between dice, and by substantially reducing the connection inductance compared to bonding wires, TSVs are able to provide both high bandwidth and relatively low latency between dice, at the cost of significantly more complex design and manufacturing.
Other designers are working to achieve the high density and low impedance of TSVs without the process complexity and yield issues. One such effort is the embedded multi-die interconnect bridge (EMIB) at Intel Custom Foundry. EMIB is, like CoWoS, a 2.5D assembly, with dice mounted face-down, flip-chip fashion, on a substrate. But instead of a silicon interposer with TSVs, the substrate for EMIB is a normal packaging substrate with bumps on the top to connect to the dice, and package balls on the bottom.
The standard substrate has significant interconnect capability, but its metal traces are set at a wide pitch to match the I/O pads on a die. They are not dense enough for the kind of high-density inter-die connections we seek. So Intel takes another step—the bridge part. EMIB embeds little rectangles of silicon into the top face of the substrate. The rectangles are positioned so that when the dice are placed, the ends of a bridge will lie beneath the edges of adjacent dice, forming a bridge beneath the dice. The bridges carry an array of microbumps on each end, connected by ordinary silicon-type interconnect. So they provide a very dense, short electrical bridge between adjacent dice.
If you can get by with the wider-pitch interconnect on a normal substrate, there are potentially cheaper multi-die approaches. One is TSMC’s integrated fan-out (InFO) packaging. InFO, originally designed for wafer-scale production of conventional fan-out packages, plans to offer multi-die wafer-level packaging. In this process, you fasten your various dice face-up in their intended positions on a synthetic wafer. Then you build up a multi-layer substrate over the wafer, building-in interconnections between the dice, using multiple layers if necessary. In the top layer of the build-up you put the package solder-balls. Then you separate the individual multi-die assemblies from each other and from the synthetic-wafer backing, passivate the whole thing, and you have a 2.5D module with the efficiencies of wafer-scale packaging (Figure 2).
One important consideration for any of these technologies is the design flow. The inter-die connections become part of the system, and so you can’t just design the dice in isolation. Accurate delay and power modeling—and in some cases thermal, mechanical, and electromagnetic modeling—are vital, and should be done in concert with chip design. For example, at its recent Ecosystem Forum TSMC strongly endorsed chip-package co-design even for the comparatively simple InFO packages.
The Serial Alternative
Comparing CoWoS to InFO, it is obvious that the fewer connections you need between dice, the less you are going to have to spend on packaging and analysis. So the ability of high-speed serial transceivers to achieve very high data rates over a very few wires—albeit with some added latency—becomes a very important tool. Using serial transceivers you can get data rates of 28 gigabits per second (Gbps) today over a single pair of conductors. And far greater speeds are on the way.
“By next year I think we will see a production system using 56 Gbps,” said Lee Ritchey, founder of board-design house Speeding Edge, at a recent Semico Impact panel discussion. “By then 28 Gbps will be common.”
Scott McMorrow, R/D consultant at Teraspeed, said even 56 may not be the practical limit. “Modeling says we can go about ten inches with 56 Gbps before it falls apart,” McMorrow told the panel. “But on paper we can get 110 Gbps out of a conventional IC package.”
Distance and electrical complexity make a huge difference for serial links. “New speeds show up in the easiest places first,” explained Mentor Graphics principal engineer Daniel de Araujo to the panel. “We’ll see 56 Gbps first in chip-to-module connections, then later on a board. Crossing connectors and backplanes will take longer.”
In deployment, modest-speed serial links have been used already inside 2.5D modules for some time. But given the potential for very clean and short interconnect runs in a 2.5D module, the speeds have probably been limited by the transceivers and by power considerations rather than by the channel.
Once you leave the sanctuary of the multi-die module, high data rates become increasingly difficult. “The problem is not so much loss as it is skew and crosstalk,” Ritchey said. McMorrow quickly agreed. “Variations in the circuit-board material and anisotropy keep you from getting skew below 4ps,” he explained. “Then if you let the skew get just a little larger, things stop working.”
Some researchers are attacking this limitation by working on new board materials that are highly uniform and isotropic, such as polytetrafluoroethylene (PTFE). But the industry is neither tooled nor experienced for working with such materials today.
Another possibility is to, as it were, transcend the circuit board and its limitations. McMorrow pointed out that existing build-up technology can make a winged package with a layer of electrical or optical connectors up above the level of the circuit board. This would allow high-speed serial channels to live in their own controlled environment, off the board.
Wings also expedite McMorrow’s favorite solution: twin-ax copper cable. He claims that existing cable offers three to four times the performance of circuit-board traces. And cabling of critical signals is, he says, actually less expensive than switching to special high-cost circuit-boards.
In principle, optical interconnect is an even better option. But electro-optical transducer modules are too big for use at intra-board levels, nor is there infrastructure to support manufacturing of boards with optical links. The application needs optical components integrated into ICs: silicon photonics. In fact we already know how to fabricate optical modulators, add-drop multiplexers, splitters, waveguides, and detectors on a silicon chip. The problem has been the light source, which has had to be off-die.
But work by Yasuhiko Arakawa and associates at the University of Tokyo, suggests a solution may be on the way. These researchers use semiconductor quantum dots on the surface of a die. In effect, each dot confines a single electron so that it can’t move. Without freedom of motion, the electron has essentially no thermal energy and can only give or take energy by changing its quantum state.
A 2D array of such dots forms a solid-state laser with excellent spectral purity and very low temperature dependence—perfect for optical communications. Such lasers would make possible a complete multi-channel optical transceiver on a die for use in those winged packages.
There is still the mechanical problem of aligning and attaching the optical fibers to the die. Today’s solutions take space and tricky assembly techniques. And true to form, this essentially mechanical problem is proving less tractable than the electronic and optical issues. But a solution may be in the air.
Keyssa Inc., a well-funded start-up, is using 60 GHz radio transceivers over near-contact distances to transfer data. The company claims 6 Gbps data rate with negligible EMI. While Keyssa’s current focus is on replacing board-edge connectors and inter-device cables, the technology suggests another possibility. Imagine a circuit board above which hovers a network of short, point-to-point microwave links. The links emanate from molded antennas in the tops of winged packages.
Whatever the best solution turns out to be, clearly inter-die connections and multi-die packaging are opening new options for SoC partitioning. Careful partitioning—a good idea anyway for SoC floorplanning and interconnect planning, is becoming a vital step in future-proofing your designs—not just for ultimate performance, but for cost and power reasons as well.