In the enormous pressures and temperatures far beneath the Earth’s surface, minerals take on forms impossible to replicate in our familiar environment. Hence carbon may become diamond, or graphite.
Similarly, in the intense, dynamic environment of the data center, familiar Ethernet is taking on configurations unimaginable to its inventors at Xerox PARC. As with carbon, the results will take many forms that may not share a resemblance. But unlike minerals, that take a form and remain in it, within the crucible of the data center each Ethernet adapter will have to continually change to support the shifting application environment. It is no longer enough to just get a bigger pipe (Figure 1).
Let’s start with the pressures—and temperatures. A number of external forces are warping the data center far beyond the simple shape of a big room full of servers. Some of these forces come from applications, some from business models, and one from physics.
Among the applications, the most widely publicized has been big-data analysis. The idea of passing massive data sets through statistical, classification, and even neural-network analyses has grown from a tool for analyzing survey data into the panacea of the year. Increasingly, big-data tools are moving from background jobs to real-time filters—in finance, security, and fault prediction, for example—and they are making huge demands on the bandwidth of data-center networks, pushing for 400G and more in some places.
Virtualization in its many forms—centralized radio access networks (C-RAN), network functions virtualization, flash trading, even some concepts of the Internet of Things—pulls in a different direction. These applications press for zero network latency, service-level agreements, and special features like precision timing across the network.
But data center operators are moving in yet another direction, in accord with their business imperatives. They want to offer cloud centers where all kinds of applications mix freely, relocating to best use whichever servers and storage channels are available. Cloud operators need strictly uniform hardware, software-configured infrastructure, and—whether they admit it or not—powerful hardware-based security.
And then there is physics. To increase throughput, data centers scale out: they add a few thousand more racks of servers. But scaling out quickly creates a power problem—getting power into the racks, getting heat out, even getting enough power from the grid. There is an economic side to power as well. Over the short life of a data center the power cost will outweigh the capital investment. So the heat is on, so to speak, to minimize power consumption everywhere, including in the Ethernet interfaces.
Bandwidth, latency, features, security, uniformity, configurability, efficiency—these are all reasonable requirements. The problem, as we are about to see, is that they are mostly mutually exclusive at a hardware level.
To explore that statement we have to look inside a 10G (or faster) Ethernet interface. So let’s do a block-by-block tour (Figure 2).
We can start at the point where a message from an application gets separated into Ethernet length-compliant blocks of data. These, along with the destination address and perhaps some control information, gets passed to a function in the Ethernet interface called the media access controller (MAC). Note that this transfer is after Transmission Control Protocol and Internet Protocol (TCP/IP) or Universal Datagram Protocol (UDP) have shaped the message to their own requirements, and well after features like Precision Time Protocol have created their own special messages. The MAC just picks up the data blocks, whatever they might contain, and moves them along.
Specifically, the MAC’s job is to attach a preamble to each block, and to add a header including the address of another MAC somewhere in the Ether—a unique hardware identifier, not the same as an Internet Protocol address—and to add in a Type code. Then the MAC calculates a cyclic redundancy check (CRC) code and appends it as well. Obviously we are talking about the transmit direction. The description is reversed for the receive side.
This sounds simple enough, until we start discussing speed. In fact you can do all the MAC functions in software if you aren’t in a hurry. But at 10 Gbps, short packets can arrive every 100 ns or so. Either you process them at this rate, or you incur additional latency and power to buffer them. At 25G or 50G per lane, the time goes down proportionately. Faster Ethernet links are implemented as multiple lanes: ten 10G lanes for 100 GE, for example. So the load on the MAC keeps going up.
One aspect of security further complicates the MAC layer. MACsec is a standard that encrypts the entire packet, including the preamble. It is particularly useful in situations—like data centers—where man-in-the-middle attacks are a major threat. The algorithm requires a number of multiply and look-up operations that make it quite computation-intensive, mandating hardware acceleration and adding to latency.
From the MAC, fully-formed Ethernet packets move on to the Physical Layer, or PHY. At multi-Gigabit speeds, the PHY is divided into two blocks: the physical coding sublayer (PCS) and the physical medium attachment (PMA).
The PCS prepares the frames for efficient transmission. “A 64b/66b encoder and scrambler ensure the bit stream will likely have lots of transitions, good DC balance, and space for some special characters,” explains Intel Programmable Solutions Group Ethernet Protocol Lead Nigel Gulstone. “Finally, if the physical interface uses multiple lanes, the PCS splits the blocks of encoded, scrambled bits up among the lanes.”
These functions are followed by a gearbox that reshapes the still-parallel blocks between the PCS’s 66b width and whatever data width is used in the PMA. As with the MAC circuitry, the receive side is essentially the reverse of the transmit side.
There are no complex or mysterious algorithms in the PCS as we have described it. The 64b/66b encoder is straightforward, and the scrambler uses a polynomial such as 1 + x39 + x58, a relatively easy, although latency-inducing, bit of hardware. Once again the challenge is speed. At 100G, 64 bit words pass through the PCS about once every 670 ps.
But now things start to get difficult. When the speed per lane moves from 10G to 25, the next stage in the PHY, the PMA, gradually falls behind in its ability to recover correct data bits from the incoming waveform. Achieving an acceptable bit error rate—10-15 or better in many applications—requires forward error correction (FEC). And FEC gets implemented either as part of the PCS, at 400G, or between the PCS and PMA in lower-speed standards.
“Fire code and two different Reed Solomon codes are possible alternatives for the BAE-R standards used in the data center,” Gulstone says. The FEC encoder on the transmit side gathers up a large number of 66 bit blocks and encodes the entire bunch, embedding error-correcting bits. On the receive side, the FEC decoder gathers up a block of incoming bits and decodes it, in the process correcting any isolated errors that might have crept in. While these codes can correct up to 70 bit errors in a 5280 bit block, bursts of incorrect bits, such as might come from some circuits in the PMA, can eventually cause loss of data.
“FEC requires a fair amount of calculation and fast buffers big enough to hold an entire block,” Gulstone explains. Thus it adds to the energy consumption and latency of the Ethernet interface. But it is the only way, currently, to achieve adequate bit error rates at these multi-gigabit speeds over the interconnect channels found in data centers.
With FEC integrated in, the PCS takes in Ethernet frames from the MAC, and sends long blocks of error-protected, scrambled, encoded bits through the gearbox to the PMA.
The PMA transmit side has basically two jobs. The first is to convert the parallel data stream—or streams, in a multilane connection—into serial data with an embedded clock. The second job is to convert that serial data into an analog signal suited to whatever medium will be used to carry it, whether PC board trace, coaxial cable, copper backplane, or optical fiber. The receive side of the PMA must detect the analog signal coming out of the channel, recover a serial bit stream from each lane, convert the serial data to parallel form, and pass it on to the PCS.
At 10G, the transmit circuitry would look familiar to designers of lower-speed interfaces. In each lane a frequency source drives an NRZ encoder. The encoder drives a feed-forward equalizer that pre-distorts the pulse stream to compensate for losses in the channel. An amplifier then drives the physical medium, or it drives the laser that drives the optical cable.
In the receive direction, things are somewhat more complicated. A receive amplifier drives a continuous-time filter, a clock-data recovery circuit (CDR), and a decision feedback equalizer (DFE). The division of labor between the two filters is, roughly, that the linear filter corrects for the frequency response of the channel, while the DFE reduces inter-symbol interference and reflection noise. When a connection is established, the transmitter and receiver negotiate the tap settings for the FFE and the DFE respectively, to tune the electronics to the channel.
But under the dual pressures of increasing speed and the demand for lower power, the familiar is giving way to the new. “As the frequency has increased, we are moving from a conventional CDR to a very fast analog-to-digital converter, followed by a substantial digital signal processor,” Gulstone says. “But the digital signal processing (DSP) algorithms are not included in the industry standards: they have become an opportunity for differentiation.” In this new model, the DSP block passes its best estimate of the correct data on to the PCS.
In the move to 50 G, another technique is being investigated. Losses in the channel go up with frequency, so sending multiple bits per clock at a lower frequency should, in principle, work better than trying to force ever-higher frequencies through the physical connections. NRZ passes essentially one bit per clock. But if you were to drive the transmit amplifier with, in effect, a 2-bit digital-to-analog converter, you could pack two bits into each clock period—hence, 50 G with no increase in fundamental frequency over 25G. This is called pulse-amplitude modulation, in this case, PAM-4 for the four discrete voltage levels possible in a clock cycle.
On the transmit side, the linearity requirement on the amplifier goes up. On the receive side, the receive amplifier also gets more stringent linearity requirements, and the conventional CDR gets replaced with a 2-bit sigma-delta converter, with its attendant DSP hardware. Whether all this additional work is a net gain compared to just increasing the frequency is still a matter of heated debate. The answer may be channel-dependent.
Working together, novel modulation, many-tap equalizers, DSP, and FEC can demonstrably achieve the data rates that data-center architects seek, at least on some well-behaved channels. MACsec hardware, working in concert with active security hardware in layers above the MAC, can approach the level of security cloud users are assuming.
But each of these functions increases power consumption—which data center operators hate—and adds significantly to latency—which some applications cannot tolerate. Yet these are hardware functions. So the need for a homogenous, software-defined data center fabric implies that these functions must be deployed on every network interface.
The solution to this quandary would appear to be reconfigurability: either ASIC SoCs that implement all the functions and can selectively bypass each one and power-gate it on short notice, or FPGAs that can be reconfigured in place. Either way, the data-center management system must have the ability to relocate jobs at will, yet give each job only the latency it can tolerate and only the Ethernet bandwidth it needs. That is a significant challenge, reaching all the way from the PMA hardware to the data center management code.
See an implementation of a 100 G Ethernet core.