Two strong currents of technology change are surging across the data center, sweeping away conventional thinking and leaving behind profoundly changed software, memory, and storage architectures. One current, rising from the largely unexplored region of neural networks, is altering the way applications access data. The other, springing from the depths of semiconductor physics, is dissolving the boundary between memory and storage. Together they will fundamentally change data centers. And that change will flow straight on into embedded systems.
Infinite and Flat
Since the beginning of computing, applications have had to recognize some fundamental distinctions: between working data and files, between memory and mass storage. Even relatively recent codes like the big-data platform Hadoop studiously respect this boundary, working hard to manage DRAM and disk space for each server. But that is not really how modern programming languages view the universe.
To a Java or Python program, the world is a limitless pool of objects, all implicitly resident in main memory (Figure 1). Physical realities like limited DRAM, slow disk drives, and legacy interconnect schemes are issues to be made transparent, not features to be used. In place of Hadoop map/reduce, now we have Spark, with its essentially infinite pool of object-storing DRAM.
The change was driven by a shift in workloads. When the dominant data-center app was Web search—in which a task makes one query against a giant data set and then retires—map/reduce worked well. Pages of data flowed from disk into memory, and rankings flowed back to disk. But emerging applications like deep learning, which can iteratively revisit the same data set many times in succession, or graph analysis, which can create bursts of widely scattered accesses, can spend 90 percent of their time waiting for disk. Switching to a memory-resident object store yields huge improvements in execution time.
Another aspect that is changing is data mutability. Traditional IT databases are assumed to be always changing—even though in reality they are often almost entirely static. Web crawlers in principle continuously update page data, but changes from day to day are rather small as a portion of the entire data set. Graphs of social media connections may change only a tiny amount from day to day. Neural network models don’t change at all once they are trained. The massive data sets build up by Internet-of-Things networks may grow continuously, but by appending new data without modifying existing data. Many of the most important new data sets are completely, or nearly, immutable.
An Opening for NVM
These two shifts—the wish for massive memory capacity and the growing immutability of data sets—have flowed in parallel with a current of unprecedented change in the very physical world of semiconductor memory. New storage mechanisms like phase-change memory and resistive memory are finally approaching production—bringing access times and densities that change the game. And 3D NAND flash, using towering vertical structures resembling a science-fiction cityscape, are upending the speed and density projections of Moore’s-Law pessimists. On the surface, these changes sound like just what Spark wanted: massive amounts of fast, local memory.
Gains are certainly impressive at chip level. And they come at a time when conventional scaling is flattening out. “In 15 nm, a planar NAND floating gate can only store about 15 electrons,” observed Western Digital executive VP Siva Sivaram at this year’s Flash Memory Summit. A charge so small you can count the electrons leaves no margin for error, especially if you are trying to discriminate four charge levels on the memory cell, as multi-level flash must do. Flash designers either had to give up on further scaling, or find a way to pack more floating-gate volume into a given amount of chip real estate.
To escape this limitation, NAND designers have done exactly what real-estate developers have: they’ve gone vertical. Instead of using planar floating-gate transistors, 3D NAND chips go up. The floating-gate structures are wrapped around a cylindrical tower, stacked up like bagels on a peg, and enmeshed in a dense multi-layer network of control and signal lines. In announcements this year, some vendors have built these towers to a height of 64 floating gates—allowing Samsung at the conference to introduce a 512 Gb die with a peak transfer rate of 800 Mbps.
Nor have we seen the end of vertical growth. KW Jin of SK Hynix projected that the technology would continue advancing, in 18-month strides, until it surpasses 100 cells high. “But a 200-cell stack with require a breakthrough,” he said.
Gathered into a solid-state drive, 3D NAND chips produce formidable figures: a 1 TB single BGA package with an NVMe interface is in the works, Samsung executive VP Jaeheon Jeong said. He went on to announce a 32 TB SSD with an SAS interface. And he revealed that Facebook is working on an alternative to today’s M.2 card will allow 32 NAND cards to squeeze into a 1U rack slot and transfer 12 GBps over PCI Express® (PCIe®) Gen4.
Into the Gap
Even with the latest advances in speed, 3D NAND still falls well short of the speed and latency of DRAM. This leaves a gap, into which Intel and Micron pushed last year with their announcement of 3D XPoint memory. With density currently about a quarter of 64-high NAND but claims of 1000 times lower latency at die level, the as-yet-undisclosed transistorless-cell technology fits neatly between DRAM and 3D NAND in density and performance.
At the Flash Summit, Micron director of advanced architecture Stefanie Woodbury showed how such figures translate into system-level performance. “At the SSD level,” she said, “we are getting ten times the I/O operations per second (IOPS) and a tenth the response time of conventional NAND SSDs.”
Not to be left behind, Samsung dropped broad hints about a rival product it calls Q-NAND. “It is faster than PRAM [phase-change memory, the technology many believe is behind 3D XPoint] with lower energy consumption,” Jeong claimed. “We will have 1 TB devices this year.” Other than the obvious point of being called NAND, Jeong did not discuss the technology involved.
The Bad News
These advances sound like just the thing to meet software’s demand for an enormous, flat physical memory. But in reality there are many complicating issues.
The first is speed—both access latency and bandwidth. NAND is much faster than disk drives, especially in average latency. But even the latest 3D NAND chips are no match for DRAM. CAS latency—the delay between receiving a read command and appearance of the first valid data at the DRAM pins—is around 15 ns. NAND flash latency can be a thousand times longer: much longer still at module level, where a controller must intervene in transactions. There is a similar disparity in transfer rates. So purely based on read performance, just using NAND flash as main memory would torpedo application performance—unless you had either wonderful cache hit rates or an enormous depth of threads to run while you were waiting.
But read speed isn’t the real problem. NAND flash block erases and writes—the NAND chip has to do these operations a block at a time—are much slower than reads—and their timing can vary over a wide range. Further, the number of times you can successfully erase and write a block is tiny compared to DRAM: from around a thousand cycles to perhaps 20,000 for high-density devices. And if you want endurance on the high end of that range, you give up other things, such as speed or error rate.
Looking at these realities it would appear that NAND flash is not memory at all—rather it is a storage medium, like an expensive kind of high-performance disk. As such, it must have a controller, and driver software to make useful. That is, of course, exactly how most SSD vendors use NAND: they put a big array of flash chips behind a controller. The controller may cache reads to reduce latency. It will post writes to hide the huge write latency and minimize the number of times a physical block of flash has to be erased and rewritten. It will use strong error-correcting codes to minimize the impact of failed cells and random errors. And it will remap addresses to avoid blocks with too many failed bits and to even out wear.
Apparently even switching to a fundamentally different non-volatile technology won’t free us from these realities. Neither Intel nor Micron has disclosed the device-level characteristics of their 3D XPoint memory in detail. But in her Flash Summit talk Micron’s Woodbury did say that the controller and firmware were critical to 3D XPoint SSD performance.
The obvious way to use SSDs, then, is as emulations of disk drives. A module including NAND flash dice and a controller die can drop into the disk footprint on a server card and connect to the card’s SATA port. Or a tray of flash chips and controllers can emulate an array of disks, RAID-fashion, serving an entire cabinet.
But there is growing acceptance of dropping the emulation and talking to the SSD in a protocol optimized for non-volatile memory—that would be NVMe—over the fastest available bus—which would be PCIe on most cards. This approach can give three to five times the performance of SATA or even SAS connections in a data-center environment, according to published figures from testing lab Calypso Systems.
And there is another possibility. With a proper controller, an SSD could sit directly on the server card’s memory channel, giving it access to much higher bandwidth and lower latency. With careful design, rather a lot of flash and a controller die could fit in a conventional DIMM format and plug into the DDR memory channel. That is the idea behind NVDIMM, of which there are several flavors. NVDIMMs’ small size means strictly limited capacity—say tens of gigabytes, compared to up to tens of terabytes for a tray-sized SSD. But with a DRAM between the flash chips and the DDR bus, NVDIMM-N cards—which are in effect DRAM cards that only use the flash chips as back-up in case of power failure–can have ten times the overall performance, in IOPS, of enterprise-class NVMe SSDs. NVDIMM-F cards, which attach an SSD to the memory channel without the DRAM buffer, can be similarly high in bandwidth, but with longer latencies.
NVDIMM-N cards have another important characteristic: they can be addressed as memory rather than sent commands in NVMe protocol or treated as disks. And this raises an interesting possibility: what if instead of having to interact with non-volatile memory as storage, we could do remote direct memory access (RDMA) among the non-volatile memory devices across a data center? The many separate pools of DRAM and NV memory could be linked together by the data center’s Ethernet and RDMA over Converged Ethernet (RoCE) protocol.
Now we are approaching that software dream. To the caches on CPUs and accelerators this enormous collection of memory and memory-like devices can appear as a single, flat memory space. Even huge data sets can be completely memory-resident, just with varying latency, depending on where in the data center the physical page is located. Even worst-case, the RoCE latency is going to be far shorter than cold-storage disk latency.
This brings us to three new ideas, each of which has the potential to further disrupt the data center. One is a logical outgrowth of the distributed memory architecture we have just described. And the other two are related concepts raised at the Flash Summit.
Let’s start with the logical outgrowth. The pools of memory in our distributed architecture are stitched together into one giant virtual memory by RDMA transactions over the data center’s converged Ethernet. Some data centers—notably, Microsoft’s Azure centers—are exploring the use of hardware accelerators in the network interface cards (NICs) at the edge of this network. This places programmable hardware—in Azure’s case, FPGAs—effectively inside the huge flat memory.
For conventional network or storage transactions, the uses of these so-called Smart NICs are obvious. They can implement compression/decompression or encryption/decryption algorithms as data moves through the network interface on its way to or from caches. But as general-purpose processors embedded in memory, the implications of the Smart NICs are less explored. Is it feasible, for example, to organize an application so that as data moves toward the CPUs, from persistent memory into cache, it passes through hardware preprocessors in the Smart NICs? Such an organization could offload much work from the CPUs, saving time, energy, DRAM, and cache space.
At the Summit one company, NGD Systems, formerly known as NxGn Data, was discussing a different take on this idea. The company described a 6 TB, 150k IOPS SSD with embedded processing capabilities. “The idea is to execute search, regular-expression processing, map-reduce, or Spark tasks inside the SSD,” explained CEO Nadar Salessi. The programming model is of course an important issue. Salessi said the basic idea is to code the task conventionally, and then containerize if for the SSD.
Both the Smart NIC and NGD’s in-situ processing focus on embedding processing into the network or the storage system. A third idea from the Summit heads in the opposite direction: taking even low-level controller processing away from persistent memory.
“Raw NAND chips are capable of 5 million IOPS,” observed EMC fellow Daniel Cobb. But, he continued, by the time we pack the chips behind a controller and interface, we only see a fraction of that. Perhaps we need to lose the controller and the peripheral bus connection.
In principle, the overhead tasks an SSD controller performs could be virtualized, tucked into a software driver and activated by a hypervisor. And the chips themselves could be designed to directly support RDMA. This would eliminate the bottlenecks between DRAM and NAND, exposing the full potential bandwidth of the chips. The impact on system performance, particularly with respect to write latency, where system software would have to intervene in the write process, is unclear. Given the growing importance of immutable data structures, however, this might not be an issue in many applications.
Currents of Change
The rapid shifts in non-volatile memory technology have presented data-center architects with perplexing choices (Figure 2). The obvious response, and the one that most nearly preserves current architectures, is to use the devices packaged up as disk replacements. But this is severe underutilization.
That concern has pushed architects toward other choices, such as NVMe protocol over PCIe or over any of a variety of high-performance fabrics. And it opens the door to deploying smaller amounts of flash in NVDIMM modules. But these solutions place the non-volatile memory in an intermediate position: not disk, but not real memory either. They promise to help performance, but only by introducing a third category of devices with its own unique protocols. Emerging technologies like 3D XPoint follow this pattern by introducing yet another intermediate category, this time between DRAM and SSDs.
That is not the direction applications want to go. They just want an infinite, uniform memory space. The architecture that seems to best fit this worldview is RoCE. All the DRAM and non-volatile devices in the data center become a single, flat space with difference in latency but not in logical behavior. But that is a big step away from today’s ad-hoc blends of DRAM, NVDIMM modules, and SSDs. Not least among the differences is that RoCE explicitly puts what applications believe to be memory traffic on the data center’s Ethernet networks—not only the fast in-rack networks that today are moving to 40 Gbps and beyond, but also across the center-wide network that links together the top-of-rack switches. This move should create an enormous thirst for Ethernet bandwidth: 40 Gbps will be just the beginning.
Does the bandwidth pressure then bring about even more radical change, such as in-network or in-memory computing, requiring software adjustments clear up to application level? Does the increased latency of traversing the networks focus architects’ attention on CPUs and accelerators with higher levels of multithreading? Do these pressures encourage NAND flash vendors to seriously evaluate radical ideas like chip-level RDMA?
Whichever way the data center moves, embedded-system designers would be well-advised to watch. In the past, enterprise-level architectural ideas from cache to virtual memory to Ethernet to multicore CPUs have migrated into embedded designs. Revolution in non-volatile memory is not likely to be an exception. So, what could your next embedded design do with 1 TB of physical memory?