Perhaps you could call it a tipping point. For years the increasing density of flash memory chips has meant more room for media on smart phones, cameras, and media players. And far more powerful flash-based solid-state drives (SSDs) have enabled new tablet and notebook computers, and have begun quietly appearing in data centers.
But suddenly the rate of technology change has blossomed. Vertical-cell NAND flash is delivering on its promise to move beyond the limits of planar flash to deliver enormous storage capacity per chip. New non-volatile technology—in the form of the Intel-Micron 3D Xpoint memory—is headed into volume production. And rotating disks are striking back with new technologies of their own. One result will be unprecedented change in data centers, beginning in hyperscale clouds and extending down into the enterprise.
The Technology Delivers
Papers at the 2015 Flash Memory Summit documented both the new storage technology and its impact on the data center world. Samsung corporate vice president of marketing Jim Elliott announced shipping for a 256 gigabit (Gb) single-die flash chip, using a 48-high vertical bit-cell architecture and up to three bits per cell. “The chips have twice the sequential read speed and 40 percent less power consumption compared to our 128 Gb devices,” Elliott said. He added that work is underway to reach a 100-layer stack and 1 terabit (Tb) chip capacity. Toshiba, which had announced its own 48-layer, 256 Gb device a week earlier, described briefly a stacked-die arrangement in which the flash dice link the parallel data from their memory arrays directly to a controller die using through-silicon vias, potentially opening a chip-crossing bottleneck and allowing greater bandwidth.
But the advances in vertical flash were nearly eclipsed by Intel’s and Micron’s announcement before the conference of 3D Xpoint memory. Starting at 128 Gb, but with a thousandth the latency of flash, Xpoint appears to represent a new category of persistent memory: fast enough to operate on a server DRAM bus, but dense enough to be a mass storage medium.
Meanwhile the incumbent mass storage—rotating disk—is not giving up. “There is still a long future for disks,” said Seagate cloud systems president Phil Brace. “We are shipping drives with 1 Tb/in2 density. Coming technology, including shingled recording, 2D recording, and heat-assisted recording will take us to 5 Tbs/in2 and less than one cent per gigabit. The cost advantage of 5-10 times for disk over solid state will endure.”
One might expect such optimism from a drive vendor, but system vendors supported Brace’s view. “Flash has a huge latency advantage over disk,” observed Oracle flash storage group senior vice president Michael Workman. “But in a cloud array it has little advantage in bandwidth or power. The huge cost difference stays. In five years you will still see both performance and capacity disks in cloud data centers.”
Just Dropping In
The obvious way for data-center architects to respond to an increasingly dense non-volatile technology with a fraction of disk’s read latency would be to simply put a cluster of flash chips together under a chip that emulates a disk controller. Then use this SSD to replace real disks at critical locations (Figure 1). In fact, that has been happening in the data center for several years.
On the server boards, SSDs have become an alternative to small, low-latency board-mounted SAS drives. Elsewhere in the rack, larger pools of flash have moved in as alternatives to, or front ends for, big, high-capacity disk arrays. There is even talk of massive arrays of flash replacing the so-called cold-storage disk arrays: the banks of slower, high-capacity drives used to hold infrequently-accessed data.
But experts argue that directly replacing disks is the wrong way to go. One reason is that the existing interfaces, which evolved in a disk world, are poorly suited to getting the best out of the increasingly capable flash chips.
“Flash arrays have a considerable bandwidth advantage over disks” explained Workman, “but the SAS interface, and to some extent even PCI Express® (PCIe®) connections, hide the advantage.” NVMe, which defines a flash-specific protocol on top of PCIe, is a better approach than trying to treat flash arrays as disk clones, Workman granted. But even NVMe misses the reliability, accessibility, and serviceability (RAS) needs of large storage arrays. So both architects and system vendors are looking for new options. “SAS, SATA, and PCIe are yesterday’s game,” asserted Kevin Conley, Sandisk CTO.
Beyond the physical reach of PCIe, architects are using a variety of techniques to attach to large storage arrays, Workman said. Ten Gb Ethernet transporting the SCSI storage protocol over Internet Protocol is a growing option, exploiting the growing presence of 10 GE on server cards. But Workman cited Fibre Channel and, increasingly, Infiniband as important alternatives.
The weakest point in the picture appears to be on the server card, where SAS just may not be the right answer. At the same time as interface questions are arising, a shift in the data center’s software environment is raising different questions about how to deploy flash in the cloud.
When data centers all belonged to enterprise IT departments, the software environment was fairly predictable. Applications issued SQL queries to disk-resident relational data bases. Great effort went into strategies for identifying hot regions of a data base and shifting them to high-performance disks or caching them in DRAM.
With the advent of search engines and big-data analyses, the picture changed. In the new hyperscale data centers the mass of data remained disk-resident, but instead of a relational data base and structured queries, the data were often unstructured: key-value stores or simply piles of unstructured document data following no schema at all. Map-reduce environments such as Hadoop managed the parallel flow of masses of data from the disks into servers’ DRAM arrays for analysis. But now, the environment has changed again.
“In one retail product page on Amazon you see the work of about 30 microservices,” said NetApp cloud czar Val Bercovici. “You may have Redis or Riak key-value stores, a Neo4j graph data base to identify related products, and a MongoDB forms-based document store, all cooperating to build the page.”
One huge difference between these new applications and older codes like Hadoop is their treatment of storage. They more or less assume that their entire data set is in memory. As a consequence, “Beginning with the spread of Memcached, continuing with codes like Spark and Redis, these new applications eat memory,” warned Riccardo Badalone, CEO of Diablo Technologies. “We have to find an alternative to DRAM.”
New Architectures for New Code
Looking at the cloud data center not from the hardware-centric perspective of gigabits, controllers, and busses, but from the viewpoint of new applications, we see a very different picture. The objective is not to improve the performance of today’s disk-centric architectures. The goal is to get all the data into main memory at once. The focus shifts form SCSI over 10 GE or Infiniband to the common factor of all main memory: the DRAM bus.
“We will see all-flash storage putting masses of flash behind DDR DRAM DIMMs,” said Oracle’s Workman. Other speakers agreed: Badalone said “It’s time to think of flash as memory, not as storage. By using all-flash DIMMs we can put four to ten times more active data on a server memory bus.”
The picture we see emerging is one of radical change. On the server memory bus a thin layer of DRAM fronts a massive array of fast flash. Deeper in the data center, high-capacity, high-reliability SSDs of genuinely massive capacity—tens or hundreds of terabytes (TBs)—support the server DIMMs through RDMA transactions that bypass the operating system and hypervisor altogether to minimize latency. Some architects call this disaggregation—pulling the mass storage out of deep centralized pools and spreading it across the data center, as near as possible to the servers.
In effect, the DRAM, DIMM flash, RDMA-connected in-rack flash, and cold storage form concentric layers of cache, extending the virtual memory hierarchy seamlessly all the way from L1 caches on the server CPU cores to deep, persistent storage on the other end of a network link (Figure 2). From the application’s perspective, everything is in physical DRAM. From the data center operator’s perspective, big variations in task-latency—poison to high utilization and therefore to profits—go away.
There is, however, a big fly in the ointment. Those predictable latencies assume symmetric read and write timing. Even at a hardware level, “The DDR4 timing is deterministic,” warned one speaker. DDR4 protocol makes no provision for a memory that has deterministic read behavior but that behaves unpredictably on writes.
If write activity is not frequent, a controller with enough fast buffer memory can resolve this problem. Just post the writes to a coherent buffer, and commit them as the hardware allows. Fortunately, to a great extent, modern applications meet this criterion. Programs like Spark and Redis rarely write to memory. Even in older SQL applications, writing tends to be much rarer than even data center managers think. Shacher Fienblit, CTO at Kaminario, observed that 97 percent of users write their entire data set less than once per day. With a good controller, he said, you can keep the writing load to an average of 15 percent of the data set per day. Write buffers can handle that.
Two New Waves
Two technology trends may accelerate the move to these new architectures. One is new non-volatile memory, as exemplified by the Intel-Micron 3D Xpoint announcement. This technology, which analyst Dave Eggelston claimed is based on a phase-change memory element stacked with an Ovonic switch, claims ten times the density of DRAM and 1,000 times the speed of flash. “This creates a new layer in the hierarchy,” said analyst Chuck Sobey.
Intel evidently agrees. At the recent Intel Developers’ Forum the company said it will provide the new memory on DIMMs, together with a controller chip that will implement proprietary extensions to the DDR4 memory protocol to support the particular features and timing of Xpoint. These DIMMs will form exactly the layer of dense persistent memory software developers have been designing for.
A second technology received surprisingly frequent mention at the Flash memory Summit: in-memory computing. Several speakers observed that the transition to software like Spark that keep entire data sets memory-resident create a natural opportunity to do some operations in the DIMMs themselves, rather than in CPUs. Coming down on the aggressive side, “In the future, we will do 90-95 percent of computing inside persistent cache,” claimed Tegile Systems CEO Rohit Kshetrapal. Certainly Micron, which has placed big bets on its Hybrid Memory Cube—a technology that includes a processor/interface die in the memory stack—seems to agree.
With these tailwinds gathering, data center architects are likely to be swept right past the era of SSDs replacing disks, past all concerns about how to use legacy disk software drivers and interconnect schemes with the new memory, and into a totally unfamiliar world. In this world, to software everything looks like main memory. There are no disk drivers, mass-storage application programming interfaces (APIs), or virtualization layers visible, these functions being subsumed into software-defined cache controllers and switches. The server CPU chip’s SRAM caches, the DRAM, persistent high-speed memory, and networked flash SSDs are all linked, controller-to-controller, as concentric layers of cache, in a topology that can shift with shifting applications. Far more of the total storage capacity of the data center has been disaggregated and pushed close to the CPUs—terabytes of it right onto the DDR memory bus. And disks, with their low cost and massive capacity, sit in the less demanding, high-latency seams of the fabric, holding the coldest data. It is a very different vision.
See an advanced SSD controller implemented in an SoC FPGA.