For data-center architects it seems like a no-brainer. For a wide variety of applications, from the databases behind e-commerce platforms to the big-data tools in search engines to suddenly-fashionable data analytics to scientific codes, the dominant limitation on application response time is storage latency. But DRAM keeps getting denser, and solid-state drives (SSDs) cheaper. And a new class of memory devices—storage-class memory (SCM)—promises to put enormous amounts of memory on server cards. So why not just make all the data for these problem applications memory-resident, and eliminate disk and even SSD latency altogether?
The notion fits well into the shifting needs of data-center workloads. Many are becoming more sensitive to user-level response time, as users show increasing willingness to abandon a search, an online purchase, or a content view after only a few seconds’ delay. And the emergence of real-time constrains, as machine-learning or data-analytic functions are included in control systems—notably for autonomous vehicles—puts an extra urgency on latency questions.
At the same time, really huge data sets are getting pulled into online roles. “Cold data is getting rewarmed by big-data analysis,” observes Intel® senior principal engineer David Cohen. New analytical approaches are delving into huge sets of historical data—transaction logs, ledgers, telemetry, or the unending streams for Internet of Things (IoT) networks—that used to be nothing but deep archives. And developers want the analyses in seconds, not days.
So put all the data in main memory. Great idea—good enough to create a whole generation of new applications and platforms, and a new category name: in-memory computing. (Note that the term refers to applications with entirely memory-resident data, not to processing elements embedded in memory subsystems.) But as usual, good ointment attracts flies. And getting the insects out requires rethinking memory organization and data-center network architecture (Figure 1). That makes in-memory computing not just a programming decision but an engineering challenge.
The story begins with evolution.
Closer to the Core
As so often happens with a great idea, the initial response to in-memory computing quickly overwhelmed the initial implementation. The data sets that scrambled and elbowed to get into the expanded server DRAM soon outgrew the few hundred MB DRAM capacity of a rack-mount server card.
That forced architects to look closely at data access patterns in the workloads. For map-reduce workloads, in which each server is given its own chunk of data and there is little call for one server to access data not in its local DRAM, there is no big issue. In-memory computing in this context just means that you divide up the data set so that each chunk fits in one server card’s DRAM, and stays there—it is persistent. The same is true of other workloads that make the vast majority of their accesses to working sets that fit into their DRAM. The challenge comes when the working set just doesn’t fit.
The next evolutionary step is to bring SSDs into play. An SSD on a server card, connected to the CPUs via PCIe can add from one to several TB of local storage to a card (Figure 2). The SSDs typically have DRAM caches to conceal their inherent latencies, and so can crank out hundreds of thousands of random read operations per second, and maybe a quarter as many write operations. Normally they would use the NVMe command protocol over PCIe, which works with read and write commands rather than emulating memory.
So when the application needs something that is in the SSD, it has to issue a system call that sends NVMe commands to transfer a block of data from the SSD over PCIe to the main-memory DRAM. A hypervisor can hide this process and make these devices, with their perhaps 30-40 µs latencies from their internal caches, appear to the application to be just annoyingly slow memory. Relying on the local SSDs isn’t really in-memory computing but it can look that way to the application.
This reality—especially the part about 40 µs—created demand for the next evolutionary step: SCM. This category of memory, nearly as dense as NAND flash but nearly as fast as DRAM, can either make a really fast SSD, or it can be made in DIMM format and can plug directly into the server-card DRAM bus. Current candidates for the title of SCM are also non-volatile. This sounds great, but the problem is that none of these technologies has actually appeared in a DIMM format yet so for now they are just another way to build an SSD.
If and when we do get SCM DIMMs, the boost to in-memory computing will be significant, turning that 40 µs into something more like 4 µs. They would also up the amount of main memory on the server card’s DRAM bus from a few hundred MB to, eventually, 24 TB. Now we have a hardware platform that can offer real in-memory computing to real-world data sets.
But massive memory capacity on each server card doesn’t solve every problem. Cohen points out that many applications must keep transaction logs and checkpoints, even if they are running in persistent memory. Non-volatility does not protect a data set against bugs or malicious attacks. These needs can result in a background of high-frequency traffic in short messages from the server cards to storage pools, creating a serious challenge to the top-of-rack (ToR) network.
At the same time, those 10 or 25 Gbps Ethernet (GbE) networks face a second challenge from in-memory computing. Some data-center architects want to give the server CPUs access to more memory than resides on the card. Perhaps they don’t want to wait for SCM DIMMs to appear. Or perhaps they expect the working sets of their applications to grow beyond that 24 TB promise that SCM DIMMs hold out. In any case, they are pushing for remote direct memory access (RDMA) to the SSDs and DRAM busses of other server cards in the rack. In effect, they want all the DIMMs in a rack, and maybe all the fast SSDs too, to form a single unified virtual memory.
The medium for such RDMA transactions would be a memory-area network.
It’s the Network
This desire to reach and virtualize memory beyond the server card has big implications for the network inside the rack. For memory references, you want a cache miss to trigger a read that directly accesses a page of DRAM or SCM on another server card. For storage accesses, you want the NVMe commands to go to an SCM DIMM or SSD on a different card, or to a giant pool of flash (just a bunch of flash, or JBoF) on a storage card.
You can do that with software and the existing ToR network. Using hypervisor code to find the block you want, and the 10- or 25 GbE driver to transport the data will work, but the latency could be in the 50 µs range. For many applications that is not viable. Analysts figure that if the latency is longer than about 10 µs, the CPU should switch to another thread rather than waiting for the request. Unless the application is very rich in threads, anything much longer than 10 µs is going to be a problem for performance. In addition to the latency problem, there is a bandwidth issue: a single high-performance SSD could saturate a 25 GbE network.
The solution to these problems requires several layers of changes within the rack.
First, you need hardware-based RDMA. Instead of each Ethernet packet getting assembled by a software driver, send across the network, unpacked by software into a buffer, and then moved by software into another location for the application, network interfaces on both ends need hardware DMA. Thus, data can move from memory or SSD on one server card into memory on another card without passing through a CPU along the way.
Next you need to make sure these latency-critical RDMA transfers don’t get stuck behind a farm wagon. We have to either separate the RDMA traffic onto its own private network, or we have to create a priority scheme for the ToR Ethernet.
The ideal would be a dedicated, point-to-point RDMA network connecting the server cards’ PCIe busses. This would open the full transceiver bandwidth of the private network to RDMA transfers and could get latency down into the 2 µs range—almost close to putting all the DRAM and SCM in the rack on the same DRAM bus.
But for a number of reasons—already crowded racks, added cost, and dependence on a single source, for example—most data-center operators don’t like dedicated networks, even inside the rack. So point-to-point RDMA networks will probably be limited to high-performance computers and a few private data centers with particularly demanding workloads. That leaves the ToR network to carry the traffic in the majority of cloud and data-center configurations.
The ToR Ethernet has all the connectivity that our RDMA needs. At 25 Gbps it has adequate bandwidth to handle a modest amount of RDMA activity. And with a minimum latency around 5 µs for some of the best network silicon available now, it is close to being quick enough. What vanilla 25 GbE doesn’t have is a good upper bound on latency.
That is where Converged Ethernet comes in. CE adds prioritization to the Ethernet traffic, allowing the ToR switch to give RDMA traffic priority over everything else, including those short, high-frequency messages from in-memory computations and the unending streams of traffic about cute cats. So now we have RDMA over CE, or RoCE—affectionately pronounced like the name of the famous movie boxer. RoCE offers reasonable latency for memory-to-memory transfers and for NVMe transactions between memory and remote SSDs (Figure 3).
But RoCE isn’t free. To achieve the kind of latencies we are talking about will require RoCE network adapters that support RDMA and prioritization in hardware, as well as ToR switches capable of low-latency CE switching—probably including some really big fast buffers as queues get very large. Dropping packets because of queue overflow is not really an option here.
So what does the roadmap look like, realistically? Today we are still in the transition from legacy 10 GbE ToR networks to 25 GbE, with 40 GbE on the horizon. RoCE is still uncommon, but there are implementations of RoCE network adapters both in dedicated chips and in high-end FPGAs.
Tomorrow, with wide availability of SCM DIMMs, we can envision another evolutionary step in the ToR network to go beyond just packet prioritization to fully software-defined provisioning. In such a world giant data sets would be relatively static, spread across the persistent DIMMs and in some cases, still lapping over onto the SSDs inside the rack. Applications and virtual connections would come and go based on their need for access to the data. The partitioning boundary for dividing up data sets moves from the server card, with its few TB of storage, to the rack, with perhaps 720 TB of memory capacity organized as a single virtual memory pool.
Then, as data sets push into the PB range, attention will shift to the data-center spine network. And we can start the latency discussion all over again.