In the dark world of J.R.R. Tolkien’s Lord of the Rings, the Dark Lord commanded the creation of a set of golden rings that embodied and projected his power. One in particular (Figure 1) held power over the others, with the ability to find them and bring them together.
If you follow reports in the press, you might see an analogy in the growing power of deep learning algorithms. Convolutional Neural Networks (CNNs) in particular seem to have completely triumphed in their realm. They have crushed all other algorithms in the ImageNet visual recognition challenge, so thoroughly that it appears the only remaining issues are implementation details such as the number of layers in the network.
And CNNs have won a seemingly permanent place in a range of more worldly systems, from self-driving cars to Go-playing computers. CNNs are the answer of the hour—the algorithm which seems to be summoning together and dominating all other work in artificial intelligence.
But all power is not foredestined in this particular world. Critics of deep learning point out that there are many issues still unresolved about CNNs. There is discussion about just what the better-than-human accuracy reports on object recognition really mean. There are unsolved problems about how to architect a CNN and—once you’ve settled on an architecture—how to predict its accuracy and performance. There are questions about how to keep CNN performance within reason, especially as the number of layers and nodes grows. And there are profound questions about how to measure accuracy.
These questions hung over a recent IEEE Computer Society event at Stanford University on cognitive computing. Far from assuming the victory of CNNs, many of the papers studied quite different techniques.
One of the strongest claims for CNNs is their accuracy on the ImageNet suite. The Microsoft implementation that carried away nearly all categories of object recognition awards indeed scored better than human subjects. But that score might be a bit misleading until you understand the structure of the challenge. The object classification test requires an algorithm to detect objects in the presented images, put bounding boxes around them, and attach five category labels—from a provided list of 1000—that best identify the object. Scoring is based on getting the box close and the labels accurate.
Critics have pointed out that while the competition may give a useful comparison between algorithms trained on the 50,000 sample images, comparison to differently trained algorithms or to humans is less informative. A human might have no difficulty locating a furry monkey in an image. But how many humans would—as a well-trained network would—identify the monkey in question as a guenon, and not a Nasalis larvatus? How many who know N. larvatus—a proboscis monkey, by the way—would be marked down for identifying the guenon—probably correctly—as a Cercopithicus dryas?
This is a quibble for the intended purpose of ImageNet. But in real-world applications, where CNNs will be expected to behave in place of, and in ways explainable to, humans, there is a real issue. Perhaps the way to express the problem is to say it is not a matter of error rate, but of error magnitude. A human might not recognize a monkey in the road as a guenon, but she would not misclassify it as a puddle, or as “no high-likelihood response available” and drive over it.
The inability to be certain a CNN will not make catastrophic errors in a novel situation is rather inherent in the networks’ structure. Once a CNN is trained it becomes nearly impossible to predict, either analytically or qualitatively, how it will respond to a new input. But we can talk in conceptual terms about what goes on in the CNN, to form some idea of the networks’ range of responses (Figure 2).
Roughly speaking, each node in a layer of a trained CNN holds an estimate of the truth of a specific proposition about the data presented at the network input. As you move from the layers nearest the input to those nearest the output, the propositions become increasingly abstract.
For example, one node near the input might represent the presence of a vertical blue line at a particular location in an image. Another might represent the presence of a red dot. Several layers deeper into the network, a node drawing upon many earlier nodes might represent the presence of a blue trapezoid surrounding a red dot. Toward the output side of the network, a node value might represent the likelihood that there is a blue truck with a red logo moving across the intersection in front of you.
At each layer of abstraction the number of different atomic propositions you can entertain appears to be limited by the number of nodes at that layer. In particular, the number of different tags from which a CNN can choose for an object cannot be larger than the number of outputs from the network’s final stage. That would be 1000 in the case of the carefully constrained ImageNet challenge. Developers have found that CNNs of from 20 to 50 stages, and with perhaps five million nodes, are appropriate for the challenge, according to IBM Almaden Research lab director Jeffrey Welser. Such scale is unsupportable today outside a data center.
But object-tagging CNNs, with their arbitrarily restricted universe of possible tags could be tiny compared to a network responsible for operating a car on a real-world city street. How many tags would be necessary to ensure important conclusions get made about visually ambiguous or entirely novel objects? How many propositions have to be evaluated to ensure the network selects the best trajectory in a sub-optimal, real-world driving situation? And where in the car do you put the data center?
One way out of these issues is to accelerate the training and execution of CNNs with specialized hardware. Since both tasks are mainly matrix arithmetic, chips with high memory bandwidth and lots of multipliers work nicely. Examples include GPUs and FPGAs, both of which IBM has used, according to Welser.
The mathematics of CNNs is well-understood and stable—so long as you are using reinforcement learning with a gradient-descent algorithm. So you could also design an ASIC to handle the computations, with the expectation that the triumph of CNNs will create a mass market to justify the development cost of the chip. At least two teams have done exactly that.
The highest-profile of the two is at Google. At their recent I/O Conference the search giant acknowledged that they had designed a deep-learning ASIC, the Tensor Processing Unit (TPU), and had been using the chips as accelerators for a year. The company claimed the AlphaGo Go-playing system was one application. Google has not discussed the architecture of the TPU, but one presumes it is organized to support the company’s TensorFlow machine-intelligence libraries.
Google is not alone. At the Computer Society event, start-up Nervana described another such chip, also called a TPU. Company CTO Amin Khosrowshahi explained that today the company provides its machine learning platform as a service, either cloud-based or on your site, using CPUs with graphics processing unit (GPU) accelerators. Soon, they will replace the GPUs with their new TPUs.
Because it is basically an array of multipliers in a sea of RAM, Khosrowshahi said, the chip can have 100 times more multipliers than a GPU. He cited the Anton molecular-dynamics computer as architectural inspiration, which the Nervana designers adapted to the memory-intensive CNN world with a greater focus on memory capacity. To that end, the chip will use high-bandwidth memory (HBM) stacked DRAM to get 1 terabyte per second (TBps) of DRAM bandwidth, and an integral NVMe port for intimate access to massive non-volatile memory.
Like the Google TPU, the Nervana chip exploits recent research showing that, while training usually requires floating-point precision, CNN execution can be done using small integers—as small as 8 bits or even less—without harming accuracy. So both chips support reduced-precision integer arithmetic.
IBM has gone a step farther, creating a chip with almost no arithmetic at all. The True North chip is instead an emulation of biological neurons. While Nervana’s chip can accelerate all 13 of UC Berkeley’s high-performance computing motifs and can aid both training and execution, the IBM True North is not arithmetically oriented, Welser explained. Instead, it emulates the sum, difference, min/max, pulse-code, hysteresis, and many other strange behaviors observed in living nerve cells. The chips are not used in training, Welser said. That is done in the data center, often with a great deal of human intervention, and the results are then coded into the True North chips. Researchers are finding that, in addition to having a far wider repertoire than CNNs, arrays of True North chips can have execution-time behaviors equivalent to that of large CNNs, often with dramatic advantages in energy and performance.
Accelerators can speed the training and execution of CNNs, allowing them to grow almost unimaginably large. But accelerators don’t by themselves address a growing question about the limitations of the networks. “The sum of a set of labels attached to an object is not a meaning”, warned Stanford associate professor of computer science Silvio Savarese. And as applications begin to require cognitive systems that can attach meaning to a data set and reach conclusions, there is concern that CNNs may only be a piece—or not even a piece—of the whole system.
An entirely different set of approaches has grown up in the field of natural-language processing. For example, Sayandev Mukherjee, senior research engineer at DOCOMO Innovations, described a system for taking in unstructured streams of voice or text—social media streams, for example—and decomposing them into constituent tagged objects, such as proper nouns, common nouns, attributes, and sentiments. The tool builds a knowledge graph underlying the data, in which then nodes of the graph are concepts. In this environment, graph analysis tools become reasoning tools, able to find relationships or analogies and to perform induction and deduction. If there is a role for CNNs at all, it might be in the front-end speech or text recognition blocks.
This sort of approach can work spatially, as well as in a knowledge graph. Savarese described a system—apparently rule-based—that locates objects and tracks them in three-dimensional space. By including location and orientation in an object’s attributes, Savarese said, the system can infer relationships among the objects that would not be apparent from a 2-dimensional image. He gave the example of Jackrabbit, a delivery robot that navigates the pedestrian-cluttered Stanford campus in part by inferring the relationships between the people in its visual field. These inferences allow the bot to avoid faux pas such as motoring through the middle of a group deep in commiseration over a midterm, or barging between two students meeting for a date.
One of the challenges with inferring relationships, whether between concepts in a knowledge graph or between freshmen on the Stanford Quad, is ambiguity. Humans generate ambiguity so casually that we even rely on it for humor, irony, and other vital message types. In an evening keynote Stanford assistant professor of psychology Noah Goodman discussed the role of common sense in interpreting ambiguous natural language, and how that very non-deterministic faculty could be emulated by incorporating conditional probabilities into language parsing.
Goodman’s choice of language—perhaps not surprisingly given his background—is Lisp, the first language intended for list analysis and the grand old man of λ calculus. To the base language he adds functions that implement probability distributions and conditional distributions. Goodman demonstrated how the resulting language—called Church in honor of the developer of λ calculus—could exploit composition of probability functions to reach rich and useful inferences about natural-language statements, often with very small amounts of code. In particular, Church seems adapted to dealing with natural-language constructs like puns and metaphors that tend to defeat deterministic analyses.
Is there a clear direction from all this work? IBM’s Welser says there is. But it won’t be a migration to a giant CNN, or to any one universal algorithm. “Systems will include many separate components, each with a specific capability,” Welser predicted. This happens, coincidentally, to be the structure of IBM’s Watson cognitive computing systems.
At a macro level, Welser said, cognitive systems would include both model-based simulations of the real world around them and big-data analyses of the flood of unstructured data collected directly from that world. The system would combine the two sources to refine its models, develop a distribution of probable situations, and select conclusions that move it toward its goals. This model could describe a self-driving car or a debating-winning robot with equal plausibility.
Such systems likely will employ CNNs for vision processing and similar object-detection and tagging tasks. But they will wield many other algorithms and data structures as well. Among machine-intelligence algorithms, there is, as yet, no one ring to rule them all.