What is the future of bricks-and-mortar retail? Imagine, if you will, two scenarios.
In one scenario, shopping malls, main-street stores, and even big-box stores slip into a downward spiral. Customers shop online, starving brick-and-mortar stores of revenue. The stores cut staff, inventories, and maintenance to preserve profits, alienating even loyal customers. In the end, whole malls and shopping districts drift derelict, eventually being repurposed as light industry sites, office parks, or affordable housing. Residential neighborhoods around them wither.
In the other scenario, retail stores begin a metamorphosis. From impersonal, utilitarian, self-service stock displays and warehouses they become destinations: rich in sensory pleasures, locations for events, selfie-worthy, and above all personal. Online shopping becomes a niche for the poor or immobile as retail clusters thrive, winning back shoppers’ presence, time, and money.
The difference between these two scenarios, some analysts, say, is the disciplined application of machine learning, supported by a new balance between cloud and edge computing. To understand this claim, step for a moment into a retail store of a decade from now—let’s call it Future Store.
It’s the Experience
The first thing you notice on entering Future Store is the space—you enter something like a neighborhood coffee bar, or even a parlor. It is small, quiet, and comfortable: no sign of giant, noisy open warehouses teaming with shoppers in mazes of display racks. The next thing you notice is that you are greeted—by name, if you so choose—by either a virtual assistant or a trained human—again, your choice. You may be offered your favorite beverage, and perhaps a choice of music.
From here your shopping experience would be familiar to anyone who has worked with a personal shopper at a high-end retailer. You discuss what you are trying to achieve. Your shopper brings you samples to see, feel, smell, hear, or taste. The unassailable advantage of the brick-and-mortar (B&M) store is the physical presence of the merchandise. You may use augmented reality (AR) (Figure 1) to assist you—for instance, posing in front of an AR virtual mirror to see how a new item of clothing would look on you in different contexts, without having to try it on. Or maybe the AR experience walks you through how you would use that giant gas grill, or gives you the experience of actually flying that racing drone..
All the time, your shopper will be gauging your emotions, offering alternatives and accessories, up-selling you when it is appropriate, even unobtrusively estimating your measurements to ensure clothing will fit properly. When you have made your selections, the correct items will be wrapped for you or scheduled for shipment, and your account charged. No waiting in checkout lines or fishing for credit cards. There is even time for another celebratory beverage.
Future Store may be a viable alternative to the death of B&M stores, at least for goods for which physical presence conveys important information to the customer. But there are three major obstacles to be overcome before the virtual personal shopper can save Main Street. First, the business case must make sense to retailers. Retailers, after all, live on some of the slimmest gross margins in the entire economy, and must be able to see a near-term payback for any capital investment.
Second, the technology must be adequate to provide an interactive user experience that never distracts, becomes implausible, or gets creepy. Third, customers must enjoy and embrace this novel blend of humans, physical objects, machine intelligence, and AR.
Today, the business-model question is controlling the pace of adoption. Prabhat Gupta, CEO at machine-learning specialist Megh Computing, observes that some retailers are looking at the whole range of possibilities, from facial recognition to motion tracking to reading emotions and gestures to AR. But many want to start with a small investment that will yield an immediate return without having to redesign retail floors or upset customers.
It turns out there are such low-hanging fruits—and they are not just in high-end retail ateliers. They even dangle in supermarkets. One of the most attractive to retailers is fraud prevention. At a supermarket self-checkout stand, for instance, it is not hard with a little slight of hand to trick the scanner into missing an item. Human monitors can usually catch such ploys, and so, it turns out, can security cameras backed by well-trained convolutional neural networks (CNNs). With inexpensive cameras and deep-learning video analytics, such irregularities are proving detectable. For the retailers, the investment is not that high, and the returns can begin the moment they turn the system on, and can amount to several percent of sales.
The concept extends beyond big-box stores’ self-checkout lines. More cameras, aimed at shopping aisles rather than at checkout counters, coupled with a differently trained machine-learning system—but probably still just a CNN—could detect other undesired activities, such as shoplifting or tampering. Such systems could also monitor inventory levels on shelves and detect misstocking. Again, the cost is modest and the returns immediate.
Once the cameras are in place, it is a reasonable step to facial recognition and—with the customer’s permission—tracking. Even sentiment analysis based on posture, gestures, and expressions should be feasible, although these measures may require use of different kinds of machine learning models.
These latter steps mark a fundamental change in the business case. Fraud and theft prevention add directly to the bottom line by preventing what the retail business politely calls shrinkage. But collecting data about customer behavior and sentiment will only help earning if analysis of the data leads to increased sales or margins. This effort will require yet more machine learning algorithms to convert customer behavior and attitude data into predictions of future buying behavior. It will require smart signage and eventually AR to translate the predictions into experiences that invite the customer to buy. The investment will be more substantial, the returns less predictable and more delayed. But all the groundwork is in place for the first steps toward Future Store.
Making it Work
The second barrier to Future Store is technology. Here, the news is generally good. Systems emerging now appear ready to meet the requirements of the early stages of deployment—that low-hanging fruit. CNN algorithms show promise for at least the fraud-detection, shelf-monitoring, and checkout-free sales use cases. They may even be adequate for the less clever forms of shoplifting. Facial recognition and some level of gesture detection are more mature, even though they have yet to be widely deployed in a retail context.
One wrinkle is that many of these algorithms, especially CNN inference, have evolved in data centers, executing on server CPUs, often supported by graphics processing unit (GPU) accelerators. Such configurations tend to process large batches of images at once in order to get high utilization of the GPU cores and to avoid memory thrashing. But in this application, latency is paramount. You may have only seconds to detect fraud at a self-checkout kiosk before it is too late to confront the perpetrator. This constraint argues for local inference processing in either very small batches or in a data-streaming mode. Not only is it important to avoid the latency uncertainties and availability issues of a remote data center, but increasingly stringent privacy and data locality laws may prevent use of a data center outside the store’s metro area. These issues also argue for moving inference to the edge.
Accordingly, a number of startups are adapting edge-computing systems with hardware vision processing and inference acceleration to the needs of B&M retailers. A range of architectural approaches is available, ranging from smart cameras to quite centralized edge server clusters—essentially tiny data centers in a box.
Megh Computing has chosen an approach using dumb cameras, streaming the resulting video to an edge server cluster accelerated by FPGAs (Figure 2). Not only can the FPGAs be optimized for stream processing of video analytics and deep-learning inference, but they can be reconfigured as the application evolves, from CNNs for inventory management, fraud control, or facial recognition to more demanding models like long-short-term memory networks (LSTMs) for tracking, natural-language interaction, or sentiment detection.
Another startup, PointR.ai, is taking a different architectural approach, using a network of compact, low-power computing modules rather than a rack of server cards. Each module supports up to dozens of cameras. PiontR focusses on store-wide coverage and uses machine learning to identify when a customer puts an item in their cart—allowing checkout-free purchasing. The system can also identify when an item is out of stock or when an item is shelved incorrectly, and it may have potential for detecting shoplifting from the shelves as well.
Extrapolating from these examples, a rather radical partitioning scheme suggests itself (Figure 3). The early layers of a CNN—the layers concerned with feature mapping and object extraction–in which there is a great deal of data but relatively low connectivity, can be pushed closer to the cameras, minimizing the need to transport and manage massive amounts of video data centrally. The later layers, which are often fully connected and which work on scene classification and interpretation, can be centralized in a cluster of servers and accelerators. This concentrates computing resources on the most connection-intensive layers of the neural network and potentially allows inference from data across multiple cameras.
But partitioning a CNN in this way between quite kinds of computing systems is relatively unexplored territory, at least compared to partitioning the network between a CPU and a GPU, FPGA, or ASIC. And when the application’s needs move beyond CNNs to the various types of recurrent networks, the need to feed state back into the network—possibly across the partitioning boundary—could further complicate things.
The AR features of our Future Store present a different set of challenges. For a clothing store, for example, the AR system must accept live video streams from probably several cameras aimed at the customer. It must reduce that data quickly to a dynamic model of the customer’s partially-clothed body—probably a finite-element surface model, perhaps augmented by a dynamic model of the underlying tissues and bones. Then the system must use a model of the new garment to accurately drape the body model. This image is then merged with live camera data to produce a moving, real-time image for the virtual mirror.
All of this requires a great deal of computation on strict real-time deadlines. The image in the mirror can’t lag noticeably behind the customer’s movements, or it will be disorienting. It is intriguing to note that the mathematical operations involved—lots of matrix arithmetic—are quite similar to the operations used in inference. There may be an opportunity to share acceleration hardware. And it may be possible to shortcut some of the computations, especially for draping, by using a machine-learning model to predict how the garment will hang and move.
There is also a third computation site to consider: the cloud. For the foreseeable future edge-based inference systems will rely on cloud data centers for periodic retraining—when new patterns of checkout fraud appear, for instance, or when a retailer adds new products. This could mean a regular traffic in image data going up to the cloud, and new trained network models coming back. But there may also be opportunities to apply the enormous resources of the cloud to demanding network models—deep fully connected networks or large neuromorphic networks, for example—when these models can produce valuable results and strict real-time response is not necessary.
The Third Barrier
Technology available or in development today appears capable of taking us a long way down the road to Future Store. But for the application to be commercially viable, there is one more barrier to be overcome: customer acceptance. In the early stages, such as fraud detection and stock control, this is not such an issue. These functions can be invisible to customers—at least the honest ones—save for a courtesy notice that they are in use.
But as the services become more personal, they become more intrusive. Facial recognition and tracking in particular are at risk of developing a very bad reputation due to their use for surveillance in some authoritarian countries. These applications may require explanation and an active customer opt-in on every visit. In some regions this may be a legal requirement. Sentiment estimation will be even more sensitive—across the line into creepy, for some people. And the end goal of smooth interaction among the customer, a human personal shopper, AR displays, and the machine-learning environment will have to be executed flawlessly to be effective.
Such a degree of integration will require maximum latencies in the tens of milliseconds on critical human-machine interactions. It will require the machine-learning system to track the cognitive state and emotions of two interacting humans in order to avoid inappropriate or merely obtuse interjections. Even the best deep-learning network must choose carefully its response to “Does this make me look fat?”
So there is much work to be done, some of it probably beyond the range of just making existing machine-learning models bigger and faster. But there are steps that can be taken now, with today’s technology. Individually they have solid business cases. And they can be composed into a quite compelling Future Store without huge hardware replacement as the technology evolves. It’s time to get started.