Amazon’s Elastic Compute Cloud (EC2) offers businesses the opportunity to rent scalable servers and host applications and services remotely, rather than pay for and handle the infrastructure and management of those resources on their own. The service, which first entered beta a little more than ten years ago, has historically focused on CPUs, but that’s changing now, courtesy of a newly-unveiled partnership with Nvidia.
According to joint blog posts from both companies, Amazon will now offer P2 instances that include Nvidia’s K80 accelerators, which are based on the older Kepler architecture. Those of you who follow the graphics market may be surprised, given that Maxwell has been available since 2014, but Maxwell was explicitly designed as a consumer and workstation product, not a big-iron HPC part. The K80 is based on GK210, not the top-end GK110 parts that formed the basis for the early Titan GPUs and the GTX 780 and GTX 780 Ti. GK210 offers a larger register file and much more shared memory per multiprocessor block, as shown below.
The new P2 instances unveiled by Amazon will offer up to 8 K80 GPUs with 12GB of RAM and 2,496 CUDA cores per card. All K80s support ECC memory protection and offer up to 240GB/s of memory bandwidth per card. One reason Amazon gave for its decision to offer GPU compute as opposed to focusing on scaling out with additional CPU cores is the so-called von Neumann bottleneck. Amazon states: “The well-known von Neumann Bottleneck imposes limits on the value of additional CPU power.”
This is a significant oversimplification of the problem. When John von Neumann wrote “First Draft of a Report on the EDVAC” in 1945, he described a computer in which program instructions and data were stored in the same pool of memory and accessed by the same bus, as shown below.
In systems that use this model, the CPU can either access program instructions or data, but it can only access one or the other. It cannot simultaneously copy instructions or data at the same time, and it cannot transfer data directly to or from main memory nearly as quickly as it can perform work on that data once the information has been loaded. Because CPU clock speeds increased far faster than memory performance in the early decades of computing, the CPU spent an increasingly large amount of time waiting on data to be retrieved. This wait-state became known as the von Neumann bottleneck, and it had become a serious problem by the 1970s.
An alternative architecture, known as the Harvard architecture, offers a solution to this problem. In a Harvard architecture chip, instructions and data had their own separate buses and physical storage. But most chips today, including CPUs built by Intel and AMD, can’t be cleanly described as Harvard or von Neumann. Like CISC and RISC, which began as terms that defined two different approaches to CPU design and have been muddled by decades of convergence and common design principles, CPUs today are best described as modified Harvard architectures.
Modern chips from ARM, AMD, and Intel all implement a split L1 cache with instructions and data stored in separate physical locations. They use branch prediction to determine which code paths are most likely to be executed, and they can store both programs and instructions in case that information is needed again. The seminal paper on the von Neumann bottleneck was given in 1977, before many defining features of CPU cores today had even been invented. GPUs have far more memory bandwidth than CPUs do, but they also operate on far more threads at the same time and have much, much smaller caches relative to the number of threads they keep in-flight. They use a very different architecture than CPUs do, but it’s subject to its own bottlenecks and choke points as well. I wouldn’t call the von Neumann bottleneck solved — when John Backus described it in 1977, he railed against programming standards that enforced it, saying: