Last night, Michael Shebanow took members of the local IEEE Computer Society deep into the world of graphics processing and graphics processing units (GPUs). Over an hour and a half, he revealed some interesting and surprising facets to the topic. Shebanow is a Principal Resarch Scientist and has been with NVIDIA since 2003. He’s also worked on nearly every major RISC and CISC 32-bit microprocessor architecture (with one obvious exception) including 32- and 64-bit x86 processors, SPARC v9, the Motorola 68000, and the Motorola 88000—clearly someone who knows the ins and outs of processor architecture.
Shebanow started his talk, which took place in the Cadence Building 10 auditorium in San Jose, by running through a history of GPU development starting in 1995 with non-programmable graphics accelerators. The first programmable graphics chip from NVIDIA was the NV4, which had programmable “register combiners.” This chip had register-level programmability—you could turn certain features on and off—but it wasn’t a GPU. The first fully programmable GPU from NVIDIA, said Shebanow, was the NV30 introduced in 2003. By 2006, NVIDIA had developed the concept of the streaming multiprocessor (SM), a multithreaded compute element that was adept at the kinds of calculations required of a GPU.
Jumping to the latest GPU generation we get to Fermi. At this point, NVIDIA is building GPUs that can more easily be programmed with familiar high-level languages (HLLs) such as C and C++. The Fermi architecture consists of 16 SMs.
Here’s a block diagram of one SM core in the Fermi architecture:
Each SM contains two fetch/decode/issue units, called the “Warp Scheduler” and “Dispatch Unit” in the above diagram. There’s a large shared register file in the SM that’s programmably distributed for use by the 32 CUDA (Compute Unified Device Architecture) cores in the SM. Each of these cores contains a floating-point and an integer unit. The SM also contains 16 load/store units, and four “Special Function Units” (SFUs) used largely for graphics. There’s also a 64-Kbyte shared L1 cache.
Shebanow said that half of the design effort for Fermi was spent on making the architecture a better HLL target by including features such as exception processing, debug support, an L1 cache for the SMs, larger shared memory, and better double-precision math.
The Fermi architecture contains 16 SMs, for a total of 512 CUDA cores and 256 load/store units. In the initial Fermi implementation, a chip called the GF100, a maximum of 15 cores are enabled. That’s enough, said Shebanow, to run 23,000 threads. Now there aren’t enough hardware resources to run all of those threads simultaneously on a Fermi chip, but the chip can keep track of that many threads at one time and schedule them appropriately.
Shebanow said that the spare SM core was used to enhance yields. That’s easy to understand because of the chip’s specs. It consists of 3 billion transistors spread out over 644 mm2. The chip runs at 1.5GHz and consumes 225W. You can get more GF100 details, plus details of the second-generation GF110 chip from this article on the Tom’s Hardware site. Both the GF100 and GF110 are made using TSMC 40nm process technology.
The development of either the GF100 or the GF110 are indeed mighty feats of Silicon Realization but that isn’t what this particular blog post is about. It’s what Shebanow said next that served as the trigger for this post. When you throw 3 billion transistors at a problem and produce a machine that can handle 23,000 execution threads, you’ve got to be really careful about running into bottlenecks. Some bottlenecks can be avoided by properly structuring the code run on the GPU. Nevertheless, the Fermi architecture clearly has a voracious appetite for data.
Shebanow used an apt analogy. He said that current multi-GHz general-purpose CPUs were like sharks. They can “do a lot of damage”—meaning execute a lot of single-threaded code—very quickly. GPUs, on the other hand, are like a school of piranha. Each CUDA core can execute a “lightweight thread.” All 512 CUDA cores working simultaneously can also do a lot of damage, which is to say execute a lot of lightweight threads quickly.
Because the Fermi architecture is still a graphics chip, it must exist in a pre-existing system design. The GF100 and GF110 chips are soldered to graphics cards that plug into graphics slots in PCs. So you get the resulting system architecture:
Data in the CPU memory needs to move to the GPU memory before being processed by the Fermi GPU. As you can see from the diagram, the link between the GPU and its memory has an extremely high bandwith—178 Gbytes/sec—as a result of using the GDDR5 memory interface protocol with a memory array that’s 384 bits wide. The link between the CPU and its memory will be narrower and will probably be an order of magnitude slower but it’s the PCIe motherboard link between the CPU and GPU that’s the real bottleneck in this design. With poorly structured code that requires as many or more data-movement operations as math operations, a GPU like Fermi that is capable of Teraflop performance can easily be throttled to a few Gigaops because of this bottleneck.
Now NVIDIA is currently stuck with this pre-existing PC system design, but we are not constrained by legacy or reality in this blog post. We are free to think of ways to break this bottleneck. One of the first things we might try is to create a direct, high-bandwidth path between the CPU memory and the GPU memory. That system architecture would look like this:
But why stop there? What we really want is for both processing units to be able to access both memories. So we might develop an architecture that looks like this:
Here we’re using a dual-ported memory controller that handles requests from the GPU and CPU to make the GPU and CPU memory arrays equally accessible to either processing unit. We have two different memory arrays because the GPU memory must be very wide and must be built from fast GDDR5 memory chips, which are somewhat more expensive per bit than standard DDR2 or DDR3 memory modules. So we might prefer to minimize the amount of GDDR5 memory using this approach to control system costs.
But perhaps we care more about memory performance than memory cost. In that case, we might adopt a unified-memory approach as shown in the following diagram:
This approach might work well because the required bandwith for the CPU memory access might represent only 10% of the overall bandwidth requirement for the unified memory and running the GDDR5 memory chips 10% faster (if possible) might be all that’s needed.
Shedding even more system constraints, another alternative is to place the CPU and GPU on the same piece of silicon with a dual-ported memory controller, and then connect that chip to separate memory arrays as shown in this next diagram:
In fact, that’s sort of the architecture AMD discussed last month at the Hot Chips Conference albeit with somewhat lower performance objectives. Take a look at the AMD Llano APU architecture and see if you also note the similarities:
So which of these many architectures is “the best”? Ah, that’s a question worth some effort to solve, is it not? Well, that is the realm of System Realization. (Warning: Obvious EDA360 tie-in.) You need some application code and good system-level models of the CPU, GPU, memory arrays, and interconnect to help you decide which architecture to pick for a specific application. If you have good System Realization tools, like the Cadence System Development Suite, then you will be able to find the answer to these performance questions more objectively.
Otherwise, you’ll just be guessing.