I regularly write about Deepak Sekar’s blogs and his latest entry about a talk given by NVIDIA’s Chief Scientist Bill Dally is causing me to quote from Sekar’s blog once again. Sekar is MonolithIC 3D’s Chief Scientist, so I suppose I should also sport those credentials to wade in here with Sekar and Dally, but as they say, fools rush in where angels fear to tread. So here goes.
Sekar’s latest blog entry is titled “The Dally-nVIDIA-Stanford Prescription for Exascale Computing”. It talks about changes to the electronics landscape that are in turn changing the way we can and must build large computing systems. Dally’s focus is on an exascale computing platform that “only” requires 20MW of power (as in MEGA WATTS). This was also part of the presentation I wrote about titled “Watt’s Next” by Chris Malachowsky who gave a talk at the recent ICCAD conference in San Jose. (See “‘Watt’s Next?’ asks Chris Malachowsky, co-founder, NVIDIA Fellow, and Senior VP or Research”)
Dally starts by listing the energy costs for some critical computing operations:
- 1 pJ for an integer operation
- 20 pJ for a floating-point operation
- 26 pJ to move an operand over 1mm of wire to local memory
- 1 nJ to read an operand from on-chip memory located at the far end of a chip
- 16 nJ to read an operand from off-chip DRAM
The fast conclusion you can make here is that in the upside-down world of nanometer SoC design, computation costs much less energy than moving operands to and from the computation units. So what does that mean in terms of system design?
It means we need to stop designing systems the way we did when transistors were expensive and energy was “free.” By that, I mean the use of off-chip SDRAM to act as a main memory buffer several on-chip processors needs some deep consideration and some well-deserved optimization.
Why? Because power is the name of the game today. Just ask anyone. Power is the limiting factor in System and SoC Realization today. That’s because:
- There’s not enough power available from today’s battery systems and “battery’s law” is a slow, linear law in contrast to the world-eating exponential law known as Moore’s Law.
- All nanometer SoCs are power limited because we can no longer get the heat out of them. That’s why we’re talking about dark silicon like scientists talk about dark matter and dark energy.
- Heat sinks are costly. Fans make noise and degrade reliability. Fans are also relatively large (lots bigger than an SoC) and industrial designers are not big fans of fans because fans impose unsightly bulges on swoopy cases that make consumers pant with geek lust.
So if we want to cut power and energy consumption, we need to stop moving so much data around. That’s the message from the preceding paragraphs.
We also need to stop multitasking. Think about it for a moment. Why do we burden an individual processor with many tasks? It’s because we “think” processors are expensive (it’s a belief four decades in the making), so we feel the need to impress as many tasks on each processor as possible to amortize the costs.
But are processors expensive? Still? The transistors surely aren’t. That’s the point of nursing Moore’s Law along with immersion lithography, high-K metal-gate construction, and a collision course for X-ray litho. Clearly, processors are no longer so expensive. Even inexpensive microcontroller chips and application processors are solidly marching into the dual-core camp. Four-processor chips at the low end are not far off. (Witness the recent ARM Cortex-A7 launch with its 4-core configuration just ready to be adopted.)
What about saving energy? Doesn’t multitasking give you that? Well, it turns out that the answer to that question is “No, it doesn’t.” Not really. Here’s Dally’s number from Sekar’s blog: “the overhead associated with branch prediction, register renaming, reorder buffers, scoreboards and other functionality common in today’s superscalar, out-of-order cores” burden “a 20pJ operation” with as much as 2000pJ of speed augmentation. Why add all that superscalar, out-of-order hardware? So that one processor can go faster. Why? So it can run more tasks, of course.
Looked at another way, you could add 100 extra processor cores and break even on the energy costs. You would also extricate your software team from the unpleasant morass of coding 101 applications to run on the same processor core with all of the consequential inter-task interference that could result. It will just cost you more transistors. Oh, yeah—so what?
We’re not all designing exascale supercomputers that require Niagara Falls for power and cooling. However we are all on a clear, collision course with the era of multicore design. It’s is time—past time—to think about what is scarce (power and energy) and what is abundant (transistors) in this world and to design accordingly.
Current Nvidia GPUs are delivering around 6GFLOPS/W and increasing at only 1.4x per annum. Against this background TI have already thrown the their hat into the HPC ring with their 16GFLOPS/W c64x’s (http://newscenter.ti.com/Blogs/newsroom/archive/2011/11/14/new-quot-lows-quot-in-high-performance-computing-ti-s-tms320c66x-multicore-dsps-combine-ultra-low-power-with-unmatched-performance-offering-hpc-developers-the-industry-s-most-power-efficient-solutions-862402.aspx), with Movidius 50GFLOPS/W also in the frame. Dally’s work on ELM (http://cva.stanford.edu/projects/elm/architecture.htm) with a 20x improvement in energy per operation is now dictating the strategy at Nvidia.
BTW the 20MW limit appears to be based on the German national supercomputer centre at Julich http://www.gauss-centre.eu/about-gcs/.
This centre has a 30km hardline from a nuclear power-plant with an option to upgrade to 20MW.
This the real limit for HPC is 2 or more times lower than the 20MW limit would suggest.