3D Thursday: Let’s end 2011 with a high-performance DRAM memory stack design. How would you improve it?

For the last 3D Thursday blog post of 2011in the EDA360 Insider, I thought I’d take a flight of fancy and try to put as many of this year’s 3D IC concepts as possible together to see what we might get. I started thinking about the year’s major announcements and here’s my short list:

I then started to think about designing my own version of the Micron/IBM HMC using standard Wide I/O memory chips instead of the specialized memory chips Micron has developed for the HMC. Those Micron memories incorporate 16 memory arrays—each with a separate I/O channel—onto each die. The HMC can therefore deliver a peak throughput of approximately 160 Gbytes/sec. It’s designed for high-performance computing applications, which is why IBM is interested in the technology.

In contrast, a Wide I/O SDRAM is designed for low-power and mobile applications with somewhat less performance. One Wide I/O SDRAM delivers about 17 Gbytes/sec of bandwidth through four memory arrays and four 128-bit interface ports. Not shabby compared to DDR memory DIMMs, but not HMC-class performance either.

Because I expect Wide I/O memory parts to go into high-volume production over the next couple of years due to the demand of mobile designs, I decided a thought experiment using these parts was in order to round out the year. So let me take you on my quick flight of fancy through a top-level conceptual design to see where this technology takes us.

First, I plan to use stock Wide I/O memories, which gives me 17 Gbytes/sec peak bandwidth from each Wide I/O memory chip (or memory stack, because I can stack as many as four Wide I/O die using 3D assembly techniques to get additional memory capacity but I get no additional memory bandwidth by stacking Wide I/O die). Four such memories or memory stacks will deliver about 68 Gbytes/sec of memory bandwidth—somewhat less than the HMC, but close enough for a thought experiment and plenty of bandwidth to make the experiment interesting.

To control each Wide I/O stack, I need four Wide I/O memory controllers (one for each of the four memory ports on the Wide I/O die). Conveniently, CEA-Leti and ST-Ericsson have just discussed such a controller design at the recent RTI conference on 3D design. Here’s a layout of one such controller developed by the partners. They employed the Cadence Wide I/O memory controller IP block.

If we put four of these controllers together and arrange them properly, we get a Wide I/O stack controller that meets the JEDEC-specified layout. Then we take four such stack controllers and arrange them on a chip to define a logic die that can control four independent Wide I/O memory stacks using 16 Wide I/O memory controllers with a resulting peak bandwidth of 68 Gbytes/sec.

To get that sort of bandwidth off chip, I’ve elected to use four PCIe Gen 3 x32 I/O ports, which gives me a peak theoretical bandwith of 128 Gbytes/sec (32 Gbytes/sec per port). I could use four PCIe Gen 3.0 x16 ports, for an aggregate bandwidth of 64 Gbytes/sec, but that seems a bit undersized to me. I don’t want the memory subsystem’s I/O bandwidth to be the limiting factor here, so I’ve used x32 ports.

Now I know we’re going to need some significant data-moving and –handling bandwidth on this chip, so I’ve added a 4-processor array of ARM Cortex-A7 cores to the design with a shared L2 SRAM cache on chip. I think the SRAM cache will need to be fairly large to act as a buffer between the Wide I/O SDRAM stacks and the PCIe ports.

The resulting chip layout (please remember, it’s a thought experiment) looks like this:

I have no idea if the aspect ratios of the on-chip blocks is correct, but this layout will do for a quick exercise. We might also be smart to add several DMA controllers, but that’s more than I plan to do in this initial thought experiment.

Note that I’ve placed the Wide I/O memory controllers and the PCIe ports on the chip periphery because I am not planning on placing the Wide I/O memories on top of this controller chip using 3D assembly techniques. Instead, I am planning on using the silicon interposer technology and the 2.5D assembly techniques pioneered by Xilinx with the Virtex-7 2000T FPGA. Also, I plan to minimize the assembly’s footprint by using both sides of the silicon interposer. The logic chip will not need TSVs. The silicon interposer will need TSVs. The Wide I/O memory die already have TSVs.

Here’s what the assembly might look like:

 

It looks like there are three Wide I/O SDRAM stacks, but there are four. The second stack beneath the interposer is located behind the one that’s visible.

The full assembly using standard memory die and a custom controller deliver a high-performance, high-capacity SDRAM assembly using many of the technologies introduced just this year. This would not be much of a thought experiment a year ago because it would look like a far-off dream but with 12 months of 3D developments, it now seems quite possible.

How would you improve this design?

Happy New Year!

–Steve Leibson

About sleibson2

EDA360 Evangelist and Marketing Director at Cadence Design Systems (blog at https://eda360insider.wordpress.com/)
This entry was posted in 2.5D, 3D, EDA360, Memory, SoC, SoC Realization, System Realization and tagged , , , , , , . Bookmark the permalink.

Leave a comment