Realizing the ARM Cortex-A15: What does the road to 2.5GHz look like?

ARM and Cadence jointly presented a paper at today’s ARM Technology Conference discussing the steps needed to get the new ARM Cortex-A15 multi-processor core ready to run at or above 2.5GHz in a 32/28nm G/HP process technology and at or above 1.5GHz in a 32/28nm LP process technology. The two companies have been working on this project since March of this year. Obviously, the ARM Cortex-A15 is designed to be fast out of the gate. The core’s multiple-processor-core configurability suits it to a wide variety of mobile and tethered applications and end products with a wide range of performance needs. Getting high clock speed and good power performance at the same time is a good trick and Andrew Lambert from ARM and Rob Lipsey and Gopi Kudva from Cadence provided a significant number of details with respect to the tool flow that gets you multi-GHz speed with low power.

First, Lambert gave a few of the ARM Cortex-A15 architectural details—without revealing too much about the processor internals. The ARM Cortex-A15 is ARM’s first processor to employ the AMBA 4 system coherency bus—necessary because the ARM Cortex-A15 can be configured with as many as four processor cores. The ARM Cortex-A15 is also the first ARM processor core with a 1Tbyte address space. For reduced power consumption, the ARM Cortex-A15’s L2 cache is divided four ways and each of the four quadrants can be powered independently. The processor pipeline has a fine-grained, power-shutdown feature to reduce dynamic power consumption and the processor’s register-save and –restore abilities are accelerated to reduce the amount of time and the amount of power needed to transition from sleep to full operation and back again. The design supports the Common Power Format (CPF) to permit the design intent for power consumption to permeate the tool chain.

ARM and Cadence assembled an implementation team in March. The target implementation had two processor cores each with 32Kbyte L1 caches, a 1Mbyte L2 cache, ECC and parity protection on both the L1 and L2 caches, and the power domains that permit multiple operating/power modes. The design also included the ARM NEON SIMD vector coprocessor and the FPU. In an interesting twist, the ARM Cortex-A15 processor consists of two blocks: a processor block (for ease of replication, two were used in this design) and a non-processor block (used for providing an external interface and other support circuits such as an interrupt controller to the processor blocks).

Both block types were developed flat, from the bottom up. The targeted process for this exercise was a 32LP process with six base metal layers and two additional power-distribution layers (x8 metal width).

The goal of this exercise was to develop a script that will allow anyone using the ARM Cortex-A15 processor core to easily insert it into an SoC design. The result was an optimized synthesis script for the Cadence Encounter RTL Compiler and associated tools. “Use it with the prescribed methodology and you’re done,” quipped Kudva.

The implementation team used a nearly standard synthesis flow to develop the ARM Cortex-A15 design, paying attention to early timing results and signal flow to guide the floorplan, critical node routing after floorplanning, and clock-tree synthesis—among other factors. “The clock-tree structure is critical to getting performance,” said Lipsey. The approach used here was to bring the clock to the center of the core and then branch out from there, using no more than three branching levels. The target for the clock tree was less than 1.2nsec of latency and 50psec of global skew.

Finally, it’s worth noting the number of switching transistors required to build in the functional power gating. The CPU block required approximately 9000 power-switching transistors and the L1 memories required an additional 12,000 power-switching transistors. The non-CPU block employs about 2000 power-switching transistors for the logic and another 5000 power-switching transistors for the associated memories. In all, the power-switching transistors added only 2% to the area.

About these ads

About sleibson2

EDA360 Evangelist and Marketing Director at Cadence Design Systems (blog at http://eda360insider.wordpress.com/)
This entry was posted in ARM, EDA360, Silicon Realization, SoC Realization and tagged . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s