Today, ST Microelectronics introduced a new family of STM32 F4 microcontrollers based on the ARM Cortex-M4 microprocessor IP core. The Cortex-M4 core incorporates several DSP-specific extensions to the ARM Cortex-M architecture including single-cycle 16/32-bit MAC and dual-16-bit MAC instructions, 8- and 16-bit SIMD arithmetic, and a hardware-assisted divide instruction that takes 2 to 12 cycles. The Cortex-M4 processor core also includes am IEEE-754 single-precision floating-point unit, so it’s a high-end 32-bit processor as RISC microcontroller processor cores go. The DSP features of the ARM Cortex-M4 processor core clearly telegraph that this family is aimed at digital-signal-control (DSC) applications.
However, this particular blog entry is about just two of the innovative architectural features designed into the STM32 F2 microcontroller family that allows these microcontrollers to deliver more performance than you might expect. These two features are the Adaptive Real-Time (ART) Accelerator and the unique on-chip interconnect based on a high-performance, crossbar implementation of the ARM AHB on-chip bus.
But first, here’s a block diagram of the STM32 F4 microcontroller:
There are far too many components in this design to discuss in one blog entry, but suffice it to say there is a rich mix of peripherals including major standard interfaces (10/100 Ethernet, USB OTG (On the Go), CAN, asynchronous/synchronous serial ports), ADCs, DACs, timers, a PWM, a random-number generator, an optional crypto/hash processor, and on and on. There’s also a big chunk of Flash EEPROM, another fair-sized chunk of SRAM, and a small chunk of battery-backed SRAM. That’s a lot for a microcontroller that sells for less than $6 (at the low end of the family) in quantities of 1000.
Now microcontrollers (as opposed to application processors and embedded processors) tend to run directly from their on-chip FLASH memories and Flash is S-L-O-W. Consequently, fast processors in microcontrollers often need to execute wait states when executing directly out of Flash. A fast clock rate is great, but the instructions/clock will fall as a result of the wait states. One solution to this problem is to move the instruction stream from Flash to RAM and then execute from RAM. Not a good plan because on-chip RAM is less dense than Flash and moving code from Flash to RAM means the code lives in two places on the chip, which further reduces spatial efficiency.
The STM32 F4 family employs a more ingenious approach called the ART Accelerator to circumvent the speed mismatch between the processor and Flash memory. Here’s a block diagram of the STM32 F4 microcontroller’s ART Accelerator:
In this diagram, the ARM Cortex-M4 processor core appears on the left. The microcontroller’s Flash memory is on the far right. Note that the Flash memory is organized as 128-bit memory. Now ARM Cortex-M4 instructions are either 16 or 32 bits wide, so each 128-bit Flash memory slice holds four to eight instructions, with an average instruction-stream mix being five to six instructions per 128-bit chunk. Simply organizing the Flash in 128-bit words allows the processor core to execute at full speed without wait states at 168 MHz because one fetch from Flash keeps the processor running for four to eight clock cycles.
Now that works fine for sequential code execution but the program will inevitably come to a branch. All programs do and as Yogi Berra famously said, “When you come to a fork in the road, take it.” Eventually, the processor must take a branch and break from sequential code execution. When it does, there will be wait states as the first 128-bit slice of code from the new instruction branch is fetched from Flash. At the same time, that 128-bit slice goes into one of the 64 storage locations in the ART Accelerator so that the next time that branch is taken, the first four to 8 instructions are quickly available. The ART Accelerator has room to store 64 of these 128-bit branch slices plus another eight locations to store 128-bit data slices to speed data access from Flash.
The second architectural innovation I want to discuss is the on-chip interconnect, which ST calls the “Multi-AHB Matrix” (shown below).
This is an implementation of the ARM AHB bus based on a sparse crossbar switch. There are seven AHB masters (three for the processor, two DMA controllers, the Ethernet controller, and the USB controller) along the top of the diagram and seven slaves shown along the right side of the diagram (instruction and data ports for the ART Accelerator, RAM, FSMC (a “Flexible static Memory Controller” used for controlling external RAM or a graphical LCD), and two banks of peripherals. The key architectural innovation in the Multi-AHB Matrix is that several masters can independently access different slaves over the crossbar switch without blocking. Implementing the on-chip “bus” as a crossbar helps prevent the interconnect from becoming a blocking resource.
The ART Accelerator and the Multi-AHB Matrix are examples of thoughtful architectural additions that help to get the maximum performance out of a relatively low-cost piece of 90nm silicon. A 168MHz processor may not be fast in smartphone, tablet, or PC territory but it’s plenty fast in the microcontroller world and architectural features like the ART Accelerator and the Multi-AHB Matrix help ST maximize this SoC’s performance to make 168MHz seem even faster than it might otherwise appear. All SoC and Silicon Realization teams should be looking for similar ways to get the maximum performance from their architectural designs.