STMicroelectronics – A smart architecture which boosts MCU performance while preserving determinism


Fig. 2: the bus matrix of the STM32F7 microcontroller

Microcontrollers which are based on the ARM® Cortex®-M7 core mostly share similar processor configuration options. These typically include:

  • a 64-bit AXI system bus interface
  • an instruction and data cache
  • 64-bit Instruction Tightly Coupled Memory (ITCM)
  • dual 32-bit Data Tightly Coupled Memories (DTCM)

This Design Note, however, describes several features of the STM32F7, a family of MCUs from STMicroelectronics which are based on the ARM Cortex-M7 core, which differentiate them from other MCUs that use the same core.

The first important difference to note is that the STM32F7 devices have both an ITCM interface and an AXI interface to their embedded Flash memory, as shown in Figure 1. This offers greater flexibility when executing code. In addition, the STM32F7 MCU has a built-in Flash accelerator, called the Adaptive Real-Time ART Accelerator™, which performs zero-wait execution from Flash. Using the TCM interface with the ART Accelerator results in similar performance to that of the cached AXI interface, but without the penalty of cache misses and cache maintenance operations in user code.


Fig. 1: block diagram of a system-on-chip based on
the ARM Cortex-M7 core


Taking advantage of the ART Accelerator as well as an L1 cache of up to 16kbytes, STM32F7 devices provide the maximum performance of the ARM Cortex-M7 core whether code is executed from embedded Flash or external memory: 1082 CoreMark/462 DMIPS at an operating frequency of 216MHz.

The second big differentiator is that internal SRAM is distributed in several blocks to reduce dynamic power consumption, and to optimise bandwidth and latency by allowing concurrent access to different SRAM blocks from various bus masters.

One use case for this architecture is in human-machine interfaces, in which audio and graphical data must be transferred concurrently from or to the system RAM.

Superior floating point unit performance
Devices in the STM32F7 family feature a high-performance single or double precision Floating-Point Unit (FPU) supporting all ARM single or double data-processing instructions and data types. The FPU offers benefits in many applications which require floating-point mathematical precision, including loop control, audio processing, audio decoding and digital filtering.

An additional benefit is that certain functions may be offloaded from the CPU to the FPU, leaving the CPU available for other tasks. Support for double precision also makes it easier to use PC-based mathematical software which uses double-precision floating-point instructions. One of the most distinctive ways in which the STM32F7 MCUs are designed is their smart system architecture, which uses two sub-systems, as shown in Figure 2 (see above):

  • an AXI-to-multi-AHB bridge converts the AXI4 protocol to the AHB-Lite protocol
  • a multi-AHB bus matrix manages the access arbitration between masters

Such arbitration uses a round-robin algorithm. It provides access from a master to a slave, enabling concurrent access and efficient operation even when several high-speed peripherals work simultaneously.

Cache maintenance operations
Finally, it is worth pointing out the purpose of cache maintenance when implementing critical code on an ARM Cortex-M7 device. The STM32F7 embeds an instruction and data cache to compensate for inserted wait states when fetching code and data out of on-chip or off-chip memories, thus boosting performance. However going through those caches will not preserve determinism when cache misses and cache line fills occur.

This is why TCM memories are strongly recommended for the execution of critical code and for the storage of critical data. This is frequently useful, for instance in applications in home appliances and motors, in which safe operation must be guaranteed.

Software maintenance operations are needed because cached memories can be accessed not only by the CPU but also by other master devices including the Direct Memory Access (DMA) controller. These master devices might read out-of-date data when accessing physical memories while new updates are already available in a CPU cache.

To avoid this problem, developers should adopt the following practices when writing user code:

  • When a master other than the CPU is performing an access to a cached memory location, a cache clean is recommended prior to that operation. This is to ensure that the CPU’s most recent updates are written back to physical memory.
  • When a master other than the CPU has made an update to a memory location, the CPU should invalidate the cache prior to any read operation from that location. This is to ensure a direct read from the physical memory.
  • Cache-less operations could also be considered. When a cached memory location is frequently accessed by other masters, a non-cacheable memory attribute configured through the CPU’s settings could prevent data incoherency.

Orderable Part Number: STM32F746G-DISCO