Vectorization

Performance Estimation icon

Your Solution to Vectorize Your Application

The vector units of upcoming microcontrollers promise to speed up the execution of data-parallel applications based on linear algebra by factors greater than 10. However, programming such accelerators manually is challenging because it requires deep knowledge of their instruction set and microarchitecture.

emmtrix Parallel Studio is your solution to simplify this task significantly:

  • Easy exploitation of parallel vector hardware without writing vectorized code
  • Correct-by-construction vectorized code reduces testing effort
  • Instant feedback of performance impact enables short development cycles

Introduction to Vectorization

In their next-generation of embedded, safety-certified MCUs, vendors like Infineon will incorporate so-called “vector units”, accelerators that can perform several similar operations on different data at the same time. This concept is known as single instruction, multiple data (SIMD). SIMD has been used for a long time in the desktop and server area and is now finding its way into safety-critical embedded systems. With this kind of hardware, applications that rely heavily on linear algebra, like sensor fusion or inferencing in AI systems, can be sped-up by a factor larger than ten.

From a programming point of view, these vector architectures share some of the same principles as GPUs in that they work on many data elements simultaneously and that data needs to be kept in local memory to keep the processing units busy. However, programming is still mostly done manually, which is time-consuming and error-prone, or by using pre-defined library functions, which limits the applications that can be accelerated.

For more information, we recommend our short vectorization series, consisting of three parts:

emmtrix Vectorization Flow

emmtrix provides a user-guided, automated vectorization flow that supports the same input languages as the general parallelization flow. Vectorization, like parallelization, is an automated, interactive process: emmtrix Parallel Studio performs an initial, automated vectorization for the selected target architecture. The generated code can be run on the target system, e.g. to verify functionality and timing of the implementation. In addition, emmtrix Parallel Studio provides an interface to connect a cycle approximate or cycle accurate simulator for the respective target platform, which returns information on the runtime behaviour of the vectorized code. This simplifies work in early project stages or continuous integration, where target hardware might not be readily available. This data is presented to the user as feedback for the success of the vectorization. The user can, if necessary, trigger transformations to improve the result. This cycle is repeated until the developer is satisfied with the results of the vectorization. The code changes are automatically applied to the code by ePS. The developer does not write any vectorized code in the process.

Code transformations that the developer can apply include typical loop transformations like fission, fusion, splitting and unrolling, but also memory layout optimizations like padding.

To perform automatic vectorization, a lot of information about the target hardware is required. This includes the number of operations executable in parallel, the width of the vector register, the supported data types and instructions, including their latency and throughput, as well as information on memory hierarchy including bandwidth and latencies. All this information is captured in a comprehensive hardware model. It also defines the syntax of the generated code, which is highly dependent on the hardware and software environment of the target. This could for example be (inline) assembly, intrinsic functions translated to machine code by the compiler or an extension to C allowing to specify functionality and memory allocation in a more developer-friendly manner.

In the following we will demonstrate the workflow with a simple, but not trivial example, a multiplication of two 15×15 matrices: the algorithm itself is simple, while the vectorization requires some code transformations to yield good performance.

For the calculation of each individual matrix element, the scalar product of the corresponding row vector of matrix A with the column vector of matrix B must be calculated. In a parallel implementation of this scheme, up to 16 elements of the row vector are multiplied with 16 elements of the column vector element by element in parallel and the products are summed up. Before this, however, the vectors must be loaded into the vector registers. The row vector of matrix A lies linearly in memory, so that its elements can be loaded with a single operation. However, the column vector of matrix B is scattered throughout the memory of matrix B. Therefore, 16 individual memory accesses are required to gather all the elements, which nullifies any acceleration that could be achieved through vectorization.

This can be remedied by applying a transpose operation to matrix B, which exchanges rows and columns of the matrix. The result is a memory layout that has the column vectors stored in linear memory, so that they can be read with one memory access. Finally, a transformation is applied to unroll the inner loop by the number of parallel operations supported by the vector accelerator. This in turn allows an automatic, efficient mapping of these computation to instructions of the accelerator, which yields a significant speed-up compared to sequential execution.

On the Infineon 32-bit TriCore™ AURIX™ TC4x MCU, the vectorized implementation for the 15×15 matrix multiplication generated by emmtrix Parallel Studio yields a speedup of 17,5x.

Vectorization of Simulink Models

Projects utilizing Simulink models benefit particularly from emmtrix’s vectorization solution: As the blocks and interconnects given in the Simulink model describe functionality and data flow rather than an actual implementation, emmtrix Code Generator is free to choose the best possible implementation available for vectorization on the selected target architecture. In addition, it supports a feature called “Code Fusion”, which, unlike library-based approaches that only optimize individual blocks of a Simulink model, allows for cross-block optimization and vectorization. It combines calculations that are performed by different blocks into a single, inlined implementation. This improves data locality and the ratio between computation and data transfer, which raises the utilization of the vector processor and thus increases the achievable speedup.

The pictured Simulink model implements a calculation that multiplies a matrix with a second, transposed matrix. The elements of the resulting matrix are then multiplied by a constant factor (gain). Optimized vector code can be generated for each of these blocks using library-based approaches. Then, however, at each step the entire matrices are read in, processed and the complete intermediate result is written back, where it must be read in again from the next block to generate the next intermediate result. In the absence of data locality, many load/store operations are required to transfer the data between the registers and the local vector data store. To avoid this, the functionality of the Transpose, Product and Gain blocks can be combined into one calculation, where the transpose and gain operations are performed during matrix multiplication. Since no intermediate results need to be kept, the implementation uses less local memory. In addition, the runtime in this example improves by about 10% on an Infineon 32-bit TriCore™ AURIX™ TC4x MCU.

Features

  • User guided, automated vectorization
  • Correct-by-construction vectorized code
  • Functional testing of vector code independent of target platform
  • Code transformations improving data level parallelism and optimizing code for vectorization
  • Integration of target platform simulators for performance estimation
  • Instant feedback on performance impact of transformations
  • Vectorization-aware code generation from Simulink models
  • Code Fusion: block-crossing vectorization of Simulink models
  • Support for target-specific language extensions: intrinsics, vector C

Your Benefits

  • Easy exploitation of parallel vector hardware
  • No need to write vectorized code manually
  • Limited hardware knowledge necessary
  • Correct-by-construction vectorized code reduces testing effort
  • Functional testing without hardware
  • Instant feedback of performance impact enables short development cycles
  • Workflow enables fast path to acceleration of models

Supported Platforms

emmtrix Parallel Studio supports vectorization for the next-generation of Infineon 32-bit TriCore™ AURIX™ TC4x MCUs, as well as microprocessors implementing the RISC-V “V” Vector Extension (RVV). Our generic solution allows us to support additional architecture with ease.

Please call us if you require support for a specific hardware architecture.

Infineon logo
Aurix
Power PC logo

RISC-V

Join Our Webinar

“Automated Vectorization in emmtrix Parallel Studio for Next-Generation Infineon 32-bit TriCore™ AURIX™ MCU”

For more information on emmtrix vectorization or to request a demo, a chat or a meeting use our contact form or get directly in touch. We’re looking forward to hearing from you.

Portrait Rainer Heim

Rainer Heim

Cookie Consent with Real Cookie Banner