The Tool to Vectorize Your Application
The vector units of upcoming microcontrollers promise to speed up the execution of data-parallel applications based on linear algebra by factors greater than 10. Programming such accelerators manually is challenging because it requires deep knowledge of their instruction set and microarchitecture. emmtrix Code Vectorizer (eCV) is your solution to simplify this task significantly.
See our tool in action. Contact us to request a demo or to set up a chat or meeting.
Introduction to Vectorization
In their next-generation of embedded, safety-certified MCUs, vendors like Infineon will incorporate so-called “vector units”, accelerators that can perform several similar operations on different data at the same time. This concept is known as single instruction, multiple data (SIMD). SIMD has been used for a long time in the desktop and server area and is now finding its way into safety-critical embedded systems. With this kind of hardware, applications that rely heavily on linear algebra, like sensor fusion or inferencing in AI systems, can be sped-up by a factor larger than ten.
From a programming point of view, these vector architectures share some of the same principles as GPUs in that they work on many data elements simultaneously and that data needs to be kept in local memory to keep the processing units busy. However, programming is still mostly done manually, which is time-consuming and error-prone, or by using pre-defined library functions, which limits the applications that can be accelerated.
For more information, we recommend our short vectorization series, consisting of three parts:
emmtrix Vectorization Flow
emmtrix provides a user-guided, automated vectorization flow that supports the same input languages as the general parallelization flow. Vectorization, like parallelization, is an automated, interactive process: emmtrix Parallel Studio performs an initial, automated vectorization for the selected target architecture. The generated code can be run on the target system, e.g. to verify functionality and timing of the implementation. In addition, emmtrix Parallel Studio provides an interface to connect a cycle approximate or cycle accurate simulator for the respective target platform, which returns information on the runtime behaviour of the vectorized code. This simplifies work in early project stages or continuous integration, where target hardware might not be readily available. This data is presented to the user as feedback for the success of the vectorization. The user can, if necessary, trigger transformations to improve the result. This cycle is repeated until the developer is satisfied with the results of the vectorization. The code changes are automatically applied to the code by ePS. The developer does not write any vectorized code in the process.
Code transformations that the developer can apply include typical loop transformations like fission, fusion, splitting and unrolling, but also memory layout optimizations like padding.
To perform automatic vectorization, a lot of information about the target hardware is required. This includes the number of operations executable in parallel, the width of the vector register, the supported data types and instructions, including their latency and throughput, as well as information on memory hierarchy including bandwidth and latencies. All this information is captured in a comprehensive hardware model. It also defines the syntax of the generated code, which is highly dependent on the hardware and software environment of the target. This could for example be (inline) assembly, intrinsic functions translated to machine code by the compiler or an extension to C allowing to specify functionality and memory allocation in a more developer-friendly manner.
In the following we will demonstrate the workflow with a simple, but not trivial example, a multiplication of two 15×15 matrices: the algorithm itself is simple, while the vectorization requires some code transformations to yield good performance.
For the calculation of each individual matrix element, the scalar product of the corresponding row vector of matrix A with the column vector of matrix B must be calculated. In a parallel implementation of this scheme, up to 16 elements of the row vector are multiplied with 16 elements of the column vector element by element in parallel and the products are summed up. Before this, however, the vectors must be loaded into the vector registers. The row vector of matrix A lies linearly in memory, so that its elements can be loaded with a single operation. However, the column vector of matrix B is scattered throughout the memory of matrix B. Therefore, 16 individual memory accesses are required to gather all the elements, which nullifies any acceleration that could be achieved through vectorization.
This can be remedied by applying a transpose operation to matrix B, which exchanges rows and columns of the matrix. The result is a memory layout that has the column vectors stored in linear memory, so that they can be read with one memory access. Finally, a transformation is applied to unroll the inner loop by the number of parallel operations supported by the vector accelerator. This in turn allows an automatic, efficient mapping of these computation to instructions of the accelerator, which yields a significant speed-up compared to sequential execution.
Vectorization of Simulink Models
Projects utilizing Simulink models benefit particularly from emmtrix’s vectorization solution: As the blocks and interconnects given in the Simulink model describe functionality and data flow rather than an actual implementation, emmtrix Code Generator is free to choose the best possible implementation available for vectorization on the selected target architecture. In addition, it supports a feature called “Code Fusion”, which, unlike library-based approaches that only optimize individual blocks of a Simulink model, allows for cross-block optimization and vectorization. It combines calculations that are performed by different blocks into a single, inlined implementation. This improves data locality and the ratio between computation and data transfer, which raises the utilization of the vector processor and thus increases the achievable speedup.
The pictured Simulink model implements a calculation that multiplies a matrix with a second, transposed matrix. The elements of the resulting matrix are then multiplied by a constant factor (gain). Optimized vector code can be generated for each of these blocks using library-based approaches. Then, however, at each step the entire matrices are read in, processed and the complete intermediate result is written back, where it must be read in again from the next block to generate the next intermediate result. In the absence of data locality, many load/store operations are required to transfer the data between the registers and the local vector data store. To avoid this, the functionality of the Transpose, Product and Gain blocks can be combined into one calculation, where the transpose and gain operations are performed during matrix multiplication. Since no intermediate results need to be kept, the implementation uses less local memory. In addition, the runtime in this example improves by about 10% on an Infineon 32-bit TriCore™ AURIX™ TC4x MCU.
- Functional testing of vector code independent of target platform
- Code transformations improving data level parallelism and optimizing code for vectorization
- Integration of target platform simulators for performance estimation
- Vectorization-aware code generation from Simulink® models
- Code Fusion: block-crossing vectorization of Simulink® models
- Generation of C code with vector extensions using generic libraries or target specific intrinsics
- Easy exploitation of parallel vector hardware
- No need to write vectorized code manually
- Limited hardware knowledge required
- Reduced testing effort
- Functional testing without hardware
- Short development cycles
The tool supports vectorization for the next-generation of Infineon 32-bit TriCore™ AURIX™ TC4x MCUs, as well as microprocessors implementing the RISC-V “V” Vector Extension (RVV). Our generic solution allows us to support additional architecture with ease.
Please call us if you require support for a specific hardware architecture.