Part III: Vectorization – Automated, User-Guided Vectorization Workflow with emmtrix Parallel Studio
March 09, 2022
Harness the power of vector units in next-generation safety-certified MCUs with emmtrix Parallel Studio.
In the first post of our three-part series on vectorization we introduced vector processing as a means to accelerate applications utilizing linear algebra manyfold. In our second post we explained how the absence of a universal and portable method for programming vector units impedes their effective application. In this post we will have a look at how emmtrix Parallel Studio (ePS) remedies this and simplifies programming with an automated, user-guided vectorization workflow.
Performing vectorization on existing code consists of two steps: The first step is to rewrite the sequential code so that concurrency is visible. This includes loop transformations such as loop unrolling. The goal here is to map similar operations that are executed in multiple loop iterations to single vector instructions so that they are executed simultaneously. To get data to the vector units fast enough, it is necessary to arrange them in memory in such a way that they can be accessed linearly. These optimizations are still largely hardware-independent, but here the vector length of the hardware already determines the necessary degree of loop unrolling. Performing these transformations by hand is tedious and error prone. ePS, in contrast, offers an interactive workflow: the user selects the transformations that are carried out by the tool. In the second step, ePS automatically generates correct-by-construction vectorized code that does not contain any manually written parts, which reduces testing effort significantly. The vectorized code can directly be run on hardware or on a simulator. This provides instant feedback to the developer on the performance impact of vectorization that can be used as a basis for the next round of optimization. The result is a shorter development cycle.
When using Simulink, the emmtrix vectorization flow enables a fast path to the acceleration of models: Once emmtrix Code Generator (eCG) is made aware that a vector architecture is to be targeted, it selects an implementation of the Simulink blocks that perform well on the vector hardware. Here, eCG supports a feature called “Code Fusion”, that can combine the implementation of multiple blocks into a single one, which helps improve data locality and therefore yields an optimized vectorized implementation.
Visit our website “vectorization” for more information and/or register for our webinar “Vectorization for Infineon AURIX™ TC4x” which we are scheduling around embedded world 2022.