Hi, and welcome to this class, where we will see you on how to optimize the implementation of our kernel in order to efficiently use the available resources on our target FPGA. In particular, we will discuss the loop unrolling optimization. Let's come back to our vector some example that has been introduced in the interface optimization classes. The code version that you see here has already some interface optimizations applied to it. In particular, the code already exploits bus data transfer and we leverage on local memories in order to read the operands and store the results of our floating point additions. The code of our kernel, resides in the loop labeled as sum_loop. Here we iterate over the n elements of our vectors and iteratively perform the additions. Looking again at our synthesis reports, we can see that each iteration of our sum_loop takes 10 cycles. Since the loop needs 1,024 iterations, the overall latency for computing the loop is 10,240 cycles. Noted that the number of loop iterations is referred as trip count within the Vivado HLS performance reports. To understand why we need 10 cycles for each iterations, we can look at the analysis report. Here, we can see that two cycles are needed to load in parallel the operands from arrays local A and local B. Seven cycles are required to perform the floating point addition. And finally, one cycle is needed for storing back the results into array local res. Is there any way to reduce the overall latency of the loop and achieve higher performance? Well, likely the answer is yes. We will now look into two different optimization directives namely loop unrolling and loop pipe lining. If we take a closer look to our original code, we can clearly see that all the iterations of the loop are independent from each other. Indeed, each addition is done on different elements of the input arrays, and it is stored on different elements of the output array. Hence, would it be possible to perform multiple additions in parallel on different elements? The answer is again yes. And the way to achieve it is by unrolling the loop. Loop unrolling effectively means unrolling the loop iterations so that the number of iteration of the loop reduces, and the loop body performs extra computation. This technique allows to expose additional instruction level parallelism that Vivado HLS can exploit to implement the final hardware design. In this example, we have manually unrolled our sum loop by a factor of two. As you can see, the variable i increments with step two, hence effectively reducing the number of loop iteration from 1,024 to 512. On the other hand, each loop iteration perform two additions instead of one. The same optimization can also be expressed in a much more convenient way by using the HLS UNROLL pragma. The pragma must be placed directly within the loop that we wish to unrol. The pragma also allows us to specify the unrolling factor by which we want to unroll our loop. Notice that the unrolling factor can be any number from two up to the number of iteration of the loop. If the factor parameter is not specified, Vivado HLS will try to completely unroll the entire loop. However, this can be achieved only if the number of iteration is constant, and not dependent on dynamic failure computed within the function. All right. Let us now see what is the effect of our optimization. If we run Vivado HLS and look at the synthesis report, we can now see that the latency of the sum loop has halved. The reduction comes from the fact that the loops now iterates over 512 iterations, but it's still able to perform each loop iteration in 10 cycles, as for the previous case. To understand how Vivado HLS achieved these, we can look at the analysis report. Here, we can clearly see that Vivado HLS was able to schedule the execution of two floating point addition, as well as the load and store operation completely in parallel. Nevertheless, these optimization comes at a cost. In order to perform the two floating point addition fully in parallel, we need two floating point address in our hardware design, which increase the overall resource consumption of our kernel. Indeed, if we look at the resource estimation report, we can actually see the two floating-point other instances and their corresponding resource consumption. In our design, we are far away from using all the available FPGA resources. But in more complex design, it's very important to consider the impact on the resource consumption when applying optimization to our kernel. In this example, our rolling by a factor of two provided a straight 2x reduction in the latency of the loop at the cost of 2x extra resources for its implementation. Nevertheless, in some cases, it might not be possible to achieve such an idea latency improvement. When performing loops optimization, there are two potential issues that needs to be considered. First, constraints on the number of available memory ports and available hardware resources. Second, loop-carried dependencies. I know you are interested in knowing more. Don't worry. More information will be provided in the following lesson.