Saturday, March 4, 2017

SPO600 - Lab 6 - Vectorization Lab

Within this lab we were tasked to write some code to be auto-vectorized, and then analyze the disassembled machine code. The following code may seem to be inefficient, but I had to do it this way to insure that only one section of the code were to get vectorized. For reference, this code was compiled using: "gcc -O3 -g -o lab6 lab6.c"


#include <stdlib.h>
#include <stdio.h>

#define SIZE 1000
#define ARR_SIZE sizeof(int) * SIZE
int main(){
        int* arr1 = malloc(ARR_SIZE);
        int* arr2 = malloc(ARR_SIZE);
        int* sum  = malloc(ARR_SIZE);
        long long finalSum = 0;
        size_t i;

        for(i = 0; i < SIZE; i++) {
                arr1[i] = rand();
        }

        for(i = 0; i < SIZE; i++) {
               arr2[i] = rand();
        }

        for(i = 0; i < SIZE; i++){
                sum[i] = arr1[i] + arr2[i];
        }

        for(i = 0; i < SIZE; i++) {
                finalSum += sum[i];
        }

        printf("Final sum:%d\n", finalSum);

        free(arr1);
        free(arr2);
        free(sum);
}


In the past I've broken the code down into parts, and then explained each part, however the disassembled code here is much longer than it was in the past, and much of this code doesn't have vectorized code, so we're just going to look at the code in the 3rd loop, as it's relatively simple, and we only have to analyze a small section of it to understand how the vector operations work!
Here is the main loop for calculating the sum array:

  40062c:       d2800001        mov     x1, #0x0
  400630:       4cdf7861        ld1     {v1.4s}, [x3], #16
  400634:       4cdf7880        ld1     {v0.4s}, [x4], #16
  400638:       4ea08420        add     v0.4s, v1.4s, v0.4s
  40063c:       91000421        add     x1, x1, #0x1
  400640:       4c9f7840        st1     {v0.4s}, [x2], #16

Without any context it can be rather difficult to tell what this code does. The first line sets x1 to be zero, where the second one and the third one load the values from arr1 and arr2 into vector registers, (16/register_size integers. So 4 integers get loaded). Then each vector set gets added, and stored into v0.4s. We then increment the loop counter by one, and store the results of our addition into memory (x2 being the sum array).

As you can see, using vector operations generate more complicated code, but it is also much faster then doing plain addition on each iteration.


If you would like to see the disassembled and annotated code for the main function:
Click here
Please note I only documented up to the first bit of vectorization code, by that point I was able to understand the rest of the code's body rather well.