An Arrogant Programmer's Adventures in Assembly (and C): SPO600 - Lab 6

Within this lab we were tasked to write some code to be auto-vectorized, and then analyze the disassembled machine code. The following code may seem to be inefficient, but I had to do it this way to insure that only one section of the code were to get vectorized. For reference, this code was compiled using: "gcc -O3 -g -o lab6 lab6.c"

#include <stdlib.h>
#include <stdio.h>

#define SIZE 1000
#define ARR_SIZE sizeof(int) * SIZE
int main(){
int* arr1 = malloc(ARR_SIZE);
int* arr2 = malloc(ARR_SIZE);
int* sum = malloc(ARR_SIZE);
long long finalSum = 0;
size_t i;

for(i = 0; i < SIZE; i++) {
arr1[i] = rand();
}

for(i = 0; i < SIZE; i++) {
arr2[i] = rand();
}

for(i = 0; i < SIZE; i++){
sum[i] = arr1[i] + arr2[i];
}

for(i = 0; i < SIZE; i++) {
finalSum += sum[i];
}

printf("Final sum:%d\n", finalSum);

free(arr1);
free(arr2);
free(sum);
}

In the past I've broken the code down into parts, and then explained each part, however the disassembled code here is much longer than it was in the past, and much of this code doesn't have vectorized code, so we're just going to look at the code in the 3rd loop, as it's relatively simple, and we only have to analyze a small section of it to understand how the vector operations work!
Here is the main loop for calculating the sum array:

40062c: d2800001 mov x1, #0x0
400630: 4cdf7861 ld1 {v1.4s}, [x3], #16
400634: 4cdf7880 ld1 {v0.4s}, [x4], #16
400638: 4ea08420 add v0.4s, v1.4s, v0.4s
40063c: 91000421 add x1, x1, #0x1
400640: 4c9f7840 st1 {v0.4s}, [x2], #16

Without any context it can be rather difficult to tell what this code does. The first line sets x1 to be zero, where the second one and the third one load the values from arr1 and arr2 into vector registers, (16/register_size integers. So 4 integers get loaded). Then each vector set gets added, and stored into v0.4s. We then increment the loop counter by one, and store the results of our addition into memory (x2 being the sum array).

As you can see, using vector operations generate more complicated code, but it is also much faster then doing plain addition on each iteration.

If you would like to see the disassembled and annotated code for the main function:
Click here
Please note I only documented up to the first bit of vectorization code, by that point I was able to understand the rest of the code's body rather well.

An Arrogant Programmer's Adventures in Assembly (and C)

Saturday, March 4, 2017

SPO600 - Lab 6 - Vectorization Lab

No comments:

Post a Comment