Sunday, April 16, 2017

Benchmarking My Changes

In my last blog post I explained how I was able to optimize the strfry() functions from glibc, and now I have to test for regressions. I was testing my last results on an x86 virtual machine on my laptop. The issue with that is that we're supposed to be targeting our code for aarch64, so now I need to test on there!

For reference, the code I am using to test both the current glibc's strfry and my own:

#include <string.h>
#include <stdio.h>

#define STR_LEN 1000000

#define LOOP_COUNT 10000

#define CHARSET_SIZE 26

int main()
{
char charset[CHARSET_SIZE];
char text[STR_LEN];
for(int i = 0; i < CHARSET_SIZE; i++){
charset[i] = (char)('a' + i);
}
int i;
for(i = 0; i < STR_LEN;i += CHARSET_SIZE){
memcpy((void*)text + i, charset, CHARSET_SIZE);
}

text[i - CHARSET_SIZE] = '\0';

for(i = 0; i < LOOP_COUNT; i++){
strfry(text);
}
}

To keep my benchmarks consistent with my peers, I will first re-run my tests on Xerxes, our x86_64 server here to use.

Results of running the current glibc strfry on Xerxes:
real 4m15.630s
user 4m15.525s
sys 0m0.003s

And the results for running my implementation on Xerxes:
real 2m54.665s
user 2m54.609s
sys 0m0.005s

This is a roughly 30% runtime improvement, better even then the 5-10% I was getting on my virtual machine! However when tested on Betty (our aarch64 server), the results are less impressive.

The current implementation:
real 3m21.161s
user 3m21.160s
sys 0m0.000s

Mine:
real 3m18.495s
user 3m18.500s
sys 0m0.000s

This is only a roughly 1% improvement, and I can only assume that is because aarch is a highly optimized architecture, and is better able to handle larger sets of data.


You can view my exact code on my spo600 github pull request here

Sunday, April 9, 2017

strfry is totally optimizable

A note: I've always been partial to the idea of "let the code do the talking" and if my last blog post didn't reinforce that, I don't know what does! It was a bit of a mess, and rather rushed.

So I previously remarked that I couldn't optimize strfry. I was wrong. For reference, here is the currently implementation of strfry:

char* strfry (char *string)
{
  static int init;
  static struct random_data rdata;

  if (!init)
    {
      static char state[32];
      rdata.state = NULL;
      __initstate_r (time ((time_t *) NULL) ^ getpid (),
    state, sizeof (state), &rdata);
      init = 1;
    }

  size_t len = strlen (string);
  if (len > 0)
    for (size_t i = 0; i < len - 1; ++i)
      {
int32_t j;
__random_r (&rdata, &j);
j = j % (len - i) + i;

char c = string[i];
string[i] = string[j];
string[j] = c;
      }

  return string;
}

As a simple test case, the following code was run:

#define STR_LEN 1000000
#define LOOP_COUNT 0000

#define CHARSET_SIZE 26

int main(){
char charset[CHARSET_SIZE];
char text[STR_LEN];
int i;
//fill the charset
for(i = 0; i < CHARSET_SIZE; i++){
charset[i] = 'a' + i;
}
//fill the text
for(i = 0; i < STR_LEN; i ++ CHARSET_SIZE){
memcpy((void*)text + i, charset, CHARSET_SIZE);
}
text[i - CHARSET_SIZE] = '\0';

for(i = 0; i < LOOP_COUNT; i++){
strfry(text);
}

}

And the results when timed (compiled with gcc, and statically linked the binary for faster code execution):

real 3m3.960s
user 3m3.676s
sys 0m0.020s


Can you see where the function can be optimized? The strlen call is actually unneeded! It iterates though the full string to get the length, and then shuffles the contents of the string, only you don't actually need the length to shuffle the string, we can just iterate till we hit the null character. Our changed loop would look like:

  if (string[0])
    for (size_t i = 1; string[i]; ++i)
      {
int32_t j;
__random_r (&rdata, &j);
j = j % i;

char c = string[i];
string[i] = string[j];
string[j] = c;

      }

as you an see, we only removed the function call, changed both the loop condition and the initial if statement, and then changed how we compute the index to swap with. These are all relatively small changes, but do have a rather significant impact when executed. The results when timed with my modification:

real 2m47.227s
user 2m47.024s

sys 0m0.020s

This is an approximate improvement of 5-10%! Now I can go and try to get this reviewed by my fellow classmates, and see what they say!

Monday, April 3, 2017

Optimizing glibc

So for our course's final assignment, we are tasked with finding a function in glibc, and then to optimize it. At first I assumed this project would have been a relatively easy (spoiler: it's not), and decided to leave picking my function to the last minute (literally, I picked it in class when others were presenting on their selected functions). I ended up stumbling upon the function "strstr", a function used to locate the first occurrence of a string inside of another string. After reviewing the contents of the function however I was forced to conclude that I (with my current knowledge) am unable to further optimize it.

Instead I decided to look at the function "strfry" where it shuffles the contents of the string. At first glance I thought it would also be optimizable, but now I'm not so sure. I need to discuss this with our professor.

Saturday, March 4, 2017

SPO600 - Lab 6 - Vectorization Lab

Within this lab we were tasked to write some code to be auto-vectorized, and then analyze the disassembled machine code. The following code may seem to be inefficient, but I had to do it this way to insure that only one section of the code were to get vectorized. For reference, this code was compiled using: "gcc -O3 -g -o lab6 lab6.c"


#include <stdlib.h>
#include <stdio.h>

#define SIZE 1000
#define ARR_SIZE sizeof(int) * SIZE
int main(){
        int* arr1 = malloc(ARR_SIZE);
        int* arr2 = malloc(ARR_SIZE);
        int* sum  = malloc(ARR_SIZE);
        long long finalSum = 0;
        size_t i;

        for(i = 0; i < SIZE; i++) {
                arr1[i] = rand();
        }

        for(i = 0; i < SIZE; i++) {
               arr2[i] = rand();
        }

        for(i = 0; i < SIZE; i++){
                sum[i] = arr1[i] + arr2[i];
        }

        for(i = 0; i < SIZE; i++) {
                finalSum += sum[i];
        }

        printf("Final sum:%d\n", finalSum);

        free(arr1);
        free(arr2);
        free(sum);
}


In the past I've broken the code down into parts, and then explained each part, however the disassembled code here is much longer than it was in the past, and much of this code doesn't have vectorized code, so we're just going to look at the code in the 3rd loop, as it's relatively simple, and we only have to analyze a small section of it to understand how the vector operations work!
Here is the main loop for calculating the sum array:

  40062c:       d2800001        mov     x1, #0x0
  400630:       4cdf7861        ld1     {v1.4s}, [x3], #16
  400634:       4cdf7880        ld1     {v0.4s}, [x4], #16
  400638:       4ea08420        add     v0.4s, v1.4s, v0.4s
  40063c:       91000421        add     x1, x1, #0x1
  400640:       4c9f7840        st1     {v0.4s}, [x2], #16

Without any context it can be rather difficult to tell what this code does. The first line sets x1 to be zero, where the second one and the third one load the values from arr1 and arr2 into vector registers, (16/register_size integers. So 4 integers get loaded). Then each vector set gets added, and stored into v0.4s. We then increment the loop counter by one, and store the results of our addition into memory (x2 being the sum array).

As you can see, using vector operations generate more complicated code, but it is also much faster then doing plain addition on each iteration.


If you would like to see the disassembled and annotated code for the main function:
Click here
Please note I only documented up to the first bit of vectorization code, by that point I was able to understand the rest of the code's body rather well.

Saturday, February 18, 2017

SPO600 - Lab 5 - Algorithm Selection Lab

For this lab we were required to write a program to multiply a large set of 16-bit signed integers by one floating point value, between 0.0 and 1.0. This simulates how volume scaling works, and it was used to show how the code we write is optimized by the compiler, and also how you can format your code for the compiler to better optimize it.

The lab itself asked us to write two separate implementations, a naive implementation, and a potentially more optimized solution, where we chose to write one using a lookup table. To assist us with our benchmarking,  we created a struct called "Result", and it looks something like this:

struct Result{
        double elapsed, elapsed_middle;
        struct timeval time_start,time_end;
        double sum;
};


Our naive implementation was about as efficient as expected, taking 2768ms to process a set of 500,000,000 pseudo-random data (from /dev/urandom, and using gcc's -O3 flag). Whereas our custom implementation takes 1852ms. That's a difference of 916ms, or a 33% increase in speed. When compiling without any optimizations, our naive function takes 18080ms, whereas our custom one takes 11756ms, or an improvement of 6324ms, or a 65% improvement.

For reference, here is our naive and custom functions:

struct Result sum_custom() {
        int i;
        struct Result res;
        int16_t table[0xFFFF];
        int idx;

        gettimeofday(&res.time_start, NULL);
        res.sum = 0;
        for(i = -32767; i <=32767;i++){
                table[i & 0xFFFF] = i * VOL;
        }

        int sz = SIZE;
        for(i = 0; i < sz ; i++){
                idx = (unsigned short) data[i];
                res.sum += output[i] = table[idx];

        }
        gettimeofday(&res.time_end, NULL);
        return res;
}

struct Result sum_naive(){
        size_t i;
        struct Result res;
        int16_t val;

        gettimeofday(&res.time_start, NULL);
        res.sum = 0;
        for(i = 0; i < SIZE;i++){
                val = data[i] * VOL;
                output[i] = val;
                res.sum += val; // we're adding here to prevent the compiler from optimizing
// our whole loop out.
        }
        gettimeofday(&res.time_end, NULL);
        return res;
}

Tuesday, January 31, 2017

SPO600 - Lab 4 - Compiled C Code

Within this lab, we looked at how various flags provided to the GNU C Compiler can affect the compiled binary, along with how changes in our code can affect the compiled binary.

The -static Flag

The -static flag tells GCC to not generate a link table, instead it embeds all called functions into the compiled binary itself. In turn that makes the binary grow exponentially, as the compiler doesn't know what functions the library function calls, so the whole library gets loaded into the binary as well.

The -fno-builtin Flag

As the title implies here, this tells GCC to not optimize calls to the built-in functions. In class this was shown using printf. Without the flag the compiler replaces calls to printf (that only have one parameter) with putc. putc doesn't check the string for any patterns, so the code should run much faster that it would when printf is called.

The -g Flag

The -g flag tells the GCC to insert debugging sections into the binary, and in turn that makes the binary much larger. It adds a few more sections into the binary, namely the .debug_info, the .debug_types, and .debug_str sections. The binary also contains a line number table, to allow debuggers to map the compiled operations to their C code counterparts, along with other info containing variable names, and their types, effectively allowing debuggers to reconstruct source code from the assembly code.

printf(), With Many Arguments

In this section, we didn't actually change any flag provided to GCC, instead we simply kept adding more parameters to printf to see what happens. It turns out that after 7 arguments, the compiler starts pushing them onto the stack instead. This is because there are only so many registers in the processor itself, and functions with that many arguments tend to be rare, so it would be wasteful to reserve more registers for parameters.

Calling A Simple Function, With The -O0 Flag

Here we had to compare a binary where printf() was called directly and another one where it was called in another function, and we just called that function. When the code was compiled, in main printf() doesn't exist, instead it called output(), and output() contained the printf() call instead.

Calling A Simple Function, With The -O3 Flag

Here we did the exact same as above, but instead of compiling with no optimizations, we compiled with some. This time main() doesn't have a call to output() at all, instead it has the printf() call. Interestingly, the compiled binary still contains the original output() functions. I assume that is  in case your code is a library of some sorts, and you need to keep all functions in case some external binary needs them.

SPO600 - Lab 3 - Assembly Lab

Within this lab we were tasked with writing code that prints the numbers 0 to 99 in both x86_64 and aarch64 (arm64) assembly code. However in this write up, I will only focus on the aarch64 assembly code, as both topics overlap.

In C a task like this is simple, and can be done in 8 lines of code, including formatting.

#include <stdio.h>

int main() {
int i;
for(i = 0; i < 100; i++) {
printf("Loop: %d\n", i);
}
 return 0;
}

While in both x86_64 and AArch 64 it's much more complex. We do not have direct access to the C standard library (where printf is located), so we need to instead invoke something known as a "syscall" to display our text. A syscall is effectively a function provided by your operating system's kernel, and we need to use one to display to the console. In C the code would then look like:

#include <unistd.h>
#define STDOUT 1
#define ZERO_ASCII 48

int main() {
int i;
for(i = 0; i < 100; i++) {
write(STDOUT, "Loop: ", 6);
if(i > 9){
write(STDOUT, (i / 10) + ZERO_ASCII, 1);
}
write(STDOUT, (i % 10) + ZERO_ASCII, 1);
write(STDOUT, "\n", 1);
}
 return 0;
}

While this code is more complex, as it has to format the number it is still rather simple, and only 15 lines of code. In AArch64 assembly, a program that does the same thing would look like:

.text
.globl _start

start = 0
max = 100

_start:
/*setup initial loop counter */
mov x19, start

loop:
/* Start loop here */

        /* Print the Loop string */
        mov     x0, 1        /* file descriptor: 1 is stdout */
        adr     x1, loop_msg /* message location (memory address) */
        mov     x2, loop_msg_len /* message length (bytes) */

        mov     x8, 64       /* write is syscall #64 */
        svc     0            /* invoke syscall */

        mov x20, num_msg_len
        udiv x21, x19, x20

        cmp x19, x20
        b.lt skip

        mov     x0, 1        /* file descriptor: 1 is stdout */
        adr     x1, num_msg  /* message location (memory address) */
        add     x1, x1, x21  /* add the loop count */
        mov     x2, 1        /* message length (bytes) */

        mov     x8, 64
        svc     0
skip:
        msub x21, x20, x21, x19

        mov     x0, 1
        adr     x1, num_msg
        add     x1, x1, x21
        mov     x2, 1

        mov     x8, 64
        svc     0


        /* Print newline */

        mov     x0, 1
        adr     x1, nl_msg
        mov     x2, nl_msg_len

        mov     x8, 64
        svc     0

        /* Increment loop */
        add x19, x19, 1
        /* compare the loop counter (x19) to the max value */
        cmp x19, max
        /* branch if less then the max */
        b.lt loop

        mov     x0, 0           /* status -> 0 */
        mov     x8, 93          /* exit is syscall #93 */
        svc     0               /* invoke syscall */

.data

loop_msg:       .ascii  "Loop: "
loop_msg_len = . - loop_msg

num_msg:        .ascii "0123456789"
num_msg_len = . - num_msg

nl_msg:         .ascii "\n"
nl_msg_len = . - nl_msg


As you can see, the amount of code required to complete the same task is far bigger, but I'll break down each section of code.

_start:
mov x19, start

This section of the code is relatively easy to understand, as it simply store the number of times we are going to loop into register 19. You can think of it like the 'i' variable from the C program. The next section of code is the body of the loop, but we're only going to look at half of it for now.

loop:
        mov     x0, 1        /* file descriptor: 1 is stdout */
        adr     x1, loop_msg /* message location (memory address) */
        mov     x2, loop_msg_len /* message length (bytes) */

        mov     x8, 64       /* write is syscall #64 */
        svc     0            /* invoke syscall */

        mov x20, num_msg_len
        udiv x21, x19, x20

        cmp x19, x20
        b.lt skip

        mov     x0, 1
        adr     x1, num_msg
        add     x1, x1, x21
        mov     x2, 1

        mov     x8, 64
        svc     0

We can then break this section down more, info the first part of the code.

        mov     x0, 1        /* file descriptor: 1 is stdout */
        adr     x1, loop_msg /* message location (memory address) */
        mov     x2, loop_msg_len /* message length (bytes) */

        mov     x8, 64       /* write is syscall #64 */

        svc     0            /* invoke syscall */

This section of the code is equivalent to the first call to write() from the C code, where it prints "Loop: ". x0 contains the first parameter (1 is STDOUT), x1 contains the second parameter (the pointer to the block of memory we are printing), and x2 contains the third parameter (number of bytes we are writing). We then store the number 64 in x8, as it is the id for the write syscall, that is invoked on the next line.

At this point I would like to note that it is possible to print the whole line with one syscall by buffering it ahead of time, instead we chose to do each print separately, as it was easier to model if after our C example, and as an added benefit we were able to easily extend our code to be able to write values in formats other then base-10. All you would have to do would be add more characters to the "num_msg" buffer.


This next section of the loop body is where we print the first digit of the number, if it exists.

        mov x20, num_msg_len
        udiv x21, x19, x20

        cmp x19, x20
        b.lt skip

        mov     x0, 1
        adr     x1, num_msg
        add     x1, x1, x21
        mov     x2, 1

        mov     x8, 64
        svc     0


 We first load the number of characters in our charset into x20, and then divide x19 (our current loop count) by x20 (number of characters) and store the result int x21 (the quotient of the division). We then check to see if our current loop count is less then the charset size, and if it is we skip printing the value (as it would be 0), otherwise we print the number, in much the same way as before, only we do pointer arithmetic instead. By adding the digit to the num_msg pointer, we are able to single out an individual character to be printed from our set.


Now we're going to look at the section of the code where we print the last digit. This code will be executed every iteration, no matter what.

skip:
        msub x21, x20, x21, x19

        mov     x0, 1
        adr     x1, num_msg
        add     x1, x1, x21
        mov     x2, 1

        mov     x8, 64
        svc     0

        mov     x0, 1
        adr     x1, nl_msg

        mov     x2, nl_msg_len

This section is almost the same as the previous one, with only one difference, that being the msub call. It gets the remainder of dividing the loop counter by the charset size, and then prints it out. After that our code then prints out the newline.

The last section of code we need to look at is where we increment the loop counter, and actually check to see if we need to loop again.

        add x19, x19, 1
        cmp x19, max
        b.lt loop

This is relatively easy to understand, as we just increment the loop counter (x19) by one, then we compare it with the max value. If it's less then the max, we jump to the beginning of the loop, and run again


To conclude this blog post, it has been an interesting introduction to assembly, and I look forward to writing more.