An Arrogant Programmer's Adventures in Assembly (and C): Benchmarking My Changes

In my last blog post I explained how I was able to optimize the strfry() functions from glibc, and now I have to test for regressions. I was testing my last results on an x86 virtual machine on my laptop. The issue with that is that we're supposed to be targeting our code for aarch64, so now I need to test on there!

For reference, the code I am using to test both the current glibc's strfry and my own:

#include <string.h>
#include <stdio.h>

#define STR_LEN 1000000

#define LOOP_COUNT 10000

#define CHARSET_SIZE 26

int main()
{
char charset[CHARSET_SIZE];
char text[STR_LEN];
for(int i = 0; i < CHARSET_SIZE; i++){
charset[i] = (char)('a' + i);
}
int i;
for(i = 0; i < STR_LEN;i += CHARSET_SIZE){
memcpy((void*)text + i, charset, CHARSET_SIZE);
}

text[i - CHARSET_SIZE] = '\0';

for(i = 0; i < LOOP_COUNT; i++){
strfry(text);
}
}

To keep my benchmarks consistent with my peers, I will first re-run my tests on Xerxes, our x86_64 server here to use.

Results of running the current glibc strfry on Xerxes:
real 4m15.630s
user 4m15.525s
sys 0m0.003s

And the results for running my implementation on Xerxes:
real 2m54.665s
user 2m54.609s
sys 0m0.005s

This is a roughly 30% runtime improvement, better even then the 5-10% I was getting on my virtual machine! However when tested on Betty (our aarch64 server), the results are less impressive.

The current implementation:
real 3m21.161s
user 3m21.160s
sys 0m0.000s

Mine:
real 3m18.495s
user 3m18.500s
sys 0m0.000s

This is only a roughly 1% improvement, and I can only assume that is because aarch is a highly optimized architecture, and is better able to handle larger sets of data.

You can view my exact code on my spo600 github pull request here

An Arrogant Programmer's Adventures in Assembly (and C)

Sunday, April 16, 2017

Benchmarking My Changes

No comments:

Post a Comment