Thursday 12 November 2009

Benchmarking function call overhead in C++


I've caught some cold and been staying at home. Just for fun, decided to compare the price of calls to:
  • a regular function from the same non-position-independent-code (non-PIC) executable object
  • a regular function from a PIC shared library object
  • a virtual function from the same non-PIC executable object
  • a virtual function from a PIC shared library object
Here is the source code of the benchmark:
[max@truth test]$ cat a.h
#pragma once

#include <memory>

namespace max {

void funA(long*); // same non-pic executable, different .o
void funB(long*); // another pic shared library

struct X
{
virtual ~X() {}
virtual void fun(long*) = 0;
};

std::auto_ptr<X> createA(); // same non-pic executable, different .o
std::auto_ptr<X> createB(); // another pic shared library

}

[max@truth test]$ cat a.cc
#include "a.h"

void max::funA(long* x) { ++*x; }

namespace {

struct ImpOfX : max::X
{
void fun(long* x) { ++*x; }
};

}

std::auto_ptr<max::X> max::createA()
{
return std::auto_ptr<max::X>(new ImpOfX);
}

[max@truth test]$ cat b.cc
#include "a.h"

void max::funB(long* x) { ++*x; }

namespace {

struct ImpOfX2 : max::X
{
void fun(long* x) { ++*x; }
};

}

std::auto_ptr<max::X> max::createB()
{
return std::auto_ptr<max::X>(new ImpOfX2);
}

[max@truth test]$ cat main.cc
#include "a.h"
#include <stdio.h>
#include <time.h>

using namespace max;

namespace {

typedef unsigned long long nsec_t;

nsec_t now()
{
timespec ts;
clock_gettime(CLOCK_MONOTONIC, &ts);
return ts.tv_sec * nsec_t(1000000000) + ts.tv_nsec;
}

long benchmark(long n, X& a, X& b, double times[4])
{
long sum[4] = {};

nsec_t start, stop;
start = now();
for(int i = n; i--;) {
funA(sum + 0);
}
stop = now();
times[0] = stop - start;
start = stop;

for(int i = n; i--;) {
funB(sum + 1);
}
stop = now();
times[1] = stop - start;
start = stop;

for(int i = n; i--;) {
a.fun(sum + 2);
}
stop = now();
times[2] = stop - start;
start = stop;

for(int i = n; i--;) {
b.fun(sum + 3);
}
stop = now();
times[3] = stop - start;
start = stop;

return sum[0] + sum[1] + sum[2] + sum[3];
}

}

int main()
{
std::auto_ptr<X> a(createA());
std::auto_ptr<X> b(createB());

double times[4];
// warm-up CPU caches and branch predictors
long sum = benchmark(100, *a, *b, times);
// benchmark
long const N = 1e9;
sum += benchmark(N, *a, *b, times);

printf(
"%lu calls to: \n"
" a regular function from the same executable: %.3lfnsec/call\n"
" a regular function from a shared library: %.3lfnsec/call\n"
" a virtual function from the same executable: %.3lfnsec/call\n"
" a virtual function from a shared library: %.3lfnsec/call\n"
, N
, times[0] / N
, times[1] / N
, times[2] / N
, times[3] / N
);
}
To get nanosecond-resolution times clock_gettime() function is used.

This is how it is compiled:
[max@truth test]$ g++ -m64 -Wall -Wextra -march=native -O3 -c main.cc
[max@truth test]$ g++ -m64 -Wall -Wextra -march=native -O3 -c a.cc
[max@truth test]$ g++ -m64 -Wall -Wextra -march=native -O3 -fpic -c b.cc
[max@truth test]$ g++ -m64 -Wl,-rpath,'$ORIGIN' -Wl,-z,now -shared -o libb.so b.o
[max@truth test]$ g++ -m64 -Wl,-rpath,'$ORIGIN' -Wl,-z,now -o test main.o a.o libb.so -lrt
When linking -Wl,-z,now linker option disables lazy symbol binding. Lazy symbol binding resolves calls to functions from PIC objects on the first call. Disabling lazy binding makes the dynamic linker resolve all symbols prior to executing any application pre, rather than on the first access.

The other linker option -Wl,-rpath,'$ORIGIN' tells the dynamic linker to search for the shared libraries first in the same folder where the executable is, so that there is no need to fiddle with LD_LIBRARY_PATH variable or /etc/ld.so.conf.

And here are the results of the benchmark:
[max@truth test]$ sudo chrt -f 99 /usr/bin/time -f "%E elapsed, %c context switches" ./test
1000000000 calls to:
a regular function from the same executable: 1.999nsec/call
a regular function from a shared library: 2.665nsec/call
a virtual function from the same executable: 2.332nsec/call
a virtual function from a shared library: 2.332nsec/call
0:09.32 elapsed, 0 context switches
The test is invoked as a FIFO realtime class process with priority 99 (highest on my system). FIFO realtime class processes run until they voluntarily block themselves, time quantums don't apply.

In the above invocation GNU /usr/bin/time utility is used to report the number of context switches during the run to make sure that the process has executed uninterrupted. This is to eliminate scheduling noise from the timings completely.

The test was invoked on Intel E8400 CPU running at full 3GHz speed.

Interpretation of the results:
  • Calls to regular functions in non-PIC object files are the cheapest.
  • Calls to virtual member functions are more expensive. An interesting fact here is that calls to virtual functions always get dispatched through the virtual table. This means that the procedure linking tables (PLT) used for regular functions in PIC object files is bypassed entirely. There is no difference whether a virtual function resides in PIC or non-PIC object, they get called the same way through the virtual table.
  • Calls to regular functions in PIC objects are most expensive. This is because such calls in fact call code in the PLT which (when the function has been resolved) jumps to the actual function.
Hm..., interesting!

3 comments:

  1. The code does an increment in the function calls right? Does this mean that the 2 nanoseconds includes the number of time for the increment. I wonder how much of the time is for the function call and how much for the increment.

    Also I wonder the difference between doing the same code without a function call, because i would have thought -03 would inline this tiny function?

    ReplyDelete
  2. I tried with no increment - same numbers.

    It is interesting that the difference between the results is about 0.3 nanosecond, which is the time of 1 CPU cycle on a 3GHz CPU I used.

    ReplyDelete