compare and describe the costs of calling functions and provide guidelines when to use what:

  1. inline call
  2. non-inline/virtual function
  3. call across shared library boundary
  4. execute via an executor in same thread
  5. execute via an executor in another thread
  6. execute on same machine in another process
  7. execute on another machine

This article is in response to a discussion at http://stackoverflow.com/questions/18292619/c-string-array-sorting.

The comparison done by Mattingly does two things:

1. It compares apples to oranges: if the object is to create an array of sorted strings, this isn’t equivalent to an array of sorted pointers to C-strings.
2. It measures the time needed to construct `std::string` objects from string literals when using `std::sort()` but directly uses string literals when using `qsort()`. Although there is certainly a cost involved in create `std::string` objects, a roughly similar amount of time is spent in a real example to hold the corresponding C-strings.

Below I posted some code which uses the same tiny data set to compare `std::sort()` and `qsort()` removing the second of these issues in all cases (i.e., the construction of the data set is always removed from the measurements taken). In addition, I have added a way to deal with a bigger data set by reading in the words from a file `sort.in` (I used source of today’s featured Wikipedia entry (http://en.wikipedia.org/wiki/Banksia_violacea) but I’d think any reasonably natural text exhibits similar behavior).

The code also adds a third version of the comparison which uses `std::sort()` but operates on an array of C-strings partly to create something which doesn’t compare apples to oranges but also because it came as a surprise that sorting a large array of `std::string` objects is slower than I expected. I haven’t quite dug to the bottom of the issue but I suspect that the issue the relatively large footprint of `std::string` objects introduced by the now common small string optimization. My measurements indicate that the version shuffling pointers to strings using `std::sort()` is fastest, followed by `qsort()`, with the version directly sorting a larger array of `std::string` objects is slowest, although definitely not by a factor of four. The time measurements I observed using a recent snapshort of gcc are along the lines of those (those for clang seem to do better when using pointers with `std::sort()`):

    sorting sort content of sort.in
    c-sort:   396765
    cpp-ptrs: 375525
    c++-sort: 558646

Especially the amount time taken by `std::sort()` on `std::string` objects is, indeed a surprise. Similar tests in the past indicated that sorting `std::string` using `std::sort()` was faster than sorting C-strings using `qsort()`. I’d blame that on the small string optimization. Still when comparing the same things, `std::sort()` is faster than `qsort()`.

The code I used is below:

    #include <iostream>
    #include <fstream>
    #include <iterator>
    #include <string>
    #include <algorithm>
    #include <mach/mach_time.h>
    #include <string.h>
    #include <vector>
    
    void cpp_sort(std::vector<std::string> values)
    {
        uint64_t ts = mach_absolute_time();
        std::sort(values.begin(), values.end());
        std::cout << “c++-sort: ” << ( mach_absolute_time() – ts ) << ‘\n’;
    }
    
    struct cmp
    {
        bool operator()(char const* a, char const* b) const
        {
            return strcmp(a, b) < 0;
        }
    };
    
    void cpp_ptrs(std::vector<std::string> values)
    {
        std::vector<char const*> temp(values.size());
        std::transform(values.begin(), values.end(), temp.begin(),
                       std::mem_fn(&std::string::c_str));
    
        if (!temp.empty()) {
            uint64_t ts = mach_absolute_time();
            std::sort(temp.begin(), temp.end(), cmp());
            std::cout << “cpp-ptrs: ” << (mach_absolute_time() – ts) << ‘\n’;
        }
    }
    
    static int _sort( const void* a, const void *b )
    {
        return strcmp( *( const char** )a, *( const char** )b );
    }
    
    void c_sort(std::vector<std::string> values)
    {
        std::vector<char const*> temp(values.size());
        std::transform(values.begin(), values.end(), temp.begin(),
                       std::mem_fn(&std::string::c_str));
    
        if (!temp.empty()) {
            uint64_t ts = mach_absolute_time();
            qsort(&temp[0], values.size(), sizeof( const char* ), _sort );
            std::cout << “c-sort:   ” << (mach_absolute_time() – ts) << ‘\n’;
        }
    }
    
    void compare1(std::vector<std::string> const& array)
    {
        c_sort(array);
        cpp_ptrs(array);
        cpp_sort(array);
    }
    
    void compare2(std::vector<std::string> const& array)
    {
        cpp_sort(array);
        cpp_ptrs(array);
        c_sort(array);
    }
    
    
    template <typename T, std::size_t Size>
    std::size_t size(T(&)[Size])
    {
        return Size;
    }
    
    int main(int ac, char *av[])
    {
        switch (ac == 1? 0: atoi(av[1]))
        {
        case 0:
        case 1:
            {
                std::cout << “simple test of fixed array\n”;
                std::string name[] = {
                    “john”, “bobby”, “dear”, “test1”, “catherine”, “nomi”,
                    “shinta”, “martin”, “abe”, “may”, “zeno”, “zack”, “angeal”,
                    “gabby”
                };
                compare1(std::vector<std::string>(name, name + size(name)));
            }
            break;
        case 2:
            {
                std::cout << “simple test of fixed array (just reversed calls)\n”;
                std::string name[] = {
                    “john”, “bobby”, “dear”, “test1”, “catherine”, “nomi”,
                    “shinta”, “martin”, “abe”, “may”, “zeno”, “zack”, “angeal”,
                    “gabby”
                };
                compare2(std::vector<std::string>(name, name + size(name)));
            }
            break;
        case 3:
            {
                std::cout << “sorting sort content of sort.in\n”;
                std::ifstream in(“sort.in”);
                std::istream_iterator<std::string> begin(in), end;
                std::vector<std::string> values(begin, end);
                compare1(values);
            }
            break;
        case 4:
            {
                std::cout << “sorting sort content of sort.in (just reversed calls\n”;
                std::ifstream in(“sort.in”);
                std::istream_iterator<std::string> begin(in), end;
                std::vector<std::string> values(begin, end);
                compare2(values);
            }
            break;
        }
    }

The basic code organization in C++ is fairly straight forward: declarations and type definitions go into the header, definition of objects and functions go into the source. This way a library’s declarations are decoupled nicely from their implementation and source dependencies are kept low. Of course, nothing is as simple as this in C++. That is, some classes are used only locally and thus defined only in in the source file rather than in the header. In the other direction, some “header-only” libraries are quite popular which implement all of the functions inline and thus don’t require any library to be compiled and linked to the program separately. What goes where can become a bit of an art and sometimes people employ techniques specifically to reduce the number of dependencies.

When it comes to templates things are clear-cut on the other hand: there is no option than putting the template into the header as otherwise neither the compiler nor the linker will see the definition of a function template when it gets instantiated. Or is there? In C++2003 templates could be exported from a source and thus made visible somehow to the C++ system. Only the EDG C++ front-end implemented this and some vendors using this C++ front-end either disabled this feature or, at least, didn’t support it in any way. Exported templates were removed, not just deprecated, from the C++2011 standard although allowing vendors to be conforming if they implement exported templates (essentially export is still a keyword although one which isn’t necessarily used). What else can be done?

As it turns out, some templates can only be instantiated with a limited set of template arguments or are normally only used with a known and reasonably small set of template arguments. For example, the I/O stream classes are typically only instantiated for the character types char and wchar_t although even C++2003 supports additional character types (i.e. signed char and unsigned char which are both distinct types from char although the values representable using char are identical to one of the other two types; whether char is signed or unsigned is up to the implementation and there implementations for both; C++2011 adds the character types char16_t and char32_t). In fact, it is relatively hard to use the I/O streams with any other type because it requires setting up certain facets in the used std::locale objects and the implementation of a number of functions which need to be present. The standard class template std::basic_string is similarly generally only instantiated for the character types char and wchar_t. Instantiating it for other character types requires implementation of a suitable character traits class or a specialization of std::char_traits<cT> for the corresponding character type cT. For std::complex<T> the standard explicitly states that its use with other types than float, double, and long double is unspecified.

Generally I found that there are quite a number of class and function templates which only work with a limited set of types. Implementing them as templates is still reasonable because it allows treating the different instantiations the same in code using them. For example, the numerical algorithms used with complex numbers don’t really differ with the size of the type used to represent the values. They may even stay the same when using SSE vectors to represent the values, applying the same algorithm to multiple values at once. Thus, it is a bit sad that there is no explicit support for these types. In any case, if the template is only instantiated with a limited number of template arguments it makes sense to separate the interface from the implementation as is conventionally done for non-templatized code. For this purpose I found it helpful to implement templates using a header file with the declarations, a header file with the template definitions, and multiple files containing instantiations of the templates with the supported types. For example, for std::complex<T> this could look something like this (omitting the usual include guards):

// header <complex>
namespace std {
    template <typename T> class complex {
    public:
        complex();
    ...
    };
    template <typename T>
    complex<T> operator+ (complex<T> const&, complex<T> const&);
    ...
}

This header would only contain the definition of the std::complex<T> class template but no definitions (except, possibly, inline definitions of function which can’t afford the function call overhead). If the implementation indeed only supports a limited set of template arguments it is probably nice to create a compile-time error using e.g. static_assert() for unsupported types (actually, the standard doesn’t really define the primary template for std::complex<T> but only the specializations for float, double, and long double, i.e. it mandates an error for unsupported instantiations). The next file would be another header file including the definitions of the [member] function templates:

// header <complex.tpp>
#include <complex>

template <typename T>
std::complex<T>::complex(): real_(), imaginary_() {}
...
template <typename T>
std::complex<T> std::operator+ (std::complex<T> const& c0, std::complex<T> const& c1) {
    return std::complex<T>(c0.real_ + c1.real_, c0.imaginary_ + c1.imaginary);
}
...

This second header defines all the [member] function templates after including the corresponding header. Users of std::complex<T> don’t include this header. It is only included by a suite of source files, each explicitly instantiating the templates with one supported argument type. For example, the file instantiating the functions for the type float could look something like this:

// source file complex-float.cpp
#include <complex.tpp>

template class std::complex<float>;

template std::complex<float> std::operator+ (std::complex<float> const&, std::complex<float> const&);
...

Explicitly instantiating a class template instantiates all member functions whose definition is currently known. To explicitly instantiate non-member function templates it is necessary to list each one individually. To avoid having to list all of the functions for each type it is possible to create an auxiliary function or class template which explicitly references each of them: explicitly instantiating this template implicitly instantiates all of the non-member functions. Although this is conceptually using a different mechanism the net effect is the same, i.e. there will be an object file which defines all of the instantiations of the function templates.

For a templates only supporting a known set of arguments this works quite nicely. However, things get a bit more interesting when conceptually an infinite set of template arguments can be used. It may still be very desirable to instantiate the template with the commonly used template arguments. For example, the effort for instantiating the I/O stream library in every translation unit which uses it is quite substantial. Here is a subset of the classes getting instantiated when you just use an std::istringstream to format an int if nothing is preinstantiated:

  • std::basic_istringstream<char>
  • std::basic_istream<char>
  • std::basic_ios<char>
  • std::ostreambuf_iterator<char>
  • std::basic_stringbuf<char>
  • std::basic_streambuf<char>
  • std::num_get<char>
  • std::numpunct<char>

For the last four class template most of the functionality is in their respective virtual functions and all of these virtual functions will get instantiated, even if they are not used because they are needed to initialize the “virtual function table” (or whatever technique is actually used to implement virtual functions). In case you ever wondered why C++ code compiles fairly slow, this is definitely part of the reason and these are just the class template which I could immediately think of.

One way to deal with this is to still do a similar separation as was done for std::complex<T> and just make life a bit harder for those few people who already took up the burden of setting up streams for another character type. That is, if the definitions are moved into a separate header the documentation would somewhere state something along the lines of “oh, BTW, if you wonder why the I/O streams don’t work with your character type – you need to also do this:”, obviously followed by some explanation nobody is bound to ever read explaining how to include the proper headers and instantiate all those classes.

An alternative approach is to use C++2011’s extern templates which allow a hybrid approach. That is, you can provide a definition of the function templates and then declare that they shouldn’t be implicitly instantiated because they are explicitly instantiated somewhere else. Unfortunately, this prevents the nice trick of implicitly instantiating the various non-member functions by being referenced from an explicitly instantiated function or class template. Unless, of course, the translation unit causing the implicit instantiation of the function templates doesn’t see the extern template declarations. Here is how this could look like (using std::complex<T> to stick with the example):

// header <complex>
namespace std {
    template <typename T> class complex {
    public:
        complex();
    ...
    };
    template <typename T>
    complex<T> operator+ (complex<T> const&, complex<T> const&);
    ...
}
#include <complex.tpp>

The main change is that the <complex> header now also includes the header for the definitions. Obviously, there isn’t strictly a need to separate the two parts into two files when using extern template declarations. For the time being I find it useful to still use this as not all compilers support extern templates, yet. The definition would conditionally contain the various extern template declarations (also showing a draft version of the auxiliary instantiation function):

// header <complex.tpp>
#include <complex>

template <typename T>
std::complex<T>::complex(): real_(), imaginary_() {}
...
template <typename T>
std::complex<T> std::operator+ (std::complex<T> const& c0, std::complex<T> const& c1) {
    return std::complex<T>(c0.real_ + c1.real_, c0.imaginary_ + c1.imaginary);
}
...
#if defined(INSTANTIATING_COMPLEX)
template <typename T>
void use_all_complex_functions() {
    std::complex<T>() + std::complex<T>();
    ...
}
#else
extern template class complex<float>;
extern template class complex<double>;
extern template class complex<long double>;

extern template
std::complex<float> 
    std::operator+<float> (std::complex<float> const&,
                           std::complex<float> const&);
extern template
std::complex<double>
    std::operator+<double> (std::complex<double> const&,
                            std::complex<double> const&);
extern template
std::complex<long double>
    std::operator+<long double> (std::complex<long double> const&,
                                 std::complex<long double> const&);
#endif

The file explicitly instantiating the various class and function templates looks essentially the same as before except that it would define the macro INSTANTIATING_COMPLEX and then include the header <complex>. Overall this is quite a bit of effort. Of course, all the effort is done by the one guy who is implementing the library and everybody who is using the library is benefiting from his efforts!

Being a German living in the United Kingdom yields the constant threat of being run over because the cars tend to come from the unexpected direction. Trying to talk sense with the British is entirely futile: “the cars drive on the right side of the street which is the left side!” is a typical response to pointing out that we should globally standardize on just one side. It can reasonably be argued that the majority of the world’s population is actually used to traffic driving on the left side of the street (although I know from experience that this doesn’t necessarily mean a lot in places like India). However, I think it would be reasonable to think about “cars driving on the right side of the street which is the right side of the street”.

What does this have to do with C++? Well, for some unknown reason the predominant placement of placing the const keyword is UK-style: the right placement for the const is assumed to be on the left. Note, that if T is a type the declarations const T (let’s call this UK-style) and T const (let’s call this continental-style) are identical. However, UK-style const placement yields strange inconsistencies. For example, to indicate that a member function isn’t allowed to change the object pointed to by this you have to place the const on the right. Also, to deduce the type of a declaration you are best off to read it Arabic-style, i.e. right to left. This makes it easy to determine what is qualified by the const qualifier: the type to the left of the const, assuming, of course, that const is used continental-style. That is, continental-style makes the code not only more consistent but also easier to understand.

If you think that things are easy to understand as they are let’s consider a very simple declaration somewhere in a project:

const int* iterator;

Now, someone get’s the bright idea that it may be reasonable to actually give the type of iterators a name indicating what they are and replacing the above line with the following two lines:

typedef int* int_iterator;
const int_iterator iterator;

There was a int* for which we have a typedef now so the code got cleaned-up. Makes sense, doesn’t it? Well, maybe not so much: a constant int_iterator isn’t going to iterate a lot being const: the semantics of the code changed! That is, the new declaration of iterator without using a typedef would look like this:

int* const iterator;

Note that the const in the above declaration has to be continental-style unless an alias (e.g. a typedef) for int* is used. Now, if the original declaration had been continental-style in the first place hopefully nobody would have had the idea of trying to use the typedef:

int const* iterator;

Somehow, the const between the int and the * should have been a clear sign that the type doesn’t want to be replaced – both to the human reader and the global find-and-replace using something akin to the vi-command %s/int *\*/int_iterator/g. Well, I know that the vi-command would get this one right but I’m admittedly a bit worried about some human readers, unfortunately especially about those who wouldn’t even dream of using an editor command like the one above.

After having made a hopefully strong case for continental const placement there is one caveat I shall not leave unmentioned: C has made an attempt to take on some features introduced into C++. One of the features taken up by C as well is const. From what I can tell, it is an absolute failure. I will back the claim that const doesn’t quite fit into C up with just one example signature from the standard C library:

char *strchr(const char *s, int c);

Clearly, this signature has got const-correctness wrong: searching in a string of char const for a character shouldn’t yield a pointer to a modifiable char. Please note that I didn’t pick the one error which is considered to be editorial and which should be corrected to become:

const char *strchr(const char *s, int c); // NOT in C

Since functions can’t be overloaded in C and you would want be able to both pass in a constant string but also get out a pointer to a modifiable char if you passed in a non-const string in the first place, the signature from the standard C library is intentional and as good as it gets (when include <cstring> you’ll get properly overloaded, const-correct version of the function). This is, unfortunately, not the only place where C didn’t get it quite right: aside from the fact you can only declare one level of constness in C (i.e. you can’t have constant pointers, at least not without a typedef) there is no option for proper const placement in C at all: the const qualifier has to be applied UK-style. Ideally, this shouldn’t matter to C++ but in the odd case where you actually have a header file shared between C and C++ this means that you can’t use the correct continental-style const placement. Fortunately, these headers tend to be written by C programmers and C++ programmers can happily put the const on the correct side.

When using IOStreams it is common to use manipulators to somehow affect the stream, often just setting up formatting flags
or skipping over characters. For some uses it would be nice to have additional manipulators for specific use cases. Something I frequently
find useful is to skip over an expected character, e.g. when reading a table with a fixed number of comma separated columns. For this it
would be great to just use the familiar idiom for reading values without having to bother much about the separator being correct while
still getting it checked. Here is how I want to read a table with three columns from a stream in:

for (int x, y, z; in >> x >> comma >> y >> comma >> z; ) {
    std::cout << "x=" << x << " y=" << y << " z=" << z << "\n";
}

The approach to deal with something like this is to write a manipulator which is just a function taking a stream as argument and returning
it. The input and output streams have special formatting functions implemented which take a pointer to a function with this signature
and call the corresponding function. Thus, the comma manipulator mentioned above can be implemented like this:

template <typename cT, typename Traits>
std::basic_istream<cT, Traits>& comma(std::basic_istream<cT, Traits>& in)
{
    if ((in >> std::ws).peek() != in.widen(',')) {
        in.setstate(std::ios_base::failbit);
    }
    else {
        in.ignore();
    }
    return in;
}

When called with an input stream this function starts with skipping possibly leading whitespace using the standard maninpulator
std::ws. Normally this manipulator isn’t needed because the formatted input functions automatically skip leading whitespace.
However, in the code above the next character is inspected using peek() which just looks at the current character without
removing it and it doesn’t skip leading whitespace. If the character happens to be a comma it is extracted by ignoring it. Otherwise the
std::ios_base::failbit is set for the stream indicating that something went wrong.

The code doesn’t directly compare the comma character to the result of peek() but instead obtains a value from widen().
The use of the widen() function is necessary to cope with the potential use of wide characters: this function uses the
stream’s std::locale to convert the character into a wide character version. Note that this has nothing to do with any
sort of encoding: the characters internal to a C++ program are already considered to be converted from any encoding into the
execution character set.

Of course, having a comma manipulator immediately raises the question for a semicolon manipulator. This
should be easy to write and would have roughly identical code: there is exactly one character difference. That is, a quick copy-and-paste
job and we are down the road of [manual] code duplication. This is a path I prefer to avoid as much as possible. We could
try to turn the character to be skipped into a template argument but this fails: it would be necessary to name the comma,
semicolon, etc. manipulators somehow and this isn’t really possible without fixing all template arguments. To have the
manipulators be usable with all instantiations of input stream it is necessary to use a separate type for the manipulator and implement
an actual input operator, e.g.:

template <char S> struct separator {};
template <typename cT, typename Traits, char S>
std::basic_istream<cT, Traits>&
operator>> (std::basic_istream<cT, Traits>& in, separator<S>)
{
    return (in >> std::ws).peek() != in.widen(S)
        ? in.setstate(std::ios_base::failbit), in: in.ignore();
}

extern separator<','> const comma;
extern separator<';'> const semicolon;
// ...

The implementation of the operator>>() is just a more compact version of the code used earlier. This manipulator
can be used nearly identical as the original manipulator with one minor change: implementing the manipulator as a function allows the
use with a temporary stream:

if (std::istringstream(",1") >> comma >> x) { ... }

This works because it is allowed to call non-const member functions on temporary objects. Using the explicitly written
input operator doesn’t work because it would require binding the temporary stream to a non-const reference which isn’t allowed.
If this is really needed the obvious work-around is to use some “true” manipulator, e.g. std::ws, first: the result of the
operator applying a manipulator is always a non-const reference, i.e. once one of the member input functions was called
there is a suitable reference available for other input operators to bind the stream to.

Many C++ text books advertise the use of std::endl to terminate lines when writing to an std::ostream. Unfortunately, it seems that few people understand what std::endl does or if they know they underestimate the impact. What most people don’t notice is that std::endl does two things: it inserts a newline character i.e. '\n' and then it flushes the std::ostream. It is this extra flush which makes the operation quite expensive. Here is a very simple example program demonstrating the difference:

#include <fstream>
#include <string>

template <typename cT, typename Traits>
std::basic_ostream<cT, Traits>&
nl(std::basic_ostream<cT, Traits>& out)
{
    return out << out.widen('\n');
}

int main(int ac, char*[])
{
    std::string   text("0123456789012345678901234567890123456789");
    std::ostream& (*end)(std::ostream&) = std::endl;
    if (ac == 1)
        end = ::nl;

    std::ofstream out("file.txt", std::ios_base::binary);
    for (size_t i(0); i != 1000000; ++i)
    {
        out << text << text << end;
    }
}

All this program does is to write a million lines of 80 characters plus a newline. Depending on whether the program received a command line parameter it either uses std::endl or the custom manipulator ::nl to terminate lines. To make a fair comparison ::nl is implement like std::endl i.e. as a function template. Since the character type isn’t known, it needs to widen the newline character '\n'. Running this program on my machine shows a factor of about 6 between the time the programs take! The version using endl take a bit more than 3 second while the version using ::nl only take about 0.5 seconds.

This extra flushing the stream causes the file to write each block multiple times instead of just once when the buffer has become full. This easily slows down I/O intensive operations and my cause a bottleneck. Thus, don’t use std::endl unless the flush is actually intentional. Also note that std::cerr is already setup to flush the stream after every individual output operation. This is done using the flag std::ios_bse::unitbuf which is initially set for std::cerr. To turn this off you can use the formatter std::nounitbuf and to turn it on you would use std::unitbuf.