Update: I’ve read a lot of comments on HN about the fragility of timestamps for a distributed database. It seems this post doesn’t stress this point enough: we don’t use raw timestamps but a combination of the timestamp, the node id and other specific information. The purpose of the timestamp is to order requests. Should you wish to know more, you can read examples and explanations in Principles of Distributed Database Systems .

I was very happy with the audience in Aspen, Colorado for my Multiplatform C++ talk. I received a lot of great feedback and this gave me tons of ideas on how to improve the talk for next time. It also gave me food for thought regarding quasardb and the way we handle errors.

At some point during the presentation we had short exchanges about the relative precision of time functions and how difficult it can be to use the right function. In this blog post, I would like to dig into that topic further.

The context

To function properly, quasardb needs a high resolution timestamp to uniquely identify every request made. This timestamp needs to have four very important properties:

  1. The function must be scalable;
  2. The function must be as fast as possible;
  3. The values need to be consistent between remote servers;
  4. It must be monotonic, that is, newer timestamps must have a value greater than older timestamps.

To provide a more concrete example, it means that in the following case:

1
2
std::uint64t a = qdb::timestamp();
std::uint64
t b = qdb::timestamp();
b must be greater than a, even (especially) on optimized code.

Two different servers or threads calling the function at "the same time" must return different values and calling the function must not involve locks or costly operations as the function needs to be fast and scalable.

You could "just" use an atomic counter (or something clever based on the TLS), add a thread id and a server id... problem solved! The only problem is how do you synchronize the value between servers without introducing a lock?

I know what you're thinking. You think you have a good idea and you could "make it work", but I'll share a secret with you: you won't.

So how do you share information without sharing information?

Easy: use something already shared. Something like, I don't know, time. Assuming servers are kept synchronized enough (the enough depending on your application), you may just solve the problem by acquiring time precisely enough. (In our case, we have some additional homework to make it work on several nodes that is well beyond the scope of this post.)

The beauty of this is that with a precise enough timer, you also solve the multithreading issue because nothing ever happens at exactly the same time.

The goal

Which brings us to the heart of this post: how do you query the time at a very high resolution, in C++, and on multiple platforms?

In this post we will write a function that returns the number of 10/th microseconds (or 100-nanoseconds intervals if you will) since epoch (UTC) for FreeBSD and Windows. This function will be sub-microsecond-precise which is enough to prevent collisions.

We could try to reach higher precisions, but this would come at the cost of precious bits in our 64-bit timestamps. With 100-nanoseconds interval, we can represent time for ten thousands of years.

The C++ 11 STL way

No problem right? Because with C++11 we have everything we need:

1
2
3
4
5
6
using std::chrono;
highresolutionclock::timepoint epoch;
auto now = high
resolutionclock::now();
auto elapsed = now - epoch;
// you need to be explicit about your timestamp type
std::uint64
t timestamp = duraction_cast<nanoseconds>(elapsed).count()/ 100u;
You wish.

On FreeBSD, this will work pretty well, because most implementations use gettimeofday, and although the man page says "The resolution of the system clock is hardware dependent" it is generally very precise. The real problem is that the function is affected by discontinuous jumps (for example: a manual time adjustment) and may be updated in ticks. Oh and it's not actually very precise.

Did I say it will work pretty well on FreeBSD? Well in fact it doesn't.

As for the Windows implementation, it will most likely use GetSystemTimeAsFileTime which has a resolution of a couple of milliseconds on Windows XP which is just 10,000 times lower than what we need. No problem, right? And even on Windows 7 (or Windows 8 for that matter), the function being updated in ticks, two successive calls tend to return the same value.

It looks like we've reached the Niebler's point where we have to rewrite everything ourselves.

Shall we use the rdtsc instruction?

There is a user-mode accessible instruction, rdtsc (read timestamp counter), which looks like what we need. On Intel, the behavior is reliable as stated in the documentation:

The RDTSC instruction reads the time-stamp counter and is guaranteed to return a monotonically increasing unique value whenever executed, except for a 64-bit counter wraparound. Intel guarantees that the time-stamp counter will not wraparound within 10 years after being reset.

Unfortunately on AMD we do not always have this guarantee. And if your software need to run on something else than Intel-compatible chip, or an Intel compatible chip that doesn't support the rdtsc instruction, you're out of luck.

That's why it's much better to use the operating system functions rather than calling rdtsc directly: the operating system will have worked all these little details for us.

The abstraction layer

I will show you first the abstraction layer, that is, the code common to all architectures. In the design of a multiplatform system function, you usually don't start with the abstraction layer. You first start with a working implementation on one platform, then on another one, then see if you can find common points.

Nevertheless, for the purpose of this blog post, I will show you the arithmetic code that is common to all platforms:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
template<typename T>
struct hiresclock
{
    hires
clock(void): zerotime(T::zerotime()),
                        _offset(T::timestamp()),
                        _frequency(T::update
frequency())
    {
        assert(offset > 0u);
        assert(
frequency > 0u);
        assert(zerotime > 0u);
    }

    std::uint64t now(void)const
    {
        assert(
offset > 0u);
        assert(frequency > 0u);
        const std::uint64
t delta = T::timestamp()- offset;
        // because the frequency is in update per seconds, we have to multiply the delta by 10,000,000
        const std::uint64
t deltainus = delta * 10000000u / frequency;
        return delta
inus + _zerotime;
    }

private:
    std::uint64t _zerotime;
    std::uint64t _offset;
    std::uint64
t _frequency;
};

At initialization we get a "time zero" which we will use to return the number of 100-nanoseconds interval since epoch. Right after, we acquire a timestamp offset. That means that future timestamp acquired will give be reduced to a delta to this offset, offset very close to our "time zero". There is an imprecision within the range of 100 microseconds relative to the "true" epoch time, but we care more about the relative precision than the absolute precision.

What we mean by that is that it's more important to have an accurate number of microseconds between two timestamps than the accurate number of microseconds between a timestamp and epoch.

Two remote servers exchanging information will anyway hardly be time synchronized at a microsecond precision, we can therefore work very well with a 100-microseconds imprecision relative to epoch for inter-server reconciliations.

The platform dependent code

For each platform we can see that we need three functions:

  1. A function returning the number of 100-nanoseconds intervals since epoch with relatively good precision;
  2. A high-precision (sub-microsecond) function returning a counter updated with a sub-microsecond precision;
  3. A function returning the update frequency of said counter.

The Windows implementation

We'll start with the Windows implementation because this is my blog and I'm in charge, thank you very much.

Let's spice things up.

We want our Windows implementation to work on Windows XP SP3. Without that constraint, you could be tempted by GetTickCount64 but know that it isn't precise enough. GetSystemTimePreciseAsFileTime is the function you are looking for, but it requires Windows 8 or later.

The function you are looking for is QueryPerformanceCounter (which calls rdtsc, just so you know).

However the function returns a value unrelated to the number of microseconds since epoch. We will have to do some very advanced mathematics such as subtractions and divisions to solve that problem.

Getting the zero time

When your application starts up, you will query the counter and the time in order to "anchor" your counter to a time reference. Successive calls will compute the offset to that point to return the number of microseconds elapsed since epoch.

That means you will need to query the counter frequency (with QueryPerformanceFrequency) and use a precise enough time query function for the anchor. Because you (rightfully) don't trust the precision of std::chrono (or boost::chrono), you will write your own function:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
std::uint64t filetimeoffset(void)
{
    FILETIME ft;
    std::uint64t tmpres =0;
    // 100-nanosecond intervals since January 1, 1601 (UTC)
    // which means 0.1 us
    GetSystemTimeAsFileTime(&ft);
    tmpres |= ft.dwHighDateTime;
    tmpres <<=32;
    tmpres |= ft.dwLowDateTime;
    // January 1st, 1970 - January 1st, 1601 UTC ~ 369 years
    // or 116444736000000000 us
    staticconst std::uint64
t deltaepoch =116444736000000000;
    tmpres -= delta
epoch;
    return tmpres;
}

// offset in microseconds
static std::uint64t zerotime(void)
{
    return filetime_offset();
}

### The timestamp and update frequency

The timestamp is a request to QueryPerformanceCounter:

1
2
3
4
5
6
7
8
static std::uint64t timestamp(void)
{
    LARGE
INTEGER li;
    QueryPerformanceCounter(&li);
    // there is an imprecision with the initial value,
    // but what matters is that timestamps are monotonic and consistent
    returnstaticcast<std::uint64t>(li.QuadPart);
}
The problem with QueryPerformanceCounter is that on certain systems the call may have poor performances because of the underlying synchronizations and adjustments required by the operating system (in other words, it won't be a simple call to rdtsc). There is not much you can do about it, the only relief is that those systems tend to be uncommon.

Getting the frequency is very straightforward:

1
2
3
4
5
6
7
8
9
10
static std::uint64t updatefrequency(void)
{
    LARGEINTEGER li;
    if(!QueryPerformanceFrequency(&li)||!li.QuadPart)
    {
        // log something
        std::terminate();
    }
    returnstatic
cast<std::uint64_t>(li.QuadPart);
}
If you search the Internet for information related to high-resolution timestamps on Windows, you will find that the update frequency may change with the processor frequency (or for other reasons). This is incorrect and is a confusion with the behavior of rdtsc on AMD K8 architectures. Microsoft guarantees that the frequency update returned by QueryPerformanceFrequency is fixed at boot time and is consistent across all processors.

Getting the frequency once is therefore not only correct, but much more efficient.

Another invalid piece of information you can find about this function is that you can use it to get the processor frequency.

As the Intel documentation clearly states:

The time-stamp counter increments at a constant rate. That rate may be set by the maximum core-clock to bus-clock ratio of the processor or may be set by the maximum resolved frequency at which the processor is booted.

Notwithstanding the fact that the implementation of the performance counter is absolutely not guaranteed.

Assembling all the pieces together

By the power of C++ templates:

1
2
3
4
5
staticconst hiresclock<platformclockimpl> _clockstate;
std::uint64
t timestamp(void)
{
    return _clockstate.now();
}
You now have a working high-resolution Windows timestamp. Shall we do the FreeBSD implementation?

The FreeBSD implementation

The clockgettime() function is what we need. If you're in a hurry, you can use the CLOCKREALTIME_PRECISE id which will return the local time. If you want something with stronger guarantees, you need to "translate" the Windows code.

First things first, let's do the "zero" time function with the help of clockgettime, with the CLOCKREALTIME_PRECISE parameter:

1
2
3
4
5
6
7
8
9
static std::uint64t zerotime(void)
{
    timespec ts;

    clockgettime(CLOCKREALTIME_PRECISE, &ts);

    // in 100-ns intervals
    returnstaticcast<std::uint64t>(ts.tvsec)* 10000000u +staticcast<std::uint64t>(ts.tvnsec)/ 100u;
}

And this is how you get a precise, monotonic timestamp with an arbitrary reference in time:

1
2
3
4
5
6
7
8
9
static std::uint64_t timestamp(void)
{
    timespec ts;

    // cannot fail, parameters are correct and we checked earlier we can access the clock
    clockgettime(CLOCKMONOTONICPRECISE, &ts);
    // in ns
    returnstatic
cast<std::uint64t>(ts.tvsec)* 1000000000u +staticcast<std::uint64t>(ts.tv_nsec);
}

Once again, our good friend rdtsc is taking care of everything for us (on IA32 and AMD64, that is).

To know the update frequency, you need to call clockgetres(). Even if you use CLOCKREALTIMEPRECISE, it's a good idea to call clockgetres() to make sure you have a satisfactory precision:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
static std::uint64t updatefrequency(void)
{
    timespec ts;

    if(clockgetres(clockid, &ts)<0)
    {
        // log error
        std::terminate();
    }

    assert(!ts.tv_sec);

    // this is the precision of the clock, we want the number of updates per second, which is
    // 1 / ts.tvnsec * 1,000,000,000
    staticconst std::uint64
t billion =1000000000;

    return billion /staticcast<std::uint64t>(ts.tv_nsec);

}

In C++ 14 the above code will be much easier to read thanks to digits separators.

As of now, the difference between CLOCKREALTIMEPRECISE and CLOCKMONOTONICPRECISE is just that the function will add the boot time offset for us with CLOCKREALTIMEPRECISE. There isn't even a difference between CLOCKMONOTONICFAST and CLOCKMONOTONICPRECISE. But implementation changes and assuming anything about what happens behind a system function call is a recipe for disaster.

Now you're going to say, "But Edouard, in your timestamp() function you always assume a nanosecond update frequency!" In all rigor we should adjust the timestamp to the update frequency because clock_getres doesn't return an arbitrary value like its Windows counterpart, but it happens that the update frequency isn't really a nanosecond, it's just a convention.

Did I contradict myself in just two paragraphs?

Bonus: a Linux implementation

The FreeBSD implementation is POSIX compatible. But as I said in my Multiplatform C++ talk, you often have surprises when attempting to support multiple UNIXes, and this is no exception.

To make the FreeBSD implementation work on Linux, you have to replace CLOCKMONOTONICPRECISE with CLOCK_MONOTONIC.

You're Heisenberg!

We can consider we have a sort of uncertainty principle that prevents us to be very precise and very fast. Fortunately for us, with the right feature set from the processor, we can get both excellent precision and decent performance. This is not a hard task. Nothing we've done above requires more than reading the documentation carefully.

Attention to details like this is what makes the difference between working and rock-solid software. #frencharrogance