On the importance of automated benchmarks

A couple of years ago, Antirez wrote a note on his blog about "The Toilet test". He made a couple of great points, criticizing the test and the method.

That test was mainly about a busy loop querying a remote server and measuring how many queries per second you could do. This is not very representative of how engines like redis – or even quasardb – are used.

That’s why using this test to compare engines doesn’t yield very interesting results.

Kill all humans

Every time quasardb is built, an automated benchmark is ran on the resulting build, provided that all tests passed.

Whether we get a notable performance regression or a performance improvement, we don’t stop until we can explain it or fix it (in the case of a regression). In other words: performance changes are considered show-stopper bugs.

This tests includes complex, multi-threaded queries, but also a variant of the toilet test with a simple put and multiple 1-byte gets in a busy loop.

This busy loop has been key to finding performance regressions as we developed the software, which might sound contradictory with what I just stated above, but this is my blog and I can write whatever the Hell I want.

To be brutally honest, this test is one of the most meaningful performance tests we have with the “maximum bandwidth test” which measures the maximum throughput of quasardb (also known as “Your 100 GbE network is now dead”).

Let me explain: we run the busy loop test on a loop-back, which eliminates network latency. This leaves the TCP stack, kqueue (in the case of FreeBSD), the memory allocator and quasardb. Since the query is only 1 byte large, the impact of the memory allocator is reduced to the minimum.

The higher the value of this test, the lower the quasardb overhead, therefore when working on the engine, this test quickly shows if you added overhead in the critical path.

When your software reaches a certain size, it’s impossible to accurately guess how your changes are going to impact performances, especially because in a multithreaded environment, a lot of optimizations are counter-intuitive.

Thus, the only way to avoid ending up with an engine slow as hell as you add features is to benchmark the engine all the time, automatically without any human intervention.

Story from the field

As we were working toward our new major release, quasardb 1.1.0, we had a net performance regression that managed to slip in. This regression appeared gradually with, at every step, a regression within the error margin. Ah, the mirth!

As we 1.1.0 approached closure, we realized we had a 15% performance drop compared to 1.0.1 in the toilet test, which is a lot. When we run the benchmark in a profiler, it clearly shows than more than 95% of the time is spent waiting for the OS, which is good.

When your software only impacts 5% of the benchmark time and you have 15% performance regression, it means there’s something rotten in the kingdom of Denmark.

Through profiling and git log analysis we quickly pin-pointed the problem to a pass-by-value issue, a problem that took two days to be solved.

I’m a huge proponent of the “pass by value” approach in C++, unfortunately, the compiler cannot always elide the copies and using explicit perfect forwarding may not possible because multiple threads need simultaneous read access.

I think it would have taken an insane amount of time if we weren’t able to quickly pinpoint the regression to the appropriate git commit, because even with that it took a while to understand what was going on!

Closing words

Performance is extremely expensive, but those costs can be greatly reduced if you automate benchmarks and have the courage to investigate thoroughly all performance changes.