Key Takeaways
- For better performance, modern processors predict the branch and execute the following instructions speculatively. It is a powerful optimization
- Programmers can be misled into underestimating the cost of branch mispredictions when benchmarking code with synthetic data that is either too short or too predictable. The effect of branch mispredictions can be large. Thankfully, it is often possible to avoid branches entirely.
- Rewriting your code with fewer branches leads to code that has a more consistent speed.
- Proper handling of mispredicted branches may be required to get optimal performance.
Most software code contains conditional branches. In code, they appear in if-then-else clauses, loops, and switch-case constructs. When encountering a conditional branch, the processor checks a condition and may jump to a new code path, if the branch is taken, or continue with the following instructions. Unfortunately, the processor may only know whether a jump is necessary after executing all instructions before the jump. For better performance, modern processors predict the branch and execute the following instructions speculatively. It is a powerful optimization.
There are some limitations to speculative execution, however. For example, the processor must discard the work done after the misprediction and start anew when the branch is mispredicted. Thankfully, processors are good at recognizing patterns and avoiding mispredictions, when possible. Nevertheless, some branches are intrinsically hard to predict and they may cause a performance bottleneck.
1. Branch predictors are not easily fooled
Programmers can be misled into underestimating the cost of branch mispredictions when benchmarking code with synthetic data that is either too short or too predictable. Modern processors can learn to predict thousands of branches, so a short test can fail to expose the true cost of branch mispredictions, even if the data looks random.
Let us consider an example. Suppose we want to decode hexadecimal digits made of either the decimal digits (0-9) or the letters (A-F, a-F). We want to get the corresponding integer value in the range 0 to 16. A reasonable C function for this task might look as follow:
int decode_hex(char c) {
if(c >= '0' && c <= '9') return c - '0';
if(c >= 'A' && c <= 'F') return c - 'A' + 10;
if(c >= 'a' && c <= 'f') return c - 'a' + 10;
return -1;
}
A programmer might benchmark such a function by first generating a random string. For example, in C, we might repeatedly call the rand function and select one of the possible 22 hexadecimal digits.
char hex_table[] = { '0', '1', '2', '3', '4',
'5', '6', '7', '8', '9',
'a', 'b', 'c', 'd', 'e', 'f',
'A', 'B', 'C', 'D', 'E', 'F' };
void build_random_string(size_t length,
char *answer) {
for (size_t i = 0; i < length; i++) {
answer[i] = hex_table[rand()
% sizeof(hex_table)];
}
return answer;
}
Then, one could record the time it takes to decode all hexadecimal digits in the newly generated string. For good measure, a programmer might run this test several times and compute the average. Let us plot the number of mispredicted branches per hexadecimal digit for different string lengths (1000, 2000, ...) over many trials. Importantly, we reuse the same input string in all trials.
Using a standard server processor (AMD Rome), we find that for a string containing 3000 digits, the number of mispredicted branches per value falls to 0.015 (or 1.5%) after fewer than 20 trials. For a shorter string (1000 digits), the number of mispredicted branches falls even quicker, down to 0.005 after 8 trials. In other words, a standard server processor can learn to predict all the branches generated by a 1kB hexadecimal random string after fewer than 10 decoding passes.
In programming languages supported by a just-in-time compiler (such as Java or JavaScript), it is routine to repeat the benchmarks until the compiler has optimized the code: If the test is too short, even if the input is random, we are at risk of underestimating the cost of the branch mispredictions.
Similarly, running tests with synthetic data inputs, even when the data volume is large, can make it possible for the processor to predict the branches with high accuracy. Suppose instead of using a random string of digits, we use the digits generated by counting from 0 to 65536: 0x00, 0x01, ...
The branch predictor must essentially predict where the decimal digits and the alphabetical digits appear. The resulting string (0000100021003100421052106310731084219421a521b521c631d631e731f7310842 ... 8ce18ce29ce39ce4ade5ade6bde7bde8cef9cef ...) is not random, but it might take effort to predict the patterns followed by decimal digits and by alphabetical digits. Yet, using the same AMD Rome processor, we find that the number of mispredicted branches on such a string of length 131,072 is virtually zero on the first trial: 0.002 branch misprediction per byte.
2. Branch mispredictions can be expensive
The effect of branch mispredictions can be large. Let us take the data we collected over random hexadecimal strings of various sizes, and plot the number of CPU cycles per input digit against the number of mispredicted branches. We find that the number of CPU cycles varies between slightly over 5 cycles per hexadecimal digit all the way to over 20 cycles per digit, depending on the number of mispredicted branches.
Yet, we should not conclude that the presence of a large number of mispredicted branches is necessarily a concern because there might be more pressing concerns. For example, consider a binary search in a large array. The limitation is the latency of memory access in main memory which is far more expensive than a branch misprediction.
3. Avoiding branches
Performance analysis tools, like perf (Linux), xperf (Windows), and Instruments (macOS), can measure the number of mispredicted branches. What if you have determined that for the problem at hand, branches are a problem?
Thankfully, it is often possible to avoid branches entirely. For example, to decode hexadecimal digits, we might use a single lookup in an array:
int digittoval[256] = {
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, 0, 1,
2, 3, 4, 5, 6, 7, 8, 9, -1, -1,
-1, -1, -1, -1, -1, 10, 11, 12, 13, 14,
15, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, 10, 11, 12,
13, 14, 15, -1,...};
int hex(unsigned char c) {
return digittoval[c];
}
This function may get compiled to a single instruction, with no branching. It can still be used to validate improper inputs by checking for a negative integers.
Thus, we can often replace branching by memorization: In effect, we precompute the branches and store them in a simple data structure like an array.
Another strategy to avoid branches is to do speculative work and to later throw it away. For example, suppose that we want to erase all negative integers from an input array of integers. In C, we might proceed with a loop and a branch:
for(size_t i = 0; i < length; i++) {
if(values[i] >= 0) {
values[pos++] = values[i];
}
}
Let us call this function "branchy." This code checks the sign of the input and conditionally copies it to a new location. If we expect that it is difficult to predict which of the integers are negative, then it might be inefficient code. Instead, we could optimistically write out the input, but only increment our index if the integer is negative.
Thankfully, we can determine the sign of an integer without branching. Most compilers will compile the following C function into a negation followed by a logical right shift:
size_t sign(int v) {
return (v >= 0 ? 1 : 0);
}
Thus, we have the following function to omit negative values from an array. With most compilers, this function will contain a single branch, having to do with the loop, but it is a predictable branch for large arrays.
for(size_t i = 0; i < length; i++) {
values[pos] = values[i];
pos += sign(values[i]);
}
Let us call this function "branchless." Observe that both functions are not strictly equivalent. The branchy version may never write to the input array, whereas the branchless version will always write each input integer once. Thus, it may be difficult for a compiler to replace one for the other. Yet, the functions may be logically equivalent for practical purposes—an assessment best done by the programmer.
Even when it is impossible to remove all branches, reducing the number of branches "almost always taken" or "almost never taken" may help the processor better predict the remaining branches. For example, if we are decoding the digits one-by-one using a loop, then we have one predictable branch resulting from the loop per digit.
for (i = 0; i < length; i++) {
int code = decode_hex(input[i]);
...
For long input strings, we can reduce by half this number of predictable branches by "unrolling" the loop: We process two input characters with each iteration of the loop.
for (; i + 1 < length; i += 2) {:
int code1 = decode_hex(input[i]);
int code2 = decode_hex(input[i + 1]);
...
If we count the number of mispredicted branches per input digit, we find that the unrolled version may have fewer mispredicted branches. On an AMD Rome processor with input strings made of 5000 digits, we find that after many trials the number of mispredicted branches per input digit is 0.230 for the simple loop and only 0.165 for the unrolled loop—a 30% reduction.
A possible simplified explanation for this phenomenon is that processors use the history of recent branches to predict future branches. Uninformative branches may reduce the ability of the processors to make good predictions.
When it is not possible to avoid hard-to-predict branches, we can often reduce their number. For example, when comparing two dates, it is convenient to put them in the YYYYMMDD format, and to compare the resulting 8-byte strings as if they were 64-bit integers, using a single instruction.
Conclusion
Benchmarking code containing hard-to-predict branches is difficult and programmers can sometimes underestimate the price of branches. Rewriting your code with fewer branches leads to code that has a more consistent speed. Proper handling of mispredicted branches may be required to get optimal performance.
About the Author
Daniel Lemire is a computer science professor at the Université du Québec (TELUQ). He has written over 70 peer-reviewed publications, including more than 40 journal articles. He serves on the program committees of leading computer science conferences. During the 2016-2017 NSERC Discovery Grant competition, he received a rating of outstanding for the excellence of the researcher.