A while ago I watched a super interesting talk about the history of character encodings (https://www.youtube.com/watch?v=gd5uJ7Nlvvo) and I was intrigued by the fact that ASCII characters are not positioned randomly on the table, but they're placed in such a way that they have a convenient bit pattern for the kinds of operations that you would do on them (go to 15:00 in the talk to understand better).
Today I tried to implement some of the functions in ctype.h
to see if they would be faster than the ones that I see most often. So for example, instead of writing this:
int to_digit(char c) { return (c - '0'); }
I wrote this:
int to_digit(char c) { return (c & 0x0F); }
And instead of this:
bool is_digit(char c) { return (c >= '0') && (c <= '9'); }
I wrote this:
bool is_digit(char c) { return (c ^ 0x30) < 0xA; }
I benchmarked these and found out that most are actually faster, while some of them are slower. This is the complete code for the functions:
/* Most obvious implementation */ char to_upper_v1(char c) { return (c - 32); } char to_lower_v1(char c) { return (c + 32); } int to_digit_v1(char c) { return (c - '0'); } char to_char_v1(int i) { return (char) (i + '0'); } bool is_digit_v1(char c) { return (c >= '0') && (c <= '9'); } bool is_alpha_v1(char c) { return ((c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z')); } bool is_lower_v1(char c) { return (c >= 'a' && c <= 'z'); } bool is_upper_v1(char c) { return (c >= 'A' && c <= 'Z'); } /* Less obvious implementation */ char to_upper_v2(char c) { return (c & 0xDF); } char to_lower_v2(char c) { return (c | 0x20); } int to_digit_v2(char c) { return (c & 0x0F); } char to_char_v2(int i) { return (char) (i | 0x30); } bool is_digit_v2(char c) { return (c ^ 0x30) < 0xA; } bool is_alpha_v2(char c) { return ((c ^ 0x40) - 1) < 0x5B; } bool is_lower_v2(char c) { return ((c ^ 0x60) - 1U) < 0x1A; } bool is_upper_v2(char c) { return ((c ^ 0x40) - 1U) < 0x1A; }
These are the results (a billion calls, compiled with MSVC on -O2):
to_upper_v1 = 59.19 sec to_upper_v2 = 56.17 sec (+ 5.08 %) is_digit_v1 = 71.47 sec is_digit_v2 = 56.91 sec (+20.37 %) is_alpha_v1 = 122.46 sec is_alpha_v2 = 83.24 sec (+32.02 %) is_lower_v1 = 65.07 sec is_lower_v2 = 83.24 sec (-27.9 %) is_upper_v1 = 65.88 sec is_upper_v2 = 82.04 sec (-27.9 %)
(Note I didn't include the results for to_lower, to_digit and to_char because they're identical to the ones for to_upper.)
I'm still kind of a noob programmer and this is my first time doing benchmarks of some code so I might have made a mistake somewhere, but looking at the instruction count these results seem to make sense - faster functions are the ones with the least amount of instructions.
I know this isn't anything groundbreaking, but I though it could be interesting and something that isn't looked at very often.
EDIT: I realized as soon as I turned off my PC that I only checked that they work on printable ASCII characters, so maybe they're not as useful as I thought. But hey, that doesn't mean that there can't be a better way to write those functions that works on all characters - and if there can't be, well, it has been a good exercise.
Thanks for sharing.
One thing you can do is to write a function that verify that the two versions of each functions produce the same result.
For functions like to_digit_v1/to_digit_v2 or to_upper_v1/to_upper_v2 where both of them have exactly one simple arithmetic operation I would expect performance be exactly equal. There should be no differences.
See the benchmark for to_upper_v1/v2 here: https://quick-bench.com/q/XU9dv6u33iRM1VnXuSY4nhW42Dc
For others with multiple operations - I'd expect two comparisons to be tiny bit faster. As they are independent - they can execute in "parallel". Thus efficiently doing only two operations (compiler most likely is optimizing it two operations anyway). Versus arithmetic sequence for v2 variants where all 3 operations must be done in sequence as they depend on result of previous operation.
See the benchmark for is_lower_v1/v2: https://quick-bench.com/q/Itp5CZt2ZStEwy5OwY9qPHxitfA
And you can see that for v1 it compiles down to two operations:
add $0x9f,%cl cmp $0x1a,%cl
but for v2 it requires three:
xor $0x60,%ecx dec %ecx cmp $0x1a,%ecx
So all v1 functions (except is_alpha) should win in these kind of benchmarks.
@mrmixer that's exactly what I did, but I only called them with printable chars - I was so focused on the binary ops that I totally forgot about the existence of \n and such.
@mmozeiko that's really interesting, I didn't think about the fact that the two comparisons are independent! I thought that one comparison would be faster because of branch prediction (this mystical thing that I don't understand quite well yet, even though I've been reading about it) - if it only has to "predict" one branch rather than two, it has more chances of "getting it right", right? Maybe, but it doesn't matter since it has to wait on the other op to be completed... this stuff is every day more fascinating.