The 2024 Wheel Reinvention Jam is in 9 days. September 23-29, 2024. More info

SSE2 implementations of tan, cot, atan, atan2

SSE2 implementations of tan, cot, atan, atan2

I recently implemented tan, cot, atan, atan2 using SSE2 intrinsics, which serve as extensions to the SSE2 implementations of sin, cos, exp, log by Julien Pommier.
WIP because the original library also implements SSE1 + MMX, which mine don't. I may or may not add them in later, I couldn't get MSVC to compile MMX intrinsics.

I have written the library as an extension to sse_mathfun.h (the original library) instead of modifying it, so that if that library changes, you only need to change one file. I would like to have these functions get integrated into sse_mathfun.h on http://gruntthepeon.free.fr/ssemath/, but I have no idea how to contact the author. I looked at the site itself and his blog, but there don't seem to be any contact information. If you know how to contact the original author (Julien Pommier), please let me know, so that I can ask him to integrate these functions into the original library.

The gains of using sse2 instead of cmath functions on Visual Studio are about a 2x speed up, with atan and atan2 having the biggest gains, but you lose precision (see benchmark).
You can find the library on my github: GitHub

The inspiration for this project came from a recommendation of Mārtiņš Možeiko (mmozeiko) in the thread Guide - How to avoid C/C++ runtime on Windows to use the sse optimized math functions found at http://gruntthepeon.free.fr/ssemath/.

Here are the benchmarks from my machine:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
Results on a 3.30 GHz AMD FX-6100 Six-Core, compiled with Visual C++ Enterprise 2015 Update 3 (x64)
command line: cl.exe /W4 /DUSE_SSE2 /EHsc- /MD /GS- /Gy /fp:fast /Ox /Oy- /GL /Oi /O2 sse_mathfun_test.c

checking sines on [0*Pi, 1*Pi]
max deviation from sinf(x): 5.96046e-08 at 0.193304238313*Pi, max deviation from cephes_sin(x): 0
max deviation from cosf(x): 5.96046e-08 at 0.303994872157*Pi, max deviation from cephes_cos(x): 0
deviation of sin(x)^2+cos(x)^2-1: 1.78814e-07 (ref deviation is 1.19209e-07)
   ->> precision OK for the sin_ps / cos_ps / sincos_ps <<-

checking sines on [-1000*Pi, 1000*Pi]
max deviation from sinf(x): 5.96046e-08 at  338.694424873*Pi, max deviation from cephes_sin(x): 0
max deviation from cosf(x): 5.96046e-08 at  338.694424873*Pi, max deviation from cephes_cos(x): 0
deviation of sin(x)^2+cos(x)^2-1: 1.78814e-07 (ref deviation is 1.19209e-07)
   ->> precision OK for the sin_ps / cos_ps / sincos_ps <<-

checking exp/log [-60, 60]
max (relative) deviation from expf(x): 1.18944e-07 at -56.8358421326, max deviation from cephes_expf(x): 0
max (absolute) deviation from logf(x): 1.19209e-07 at -1.67546617985, max deviation from cephes_logf(x): 0
deviation of x - log(exp(x)): 1.19209e-07 (ref deviation is 5.96046e-08)
   ->> precision OK for the exp_ps / log_ps <<-

checking tan on [-0.25*Pi, 0.25*Pi]
max deviation from tanf(x): 1.19209e-07 at 0.250000006957*Pi, max deviation from cephes_tan(x): 5.96046e-08
   ->> precision OK for the tan_ps <<-

checking tan on [-0.49*Pi, 0.49*Pi]
max deviation from tanf(x): 3.8147e-06 at -0.490000009841*Pi, max deviation from cephes_tan(x): 9.53674e-07
   ->> precision OK for the tan_ps <<-

checking cot on [0.2*Pi, 0.7*Pi]
max deviation from cotf(x): 1.19209e-07 at 0.204303119606*Pi, max deviation from cephes_cot(x): 1.19209e-07
   ->> precision OK for the cot_ps <<-

checking cot on [0.01*Pi, 0.99*Pi]
max deviation from cotf(x): 3.8147e-06 at 0.987876517942*Pi, max deviation from cephes_cot(x): 9.53674e-07
   ->> precision OK for the cot_ps <<-

checking atan on [-10*Pi, 10*Pi]
max deviation from atanf(x): 1.19209e-07 at -9.39207109497*Pi, max deviation from cephes_atan(x): 1.19209e-07
   ->> precision OK for the atan_ps <<-

checking atan on [-10000*Pi, 10000*Pi]
max deviation from atanf(x): 1.19209e-07 at  -7350.3826719*Pi, max deviation from cephes_atan(x): 1.19209e-07
   ->> precision OK for the atan_ps <<-

checking atan2 on [-1*Pi, 1*Pi]
max deviation from atan2f(x): 2.38419e-07 at (0.797784384786*Pi, -0.913876806545*Pi), max deviation from cephes_atan2(x): 2.38419e-07
   ->> precision OK for the atan2_ps <<-

checking atan2 on [-10000*Pi, 10000*Pi]
max deviation from atan2f(x): 2.38419e-07 at ( 658.284195009*Pi, -2685.93394561*Pi), max deviation from cephes_atan2(x): 2.38419e-07
   ->> precision OK for the atan2_ps <<-

exp([        -1000,          -100,           100,          1000]) = [            0,             0, 2.4061436e+38, 2.4061436e+38]
exp([    -nan(ind),           inf,          -inf,           nan]) = [2.4061436e+38, 2.4061436e+38,             0, 2.4061436e+38]
log([            0,           -10,         1e+30, 1.0005271e-42]) = [         -nan,          -nan,     69.077553,    -87.336548]
log([    -nan(ind),           inf,          -inf,           nan]) = [   -87.336548,     88.722839,          -nan,    -87.336548]
sin([    -nan(ind),           inf,          -inf,           nan]) = [    -nan(ind),     -nan(ind),           nan,           nan]
cos([    -nan(ind),           inf,          -inf,           nan]) = [          nan,     -nan(ind),     -nan(ind),           nan]
sin([       -1e+30,       -100000,         1e+30,        100000]) = [          inf,  -0.035749275,          -inf,   0.035749275]
cos([       -1e+30,       -100000,         1e+30,        100000]) = [    -nan(ind),    -0.9993608,     -nan(ind),    -0.9993608]
benching                 sinf .. ->   16.3 millions of vector evaluations/second ->  40 cycles/value on a 2600MHz computer
benching                 cosf .. ->   15.8 millions of vector evaluations/second ->  41 cycles/value on a 2600MHz computer
benching                 expf .. ->   18.8 millions of vector evaluations/second ->  35 cycles/value on a 2600MHz computer
benching                 logf .. ->   17.8 millions of vector evaluations/second ->  36 cycles/value on a 2600MHz computer
benching                 tanf .. ->   13.8 millions of vector evaluations/second ->  47 cycles/value on a 2600MHz computer
benching                 cotf .. ->   12.2 millions of vector evaluations/second ->  53 cycles/value on a 2600MHz computer
benching                atanf .. ->   10.4 millions of vector evaluations/second ->  62 cycles/value on a 2600MHz computer
benching               atan2f .. ->    5.3 millions of vector evaluations/second -> 121 cycles/value on a 2600MHz computer
benching            atan2_ref .. ->   11.7 millions of vector evaluations/second ->  56 cycles/value on a 2600MHz computer
benching                sqrtf .. ->   69.5 millions of vector evaluations/second ->   9 cycles/value on a 2600MHz computer
benching               rsqrtf .. ->   70.2 millions of vector evaluations/second ->   9 cycles/value on a 2600MHz computer
benching          cephes_sinf .. ->   15.2 millions of vector evaluations/second ->  43 cycles/value on a 2600MHz computer
benching          cephes_cosf .. ->   16.6 millions of vector evaluations/second ->  39 cycles/value on a 2600MHz computer
benching          cephes_expf .. ->    2.9 millions of vector evaluations/second -> 220 cycles/value on a 2600MHz computer
benching          cephes_logf .. ->    3.4 millions of vector evaluations/second -> 186 cycles/value on a 2600MHz computer
benching               sin_ps .. ->   30.6 millions of vector evaluations/second ->  21 cycles/value on a 2600MHz computer
benching               cos_ps .. ->   31.1 millions of vector evaluations/second ->  21 cycles/value on a 2600MHz computer
benching            sincos_ps .. ->   30.9 millions of vector evaluations/second ->  21 cycles/value on a 2600MHz computer
benching               exp_ps .. ->   27.3 millions of vector evaluations/second ->  24 cycles/value on a 2600MHz computer
benching               log_ps .. ->   23.5 millions of vector evaluations/second ->  28 cycles/value on a 2600MHz computer
benching               tan_ps .. ->   22.2 millions of vector evaluations/second ->  29 cycles/value on a 2600MHz computer
benching               cot_ps .. ->   22.0 millions of vector evaluations/second ->  29 cycles/value on a 2600MHz computer
benching              atan_ps .. ->   31.1 millions of vector evaluations/second ->  21 cycles/value on a 2600MHz computer
benching             atan2_ps .. ->   24.1 millions of vector evaluations/second ->  27 cycles/value on a 2600MHz computer
benching              sqrt_ps .. ->   63.9 millions of vector evaluations/second ->  10 cycles/value on a 2600MHz computer
benching             rsqrt_ps .. ->   64.1 millions of vector evaluations/second ->  10 cycles/value on a 2600MHz computer
You can toss together some macros pretty quick to make the code more readable, and then *sometimes* you can do separate compilation from SSE to AVX2 by just switching a flag to use a different macro set. (though you may need some ifdefs for some parts)

For example:
https://github.com/jackmott/FastN...ter/FastNoise/headers/FastNoise.h


Or you can get fancy with templates and have it runtime detect and fall back through AVX2 -> SSE4 -> SSE2 and so on. This guy has a neat system you could copy if interested:
https://github.com/Auburns/FastNoiseSIMD


Nice!

But don't do this:
1
return ( (float *)&sse_value )[0];

Do this instead:
1
return _mm_cvtss_f32(sse_value);

Edited by Mārtiņš Možeiko on
Hey, just wanted to praise the atan2 function in here! I'm using it for some real-time pitch shifting code. Under profiling the scalar version atan2f was a pretty big bottleneck. Used this function and it flew off the profiler. Very nice.

Here's a link to the repo making use of the atan2 function: https://github.com/RandyGaul/tinyheaders

One note, I actually had some problems with cos precision. I didn't look into too deeply, but if we look here I had to cut the _mm_cos_ps function and instead manually use four cosf calls. This turned out to be significantly slower, but not a bottleneck by any means.

I'm guessing some large float values were passed into the SSE2 cos function, and it just didn't have appropriate precision to handle such cases. The audio in question would pop and become completely silent (suggesting a fairly nasty precision problem).

In the end your atan2 function worked perfectly. It's unfortunate the old cos didn't quite pull through in this particular case (too bad the original author seems hard to contact).

Edited by Randy Gaul on
Randy Gaul
I'm guessing some large float values were passed into the SSE2 cos function, and it just didn't have appropriate precision to handle such cases.

Have you tried reducing argument to 0..2*pi (or similar) range? That should take just a few sse instructions.

That SSE implementation wanted -8192..8192 interval for high precision output, if I remember correctly (maybe a bit different numbers).

Edited by Mārtiņš Možeiko on
Good idea. I'll give that a shot. Should still be faster than four scalar cos calls.
Here's a useful floating point trick to know:

1
2
3
4
5
6
7
8
9
inline double
radian_reduce_2pi(double x)
{
    static const double round_to_integer = 1.5 * (1ull << 52);
    static const double one_over_period = 1.591549430918953456e-01;
    static const double period = 6.2831853071693331e+00;
    double n = x * one_over_period + round_to_integer - round_to_integer;
    return x - n * period;
}


Hope this helps.

Edited by Andrew Bromage on
Have yet to implement an angle modulo yet in SSE. Does anyone happen to have this kind of function laying around? I imagine this is a pretty commonly needed function, and was hoping someone had a good implementation I could use. If not I'll eventually get around to implementing it myself :)

Similarly, to use _mm_atan2_ps the inputs would need to undergo a floating point modulus. For atan2 I'm assuming the value fed into _mm_atan_ps could be modulo'd in the range of -pi to pi just like _mm_cos_ps.

Edited by Randy Gaul on
The function Pseudonym posted does that - it calculates x % (2*pi).

atan function is not periodic. You cannot simply do mod pi. You can easily see that from its graph: https://www.google.com/search?q=y%3Datan(x) If you need more precision for it then you need to split function into two or multiple parts that approximates independent segment and then choose one of the values. This is what cephes does for sin and cos - two separate polynomials (0..pi/4 and pi/4..pi/2)