Cranky
TM
13 posts

#8073
SSE2 implementations of tan, cot, atan, atan2 3 years ago
SSE2 implementations of tan, cot, atan, atan2
I recently implemented tan, cot, atan, atan2 using SSE2 intrinsics, which serve as extensions to the SSE2 implementations of sin, cos, exp, log by Julien Pommier. WIP because the original library also implements SSE1 + MMX, which mine don't. I may or may not add them in later, I couldn't get MSVC to compile MMX intrinsics. I have written the library as an extension to sse_mathfun.h (the original library) instead of modifying it, so that if that library changes, you only need to change one file. I would like to have these functions get integrated into sse_mathfun.h on http://gruntthepeon.free.fr/ssemath/, but I have no idea how to contact the author. I looked at the site itself and his blog, but there don't seem to be any contact information. If you know how to contact the original author (Julien Pommier), please let me know, so that I can ask him to integrate these functions into the original library. The gains of using sse2 instead of cmath functions on Visual Studio are about a 2x speed up, with atan and atan2 having the biggest gains, but you lose precision (see benchmark). You can find the library on my github: GitHub The inspiration for this project came from a recommendation of Mārtiņš Možeiko (mmozeiko) in the thread Guide  How to avoid C/C++ runtime on Windows to use the sse optimized math functions found at http://gruntthepeon.free.fr/ssemath/. Here are the benchmarks from my machine:

MandleBro
Jack Mott
112 posts
/ 1 project
Web Developer by day, game hobbyist by night. 
#8078
SSE2 implementations of tan, cot, atan, atan2 3 years ago
You can toss together some macros pretty quick to make the code more readable, and then *sometimes* you can do separate compilation from SSE to AVX2 by just switching a flag to use a different macro set. (though you may need some ifdefs for some parts)
For example: https://github.com/jackmott/FastN...ter/FastNoise/headers/FastNoise.h Or you can get fancy with templates and have it runtime detect and fall back through AVX2 > SSE4 > SSE2 and so on. This guy has a neat system you could copy if interested: https://github.com/Auburns/FastNoiseSIMD 
mmozeiko
Mārtiņš Možeiko
1963 posts
/ 1 project

#8082
SSE2 implementations of tan, cot, atan, atan2 3 years ago Edited by Mārtiņš Možeiko on Aug. 16, 2016, 12:24 a.m.
Nice!
But don't do this:
Do this instead:

Randy Gaul
54 posts

#12227
SSE2 implementations of tan, cot, atan, atan2 2 years, 2 months ago Edited by Randy Gaul on June 19, 2017, 10:27 p.m.
Hey, just wanted to praise the atan2 function in here! I'm using it for some realtime pitch shifting code. Under profiling the scalar version atan2f was a pretty big bottleneck. Used this function and it flew off the profiler. Very nice.
Here's a link to the repo making use of the atan2 function: https://github.com/RandyGaul/tinyheaders One note, I actually had some problems with cos precision. I didn't look into too deeply, but if we look here I had to cut the _mm_cos_ps function and instead manually use four cosf calls. This turned out to be significantly slower, but not a bottleneck by any means. I'm guessing some large float values were passed into the SSE2 cos function, and it just didn't have appropriate precision to handle such cases. The audio in question would pop and become completely silent (suggesting a fairly nasty precision problem). In the end your atan2 function worked perfectly. It's unfortunate the old cos didn't quite pull through in this particular case (too bad the original author seems hard to contact). 
mmozeiko
Mārtiņš Možeiko
1963 posts
/ 1 project

#12232
SSE2 implementations of tan, cot, atan, atan2 2 years, 1 month ago Edited by Mārtiņš Možeiko on June 20, 2017, 7:06 p.m. Randy Gaul Have you tried reducing argument to 0..2*pi (or similar) range? That should take just a few sse instructions. That SSE implementation wanted 8192..8192 interval for high precision output, if I remember correctly (maybe a bit different numbers). 
Randy Gaul
54 posts

#12239
SSE2 implementations of tan, cot, atan, atan2 2 years, 1 month ago
Good idea. I'll give that a shot. Should still be faster than four scalar cos calls.

Pseudonym
Andrew Bromage
184 posts
/ 1 project
Research engineer, resident maths nerd (Erdős number 3). 
#12243
SSE2 implementations of tan, cot, atan, atan2 2 years, 1 month ago Edited by Andrew Bromage on June 21, 2017, 4:30 a.m.
Here's a useful floating point trick to know:
Hope this helps. sub f{($f)[email protected]_;print"$f(q{$f});";}f(q{sub f{($f)[email protected]_;print"$f(q{$f});";}f}); 
Randy Gaul
54 posts

#13039
SSE2 implementations of tan, cot, atan, atan2 1 year, 11 months ago Edited by Randy Gaul on Sept. 6, 2017, 10:32 p.m.
Have yet to implement an angle modulo yet in SSE. Does anyone happen to have this kind of function laying around? I imagine this is a pretty commonly needed function, and was hoping someone had a good implementation I could use. If not I'll eventually get around to implementing it myself :)
Similarly, to use _mm_atan2_ps the inputs would need to undergo a floating point modulus. For atan2 I'm assuming the value fed into _mm_atan_ps could be modulo'd in the range of pi to pi just like _mm_cos_ps. 
mmozeiko
Mārtiņš Možeiko
1963 posts
/ 1 project

#13040
SSE2 implementations of tan, cot, atan, atan2 1 year, 11 months ago
The function Pseudonym posted does that  it calculates x % (2*pi).
atan function is not periodic. You cannot simply do mod pi. You can easily see that from its graph: https://www.google.com/search?q=y%3Datan(x) If you need more precision for it then you need to split function into two or multiple parts that approximates independent segment and then choose one of the values. This is what cephes does for sin and cos  two separate polynomials (0..pi/4 and pi/4..pi/2) 