Cranky
TM 9 posts
None 
#8073
SSE2 implementations of tan, cot, atan, atan2 10 months, 2 weeks ago
SSE2 implementations of tan, cot, atan, atan2
I recently implemented tan, cot, atan, atan2 using SSE2 intrinsics, which serve as extensions to the SSE2 implementations of sin, cos, exp, log by Julien Pommier. WIP because the original library also implements SSE1 + MMX, which mine don't. I may or may not add them in later, I couldn't get MSVC to compile MMX intrinsics. I have written the library as an extension to sse_mathfun.h (the original library) instead of modifying it, so that if that library changes, you only need to change one file. I would like to have these functions get integrated into sse_mathfun.h on http://gruntthepeon.free.fr/ssemath/, but I have no idea how to contact the author. I looked at the site itself and his blog, but there don't seem to be any contact information. If you know how to contact the original author (Julien Pommier), please let me know, so that I can ask him to integrate these functions into the original library. The gains of using sse2 instead of cmath functions on Visual Studio are about a 2x speed up, with atan and atan2 having the biggest gains, but you lose precision (see benchmark). You can find the library on my github: GitHub The inspiration for this project came from a recommendation of Mārtiņš Možeiko (mmozeiko) in the thread Guide  How to avoid C/C++ runtime on Windows to use the sse optimized math functions found at http://gruntthepeon.free.fr/ssemath/. Here are the benchmarks from my machine:
None 
MandleBro
Jack Mott 96 posts
1 project
Web Developer by day, game hobbyist by night. Fond of C and F# 
#8078
SSE2 implementations of tan, cot, atan, atan2 10 months, 1 week ago
You can toss together some macros pretty quick to make the code more readable, and then *sometimes* you can do separate compilation from SSE to AVX2 by just switching a flag to use a different macro set. (though you may need some ifdefs for some parts)
For example: https://github.com/jackmott/FastN...ter/FastNoise/headers/FastNoise.h Or you can get fancy with templates and have it runtime detect and fall back through AVX2 > SSE4 > SSE2 and so on. This guy has a neat system you could copy if interested: https://github.com/Auburns/FastNoiseSIMD 
mmozeiko
Mārtiņš Možeiko 1349 posts
1 project

#8082
SSE2 implementations of tan, cot, atan, atan2 10 months, 1 week ago Edited by Mārtiņš Možeiko on Aug. 16, 2016, 12:24 a.m.
Nice!
But don't do this:
Do this instead:

Randy Gaul
24 posts

#12227
SSE2 implementations of tan, cot, atan, atan2 6 days, 20 hours ago Edited by Randy Gaul on June 19, 2017, 10:27 p.m.
Hey, just wanted to praise the atan2 function in here! I'm using it for some realtime pitch shifting code. Under profiling the scalar version atan2f was a pretty big bottleneck. Used this function and it flew off the profiler. Very nice.
Here's a link to the repo making use of the atan2 function: https://github.com/RandyGaul/tinyheaders One note, I actually had some problems with cos precision. I didn't look into too deeply, but if we look here I had to cut the _mm_cos_ps function and instead manually use four cosf calls. This turned out to be significantly slower, but not a bottleneck by any means. I'm guessing some large float values were passed into the SSE2 cos function, and it just didn't have appropriate precision to handle such cases. The audio in question would pop and become completely silent (suggesting a fairly nasty precision problem). In the end your atan2 function worked perfectly. It's unfortunate the old cos didn't quite pull through in this particular case (too bad the original author seems hard to contact). 
mmozeiko
Mārtiņš Možeiko 1349 posts
1 project

#12232
SSE2 implementations of tan, cot, atan, atan2 6 days, 17 hours ago Edited by Mārtiņš Možeiko on June 20, 2017, 7:06 p.m. Randy Gaul: Have you tried reducing argument to 0..2*pi (or similar) range? That should take just a few sse instructions. That SSE implementation wanted 8192..8192 interval for high precision output, if I remember correctly (maybe a bit different numbers). 
Randy Gaul
24 posts

#12239
SSE2 implementations of tan, cot, atan, atan2 6 days ago
Good idea. I'll give that a shot. Should still be faster than four scalar cos calls.

Pseudonym
Andrew Bromage 171 posts
1 project
(tbd)

#12243
SSE2 implementations of tan, cot, atan, atan2 5 days, 14 hours ago Edited by Andrew Bromage on June 21, 2017, 4:30 a.m.
Here's a useful floating point trick to know:
Hope this helps. sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f}); 