anael
anaël seghezzi 20 posts
1 project

#11109
fast sse convolution for convolutional neural network 1 year ago Edited by anaël seghezzi on March 1, 2017, 5:07 p.m.
Hi,
I'm trying to optimize a multilayer convolution function for some machine learning project I'm working on. The computation is very heavy so I'm trying to reduce the cost of convolution using sse intrinsics (initially I wanted to limit myself to sse2, but I guess sse3 is reasonable). For efficiency each kernel is of 4*4*4 dimension. The obvious approach is to multiply and add 4 values at a time. It's a speedup, but I wonder if there is a more efficient approach ? Here is the code : Any ideas ?

mmozeiko
Mārtiņš Možeiko 1614 posts
1 project

#11110
fast sse convolution for convolutional neural network 1 year ago
If w and h is large enough, then going GPU might help. OpenCL, DirectCompute or OpenGL ComputeShader, many options.

MandleBro
Jack Mott 103 posts
1 project
Web Developer by day, game hobbyist by night. 
#11126
fast sse convolution for convolutional neural network 1 year ago
That is happens to be 4x4x4 is pretty convenient. If you wanted to take advantage of AVX and/or AVX512 you might look at rearranging the data so you can do 8 or 16 at a time instead of just 4, if you can, this might also make the 4 at a time code more efficient.
I'm not familiar enough with your code to know if that can be done, but I think these blog posts may be relevant/have some ideas: https://fgiesen.wordpress.com/2013/07/09/simdtransposes1/ https://fgiesen.wordpress.com/2013/08/29/simdtransposes2/ https://fgiesen.wordpress.com/201...imdmatrixvectormultiplication/ 
MandleBro
Jack Mott 103 posts
1 project
Web Developer by day, game hobbyist by night. 
#11127
fast sse convolution for convolutional neural network 1 year ago
also if you want a quick way to be able to compile code for different SIMD instruction sets, you can use some simple macros like

anael
anaël seghezzi 20 posts
1 project

#11138
fast sse convolution for convolutional neural network 1 year ago Edited by anaël seghezzi on March 3, 2017, 8:43 a.m.
Thank you.
mmozeiko > I'm bound to CPU for now, but to keep in mind if I can manage the memory load. MandleBro > Thank you for the suggestion and the code, I'll try AVX. And assuming a fairly modern x8664, what about cache ? If 'c' is sufficiently high, should I copy the current 4x4x4 input block in a local buffer ? 
vict85
Vittorio Patriarca 4 posts
I'm an italian "junior" C++ software developer located in Berlin. I graduated in Math and Political Science. 
#11271
fast sse convolution for convolutional neural network 1 year ago
Are you sure that your kernels are not separable? When the kernel is separable, you can write the convolution as the composition of convolutions with kernel of lower dimension (in the best cases 1dimensional).

anael
anaël seghezzi 20 posts
1 project

#11272
fast sse convolution for convolutional neural network 1 year ago
vict85 > The kernels are not symmetric, they are generated randomly by a genetic algorithm or by back propagation. Mainly I don't know what they will look like.
I tried AVX, but the speedup was not worth it, mainly because of the last horizontal sum which needs more instructions in AVX than SSE3. The best I added was to align the memory properly to use '_mm_load_ps' everywhere instead of '_mm_loadu_ps'. 