Handmade Network»Forums
anaël seghezzi
23 posts / 1 project
fast sse convolution for convolutional neural network
Edited by anaël seghezzi on
Hi,

I'm trying to optimize a multi-layer convolution function for some machine learning project I'm working on.
The computation is very heavy so I'm trying to reduce the cost of convolution using sse intrinsics (initially I wanted to limit myself to sse2, but I guess sse3 is reasonable).

For efficiency each kernel is of 4*4*4 dimension. The obvious approach is to multiply and add 4 values at a time.
It's a speedup, but I wonder if there is a more efficient approach ?

Here is the code :
Any ideas ?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
// 4*4*4*c kernel convolution of a w*h*4 input layer
// into a (w-3)*(h-3)*c output layer
void convolve4x4xC(float *out, float *in, float *kernel, float *bias, int w, int h, int c)
{
   int x, y, i;
   int w3 = w - 3;
   int h3 = h - 3;
   int w4 = w * 4;

   for (y = 0; y < h3; y++) {

      float *iny = in + (y * w4);

      for (x = 0; x < w3; x++) {

         for (i = 0; i < c; i++) {

            __m128 out4;
            float *inp, *kp;
            int u, v;

            inp = iny;
            kp = kernel + (i * 64);

            // bias
            out4 = _mm_set1_ps(bias[i]);

            // convolution
            for (v = 0; v < 4; v++) {
            
               for (u = 0; u < 16; u+=4) {
                  __m128 p4 = _mm_loadu_ps(inp + u);
                  __m128 k4 = _mm_loadu_ps(kp);
                  p4 = _mm_mul_ps(p4, k4);
                  out4 = _mm_add_ps(out4, p4);
                  kp+=4;
               }

               inp+=w4;
            }

            // add the remaining 4 values (sse3)
            out4 = _mm_hadd_ps(out4, out4);
            out4 = _mm_hadd_ps(out4, out4);
            _mm_store_ss(out + i, out4);
         }

         out+=c;
         iny+=4;
      }
   }
}
Mārtiņš Možeiko
2559 posts / 2 projects
fast sse convolution for convolutional neural network
If w and h is large enough, then going GPU might help. OpenCL, DirectCompute or OpenGL ComputeShader, many options.
Jack Mott
110 posts
Web Developer by day, game hobbyist by night.
fast sse convolution for convolutional neural network
That is happens to be 4x4x4 is pretty convenient. If you wanted to take advantage of AVX and/or AVX512 you might look at rearranging the data so you can do 8 or 16 at a time instead of just 4, if you can, this might also make the 4 at a time code more efficient.

I'm not familiar enough with your code to know if that can be done, but I think these blog posts may be relevant/have some ideas:

https://fgiesen.wordpress.com/2013/07/09/simd-transposes-1/

https://fgiesen.wordpress.com/2013/08/29/simd-transposes-2/

https://fgiesen.wordpress.com/201...imd-matrix-vector-multiplication/
Jack Mott
110 posts
Web Developer by day, game hobbyist by night.
fast sse convolution for convolutional neural network
also if you want a quick way to be able to compile code for different SIMD instruction sets, you can use some simple macros like

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#ifndef AVX2

typedef __m128 SIMD;   //for floats
typedef __m128i SIMDi; //for integers

//process 4 at a time
#define VECTOR_SIZE 4
#define MEMORY_ALIGNMENT 16


#define Store(x,y) _mm_store_ps(x,y)
#define Load(x) _mm_load_ps(x)
//etc

#endif
#ifdef AVX2


typedef __m256 SIMD;  //for floats
typedef __m256i SIMDi; //for integers

//process 8 at a time
#define VECTOR_SIZE 8
#define MEMORY_ALIGNMENT 32

//intrinsic functions
#define Store(x,y) _mm256_store_ps(x,y)
#define Load(x) _mm256_load_ps(x)
anaël seghezzi
23 posts / 1 project
fast sse convolution for convolutional neural network
Edited by anaël seghezzi on
Thank you.
mmozeiko > I'm bound to CPU for now, but to keep in mind if I can manage the memory load.
MandleBro > Thank you for the suggestion and the code, I'll try AVX.

And assuming a fairly modern x86-64, what about cache ?
If 'c' is sufficiently high, should I copy the current 4x4x4 input block in a local buffer ?
Vittorio Patriarca
4 posts
I'm an italian "junior" C++ software developer located in Berlin. I graduated in Math and Political Science.
fast sse convolution for convolutional neural network
Are you sure that your kernels are not separable? When the kernel is separable, you can write the convolution as the composition of convolutions with kernel of lower dimension (in the best cases 1-dimensional).
anaël seghezzi
23 posts / 1 project
fast sse convolution for convolutional neural network
vict85 > The kernels are not symmetric, they are generated randomly by a genetic algorithm or by back propagation. Mainly I don't know what they will look like.

I tried AVX, but the speedup was not worth it, mainly because of the last horizontal sum which needs more instructions in AVX than SSE3.

The best I added was to align the memory properly to use '_mm_load_ps' everywhere instead of '_mm_loadu_ps'.