fast sse convolution for convolutional neural network

Hi,

I'm trying to optimize a multi-layer convolution function for some machine learning project I'm working on.
The computation is very heavy so I'm trying to reduce the cost of convolution using sse intrinsics (initially I wanted to limit myself to sse2, but I guess sse3 is reasonable).

For efficiency each kernel is of 4*4*4 dimension. The obvious approach is to multiply and add 4 values at a time.
It's a speedup, but I wonder if there is a more efficient approach ?

Here is the code :
Any ideas ?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
// 4*4*4*c kernel convolution of a w*h*4 input layer
// into a (w-3)*(h-3)*c output layer
void convolve4x4xC(float *out, float *in, float *kernel, float *bias, int w, int h, int c)
{
   int x, y, i;
   int w3 = w - 3;
   int h3 = h - 3;
   int w4 = w * 4;

   for (y = 0; y < h3; y++) {

      float *iny = in + (y * w4);

      for (x = 0; x < w3; x++) {

         for (i = 0; i < c; i++) {

            __m128 out4;
            float *inp, *kp;
            int u, v;

            inp = iny;
            kp = kernel + (i * 64);

            // bias
            out4 = _mm_set1_ps(bias[i]);

            // convolution
            for (v = 0; v < 4; v++) {
            
               for (u = 0; u < 16; u+=4) {
                  __m128 p4 = _mm_loadu_ps(inp + u);
                  __m128 k4 = _mm_loadu_ps(kp);
                  p4 = _mm_mul_ps(p4, k4);
                  out4 = _mm_add_ps(out4, p4);
                  kp+=4;
               }

               inp+=w4;
            }

            // add the remaining 4 values (sse3)
            out4 = _mm_hadd_ps(out4, out4);
            out4 = _mm_hadd_ps(out4, out4);
            _mm_store_ss(out + i, out4);
         }

         out+=c;
         iny+=4;
      }
   }
}

Edited by anaël seghezzi on
If w and h is large enough, then going GPU might help. OpenCL, DirectCompute or OpenGL ComputeShader, many options.
That is happens to be 4x4x4 is pretty convenient. If you wanted to take advantage of AVX and/or AVX512 you might look at rearranging the data so you can do 8 or 16 at a time instead of just 4, if you can, this might also make the 4 at a time code more efficient.

I'm not familiar enough with your code to know if that can be done, but I think these blog posts may be relevant/have some ideas:

https://fgiesen.wordpress.com/2013/07/09/simd-transposes-1/

https://fgiesen.wordpress.com/2013/08/29/simd-transposes-2/

https://fgiesen.wordpress.com/201...imd-matrix-vector-multiplication/
also if you want a quick way to be able to compile code for different SIMD instruction sets, you can use some simple macros like

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#ifndef AVX2

typedef __m128 SIMD;   //for floats
typedef __m128i SIMDi; //for integers

//process 4 at a time
#define VECTOR_SIZE 4
#define MEMORY_ALIGNMENT 16


#define Store(x,y) _mm_store_ps(x,y)
#define Load(x) _mm_load_ps(x)
//etc

#endif
#ifdef AVX2


typedef __m256 SIMD;  //for floats
typedef __m256i SIMDi; //for integers

//process 8 at a time
#define VECTOR_SIZE 8
#define MEMORY_ALIGNMENT 32

//intrinsic functions
#define Store(x,y) _mm256_store_ps(x,y)
#define Load(x) _mm256_load_ps(x)
Thank you.
mmozeiko > I'm bound to CPU for now, but to keep in mind if I can manage the memory load.
MandleBro > Thank you for the suggestion and the code, I'll try AVX.

And assuming a fairly modern x86-64, what about cache ?
If 'c' is sufficiently high, should I copy the current 4x4x4 input block in a local buffer ?

Edited by anaël seghezzi on
Are you sure that your kernels are not separable? When the kernel is separable, you can write the convolution as the composition of convolutions with kernel of lower dimension (in the best cases 1-dimensional).
vict85 > The kernels are not symmetric, they are generated randomly by a genetic algorithm or by back propagation. Mainly I don't know what they will look like.

I tried AVX, but the speedup was not worth it, mainly because of the last horizontal sum which needs more instructions in AVX than SSE3.

The best I added was to align the memory properly to use '_mm_load_ps' everywhere instead of '_mm_loadu_ps'.