handmade.network » Forums » fast sse convolution for convolutional neural network
anael
anaël seghezzi
20 posts
1 project
#11109 fast sse convolution for convolutional neural network
8 months, 4 weeks ago Edited by anaël seghezzi on March 1, 2017, 5:07 p.m.

Hi,

I'm trying to optimize a multi-layer convolution function for some machine learning project I'm working on.
The computation is very heavy so I'm trying to reduce the cost of convolution using sse intrinsics (initially I wanted to limit myself to sse2, but I guess sse3 is reasonable).

For efficiency each kernel is of 4*4*4 dimension. The obvious approach is to multiply and add 4 values at a time.
It's a speedup, but I wonder if there is a more efficient approach ?

Here is the code :
Any ideas ?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
// 4*4*4*c kernel convolution of a w*h*4 input layer
// into a (w-3)*(h-3)*c output layer
void convolve4x4xC(float *out, float *in, float *kernel, float *bias, int w, int h, int c)
{
   int x, y, i;
   int w3 = w - 3;
   int h3 = h - 3;
   int w4 = w * 4;

   for (y = 0; y < h3; y++) {

      float *iny = in + (y * w4);

      for (x = 0; x < w3; x++) {

         for (i = 0; i < c; i++) {

            __m128 out4;
            float *inp, *kp;
            int u, v;

            inp = iny;
            kp = kernel + (i * 64);

            // bias
            out4 = _mm_set1_ps(bias[i]);

            // convolution
            for (v = 0; v < 4; v++) {
            
               for (u = 0; u < 16; u+=4) {
                  __m128 p4 = _mm_loadu_ps(inp + u);
                  __m128 k4 = _mm_loadu_ps(kp);
                  p4 = _mm_mul_ps(p4, k4);
                  out4 = _mm_add_ps(out4, p4);
                  kp+=4;
               }

               inp+=w4;
            }

            // add the remaining 4 values (sse3)
            out4 = _mm_hadd_ps(out4, out4);
            out4 = _mm_hadd_ps(out4, out4);
            _mm_store_ss(out + i, out4);
         }

         out+=c;
         iny+=4;
      }
   }
}
mmozeiko
Mārtiņš Možeiko
1506 posts
1 project
#11110 fast sse convolution for convolutional neural network
8 months, 4 weeks ago

If w and h is large enough, then going GPU might help. OpenCL, DirectCompute or OpenGL ComputeShader, many options.
MandleBro
Jack Mott
98 posts
1 project

Web Developer by day, game hobbyist by night. Fond of C and F#

#11126 fast sse convolution for convolutional neural network
8 months, 3 weeks ago

That is happens to be 4x4x4 is pretty convenient. If you wanted to take advantage of AVX and/or AVX512 you might look at rearranging the data so you can do 8 or 16 at a time instead of just 4, if you can, this might also make the 4 at a time code more efficient.

I'm not familiar enough with your code to know if that can be done, but I think these blog posts may be relevant/have some ideas:

https://fgiesen.wordpress.com/2013/07/09/simd-transposes-1/

https://fgiesen.wordpress.com/2013/08/29/simd-transposes-2/

https://fgiesen.wordpress.com/201...imd-matrix-vector-multiplication/
MandleBro
Jack Mott
98 posts
1 project

Web Developer by day, game hobbyist by night. Fond of C and F#

#11127 fast sse convolution for convolutional neural network
8 months, 3 weeks ago

also if you want a quick way to be able to compile code for different SIMD instruction sets, you can use some simple macros like

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#ifndef AVX2

typedef __m128 SIMD;   //for floats
typedef __m128i SIMDi; //for integers

//process 4 at a time
#define VECTOR_SIZE 4
#define MEMORY_ALIGNMENT 16


#define Store(x,y) _mm_store_ps(x,y)
#define Load(x) _mm_load_ps(x)
//etc

#endif
#ifdef AVX2


typedef __m256 SIMD;  //for floats
typedef __m256i SIMDi; //for integers

//process 8 at a time
#define VECTOR_SIZE 8
#define MEMORY_ALIGNMENT 32

//intrinsic functions
#define Store(x,y) _mm256_store_ps(x,y)
#define Load(x) _mm256_load_ps(x)
anael
anaël seghezzi
20 posts
1 project
#11138 fast sse convolution for convolutional neural network
8 months, 3 weeks ago Edited by anaël seghezzi on March 3, 2017, 8:43 a.m.

Thank you.
mmozeiko > I'm bound to CPU for now, but to keep in mind if I can manage the memory load.
MandleBro > Thank you for the suggestion and the code, I'll try AVX.

And assuming a fairly modern x86-64, what about cache ?
If 'c' is sufficiently high, should I copy the current 4x4x4 input block in a local buffer ?
vict85
Vittorio Patriarca
4 posts

I'm an italian "junior" C++ software developer located in Berlin. I graduated in Math and Political Science.

#11271 fast sse convolution for convolutional neural network
8 months, 2 weeks ago

Are you sure that your kernels are not separable? When the kernel is separable, you can write the convolution as the composition of convolutions with kernel of lower dimension (in the best cases 1-dimensional).
anael
anaël seghezzi
20 posts
1 project
#11272 fast sse convolution for convolutional neural network
8 months, 2 weeks ago

vict85 > The kernels are not symmetric, they are generated randomly by a genetic algorithm or by back propagation. Mainly I don't know what they will look like.

I tried AVX, but the speedup was not worth it, mainly because of the last horizontal sum which needs more instructions in AVX than SSE3.

The best I added was to align the memory properly to use '_mm_load_ps' everywhere instead of '_mm_loadu_ps'.