MandleBro Jack Mott 112 posts / 1 project Web Developer by day, game hobbyist by night. Making the obvious code fast 2 years, 9 months ago This blog post summarizes some ideas I've been kicking around in my head, partly inspired by Jonathan Blow, who often talks about just typing the obvious code, and some performance work I've been doing in higher level languages. https://jackmott.github.io/progra...16/07/22/making-obvious-fast.html What has your experience been with the languages you work with, where the idiomatic, or obvious approach has led you to performance pitfalls? What languages are good about guiding you down the performant path?
Making the obvious code fast
2 years, 9 months ago Edited by Mārtiņš Možeiko on Aug. 5, 2016, 1:24 a.m. Reason: Ooops, type. Thx ratchetfreak

You should try summing in multiple counters in parallel:
 ``` 1 2 3 4 5 6 7 8 9 10``` ```double s1, s2, s3, s4; s1 = s2 = s3 = s4 = 0.0; for (int i = 0; i < COUNT; i+=4) { s1 += values[i+0] * values[i+0]; s2 += values[i+1] * values[i+1]; s3 += values[i+2] * values[i+2]; s4 += values[i+3] * values[i+3]; } double sum = s1 + s2 + s3 + s4; ```

This will give a bit different results due to floating rounding, but probably will be faster. Same trick applies to simd version - you should utilize more SIMD registers and do summing parallel.
Making the obvious code fast
2 years, 9 months ago

mmozeiko
You should try summing in multiple counters in parallel:
 ```1 2 3``` ```double s1, s2, s3, s4; s1 = s2 = s3 = s4 = 0.0; for (int i = 0; i < COUNT/4; i+=4) ```

don't divide by 4 for the count if you are already adding 4 to i per iteration

also the visual C++ already does 2 simd adds in parallel (using ymm4 and ymm5 as the sets of accumulators).
Making the obvious code fast
2 years, 9 months ago Edited by Laurie on Aug. 4, 2016, 12:29 p.m.

Really interesting blog post. I definitely find that the "most obvious thing" approach starts to quickly break down in higher level languages. You do a really good job in the article of highlighting how languages can conceal both the algorithmic complexity (particularly bad in functional languages as shown in your F# examples) and implementation level efficiency (as demonstrated with your SIMD examples).

At the risk of being pedantic, I think it's worth considering the difference between "obvious" and "idiomatic" code. For example, in C++, the idiomatic way of passing an object into a function (by const ref) is completely reasonable from an efficiency standpoint.
 ```1 2 3``` ```void my_function(std::string const &s) { // ... } ```

However this could not be considered to be the most obvious code, which would probably be something like this instead.
 ```1 2 3``` ```void my_function(std::string s) { // ... } ```

This of course introduces an unnecessary deep copy. In this sense, the idiomatic C++ code has become idiomatic as a way of counteracting the inherent inefficiency of the obvious code.

A similar problem manifests in Rust. As an example imagine we're implemented a Mat4x4 struct. To multiply two Mat4x4s (m1 and m2), the most obvious code to write would probably be something like this.
 `1` ```let m: Mat4x4 = m1 * m2; ```

This passes m1 and m2 by reference to the multiplication function, but it actually transfers ownership, meaning that m1 and m2 can no longer be used. If this isn't the desired behaviour, we could make our Mat4x4 copyable, which would solve that problem but would make our above code copy 32 floats into our multiplication function, which is almost certainly less efficient than passing by reference. To solve this, we could let the multiplication function borrow m1 and m2, but that would require code like this.
 `1` ```let m: Mat4x4 = &m1 * &m2; ```

To me, this code is less obvious than the previous version, but of course is more efficient.

To return to C/C++, the same problem of excess copies is manifest in the obvious implementation of various mathematical operators. For example, consider the typical implementation of operator+(), which would return a new instance of an object (idiomatic C++ is to "do as the ints do", and operator+ on an int returns a new instance). This means that in long and complex equations, lots of temporary objects are being created, copied, and destroyed when the memory could be reused. For example this code
 `1` ```std::string s = s1 + s2 + s3; ```

is the equivalent of
 ```1 2``` ```std::string temp1 = s1 + s2; std::string s = temp1 + s3; ```

This problem has led to the unspeakable horror that is C++ expression templates. Whist these do indeed "solve" this efficiency problem, they are about as far as it's possible to be from obvious, and are so complex that they are not even considered idiomatic within the C++ community, which has become fairly tolerant of high-complexity solutions.

Of course, an added complexity in all of this is the fact that when we're discussing higher-level languages, it's very hard to separate issues of inherent inefficiency from specific implementations. As you highlighted in your section on SIMD, you're rather at the mercy of your (JIT) compiler. As you made clear, there is no reason C# couldn't apply automatic vectorisation, and there is nothing requiring C++ to do so. This leads to the interesting situation where instead of application programmers being able to find the most efficient solution, you have implementers examining idiomatic code and ensuring their implementations optimize for that usage. A classic example of this would be tail call optimizations in functional languages.

As for examples of where languages can lure you down the wrong path, I had a fun experience with Eiffel once. In Eiffel, the preferred way of managing concurrency is through a methodology/technology called SCOOP. Without getting into too much detail, the basic idea is that each object explicitly marked as "separate" runs in its own virtual thread and communicates via messages, much like processes in Erlang. However unlike in Erlang, it turns out that the EiffelStudio implementation spins up an OS level thread for every object you mark as separate. You can imagine what happened when I instantiated a few hundred of these things without realising... I think that's an example of somewhere where the idiomatic way of programming was, if not inherently inefficient, then inefficient for a large proportion of cases, and in fact required a good understanding of the implementation to use. Leaky abstractions strike again!
 MandleBro Jack Mott 112 posts / 1 project Web Developer by day, game hobbyist by night. Making the obvious code fast 2 years, 9 months ago I think it actually gets memory throughput bound with the 'obvious' SIMD solution, so we may need to go multithreaded, so we get more L1 caches!
 MandleBro Jack Mott 112 posts / 1 project Web Developer by day, game hobbyist by night. Making the obvious code fast 2 years, 9 months ago The string example is a good one. Most managed languages, adding strings with the + operator is obvious, but expensive. I wonder if C# couldn't automatically identify cases where StringBuilder could be swapped in for a bunch of string concatenations.
 ratchetfreak 446 posts Making the obvious code fast 2 years, 9 months ago Edited by ratchetfreak on Aug. 4, 2016, 1:13 p.m. MandleBro I think it actually gets memory throughput bound with the 'obvious' SIMD solution, so we may need to go multithreaded, so we get more L1 caches! That's not going to work if the actual memory bus is saturated. MandleBro The string example is a good one. Most managed languages, adding strings with the + operator is obvious, but expensive. I wonder if C# couldn't automatically identify cases where StringBuilder could be swapped in for a bunch of string concatenations. Here's somewhere where I would say to make the slow code non-obvious. By which I mean get rid of the + operator overload for concatenation altogether and make the use of a string builder class idiomatic
 MandleBro Jack Mott 112 posts / 1 project Web Developer by day, game hobbyist by night. Making the obvious code fast 2 years, 9 months ago ConcatAndCrushTheGC("First Half","Second Half"); That should scare people off
 jon Jonathan Blow 20 posts Making the obvious code fast 2 years, 9 months ago I would like to clarify that when I say "type the obvious code" I am not talking about things on the level of const & or not. Getting that stuff right is part of what I consider "obvious" for the purposes of those discussions. This is not to invalidate anything about the point being made, which is that it's better if the shorter/clearer thing is also the best thing. I agree with that too, it's just a completely different idea than what I was talking about as referenced in the OP. (And it's almost never true in C++).