Making the obvious code fast

This blog post summarizes some ideas I've been kicking around in my head, partly inspired by Jonathan Blow, who often talks about just typing the obvious code, and some performance work I've been doing in higher level languages.

https://jackmott.github.io/progra...16/07/22/making-obvious-fast.html


What has your experience been with the languages you work with, where the idiomatic, or obvious approach has led you to performance pitfalls? What languages are good about guiding you down the performant path?
You should try summing in multiple counters in parallel:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
double s1, s2, s3, s4;
s1 = s2 = s3 = s4 = 0.0;
for (int i = 0; i < COUNT; i+=4) 
{
    s1 += values[i+0] * values[i+0];
    s2 += values[i+1] * values[i+1];
    s3 += values[i+2] * values[i+2];
    s4 += values[i+3] * values[i+3];
}
double sum = s1 + s2 + s3 + s4;

This will give a bit different results due to floating rounding, but probably will be faster. Same trick applies to simd version - you should utilize more SIMD registers and do summing parallel.

Edited by Mārtiņš Možeiko on Reason: Ooops, type. Thx ratchetfreak
mmozeiko
You should try summing in multiple counters in parallel:
1
2
3
double s1, s2, s3, s4;
s1 = s2 = s3 = s4 = 0.0;
for (int i = 0; i < COUNT/4; i+=4) 



don't divide by 4 for the count if you are already adding 4 to i per iteration

also the visual C++ already does 2 simd adds in parallel (using ymm4 and ymm5 as the sets of accumulators).
Really interesting blog post. I definitely find that the "most obvious thing" approach starts to quickly break down in higher level languages. You do a really good job in the article of highlighting how languages can conceal both the algorithmic complexity (particularly bad in functional languages as shown in your F# examples) and implementation level efficiency (as demonstrated with your SIMD examples).

At the risk of being pedantic, I think it's worth considering the difference between "obvious" and "idiomatic" code. For example, in C++, the idiomatic way of passing an object into a function (by const ref) is completely reasonable from an efficiency standpoint.
1
2
3
void my_function(std::string const &s) {
	// ...
}

However this could not be considered to be the most obvious code, which would probably be something like this instead.
1
2
3
void my_function(std::string s) {
	// ...
}

This of course introduces an unnecessary deep copy. In this sense, the idiomatic C++ code has become idiomatic as a way of counteracting the inherent inefficiency of the obvious code.

A similar problem manifests in Rust. As an example imagine we're implemented a Mat4x4 struct. To multiply two Mat4x4s (m1 and m2), the most obvious code to write would probably be something like this.
1
let m: Mat4x4 = m1 * m2;

This passes m1 and m2 by reference to the multiplication function, but it actually transfers ownership, meaning that m1 and m2 can no longer be used. If this isn't the desired behaviour, we could make our Mat4x4 copyable, which would solve that problem but would make our above code copy 32 floats into our multiplication function, which is almost certainly less efficient than passing by reference. To solve this, we could let the multiplication function borrow m1 and m2, but that would require code like this.
1
let m: Mat4x4 = &m1 * &m2;

To me, this code is less obvious than the previous version, but of course is more efficient.

To return to C/C++, the same problem of excess copies is manifest in the obvious implementation of various mathematical operators. For example, consider the typical implementation of operator+(), which would return a new instance of an object (idiomatic C++ is to "do as the ints do", and operator+ on an int returns a new instance). This means that in long and complex equations, lots of temporary objects are being created, copied, and destroyed when the memory could be reused. For example this code
1
std::string s = s1 + s2 + s3;

is the equivalent of
1
2
std::string temp1 = s1 + s2;
std::string s = temp1 + s3;

This problem has led to the unspeakable horror that is C++ expression templates. Whist these do indeed "solve" this efficiency problem, they are about as far as it's possible to be from obvious, and are so complex that they are not even considered idiomatic within the C++ community, which has become fairly tolerant of high-complexity solutions.

Of course, an added complexity in all of this is the fact that when we're discussing higher-level languages, it's very hard to separate issues of inherent inefficiency from specific implementations. As you highlighted in your section on SIMD, you're rather at the mercy of your (JIT) compiler. As you made clear, there is no reason C# couldn't apply automatic vectorisation, and there is nothing requiring C++ to do so. This leads to the interesting situation where instead of application programmers being able to find the most efficient solution, you have implementers examining idiomatic code and ensuring their implementations optimize for that usage. A classic example of this would be tail call optimizations in functional languages.

As for examples of where languages can lure you down the wrong path, I had a fun experience with Eiffel once. In Eiffel, the preferred way of managing concurrency is through a methodology/technology called SCOOP. Without getting into too much detail, the basic idea is that each object explicitly marked as "separate" runs in its own virtual thread and communicates via messages, much like processes in Erlang. However unlike in Erlang, it turns out that the EiffelStudio implementation spins up an OS level thread for every object you mark as separate. You can imagine what happened when I instantiated a few hundred of these things without realising... I think that's an example of somewhere where the idiomatic way of programming was, if not inherently inefficient, then inefficient for a large proportion of cases, and in fact required a good understanding of the implementation to use. Leaky abstractions strike again!

Edited by Laurie on
I think it actually gets memory throughput bound with the 'obvious' SIMD solution, so we may need to go multithreaded, so we get more L1 caches!

The string example is a good one. Most managed languages, adding strings with the + operator is obvious, but expensive. I wonder if C# couldn't automatically identify cases where StringBuilder could be swapped in for a bunch of string concatenations.
MandleBro
I think it actually gets memory throughput bound with the 'obvious' SIMD solution, so we may need to go multithreaded, so we get more L1 caches!


That's not going to work if the actual memory bus is saturated.

MandleBro

The string example is a good one. Most managed languages, adding strings with the + operator is obvious, but expensive. I wonder if C# couldn't automatically identify cases where StringBuilder could be swapped in for a bunch of string concatenations.


Here's somewhere where I would say to make the slow code non-obvious. By which I mean get rid of the + operator overload for concatenation altogether and make the use of a string builder class idiomatic

Edited by ratchetfreak on
ConcatAndCrushTheGC("First Half","Second Half");

That should scare people off
I would like to clarify that when I say "type the obvious code" I am not talking about things on the level of const & or not. Getting that stuff right is part of what I consider "obvious" for the purposes of those discussions.

This is not to invalidate anything about the point being made, which is that it's better if the shorter/clearer thing is also the best thing. I agree with that too, it's just a completely different idea than what I was talking about as referenced in the OP. (And it's almost never true in C++).
What about rather than 'write the obvious code', which will be different in every language, instead make it 'make the computer do the obvious thing,' which is language independent and problem/domain dependent (and machine dependent, which may not be a good thing)

its a subtle change in view but i think it highlights the problem better--with 'idiomatic code' you kind of sit at the level of lines of code and syntactic sugar, and your benchmarks are on simple operations over a lot of homogeneous data. but the real problem is how to write large, complex programs that still work in obvious ways without teetering over from technical debt. and one of the ways you want to do that is structure your program to be able to express what you are doing as simple operations over a lot of homogeneous data--what these benchmarks assume you already have! This is not a knock on the OP by any means but, stepping back a little, you're in a good place if your performance problems are solved by swapping out one fold function for another. If anything writing idiomatic F# has led you precisely where you want to be, even if it appears to be 10x slower than it could be for no real reason, because rectifying it involved no changes to the underlying inputs and outputs. whereas the string concatenation problem is more serious and trickier to solve (despite it being so dirt simple), and we end up with that chrome navigation bar nonsense

and it also gives you an idea of why C is still so good--it makes it hard to do things that aren't obvious at the level of the machine. take a look at the last section of this page. is that a good idea? im inclined to say no, but i might not if it was presented in a language that made it easy

anyway i just spent an inordinate amount of time trying to find a quote from one of the OOP luminaries, i want to say alan kay but im not sure, bemoaning that it would be easy to design hardware that made OOP more natural at the hardware level but instead we got x86 and what not. which when i read it struck me as specious but i cant find it now!!

Edited by graeme on
You are right that toy examples don't get at the real challenges of software development. It would be good though, if we could count on our languages, compilers, and core libraries to do something very close to the most efficient thing, when we type the most clear code.

That doesn't do much to solve the hard problem of "how do we build huge software projects correctly and in less time", but it would do a lot to make less software slow. It would help the large scale problems a little bit. For instance if you had a large F# program that processed arrays of numbers a lot, you wouldn't need to import or create your own library to get good performance. That would simplify your project, speed up your builds, etc.

Addressing the problem from the hardware end is an interesting idea, I don't know enough about EE to know how feasible it would be. Hardware support for GC? Cache schemes that know about virtual functions?



Edited by Jack Mott on