A discussion of many aspects of parallelism and concurrency in programming, and the pros and cons of different programming methodologies.
This is a fishbowl: a panel conversation held on the Handmade Network Discord where a select few participants discuss a topic with depth and nuance. We host them every two months, so if you want to catch the next one, join the Discord!
I'm having trouble understanding that. The way I remember it is that one is about authoring code as independent "trains of thought", the other is actually running code on multiple physical cores. But not quite sure about that
WaywardNov 19, 2020 04:24 AM
responsiveness versus throughput? @jfs
do we have anyone in the fishbowl at all who's been playing around with parallel programming at all or are we just a big group of beginners?
bumbreadNov 19, 2020 04:26 AM
concurrency == breaking program into tasks that are independent from each other (threads baically)
parallelism == physically running few tasks on different cpu cores
If you look at "the three columns of concurrency" that I linked above, I'm mostly interested in the responsiveness pillar right now. But to make that not a pain to develop (e.g. by create a separate OS thread for each task), shared data locks (mutexes) are definitely needed
So, assuming that we're talking about both concurrency and parallelism here. Because one isn't very meaningful without the other, right?
WaywardNov 19, 2020 04:32 AM
i thought parallelism was more of the umbrella term given how computer hardware have developed during the last 25 years
A realization that lead me into being interested in this topic recently has been how even a single machine is a concurrent and multithreaded thing. Not only do machines have multiple CPU cores, they also have a separate graphics processing unit, and all sort of other things that (for example) communicate through interrupts. It gets more pronounced of course when adding networking...
I can't really do a plain fgets() or fread() anymore, it seems so wrong to me when I'm doing e.g. GUI on the same thread(edited)
The human interacting with the GUI is like a completely separate thread as well!
WaywardNov 19, 2020 04:37 AM
i'm more curious about the massively parallel part of the discussion, maximizing throughput and thereby optimizing for speed (running a single big task on multiple cpu threads to optimize for time)
so @bumbread is right; concurrency is basically the go/node idea where you can have lots of "things" (threads, green threads, coroutines, goroutines, ...) in-flight at the same time, but you can still do it all on a single core if that's all you have. and parallelism is more when things run truly in parallel (on different cores).
bumbreadNov 19, 2020 04:53 AM
well maybe let's not worry about definitions for now
@jfs I think you meant @NWDD but I can answer too first of all, I think thread safety can be many different things. You meantioned preemption, that's one part of it -- if you only have 1 CPU and you fully control when the next thread executes, then you can be a lot more relaxed about a lot of things, since you know no other CPU will be able to access your variables concurrently. in other words, you don't need locking at all.
Now I have the definitions done and over with, let's gone down the dirty stuff
NWDDNov 19, 2020 05:01 AM
so for example: imagine you're developing a network layer which is something that is highly asynchronous by definition.
there are a lot of little details you have to handle: has the connection timed out? what data have you received? was there a network error? did the user put a nintendo switch in the microwave so the network connection broke?
with concurrency you can actually ditch all means of synchronization and have it still happen form a perceived point in "parallel"(edited)
Unless I'm misunderstanding what you mean, I disagree. You often need synchronization with concurrency. Many webapps have synchronization bugs even though they are entirely single-threaded.
NWDDNov 19, 2020 05:07 AM
ah, yeah, i meant specifically cooperative concurrency (for example fibers or lua coroutines running in a single-thread or cooperative threads pinned to one core)
WaywardNov 19, 2020 05:10 AM
@AsafGartner that sounds like a problem with the single threaded event driven process rather than a simple form of monte carlo where you have one thread that creates jobs and takes back the result afterwards with no interaction in-between each job thread
Synchronization is for shared state. If you don't have shared state then it's no problem, but if you do you still need to enforce synchronization even in non-parallel systems.
VegardNov 19, 2020 05:11 AM
@AsafGartner web apps are sometimes spawned in parallel by the web server, so even though the program itself is single-threaded, there can still be multiple instances of it running in parallel. not sure if this is what you're talking about
Yes. I didn't really understand what you meant @NWDD . And what is it anyway, that is "acting like a separate process" (like @gingerBill sasys)?
In my mind, coroutines are something that the OS is unaware of
basically I'm thinking of fibers
Fibers btw. are seen as a failed experiment(edited)
POSIX removed their implementation, and Microsoft deprecated them long ago
because with them, we not only have process-global and thread-global storage, but also fiber-global storage(edited)
and all the related confusion that comes with it
also when scheduling a single fiber on multiple OS threads
and so on
VegardNov 19, 2020 05:58 AM
glibc still has setcontext/swapcontext/etc. I think the reason it was deprecated in POSIX was because it had a bunch of requirements WRT signals, which meant that you need 2+ syscalls per fiber switch, which is slow.
so it's still fully possible to define your own semantics and get good performance out of them. just not with an implementation following the POSIX specification
@NWDD People rarely know what they want before trying stuff out.
VegardNov 19, 2020 06:10 AM
aren't fibers and green threads synonyms?
NWDDNov 19, 2020 06:11 AM
fibers are always cooperative and something that is controlled by the programmer (so the programmer has to decide when does the fiber start, where does it execute, at which point it is stopped, if/how is it scheduled...)(edited)
VegardNov 19, 2020 06:14 AM
well, okay. so you can use fibers to implement green threads
that only falls flat in tight loops where all the allocation has been optimized out
I don't think there is another way of doing scheduling in userland, or is there? How would that UMS work, anyway?
VegardNov 19, 2020 06:20 AM
on Linux at least you can register a timer that will send a signal, then you can context switch from the signal handler. but this does go through the kernel as well, so it's not that great for performance
NWDDNov 19, 2020 06:20 AM
well, in the end everything is semantic, that depends on what you consider "doing scheduling"
I have noticed a lot of beginners to parallelizing things (and I am by no means an expert either), do it because they think it will make things faster, when in reality their problem could be made extremely fast on a single thread and multiple threads would slow things down.
WaywardNov 19, 2020 07:10 AM
your point being multithreading primarily having benefits assuming the code is already optimized?
@gingerBill but even if you have efficient scheduling you still need to layout things in a way that it doesn't thrash, then use algorithms that are thread safe and slower or pay for the extra cost of locking (and either annoy the scheduler to get kicked earlier or pay the context switch)... And in the end you end up with something that makes a worse use for a system if that system is shared by other programs
yes, I mean you could have theoretically a game fit in a single cache-line(edited)
but not be parallel-friendly
VegardNov 19, 2020 07:25 AM
a 64-byte game?
NWDDNov 19, 2020 07:25 AM
it's a theoretical exercise xD
not daring anyone to do so
VegardNov 19, 2020 07:26 AM
it's not a bad dare.
NWDDNov 19, 2020 07:26 AM
just saying that not being cache friendly (when doing multithreaded stuff) isn't related to using a huge amount of memory or optimizing(edited)
WaywardNov 19, 2020 07:27 AM
@NWDD can we just jump straight to the long version?
NWDDNov 19, 2020 07:27 AM
what long version?
oh that's right, the game
regarding that game I was talking about,
we're now changing it so that instead of running separate systems on separate threads
everything will end up running in the same thread in order
this allows us to free cores so that we can actually run a separate game simulation on each core
which solves a harder problem regarding online multiplayer
point being: using multithreading for things that do not require it might actually prevent the program from reaching its full potential
WaywardNov 19, 2020 07:33 AM
single threaded applications are recommended until no longer viable, what about the latter case?
NWDDNov 19, 2020 07:34 AM
I can imagine a lot of cases where you actually want multi-threading eh? I mean for example: a compiler, a baking process, a simulation that can go faster than realtime...
It just feels a bit sad to have a thread switched out because someone decided that browsers need to have an insane level of parallelism and the gpu driver decided to spawn 30 threads in your game.(edited)
In a way, these two families can be seen as duals of one another
The actor model works very well in object orientated languages because of the "object" nature of the "actors". They are opaque to each other actor.
When these "actors" send messages, it is completely asynchronous, which means that no blocking on the receiver end.
So for the "actors", the "mailbox" (the thing holding the "messages", i.e. queue) might need to be potentially loads of messages, so it's a lot harder to reason about.
CSP (a form of Process Calculus, which Go uses a slightly variant of), is fully synchronous in nature.
The message is flipped from being the unit of communication to the channel.
A channel is effectively a message queue with a monitor applied to it.
The channel is the thing that referenced to, where messages can be sent to it or received from.
Most channels will block, and thus synchronous, until a channel receiver reads.
This has advantages of that the blocking mechanism is that the channel only needs to "hold" one message.
Go extends this to allow for buffered channels (of fixed capacity), so that you can have stuff similar to the Actor more, but the channel is the focus, rather than the sender/receiver (i.e. the actor).
So the in Process-Calculus family of models, the processes communicate (and synchronize) through channels, whilst the Actors are themselves the processes.
One interesting thing is that the actor model does have the advantage of being "modelable" across multiple machines e.g. each machine an actor which then communicates over a network.
So because the models are effectively duals of each other, they can be applied at different levels for different problems.
But because these are generalized models, they will not be good for all problems (obviously).
But one thing which seems to be a given across all forms of concurrency is the concept of a queue.
NWDDNov 19, 2020 08:07 AM
well, concurrency through shared memory may disagree
My point above was probably that there is a tradeoff - we can replace explicit synchronization on shared memory by mediating accesses through a data-structure-managing actor (process)
at the cost of more strict serialization
NWDDNov 19, 2020 08:35 AM
this is what I understood:
1 requester to 1 source == trivial -> (straight-forward single-thread code)
M requesters to 1 source == shared memory -> (for example: multiple threads accessing a hash-table)
1 requester to N sources == message -> (for example: a thread that receives network packets from multiple sockets and puts information in a hashtable)
M requesters to N sources == hybrid -> (for example: multiple threads that receive network packets from multiple sockets and put information in a hashtable)
For the audience (me), could you define the difference?
NWDDNov 19, 2020 10:01 AM
What's the worst concurrency bug you've all encountered?
A race condition that happened in a non-parallel situation of a so-called "safe" programming language (C#) in a game that was more or less ECS-based, where a system removed the "rigidbody" component in a very specific (user-triggered) situation.
There was a separate system which worked like a finite-state-machine that was being updated every few frames.
If during a transition between two states the rigidbody got removed, the thread that was running the logic crashed and the C# runtime for the target platform would silently swallow the exception freezing the game (without providing any crash-dump or useful information)(edited)
NWDDNov 19, 2020 10:12 AM
Here's a question, is there actually ever a need for a semaphore instead of a condition variable?
Have been thinking for a while, but I don't think I've ever felt the need or seen an advantage to using a semaphore
I do recall at some point using them because it was provided in a library example?
because the most recommended scenario for a semaphore is a pool of resources
however when you have a pool of resources then also want a handle to the specific resource you got
which need an additional thread-safe way to handle it
and that thing to dole out the resource handles can handle blocking the thread just as well as the semaphore
NWDDNov 19, 2020 10:29 AM
I can see them being useful for example to request a system to run once (for example run a physics update)?
But because more often than not you need to know when it has finished and wait for it, there are other options?
Plus it feels like semaphores aren't really well known which means that it may confuse other people working in the project
In the end I feel that because we're used to do so it's easier to reason about critical regions of code that should not execute at the same time and other options are just not as popular?(edited)
they are still taught as a synchronization primitive though
and for the record:
semaphore is a counter where if you try an decrement below 0 it blocks until another thread increments it above 0
a condition variable is an object that lets you wait on a signal from the thread while temporarily releasing a mutex
NWDDNov 19, 2020 10:32 AM
yes, people are taught about synchronization in general but I don't think that knowledge persists in time after tests
For those who have worked with parallel programming a fair amount, doing parallel programming properly such that close-to-optimal code is produced given any parallel problem seems extremely difficult to master, and extremely difficult to do in general. How much of this do you attribute to tools? Are there tools you wish you had when learning---or doing---this kind of programming?
What's the worst concurrency bug you've all encountered?
@bvisness I've run into a weird situation that involved some sort of false sharing. I had a writer queue that would normally transfer around 20MB / sec, but in that situation the throughput dropped to about 400KB / sec
forgot to mention explicitly that the bottleneck (400KB/s) here was a memcpy() ;-). When the original 20MB/s was simply the hardware source
@ryanfleury I'm not sure I'm attributing anything on tools. What we've missed out on in the discussion so far is focussing more on what actually happens in the hardware, HM style.(edited)
It for sure is important to understand the basics, like Caches, MESI, and synchronization primitives
the Linux kernel goes quite fancy with synchronization on shared data structures, so maybe that is a recommend. Though on the other hand, this approach is error-prone, not beautiful from an architectural perspective.
I'm sure that static analyzers, fuzzers, and similar tools can help a lot with debugging shared data structures. So that would be an answer to Ryan's question. I can't say anything about the quality of the existing tools, not really having used any
VegardNov 19, 2020 11:17 AM
I feel like I ought to have some tools to recommend, but the truth is that I don't really. usually the challenge when doing something fancy (i.e. more than the straightforward spinlock/mutex) is getting things to work correctly at all, and then the best thing you can do is to stress-test on as many CPUs as you can. so in a way maybe a machine with lots of CPUs is the best way to find bugs..?
I'm not really getting the distinction either. Also not all of CSP is channel focused or communication focused. If I recall the first half of the book doesn't even mention channels. I recall CSP more as an algebra over sequences of events
but for the channel part, not sure what is that relevant or insightful. Would very much like to know
I gather that CSP channels are understood as sending messages synchronously (i.e. the sender waits until recipient "dequeues" it), as opposed to asynchronous sending with Actor model, but I'm also not sure that this is super relevant, nor would I actually want to waste the time waiting
There are in any case, different variants of all the theories - for example I believe CSP started out as a static graph of processes connected by channels, but it might be that this constraint was lifted in later publications (don't quote me, I haven't really checked back)
I know that actor model is definitely always understood as sending completely asynchronously @bvisness . At least in these days
but if I'm thinking of modelling my program after the actor model, I'll probably consider allocating messages statically in the sender, so allocation isn't a performance bottleneck, and the number of existing messages is naturally bounded
that would also mean each sender has to wait until it gets some of its messages returned back before it can send more, which would again lead to some form of "blocking". Not really while sending but still
that makes sense, and would lead to a model like I'm used to from Go
(what you describe is basically the behavior of a buffered channel)
I actually very rarely see any of Go's concurrency features used heavily outside the standard library, and I'm not sure why. They certainly feel well-integrated into the language, but nobody seems to use them.
Maybe it's just laziness or people not knowing how to use them.
I certainly do agree that the actor model / CSP is a good conceptual fit for an object-oriented language (I continue to think they are essentially the same)
I guess I'm curious, though, if it's an essentially object-oriented concept. Concurrent execution seems to me to imply that you have concurrent "processes", which will certainly have their own state. Whether that's an "object" vs. a coroutine, goroutine, thread, etc. seems unrelated.
After all, you can apply actor model or CSP designs inside a program, or across network boundaries, and the concepts feel the same to me
Pasting some messages from #fishbowl-audience about the differences between CSP and the actor model...
Some definitions from the internet:
An actor is a computational entity that, in response to a message it receives, can concurrently:
send a finite number of messages to other actors;
create a finite number of new actors;
designate the behavior to be used for the next message it receives.
Recipients of messages are identified by address, sometimes called "mailing address". Thus an actor can only communicate with actors whose addresses it has. It can obtain those from a message it receives, or if the address is for an actor it has itself created.
The actor model is characterized by inherent concurrency of computation within and among actors, dynamic creation of actors, inclusion of actor addresses in messages, and interaction only through direct asynchronous message passing with no restriction on message arrival order.
But what is the essential difference between sending a message "via a channel" or "to an address"?
My overall take
I guess it's true that in a CSP system you don't directly reference the other actors. But I'm still not convinced that meaningfully changes the model
you could conceive of channels as "actors" that hand messages back and forth, and now you've achieved basically the same thing?
it doesn't feel to me like a fundamental difference
and channels really being a certain standard implementation with a fixed comm interface(edited)
I also happen to think that it is extremely important to use standard interfaces for concurrently modelled programs
from my experience setting up all these graphs and actors and channels and what not, and then using the comm ifaces right, is extremely painful, so one wants to maintain and use as few primitives as possible(edited)