I have some single-threaded code that uses C# reflection to load some global types and do some processing.
foreach (Assembly assembly in assemblies) { if (CheckAssembly(assembly)) { Type[] types = assembly.GetTypes(flags); foreach (Type type in types) if (CheckType(type)) DoWork(type); } }
The program was quite simple, and after got it working, I tried to multithreading it. I'm quite new to multithreading in general and C#, so I just used the standard Task library. There're a couple of ways I can do this: I can either run a new task for each of the assemblies, or for each of the types, or both.
foreach (Assembly assembly in assemblies) { if (CheckAssembly(assembly)) { // Assembly taskList.Add(Task.Run(() => { Type[] types = ...; foreach (Type type in types) ... })); // or Type Type[] types = assembly.GetTypes(flags); foreach (Type type in types) if (CheckType(type)) taskList.Add(Task.Run(() => DoWork(type))); } }
For some reason, having each task handle each assembly is faster than for each task to handle each type or both. This leads me to believe there's probably some lock contention happening for types in the same assembly. I couldn't find any article about how the C# reflection library handles threading or locking. The only thing I know is it thread-safe. Does anyone have any information about this?
I don't think that is issue with reflection itself. It's just what happens when you're dispatching too many tiny pieces of work.
In general doing threads on higher level of granularity (working on bigger chunks of work) will be better than doing then on lower level of granularity (working on many smaller pieces of work). Because in latter case task/thread system needs to deal with too much synchronization dispatching those smaller pieces of work. So giving work on assembly level will be better, because those are bigger chunks of "work" to do.
But don't guess - run profiler, and see where time is spent.
I'll probably run it through a profiler soon, but before that, here are some of my questions:
Edit: After profiling it inside Unity, here's the result (ignore the pink part, focus on the blue ones):
The first column is per assembly, the second one is per type, and the third one is both (add a task for each assembly, then each task later add a new task for each type in the assembly). All the tasks in 1 and 2 are added in the main thread which I don't show here. The main thread's job is to add tasks and wait for them to finish.
Perf: assembly > both > type. I can see a lot of empty spaces in column 2. Is this where all the synchronization and dispatching happen? How can I improve it? Is this because there's something wrong with the Task library?
My guess is no, there is not. But I don't know for sure. Those things provide static information, so there should be no reason to lock anything there.
That should not matter. Regardless of how tasks work via locking, or lockless, that will have same problems if amount of work you give is too small.
I assume the gaps are second column is because main thread is parsing Assembly and then creating all Types array (GetTypes
) before dispatching to threads.
There is nothing wrong with Task library and dispatching/synchronizing things. You should not guess what gets synchronized and what's not. Profiler should show that. If Unity cannot show system C# functions, then you should use different profilers, or just run without Unity - normal profilers should not care about whether it is your functions or System imported (or wherever they come from). They all should show up in profile views.
Often solving dispatch of small work items is done by chunking them. Instead of giving new task each item, you give chunks. Estimate how much work is reasonable and chunk your array:
int chunkSize = 1024; Type[] types = assembly.GetTypes(flags); for (int i=0; i<types.Length; i+=chunkSize) { int chunkStart = i; int chunkEnd = Math.Min(i+chunkSize, types.Length); taskList.Add(Task.Run(() => { for (int c=chunkStart; c<chunkEnd; c++) if (CheckType(type)) DoWork(type); })); }
with newer C# version there are builtin helpers for this:
var part = Partitioner.Create(0, types.Length); Parallel.ForEach(part, (range, state) => { for (int i = range.Item1; i < range.Item2; i++) if (CheckType(types[i])) DoWork(type[i]); });
You might want to explicitly pass third argument to Partitioner.Create
to use bigger chunks (I'm not sure what it uses by default).
In my experience - whatever reflection caches is not enough. They do only very minimal stuff, so any locking there is not an issue. That's why people say that reflection is slow - because most of the stuff comes afterwards on top of it. So you often want to cache reflection stuff yourself, especially when you'll be reusing it multiple times.
I only know the task system is implemented with a thread pool. Does each thread pick a task, finish it, then pick another task?
I assume it is this like that. That's why they give helpers to do tasks in bulk yourself. Because then it benefits both type of works - small and large ones. Whether if they would do only tasks in bulk, then processing large type of work would be really bad with Task system.
C# has many great automatic profilers. No need to do manual instrumenting. Manual instrumenting is great if you want to profile your own code. But when you want to know what happens in system functions, you cannot do manual instrumenting.
Even VS built-in profiler nowadays is quite useful: https://learn.microsoft.com/en-us/visualstudio/profiling/profiling-feature-tour?view=vs-2022#instrumentation - you will see in call stack all the functions not yours, but System and others too. And then in here https://learn.microsoft.com/en-us/visualstudio/profiling/profiling-feature-tour?view=vs-2022#analyze-cpu-usage (3rd screen shot) you'll see which path in which callstack contributes how much and most (Hot Path).
For
function that takes in a range and a ForEach
function that takes in an IEnumerable. As for the default chunk size of the Partitioner.Create function, it seems that the subrange size will change depending on the range you pass in.If you want to first fully iterate before doing another parallel For, then yeah - that doesn't work. You can do parallel for from withing another parallel For, if that helps. Otherwise you need to issue tasks yourself.
I did Partitioner because of reasons I explained above - doing one iteration per item is a lot of overhead, when one item workload is small.
See the link I posted: https://learn.microsoft.com/en-us/dotnet/standard/parallel-programming/how-to-speed-up-small-loop-bodies
When a Parallel.For loop has a small body, it might perform more slowly than the equivalent sequential loop, such as the for loop in C# and the For loop in Visual Basic. Slower performance is caused by the overhead involved in partitioning the data and the cost of invoking a delegate on each loop iteration. To address such scenarios, the Partitioner class provides the Partitioner.Create method, which enables you to provide a sequential loop for the delegate body, so that the delegate is invoked only once per partition, instead of once per iteration.