How C# reflection handles thread safety?

longtran2904

#29354

May 5, 2023

I have some single-threaded code that uses C# reflection to load some global types and do some processing.

foreach (Assembly assembly in assemblies)
{
    if (CheckAssembly(assembly))
    {
        Type[] types = assembly.GetTypes(flags);
        foreach (Type type in types)
            if (CheckType(type))
                DoWork(type);
    }
}

The program was quite simple, and after got it working, I tried to multithreading it. I'm quite new to multithreading in general and C#, so I just used the standard Task library. There're a couple of ways I can do this: I can either run a new task for each of the assemblies, or for each of the types, or both.

foreach (Assembly assembly in assemblies)
{
    if (CheckAssembly(assembly))
    {
        // Assembly
        taskList.Add(Task.Run(() =>
        {
            Type[] types = ...;
            foreach (Type type in types) ...
        }));

        // or Type
        Type[] types = assembly.GetTypes(flags);
        foreach (Type type in types)
            if (CheckType(type))
                taskList.Add(Task.Run(() => DoWork(type)));
    }
}

For some reason, having each task handle each assembly is faster than for each task to handle each type or both. This leads me to believe there's probably some lock contention happening for types in the same assembly. I couldn't find any article about how the C# reflection library handles threading or locking. The only thing I know is it thread-safe. Does anyone have any information about this?

Edited by longtran2904 on May 5, 2023, 3:27pm

Mārtiņš Možeiko

#29355

May 6, 2023

I don't think that is issue with reflection itself. It's just what happens when you're dispatching too many tiny pieces of work.

In general doing threads on higher level of granularity (working on bigger chunks of work) will be better than doing then on lower level of granularity (working on many smaller pieces of work). Because in latter case task/thread system needs to deal with too much synchronization dispatching those smaller pieces of work. So giving work on assembly level will be better, because those are bigger chunks of "work" to do.

But don't guess - run profiler, and see where time is spent.

Edited by Mārtiņš Možeiko on May 6, 2023, 6:48am

longtran2904

#29369

May 10, 2023

I'll probably run it through a profiler soon, but before that, here are some of my questions:

Is there any locking/synchronizing happening when each thread uses reflection on each separate type in the same assembly (for example, calling GetMembers, GetMethods, etc on those types)? What about doing that but on the same type?
I'm quite new to the C# task system, so how do they handle dispatching work or synchronizing their work queues? Is it lockless?

Edit: After profiling it inside Unity, here's the result (ignore the pink part, focus on the blue ones):

The first column is per assembly, the second one is per type, and the third one is both (add a task for each assembly, then each task later add a new task for each type in the assembly). All the tasks in 1 and 2 are added in the main thread which I don't show here. The main thread's job is to add tasks and wait for them to finish.

Perf: assembly > both > type. I can see a lot of empty spaces in column 2. Is this where all the synchronization and dispatching happen? How can I improve it? Is this because there's something wrong with the Task library?

Edited by longtran2904 on May 10, 2023, 10:35am

Replying to mmozeiko (#29355)

Mārtiņš Možeiko

#29372

May 10, 2023

My guess is no, there is not. But I don't know for sure. Those things provide static information, so there should be no reason to lock anything there.
That should not matter. Regardless of how tasks work via locking, or lockless, that will have same problems if amount of work you give is too small.

I assume the gaps are second column is because main thread is parsing Assembly and then creating all Types array (GetTypes) before dispatching to threads.

There is nothing wrong with Task library and dispatching/synchronizing things. You should not guess what gets synchronized and what's not. Profiler should show that. If Unity cannot show system C# functions, then you should use different profilers, or just run without Unity - normal profilers should not care about whether it is your functions or System imported (or wherever they come from). They all should show up in profile views.

Often solving dispatch of small work items is done by chunking them. Instead of giving new task each item, you give chunks. Estimate how much work is reasonable and chunk your array:

int chunkSize = 1024; 
Type[] types = assembly.GetTypes(flags);
for (int i=0; i<types.Length; i+=chunkSize) {
    int chunkStart = i;
    int chunkEnd = Math.Min(i+chunkSize, types.Length);
    taskList.Add(Task.Run(() => {
        for (int c=chunkStart; c<chunkEnd; c++)
           if (CheckType(type))
               DoWork(type);
    }));
}

with newer C# version there are builtin helpers for this:

var part = Partitioner.Create(0, types.Length);
Parallel.ForEach(part, (range, state) =>
{
  for (int i = range.Item1; i < range.Item2; i++)
    if (CheckType(types[i]))
      DoWork(type[i]);
});

See: https://learn.microsoft.com/en-us/dotnet/standard/parallel-programming/how-to-speed-up-small-loop-bodies

You might want to explicitly pass third argument to Partitioner.Create to use bigger chunks (I'm not sure what it uses by default).

Edited by Mārtiņš Možeiko on May 10, 2023, 7:02pm

longtran2904

#29373

May 11, 2023

I heard that the reflection library has some caching mechanism, so I thought every time you call the API, it must check/update some internal database (which may not be multithreaded correctly).
Regarding the gaps in column 2, you're correct. I tested it by having some tasks creating the Types array in a first pass, then processing those types in a second pass. When I did that, most of the gaps disappeared.
Can you give me a rough idea of how dispatching/synchronizing between tasks happens? I only know the task system is implemented with a thread pool. Does each thread pick a task, finish it, then pick another task? Or does each thread choose some number of tasks in advance, process all of them, then repeat? Presumably, the synchronizing cost will be lower in the second case.
I'm manually inserting Begin/EndProfile inside my code, so the graph you saw (blue rectangles) is how long it took my function/task (DoWork) to run. It doesn't show the cost of synchronizing between threads, only the cost of the task, so I thought that gaps between each sample are the dispatching cost when a thread finishes and asks for more work. But all the samples are close to each other (ignoring column 2), so where does this cost happen? In which system functions?

Edited by longtran2904 on May 11, 2023, 10:39am

Replying to mmozeiko (#29372)

Mārtiņš Možeiko

#29375

May 11, 2023

In my experience - whatever reflection caches is not enough. They do only very minimal stuff, so any locking there is not an issue. That's why people say that reflection is slow - because most of the stuff comes afterwards on top of it. So you often want to cache reflection stuff yourself, especially when you'll be reusing it multiple times.

I only know the task system is implemented with a thread pool. Does each thread pick a task, finish it, then pick another task?

I assume it is this like that. That's why they give helpers to do tasks in bulk yourself. Because then it benefits both type of works - small and large ones. Whether if they would do only tasks in bulk, then processing large type of work would be really bad with Task system.

C# has many great automatic profilers. No need to do manual instrumenting. Manual instrumenting is great if you want to profile your own code. But when you want to know what happens in system functions, you cannot do manual instrumenting.

Even VS built-in profiler nowadays is quite useful: https://learn.microsoft.com/en-us/visualstudio/profiling/profiling-feature-tour?view=vs-2022#instrumentation - you will see in call stack all the functions not yours, but System and others too. And then in here https://learn.microsoft.com/en-us/visualstudio/profiling/profiling-feature-tour?view=vs-2022#analyze-cpu-usage (3rd screen shot) you'll see which path in which callstack contributes how much and most (Hot Path).

Edited by Mārtiņš Možeiko on May 11, 2023, 5:47pm

Replying to longtran2904 (#29373)

longtran2904

#29384

May 18, 2023

The ForEach function will block the main thread until it's done executing, which means if I iterate through all the assemblies and call ForEach for all the types in each assembly, my main thread will need to wait for the previous ForEach call to finish until it can issue the next. Can I wrap each of these in a task? If I load all the types in a first pass, I would have an array of array of types. In which case, should I flatten that to a single array, or should I wrap the ForEach call inside another ForEach call?
Why did you need to create a Partitioner in your example? There's a For function that takes in a range and a ForEach function that takes in an IEnumerable. As for the default chunk size of the Partitioner.Create function, it seems that the subrange size will change depending on the range you pass in.

Edited by longtran2904 on May 18, 2023, 8:41am

Replying to mmozeiko (#29375)

Mārtiņš Možeiko

#29386

May 18, 2023

If you want to first fully iterate before doing another parallel For, then yeah - that doesn't work. You can do parallel for from withing another parallel For, if that helps. Otherwise you need to issue tasks yourself.

I did Partitioner because of reasons I explained above - doing one iteration per item is a lot of overhead, when one item workload is small.

See the link I posted: https://learn.microsoft.com/en-us/dotnet/standard/parallel-programming/how-to-speed-up-small-loop-bodies

When a Parallel.For loop has a small body, it might perform more slowly than the equivalent sequential loop, such as the for loop in C# and the For loop in Visual Basic. Slower performance is caused by the overhead involved in partitioning the data and the cost of invoking a delegate on each loop iteration. To address such scenarios, the Partitioner class provides the Partitioner.Create method, which enables you to provide a sequential loop for the delegate body, so that the delegate is invoked only once per partition, instead of once per iteration.

Replying to longtran2904 (#29384)