Writing fast C++ to CLR interop

I have a pet project with support for DLL based plugins. I'm currently adding support for managed plugins through .NET and I'm in the process of sanity checking the performance of the interop for these plugins. I'm currently trying to understand two things: 1) why I'm seeing a significant performance difference between x86 and x64, and 2) a quantitative performance ceiling for interop in general.

For issue 1, when I call an empty managed function from C++ 100,000,000 times in a loop from an x86 release build it averages out to 8ns per call. That appears reasonable on the surface. On my 3.5 GHz machine that's about 28 cycles. Microsoft's P/Invoke documentation says it takes about 30 x86 instructions for managed to unmanaged interop. C# to C++ interop in x86 Unity3D builds takes 8.2ns. And I've seen Hans Passant mention 7ns as the usual interop cost on Stack Overflow. So I seem to be in the right ballpark.

However, it gets weird when I build for x64 instead. That number jumps to 17ns. I haven't been able to find any information talking about interop differences between x86 and x64. Naively testing x64 interop in Unity3D is showing about 5.9ns. I'm testing an empty function with no parameters or return value so presumably there's no marshaling or slowdown from processing wider data. Playing with calling convention and optimization settings aren't making a significant difference.

I notice simply calling from native C++ into the native side of a C++/CLI DLL is 1.3ns in both x86 and x64. But calling directly into managed, or calling into unmanaged then into managed both show the doubling in cost.

I've asked this on Stack Overflow and included the code I'm testing with, though I'm not expecting a concrete answer there.

Any ideas why interop takes twice as long on x64 compared to x86?

For question 2, is 8ns about right for interop? Or are there fairly straightforward things I can be doing to get that lower?
I've stepped through the disassembly (not something I'm very familiar with at this point).

In x86 is seem something along the lines of
1
2
3
01081125  call        ManagedUpdate (0108109Dh)
0108109D  jmp         dword ptr [__mep@?ManagedUpdate@@$$FYAXXZ (0108A000h)]
02A00017  jmp         _IJWNOADThunkJumpTarget@0 (733CE9A7h)


In x64 I see
1
2
3
4
5
00007FF747F81104  call        ManagedUpdate (07FF747F83000h)
00007FF747F8300A  jmp         qword ptr [__mep@?ManagedUpdate@@$$FYAXXZ (07FF747F8A008h)]
//Some jumping around that quickly leads to IJWNOADThunk::MakeCall:
00007FFDF9FE24E9  call        IJWNOADThunk::FindThunkTarget (07FFDFA52DBC0h)
//MakeCall uses the result from FindThunkTarget to jump into UMThunkStub:


FindThunkTarget is pretty heavy and it looks like most of the time is being spent there. So my working theory is that in x86 the thunk target is known and execution can more or less jump straight to it. But in x64 the thunk target is not known and a search process takes place to find it before being able to jump to it. I wonder why that is?
Comparing MS .NET documentation and Unity C# doesn't make much sense, because Unity uses very different .NET VM - mono.

Talking about MS .NET I remember I have read that they have very different JIT for 64-bit code (not sure how accurate this statement is nowadays). So it may be the case that this JIT simply have very different performance characteristics and there's not much you can do.

Usually of interop shouldn't be a big issue for performance unless you are doing it in inner loop. Which you shouldn't do obviously.

If you haven't seen blog of author of SharpDX (managed DirectX wrapper library), check it out. It has a lot of useful information. One of his benchmarks shows that C++/CLI interop is slower than custom generated one: http://xoofx.com/blog/2011/03/14/...marking-cnet-direct3d-11-apis-vs/
Here's an overview how he does it (calli IL instruction): http://xoofx.com/blog/2010/10/19/...d-netc-direct3d-11-api-generated/

Edited by Mārtiņš Možeiko on
mmozeiko
Comparing MS .NET documentation and Unity C# doesn't make much sense, because Unity uses very different .NET VM - mono.

Sorry, I didn't provide much in the way of detail there. That's not from the documentation. I actually profiled the interop cost in the engine myself. I know it's apples to oranges, but I wanted to test a few different interop scenarios to see what the range of performance looks like. It's a sanity check. It's somewhat safe to assume the engine is on the faster side of things in terms of interop because it's a fundamental part of how the engine is used. If the engine is orders of magnitude different it might be because they're doing crazy things to optimize for it. But if it's about the same as what I'm seeing in my own code it's fairly safe to assume I'm in the right ballpark.


mmozeiko
So it may be the case that this JIT simply have very different performance characteristics and there's not much you can do.

Hmm. Would the interop code be JITed on the managed side or compiled AOT on the native side? I'll put some real code in the managed function and see what it looks like when stepping through the disassembly. I suppose I could also NGEN the plugin to see how things change.


mmozeiko
Usually of interop shouldn't be a big issue for performance unless you are doing it in inner loop. Which you shouldn't do obviously.

Yea, I'm not calling it in a tight loop. But obviously it's important that I'm aware it exists, have a reasonable understanding of how much it costs, and know if it's performing correctly.


I'll check out that blog. Thanks for the link.

Edited by Adam Byrd on Reason: words are hard
Have you tried the
1
[SuppressUnmanagedCodeSecurity]
attribute on your calls to c++?

Usually I try to get chunks of data at a time when I call down to c++. So instead of something like float x = GetNoise(x,y) its float[] x = GetNoise(x,y,stepsize,count)

But of course that isn't always workable.


Edited by Jack Mott on
I'm going the other way 'round. C++ to a .NET assembly. I'm not sure the code security would account for the x86/x64 difference.
Thanks to a helpful tip on SO, I tested interop using a delegate instead of calling a function directly. Long story short takes almost half as long and the hit for going to x64 is much smaller (no more searching). Sweet.