|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Announcements
Chapters
Services
Feature Zones
|
IntroductionThe .NET Just-In-Time Compiler (JIT) is considered by many to be one of the primary performance advantages of the CLR in comparison to the JVM and other managed environments that use just-in-time-compiled byte-code. It is this advantage that makes the JIT internals so secret and the JIT optimizations so mysterious. It is also the reason why the SSCLI (Rotor) contains only a naive fast-JIT implementation, which performs only the minimal optimizations while translating each IL instruction to its corresponding machine language instruction sequence. During my research for the .NET Performance course I wrote for Sela, I found a couple of sources regarding the known JIT optimizations. The following is a non-exhaustive list of those optimizations, followed by a detailed discussion of interface method dispatching and the JIT in-lining techniques employed thereof. Before we start, there are a couple of points worth mentioning. First of all, it's pretty obvious that in order to see JIT optimizations, you must look at the assembly code emitted by the JIT in run-time for the release build of your application. However, there is a minor caveat: if you try to look at the assembly code from within a Visual Studio debug session, you will not see optimized code. This is due to the fact that JIT optimizations are disabled by default when the process is launched under the debugger (for convenience reasons). Therefore, in order to see JIT-optimized code, you must attach Visual Studio to a running process, or run the process from within CorDbg with the Another important point to mention is that most of this article is based on the x86 JIT that comes with the .NET 2.0 CLR, although we will also take a look at the x64 JIT's behavior. Range-check eliminationWhen accessing an array in a loop, and the loop's termination condition relies upon the length of the array, the array bounds access check can be eliminated. Consider the following code. Do you see anything wrong with it? private static int[] _array = new int[10];
private static volatile int _i;
static void Main(string[] args)
{
for (int i = 0; i < _array.Length; ++i)
_i += _array[i];
}
Here is the generated 32-bit code: 00BE00A5 xor edx,edx ; edx = 0 (i)
00BE00A7 mov eax,dword ptr ds:[02491EC4h] ; eax = numbers
00BE00AC cmp edx,dword ptr [eax+4] ; edx >= Length?
00BE00AF jae 00BE00C2 ; exception!
00BE00B1 mov eax,dword ptr [eax+edx*4+8] ; eax = numbers[i]
00BE00B5 add dword ptr ds:[0A22FC8h],eax ; _i += eax
00BE00BB inc edx ; ++edx (i)
00BE00BC cmp edx,0Ah ; edx < 100?
00BE00BF jl 00BE00A7
The first line is the preamble, the lines in the middle are the loop body, and the three lines at the end are the code to increment the counter and test for the loop termination condition. The third line in the above listing (00BE00AC) performs a range check – it ensures that the Where is the range check elimination, then? The reason for this behavior is the simple fact that the array reference itself is static in this scenario. Working with a static reference causes the above code to be generated (also note how in 00BE00A7 we actually fetch the array reference into the register with each iteration of the loop). This behavior can be eliminated by making a very simple modification to the above program: private static int[] _array = new int[10];
private static volatile int _i;
static void Main(string[] args)
{
int[] localRef = _array;
for (int i = 0; i < localRef.Length; ++i)
{
_i += localRef[i];
}
}
Here is the generated 32-bit code: 009D00D2 mov ecx,dword ptr ds:[1C21EC4h] ; ecx = numbers
009D00D8 xor edx,edx ; edx = 0 (i)
009D00DA mov esi,dword ptr [ecx+4] ; esi = numbers.Length
009D00DD test esi,esi ; esi == 0?
009D00DF jle 009D00F0
009D00E1 mov eax,dword ptr [ecx+edx*4+8] ; loop proper
009D00E5 add dword ptr ds:[922FC8h],eax ; _i += numbers[i]
009D00EB inc edx ; ++edx (i)
009D00EC cmp esi,edx ; edx > numbers.Length?
009D00EE jg 009D00E1
Note how the range check has been eliminated, and now the loop termination condition is the only test performed in this code. Note: this discussion mirrors Greg Young's post on the subject. It's also worth noting that it is very easy to break this optimization by using something other than for (int i = 0; i < localRef.Length * 2; ++i)
{
_i += localRef[i / 2];
}
Method In-liningA short and simple method that is frequently used can be in-lined into the calling code. Currently, the JIT is documented (perhaps "blogged about" would be more precise here) to inline methods that are less than 32 bytes in length, do not contain any complex branching logic, and do not contain any exception handling-related mechanisms. See David Notario's blog for some additional information on this topic (note that it is not entirely relevant for CLR 2.0). Given the following code fragment, let's examine what happens when the JIT doesn't optimize and inline the method call, and what happens when it does. public class Util
{
public static int And(int first, int second)
{
return first & second;
}
[DllImport("kernel32")]
public static extern void DebugBreak();
}
class Program
{
private static volatile int _i;
static void Main(string[] args)
{
Util.DebugBreak();
_i = Util.And(5, 4);
}
}
Note the use of the kernel32.dll When stepping into the disassembly with JIT optimizations disabled (for example, when the process is started from within a Visual Studio debugging session), the following code is emitted for the method call site: 00000013 mov edx,4
00000018 mov ecx,5
0000001d call dword ptr ds:[00A2967Ch] ; this is Util.And
00000023 mov esi,eax
00000025 mov dword ptr ds:[00A28A6Ch],esi ; this is _i
If we keep stepping into the code at 00A2967C, we find the method itself: 00000000 push edi
00000001 push esi
00000002 mov edi,ecx
00000004 mov esi,edx
00000006 cmp dword ptr ds:[00A28864h],0
0000000d je 00000014
0000000f call 794F1116
00000014 mov eax,edi
00000016 and eax,esi
00000018 pop esi
00000019 pop edi
0000001a ret
Note that there is no optimization or in-lining here: the parameters are passed to the method in the Now, let's look at the in-lined call, generated when I attach to the process after the debug breakpoint is issued inside it. This is what the method call site looks like, this time: 00B90075 mov dword ptr ds:[0A22FD0h],4 ; _i = 4
00B9007F ret
The result of However, it is seemingly impossible to inline a virtual method call. This, of course, is due to the fact that the actual method to be called is unknown at compile-time and can change between method invocations. For example, consider the following code: class A
{
public virtual void Foo() { }
}
class B : A
{
public override void Foo() { }
}
class Program
{
static void Method(A a)
{
a.Foo();
}
static void Main(string[] args)
{
for (int i = 0; i < 10; ++i)
{
A a = (i % 2 == 0) ? new A() : new B();
Method(a);
}
}
}
When the JIT compiles the Therefore, a virtual call must go through the actual object's method table. (If you need some refreshment on this, have a brief skim through this MSDN Magazine article.) Going through the method table involves two levels of indirection: using the object header to reach into the method table, and using a compile-time known offset into the method table to determine the method to call. In the preceding example, the following code will be emitted at the call site (in the 007E0076 xor esi,esi ; esi = 0 (i)
007E0078 jmp 007E00AA ; jump to compare condition
007E007A mov eax,esi ; i % 2 == 0?
007E007C and eax,80000001h
007E0081 jns 007E0088
007E0083 dec eax
007E0084 or eax,0FFFFFFFEh
007E0087 inc eax
007E0088 test eax,eax
007E008A je 007E0098
007E008C mov ecx,2B3180h
007E0091 call 002A201C
007E0096 jmp 007E00A2
007E0098 mov ecx,2B3100h
007E009D call 002A201C
007E00A2 mov ecx,eax ; ecx = a
007E00A4 mov eax,dword ptr [ecx] ; eax = a's method table
007E00A6 call dword ptr [eax+38h] ; call through offset 0x38
007E00A9 inc esi
007E00AA cmp esi,0Ah
007E00AD jl 007E007A
007E00AF pop esi
007E00B0 ret
As noted above, the virtual call is dispatched in two steps: first, reaching into the object's method table ( Note that theoretically, the class A
{
public virtual void Foo() { }
}
class B : A
{
public override sealed void Foo() { }
}
class C : B
{
}
class Program
{
static void Method(B b)
{
b.Foo();
}
static void Main(string[] args)
{
for (int i = 0; i < 10; ++i)
{
B b = (i % 2 == 0) ? new B() : new C();
Method(b);
}
}
}
It is obvious that the call target 005000A2 mov ecx,eax
005000A4 mov eax,dword ptr [ecx]
005000A6 call dword ptr [eax+38h]
Note that using the To fully understand what's going on in the background, it's worth your time to learn that IL has two instructions used for dispatching method calls: class A
{
public void Foo() { } // Foo doesn't use "this"
}
class Program
{
static void Main(string[] args)
{
for (int i = 0; i < 10; ++i)
{
A a = (i % 2 == 0) ? new A() : null;
a.Foo();
}
}
}
Here's the IL that is emitted for this scenario (note the .method private hidebysig static void Main(string[] args) cil managed
{
.entrypoint
.maxstack 2
.locals init (
[0] int32 num1,
[1] CallAndCallVirt.A a1)
L_0000: ldc.i4.0
L_0001: stloc.0
L_0002: br.s L_001c
L_0004: ldloc.0
L_0005: ldc.i4.2
L_0006: rem
L_0007: brfalse.s L_000c
L_0009: ldnull
L_000a: br.s L_0011
L_000c: newobj instance void CallAndCallVirt.A::.ctor()
L_0011: stloc.1
L_0012: ldloc.1
L_0013: callvirt instance void CallAndCallVirt.A::Foo()
L_0018: ldloc.0
L_0019: ldc.i4.1
L_001a: add
L_001b: stloc.0
L_001c: ldloc.0
L_001d: ldc.i4.s 10
L_001f: blt.s L_0004
L_0021: ret
}
And, here's the assembly that is emitted for this scenario: 00AD0076 xor esi,esi
00AD0078 jmp 00AD009F
00AD007A xor edx,edx ; represents the "a" local variable
00AD007C mov eax,esi
00AD007E and eax,80000001h
00AD0083 jns 00AD008A
00AD0085 dec eax
00AD0086 or eax,0FFFFFFFEh
00AD0089 inc eax
00AD008A test eax,eax
00AD008C jne 00AD009A
00AD008E mov ecx,0A230E0h
00AD0093 call 00A1201C ; constructor call (in the branch)
00AD0098 mov edx,eax
00AD009A mov eax,edx
00AD009C cmp dword ptr [eax],eax ; null reference check
00AD009E inc esi ; move on, the method isn't really called
00AD009F cmp esi,0Ah
00AD00A2 jl 00AD007A
00AD00A4 pop esi
00AD00A5 ret
So, is the JIT capable of in-lining such a method? Yes! Note how the JIT completely eliminates the call itself (the method is empty and, therefore, in-lining it means simply optimizing it away), but it can't eliminate the null reference check, because that was the intent of the If the IL in L_0003 were to use the For the sake of completeness, it is worth noting that sometimes calling a virtual method involves the class Employee
{
public override string ToString()
{
return base.ToString();
}
}
The .method public hidebysig virtual instance string ToString() cil managed
{
.maxstack 8
L_0000: ldarg.0
L_0001: call instance string object::ToString()
L_0006: ret
}
If the instruction emitted in this case were the Interface method dispatch is not very different in essence from virtual method dispatch. It is not very different in the sense that a level of indirection must be added and therefore the actual implementation to call cannot be determined at the call site. For example, consider the following code: interface IA
{
void Foo();
}
class A : IA
{
public void Foo()
{
}
}
class B : IA
{
void IA.Foo()
{
}
}
class Program
{
[MethodImpl(MethodImplOptions.NoInlining)]
static void Method(IA a)
{
a.Foo();
}
static void Main(string[] args)
{
for (int i = 0; i < 10; ++i)
{
IA a = (i % 2 == 0) ? (IA)new A() : (IA)new B();
Method(a);
}
}
}
The trivial method dispatch technique, in this case, would be as follows:
Therefore, the approximate assembly code that should be generated for the actual interface method call site (inside the mov ecx, edi ; ecx holds "this"
mov eax, dword ptr [ecx] ; eax holds method table
mov eax, dword ptr [eax+0Ch] ; eax holds interface method table
mov eax, dword ptr [eax+30h] ; eax holds method pointer
call dword ptr [eax]
This is described in this MSDN Magazine article and other sources, but is not the case in real life. Even the debug (non-optimized) version of the interface dispatch doesn't look like this. I will examine what the code does look like, but for now, it will suffice to say that the naïve approach probably had dire performance characteristics and was therefore replaced with something different. By the way, note that interface methods are always implicitly marked as .class private auto ansi beforefieldinit A extends object implements NaiveMethodDispatch.IA
.method public hidebysig newslot virtual final instance void Foo() cil managed
.class private auto ansi beforefieldinit B extends object implements NaiveMethodDispatch.IA
.method private hidebysig newslot virtual final instance void
NaiveMethodDispatch.IA.Foo() cil managed
.override NaiveMethodDispatch.IA::Foo
If the interface implementation is explicit, or if you didn't specify the However, I would not even consider writing this article if everything were as simple as presented so far. The trigger for this article has actually been Ayende's post on a vaguely related topic, where he mentioned that the JIT is capable of in-lining interface method calls. A mild theoretical discussion over e-mail has resulted in some preliminary research, which I present below. Before diving in, the conclusion of the following discussion can be summarized as follows: it is theoretically impossible to perfectly inline virtual method calls (interface method calls fall in this category as well); the JIT does not inline interface method calls; instead, the JIT performs an optimization that does not use the naïve interface method dispatch outlined above. Before we follow what the actual CLR 2.0 JIT does at interface method call sites, it is probably wise to consider any optimization that could theoretically be performed on such calls. The most obvious are the following two: Flow AnalysisFlow analysis can determine that a particular static type is used for the interface method call and therefore can be directly dispatched through the type instead of using the technique described above. IA a = new A();
a.Foo();
It is fairly clear that Frequency AnalysisOne of the interface implementations is called considerably more frequently than others, at a given call site. This can be determined by dynamically profiling and patching the code, or by some kind of hint from the programmer. (See this and Overview of the IBM Java Just-in-Time Compiler for more information on the topic, in general, and its JVM implementation, in particular.) In this case, the interface method dispatch for that specific interface can be modified to direct dispatch or even in-lined, as follows: if (obj->MethodTable == expectedMethodTable) {
// Inlined code of "hot" method
}
else {
// Normal interface method dispatch code
}
The JIT's ApproachThe JIT is not documented (or "blogged about") to perform either of those optimizations. However, poking through the JIT-emitted code reveals that the case is not as simple as it appears to be. Let's look at the call site for the following code and attempt to analyze what happens in the method dispatching process. I've given the interface and the methods meaningful names and content so that we could look into the example more easily. interface IOperation
{
int Do(int a, int b);
}
class And : IOperation
{
public int Do(int a, int b) { return a & b; }
}
class Or : IOperation
{
public int Do(int a, int b) { return a | b; }
}
class Program
{
static volatile int _i;
static void Main()
{
for (int i = 0; i < 10; ++i)
{
IOperation op = (i % 2 == 0) ?
(IOperation)new And() : (IOperation)new Or();
_i = op.Do(5, 3);
}
}
}
In this case, we have a single call site for Let's first have a look at the call site itself, which is the 007E00A4 push 3
007E00A6 mov edx,5 ; setup parameters
007E00AB call dword ptr ds:[2C0010h] ; the actual call
007E00B1 mov dword ptr ds:[002B2FE8h],eax ; save return value
Note that this is not the interface method dispatch pattern we were supposed to see here. In its stead, we get an indirect call through the address 002C0010. You should remember this address because we will mention it in the following discussion. Stepping further inside, we see: 002C6012 push eax
002C6013 push 30000h
002C6018 jmp 79EE9E4F
The actual implementation has not been compiled yet, and therefore, we find ourselves tracing through the code of the JIT-compiler. Eventually, we are redirected (through a complex series of jumps) to the actual interface method's code. Issuing the However, after the loop has run three times, the code that 002C0010 points to (recall this is where our original call site is going through) is back-patched to an optimized version. This optimized version follows: 002D7012 cmp dword ptr [ecx],2C3210h
002D7018 jne 002DA011
002D701E jmp 007E00D0
Recall that a .NET object header starts with the object's method table, and there we are at the trivial profiling optimization: if the method table is the expected one (i.e., the "common" or "hot" implementation), we can perform a direct jump to that code. (Note that its actual address is embedded within the instruction, because the JIT has back-patched this code with the complete and intimate knowledge of the method's location. This means no extra memory access is required to dispatch this call.) Otherwise, we must go through the usual dispatching layer, which we will see in a moment. For completeness, here is the code at 007E00D0 (which is the "expected" result of the 007E00D0 and edx,dword ptr [esp+4]
007E00D4 mov eax,edx
007E00D6 ret 4
This is simply the 002DA011 sub dword ptr ds:[14D3D0h],1
002DA018 jl 002DA056
002DA01A push eax
002DA01B mov eax,dword ptr [ecx]
002DA01D push edx
002DA01E mov edx,eax
002DA020 shr eax,0Ch
002DA023 add eax,edx
002DA025 xor eax,3984h
002DA02A and eax,3FFCh
002DA02F mov eax,dword ptr [eax+151A6Ch]
002DA035 cmp edx,dword ptr [eax]
002DA037 jne 002DA04B
002DA039 cmp dword ptr [eax+4],30000h
002DA040 jne 002DA04B
002DA042 mov eax,dword ptr [eax+8]
002DA045 pop edx
002DA046 add esp,4
002DA049 jmp eax
002DA04B pop edx
002DA04C push 30000h
002DA051 jmp 79EED9A8
002DA056 call 79F02065
002DA05B jmp 002DA01A
I've marked the most significant sections so we can focus on them. First, 1 is decremented from a global variable. Its initial value is 0x64 (i.e., 100). We will see its purpose in a moment. If the resulting value is less than 0, we jump to a call to one of the JIT back-patching functions, and continue the flow. What's in that flow? The normal interface method dispatching as performed by the JIT. Note that, eventually, there's a 007E00F0 or edx,dword ptr [esp+4]
007E00F4 mov eax,edx
007E00F6 ret 4
This is clearly the 0046A01A push eax
0046A01B mov eax,dword ptr [ecx]
0046A01D push edx
0046A01E mov edx,eax
0046A020 shr eax,0Ch
0046A023 add eax,edx
0046A025 xor eax,3984h
0046A02A and eax,3FFCh
0046A02F mov eax,dword ptr [eax+151A74h]
0046A035 cmp edx,dword ptr [eax]
0046A037 jne 0046A04B
0046A039 cmp dword ptr [eax+4],30000h
0046A040 jne 0046A04B
0046A042 mov eax,dword ptr [eax+8]
0046A045 pop edx
0046A046 add esp,4
0046A049 jmp eax
0046A04B pop edx
0046A04C push 30000h
0046A051 jmp 79EED9A8
0046A056 call 79F02065
0046A05B jmp 0046A01A
Again, I've emphasized the important parts. Without getting into much detail, the purpose of this code is to check whether the current object's type matches the last object's type (note that unlike the previous snippet, the check here is not against a literal constant address – instead, it is calculated here using the object pointer itself). If there's a match, the location to jump to is calculated, and the Note that this code is less efficient than the state we had before the counter decremented itself below 0. Back then, we had a direct jump to a literal constant address. Now, we have a calculation of the jump address based on the object pointer itself. Also, note that, this time there is no counter – every time there is a miss, the meaning of this code will change. Summarized in pseudo-code, this looks somewhat like the following: start: if (obj->Type == expectedType) {
// Jump to the expected implementation
}
else {
expectedType = obj->Type;
goto start;
}
This is the final behavior that we get for the entire course of the program. It means that per call site, we get a one-shot counter (initialized to 100) which counts the times the "hot" implementation was missed. After the counter decays below 0, JIT back-patching is triggered, and the code is replaced with the version we just saw, that alternates the "hot" implementation with each miss. Note that the 00C20010 stub is generated for each call site that attempts to dispatch an interface call. This means that the counter data and related code are per-call-site optimization, which can be highly valuable in certain scenarios. Testing the above hypothesis is fairly trivial by writing a test program that performs interface method dispatching. In one mode, the program will call the first implementation in a loop and then call the second implementation in a loop; in another mode, the program will interleave every call to the first implementation with a call to the second implementation. Granted the behavior of the final dispatching code that we just saw, it is reasonable to expect that the first test case will perform better than the second one. This was tested using the following program: interface IOperation
{
int Do(int a, int b);
}
class And : IOperation
{
public int Do(int a, int b) { return a & b; }
}
class Or : IOperation
{
public int Do(int a, int b) { return a | b; }
}
class Program
{
[MethodImpl(MethodImplOptions.NoInlining)]
static void Method(IOperation op)
{
_i = op.Do(5, 3);
}
static readonly int NUM_ITERS = 100000000;
static readonly int HALF_ITERS = NUM_ITERS / 2;
static volatile int _i;
static void Main()
{
IOperation and = new And();
IOperation or = new Or();
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < HALF_ITERS; ++i)
{
Method(and);
Method(and);
}
for (int i = 0; i < HALF_ITERS; ++i)
{
Method(or);
Method(or);
}
Console.WriteLine("Sequential: {0} ms", sw.ElapsedMilliseconds);
sw.Reset();
sw.Start();
for (int i = 0; i < HALF_ITERS; ++i)
{
Method(and);
Method(or);
}
for (int i = 0; i < HALF_ITERS; ++i)
{
Method(and);
Method(or);
}
Console.WriteLine("Interleaved: {0} ms", sw.ElapsedMilliseconds);
}
}
The test results on my laptop averaged to 2775ms for the sequential case, and 2960ms for the round-robin (interleaved) case. The difference was consistent, but it's obviously not very noticeable. Therefore, I conclude for now that the two usage patterns have little (if any) effect on the program's performance, especially if the methods are more "chunky" than just a single x86 instruction. Other TidbitsIn the interests of completeness, the 64-bit JIT produces the same code for the sealed dispatch scenario as the 32-bit one. This seems clearly like a design decision. If you're very interested in what a 64-bit virtual method dispatch looks like, here it is: 00000642`8015047d 488b03 mov rax, qword ptr [rbx]
00000642`8015048b 488bcb mov rcx, rbx
00000642`8015048e ff5060 call qword ptr [rax+60h]
So again, we're calling through a method table even though the static and dynamic types are known in advance. ( History
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||