Why does the performance differ so greatly with seemingly the same code?

Why does performance differ when the only difference is in code paths that are not executed?

  • Test1 below is consistently 10% faster than Test2, although I'm always calling the method with a 0 argument, so the stuff that's inside the switch case - the only difference - is never being executed. Btw after copy-and-pasting the code into a brand new project, only changing the test function's name to Main, the results are reversed. Every time I run that project, Test2 is faster. So what are the factors involved in making this slower and faster? And: Can I influence performance deliberately in .net to my benefit? These methods of course are doing almost nothing, so a 10% performance difference for a test that involves mainly identical virtual method calls (which, afaik the CLR is not capable of inlining - chapeau to Sun btw for implementing that in the JVM) seems huge. N.B. This is in fact a minimal version of a real program, where multiple nested switch statements result in a huge performance difference of not just 10%, but 100% and more, apparently by the mere existence of code inside the nested switch branches which the tests never enter. (So maybe this minimal version omits some other aspects of the real program that might be involved, but it does replicate a significant and consistent performance difference) Edit In the real program, what may matter more than the phenomenon in this question, is whether the actual case statements in the switch statement lend themselves to implementation via https://en.wikipedia.org/wiki/Branch_table or not - (which depends on the gaps in the case values - something I could verify this by looking at the generated IL code) Test run on .net 4.5.1 Release build, run via Ctrl-F5 Intel i7 CPU The code: using System; using System.Diagnostics; class Test1 : ITest { public int Test(int a) { switch (a) { case 1: return a + a + a == 1234 ? 1 : 2; case 2: return 2; } return 0; } } class Test2 : ITest { public int Test(int a) { switch (a) { case 1: return 1; case 2: return 2; } return 0; } } class Program { static void Main(string[] args) { const long iterations = 200000000; var test1 = new Test1(); var test2 = new Test2(); while (true) { var sw1 = Stopwatch.StartNew(); for (long i = 0; i < iterations; i++) test1.Test(0); sw1.Stop(); var sw2 = Stopwatch.StartNew(); for (long i = 0; i < iterations; i++) test2.Test(0); sw2.Stop(); var iterPerUsec1 = iterations / sw1.Elapsed.TotalMilliseconds / 1000; var iterPerUsec2 = iterations / sw2.Elapsed.TotalMilliseconds / 1000; Console.WriteLine("iterations per usec: " + (int) iterPerUsec1 + " / " + (int) iterPerUsec2 + " ratio: " + iterPerUsec1/iterPerUsec2); } } } interface ITest { int Test(int a); } Here's the output of a typical run where Test1 is consistently in fact more than 12% faster: iterations per usec: 369 / 342 ratio: 1.07656329512607 iterations per usec: 367 / 314 ratio: 1.16820632522335 iterations per usec: 372 / 337 ratio: 1.10255744679504 iterations per usec: 374 / 342 ratio: 1.09248387354978 iterations per usec: 367 / 329 ratio: 1.11451205881061 iterations per usec: 375 / 340 ratio: 1.10041698470293 iterations per usec: 373 / 314 ratio: 1.19033461920118 iterations per usec: 366 / 334 ratio: 1.09808424282708 iterations per usec: 372 / 314 ratio: 1.18497411681768 iterations per usec: 377 / 342 ratio: 1.10482425370152 iterations per usec: 380 / 346 ratio: 1.09794853154766 iterations per usec: 385 / 342 ratio: 1.12737583603649 iterations per usec: 376 / 327 ratio: 1.15024393718844 iterations per usec: 374 / 332 ratio: 1.12400483908544 iterations per usec: 383 / 341 ratio: 1.12106159857722 iterations per usec: 380 / 345 ratio: 1.10267634674555 iterations per usec: 375 / 344 ratio: 1.09211401775982 iterations per usec: 384 / 334 ratio: 1.14958454236246 iterations per usec: 368 / 321 ratio: 1.14575850263002 iterations per usec: 378 / 335 ratio: 1.12732301818235 iterations per usec: 380 / 338 ratio: 1.12375853123099 iterations per usec: 386 / 344 ratio: 1.12213818994067 iterations per usec: 385 / 336 ratio: 1.14346447712043 iterations per usec: 374 / 345 ratio: 1.08448615249764 ...

  • Answer:

    Benchmarking is a fine art, it is very difficult to reliably measure very fast code like this. In general, differences of 15% or less are not statistically significant results. I can only comment on the flaw in the posted code, it is a pretty common one. Typical Heisenburgian, it is the test itself that affects the outcome. The second for() loop is not as optimized as the first for() loop. A problem caused by the optimizer's choices on which local variables to store in a CPU register. In particular a problem when you use long in a 32-bit program, it burns up two CPU registers. In all likelihood, the commenters that mentioned not being able to reproduce it ran the test with the x64 jitter. You take the pressure off the CPU register allocator by moving the tests into a separate method: static Stopwatch runTest1(Test1 test1, long iterations) { var sw1 = Stopwatch.StartNew(); for (long i = 0; i < iterations; i++) test1.Test(0); sw1.Stop(); return sw1; } static Stopwatch runTest2(Test2 test2, long iterations) { var sw2 = Stopwatch.StartNew(); for (long i = 0; i < iterations; i++) test2.Test(0); sw2.Stop(); return sw2; } static void Main(string[] args) { const long iterations = 200000000; var test1 = new Test1(); var test2 = new Test2(); while (true) { var sw1 = runTest1(test1, iterations); var sw2 = runTest2(test2, iterations); // etc.. } } And you'll now get what you expect. This change was cued by looking at the generated machine code. Tools > Options > Debugging > General > untick the Suppress JIT Optimization option. Then you can look at the optimized machine code with Debug > Windows > Disassembly. Which shows the first for() loop: for (long i = 0; i < iterations; i++) 00000089 xor ebx,ebx 0000008b xor esi,esi 0000008d mov ecx,dword ptr [esp+4] 00000091 xor edx,edx 00000093 call dword ptr ds:[04732844h] 00000099 add ebx,1 0000009c adc esi,0 0000009f test esi,esi 000000a1 jg 000000AD 000000a3 jl 0000008D 000000a5 cmp ebx,0BEBC200h 000000ab jb 0000008D And the second loop: for (long i = 0; i < iterations; i++) 000000f5 mov dword ptr [esp+2Ch],0 000000fd xor ebx,ebx 000000ff mov ecx,dword ptr [esp] 00000102 xor edx,edx 00000104 call dword ptr ds:[056E28B8h] 0000010a mov eax,ebx 0000010c mov edx,dword ptr [esp+2Ch] ; <== here 00000110 add eax,1 00000113 adc edx,0 00000116 mov dword ptr [esp+2Ch],edx ; <== here 0000011a mov ebx,eax 0000011c cmp dword ptr [esp+2Ch],0 00000121 jg 0000012D 00000123 jl 000000FF 00000125 cmp ebx,0BEBC200h 0000012b jb 000000FF I marked the instructions that made it slower. The first for() loop could store the loop variable in the esi:ebx registers and keep them there. That didn't work in the second for() loop, it ran out of available CPU registers and the top 32-bits of the loop variable had to be stored in the stack frame. That's slow. Making a change in a program like this is not generic advice and not a universal solution. It just happened to be effective in this specific case. Only looking at the machine code can give you a cue that such a change might be useful. Which is about what it takes to hand-tune code. There's probably more to what you actually try to optimize, this benchmark is very unlikely to be representative of your actual code.

Eugene Beresovsky at Stack Overflow Visit the source

Was this solution helpful to you?

Other answers

To investigate I took your code and modified it slightly. My main loop iterates 20 times and I have removed the part where you print the speed of the two test. Instead I store the difference in ticks (sw2.ElapsedTicks - sw1.ElapsedTicks) in a preallocated array. This is just to keep the main loop as simple as possible. At the end of the main loop I average the tick differences. A positive number means that the second test executed slower using more ticks than the first test. I did this test with all permutations of possible test sequences: Test1 followed by Test2, Test2 followed by Test1, Test1 followed by Test1 and Test2 followed by Test2. These are my results with optimizations turned on: Tests | Average -------+--------- 1 2 | 248,747 2 1 | 313,372 1 1 | 234.812 2 2 | 210.533 So no matter how I run the tests the second test in the loop is always slower than the first. My conclusion is that for some reason the first test in the main loop will execute slightly faster. I do not know why and I do not always get consistent results but at least this does not seem to be related to the code in Test1 and Test2.

Martin Liversage

Related Q & A:

Just Added Q & A:

Find solution

For every problem there is a solution! Proved by Solucija.

  • Got an issue and looking for advice?

  • Ask Solucija to search every corner of the Web for help.

  • Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.