Alois Kraus

blog

  Home  |   Contact  |   Syndication    |   Login
  106 Posts | 8 Stories | 293 Comments | 162 Trackbacks

News



Article Categories

Archives

Post Categories

Image Galleries

Programming

On my home machine VS2013 starts up in 2s from my SSD. But on another Windows 7 machine it takes over 13s. I never bothered really since I thought that some plugins and connecting to TFS simply takes its time. It stayed and itch until I installed Resharper 8 were things noticeably got slower. No big deal. Suspend Resharper and it will play dead. It works except that the startup time of VS still looked bad. Time for another profiling session with WPT. The approach is as simple as starting up WPRUI

image

  1. Press Start.
  2. Start Visual Studio.
  3. Wait until it is has started.
  4. Press "Save" button to stop profiling.
  5. Enter a file name to which the data should be saved
  6. Be patient while data is saved.
  7. Press "Open in WPA" button to analyze the data.

When I look at the CPU consumption until the VS 2013 is usable it takes 13,1s to start up while consuming 9,6s of CPU time. The big gap in the middle looks like it needs to connect to the network. Lets ignore that for now.

image

To see what the CPU was doing I need sample profiling (CPU Usage Sampled). Many people are confused by the fact that WPT shows you two CPU usage graphs with nearly the same content. The graphs (should) look the same but the underlying data to get the graphs is fundamentally different.

CPU Usage Sampled

This graph comes to existence by parsing CPU sampling ETW events. When sample profiling is enabled the Windows Kernel will log the call stacks of all running threads every 1ms. The sample rate is configurable from seconds between a sample and up to 8kHz (every 0,125ms see xperf -SetProfInt for more infos). Sample profiling gives you an accurate view what your application was doing with the CPU. When you look into the call stacks you should have at least 10K of CPU sample events in your view selected. Then you can check the call stacks in your application down to stacks with at least 100 Samples which represents an error of 1% which is usually enough. If you want to optimize a method which shows up in the sample profiling with only 10 samples you are chasing ghosts.

CPU Usage Precise

Although the graphs look identical you have a lot more columns (54!) to choose from. This time you are not looking at sampled call stacks but at a parsed view of Context Switch ETW events. A context switch occurs when your thread stops or starts running on a CPU. A thread is switched in (NewThreadxxxx columns) and another thread stopped running (Oldxxxxxx columns) while a third thread has enabled your thread to run by releasing a lock, signaling an event, timer expiration and a few other reasons. The thread that allowed your thread to run is called Readying Thread/Process/Stack columns. This very fine granular view allows you to find out e.g. why your thread was sitting in ReadFile for 5s to read a 1KB file.

A more thorough explanation of context switch handling can be found in the WPA documentation for CPU Analysis. To make sense of the many columns you should memorize this picture which shows all important times of a context switch. The picture below is shamelessly copied from the WPA documentation.

Figure 10 CPU Usage Precise Diagram

The most important thing to remember is that with CPU Usage Precise you can find out

Now back to the original question. We are after a CPU consumption issue. What do we see in CPU Usage (Precise)? Only context switches. A context switch occurs when your thread stops running because it is waiting for something (WaitForSingleObject) or the thread quantum was used up. It is 15ms by default but it may be more. On my Win 8.1 x64 box it is actually 32ms. The reason why a context switch event is logged that your thread is waiting (not CPU bound) for something or your thread has already executed code for 15-32ms and has used up its allotted time slice. In effect you will see in the stacks of CPU Usage Precise all of your calls that did block for some reason but you will almost never see a high CPU consuming operation.

How do I know that? Because I have tested it with a small CPU consuming application:

class Program
{
    [DllImport("kernel32.dll")]
    extern static int GetCurrentThreadId();

    static void Main(string[] args)
    {
        Action acc = () =>
        {
            Process proc = Process.GetCurrentProcess();
            var  currentThread = proc.Threads.Cast<ProcessThread>().Where(t => t.Id == GetCurrentThreadId()).Single();
            currentThread.ProcessorAffinity = (IntPtr)0x02; // Force thread affinity on core 2

            while (true)
            {
            }
        };

        Parallel.Invoke(Enumerable.Repeat(acc, 2).ToArray());  // Execute two concurrent threads and force execution on one CPU
    }
}

This little app does nothing but spawning two threads which run on the same CPU to give the thread scheduler some work. Before that it was a simple while(true); loop in the Main method which did consume CPU but after it was running it never caused and context switches for seconds. The first observation was that a context switch is only performed if another thread must run on this CPU. If not there is no reason to bring it to a halt it can run happily forever. Internally the thread dispatcher still checks if the thread quantum did expire but since no other thread is scheduled for this CPU its quantum is renewed without a context switch. It turns out that you need to cause actual contention on a CPU to make the thread dispatcher do some actual work. The very busy while loops show up in the profiler like this:

image

There we see that on my Win 8.1 machine when two threads compete for the same CPU

  • Every  thread is run again after 32ms (TimeSinceLast).
  • It runs for 32ms (NewInSwitchTime).
  • While it did wait for the CPU for 32ms (Ready).
  • It did spend no time in some WaitForSingleObject synchronization primitive (Wait).
  • The previous thread is put to sleep because it has used up its thread quantum (OldWaitReason with value WrQuantumEnd).
  • During the selected time of 1,62s 54 context switch events did happen which is nearly optimal 1,62/0,032=50 context switches with 32ms time slices (Count).
  • The CPU was maxed out. 12,5% is one core on my 8 core machine (% CPU Usage).

With context switch events you can find out pretty much anything you ever wanted to know why your thread was not running. Thread Scheduling is explained in some free EBooks by a Windows Kernel Developer named David Probert. He was written quite a lot about the internal mechanics of a context switch. If you want to know more about the possible values of OldWaitReason you will find some kernel defines but no explanation. The best explanation for some values I have found in the .NET documentation for the ThreadWaitReason enum. In reality you will find for OldWaitReason almost always the value Executive which means that there thread was waiting inside a WaitForSingleObject and has given up its current quantum.

Below is a picture of a normal Visual Studio startup which prints the time gaps in us of the main thread when it had to wait wait for something. With wait I mean waiting for the CPU to become ready to run the thread (high Ready times) or it was waiting on something to complete or simply the next window message. As you can see we have a pretty busy message pump which has 3402 context switches where the total time gap was 450ms while no code did run on the main UI thread. Since there are so many of these small gaps it makes not much sense to search for useful work on the UI thread there. Instead you should check out single event where e.g. for 145ms nothing had happened. Perhaps there is room for improvement there. I did not see TimeSinceLast used often in in performance analysis sessions so far but it is definitely useful to check out where the largest execution gaps of a supposed to be busy thread are lurking because it is basically the sum of wait times and ready times which catches both thread scheduling issues and wait times for the completion of async operations.

image

After showing off so many details of CPU Usage (Precise) I need to come back (again) to our CPU consumption issue.  With CPU Usage Precise we would see at most one call stack every 15ms-30ms because the thread was switched out because the thread quantum expired. This is usually a lousy sample rate which is polluted by the wait call stacks. In theory it should be possible to filter for WrQuantumEnd switch out reasons to get the same data over a longer period of time like with sampling. In practice this does not work since the the VS Studio startup example not a single thread ever is switched out because its quantum has expired. The table below shows a typical thread switch out distribution. Executive means that the thread did enter a WaitForSingleObject which is nearly the only reason why a thread is switched out.

OldWaitReason Count
Executive 4377
WrQueue 100
WrDispatchInt 26
UserRequest 10
WrYieldExecution 8
WrLpcReceive 8
WrUserRequest 2
WrLpcReply 1
WrResource 1

After all of the cool context switch stuff we must downgrade to the second CPU view named CPU Usage (Sampled) since this is still the only view where we can reliable track down the call stacks which consume most CPU.

image

I ignore threads for now to view an accumulated call stack of CPU usage across all threads in Visual Studio. There it is immediately obvious that Resharper is using a lot of CPU while VS is starting up although it is suspended. I am not sure why in version 8 of their product they still think that startup time is not important and they seem to fully initialize their stuff on the main UI thread. To get a more complete picture I can select an expensive method and group it by function to see who is calling it and how much CPU is consumed down the call chain.

image

Interesting: Resharper has its own brand of MEF embedded which does the initialization. They also seem to use concepts of TPL DataFlow to trigger their components from multiple threads. That certainly looks sophisticated but the CPU graph shows that very little multi threading is happening during startup. On my 8 core machine very little action goes above single core usage. I do not know why they need to initialize so much of their stuff on the UI thread only to add some menus to Visual Studio while being suspended.

Looks like we have a suspect. In case of doubt lets uninstall Resharper 8 and check again. 

image

The CPU consumption has dropped from 9,6s down to 2,8s. Startup time has dropped to 8,2s.

Time in s With Resharper 8 Without Resharper Difference
CPU Consumption 9,6 2,8 6,8
Startup Time 13,1 8,2 4,9

We have huge savings in terms of CPU consumption here which almost 1:1 translates to reduced startup time. The Dataflow stuff of Resharper is able during the startup of VS to move 14% of the CPU work onto background threads which helps a little bit but still the amount of consumed CPU is enormous for a disabled plugin. To be fair on my Windows 8.1 box with an SSD the startup time of VS 2013 with Resharper is only 3,1s vs 1,9s without it. On good dev machines this is not a big deal but on slower machines a disabled Resharper is quite noticeable. That's it for the CPU intensive stuff. Now why lets check out the flat lines in CPU consumption.

What is causing such long delays during the startup of VS? It looks like the main thread is blocked for a long time. After reading so far your natural reaction must be: Open CPU Usage Precise and check out the Wait and Ready times! In this case the UI thread was blocked by something not very complex:

image

While the visual studio is starting up it tries to initialize Direct3D as part of a WPF application. Then Nvidia kicks in (nvd3dum.dll) which calls into some nvSCPAPI.dll which seems to poll for something over 5,4s. I have no idea why this is happening but this feels very wrong. The machine in question has some NVidia Quadro card installed and only NVidia knows why on some machines (not all machines exhibit this behavior) support for stereoscopic gaming is tried to enable. Since NVidia does not release pdbs to their drivers I cannot dig any deeper here but this is a major issue which hits all WPF enabled applications running on NVidia boards under still unknown circumstances. To prove that that the nvSCPAPI.dll was to blame I simply renamed the folder where the dll is located

C:\Program Files (x86)\NVIDIA Corporation\3D Vision

to something different. Now the startup time is down to 2,6s which feels really good now. I kept the renamed folder since I do not have the need for stereoscopic gaming.

image

It is nice to see that WPT can help in these simple cases to find the root issues within minutes. I hope that the Jetbrains can make their good product still better by reducing the startup time more when it is suspended and that NVidia fix their drivers to prevent searching in vain for not installed hardware on some machines. When Nvidia fixes this issue it could help to convince some people that it is not WPF and .NET which is slow but that after their graphics card driver update some apps start much faster than before.

posted on Wednesday, April 30, 2014 11:29 AM