A colleague of mine asked me how he can find out why his laptop does consume 30% CPU in the SYSTEM process after it did wake up from hibernate. That is easy I replied: Open Process Explorer select the SYSTEM process, double click it, select the thread tab, sort by CPU and double click on the first thread which does consume so much CPU to get the call stack.
A few days later I got a mail back that it did not work. Ok could be some elevation problem. I did try it on my machine and sure enough access denied also here. Ups it is not so easy after all. A little googling around did lead to a very good article from Mark Russinovich dealing with the exact same issue. Mark did use Kernrate which seems to work still on Windows 7 but it does look rather unsupported. A few other articles indicated that xperf should also be able to do it. So I did look where to get the latest edition of XPerf. It turns out that it is part of the Windows 8 SDK and the Windows Assessment and Deployment Kit (ADK) für Windows® 8. You do not need to download GBs of stuff. Only select the Windows Performance Toolkit and uncheck all other stuff. That gives you a nice ~50MB download which will install it to
%PROGRAMFILES(X86)%\Windows Kits\8.0\Windows Performance Toolkit
Besides the well known Xperf tool I have written several years ago some new stuff was added. Recording performance with the latest version of the Windows Performance Toolkit has never been so easy. There is a tool called WPRUI.exe which allows you to record a trace session with a nice GUI. An interesting side note is that this tool is missing in the Windows 8 SDK delivered with Visual Studio 2012. If you download the latest version again you get the GUI tool as well. It does seem to have full support from MS and they did even localize the UI which is rather unusual for such low level tools.
When you select CPU, Disc and File IO you should be able to troubleshoot pretty much anything. The UI is easy to use and is even so nice to set some arcane registry key if you are using Windows 7 x64 to enable full stack walks in kernel drivers which seems to be broken otherwise. After you have saved the trace etl file you should cancel recording or you will be recording GB of additional data.
Now you are ready to load the trace file into the viewer which is called wpa.exe. After a little rearranging the stuff you get a pretty good picture what VS is doing during the first startup which is called cold startup because no files are in the OS file system cache yet. That means much more disc accesses during the first startup compared to a second warm startup of VS. How much more? We will shortly see. Here comes the picture:
Holy crap. What is all this? Well I have cut out most columns. This is already a simplified view ;-). On the left side you can select from the Graph Explorer many diagram types via drag and drop or context menu and add it to the Analysis pane. First I want to see what my CPU was doing by selecting “CPU Usage (precise) Utilization by Process Thread”. Next I do filter only for devenv.exe in each graph. But wait: Did you say filter for process? Yes this tool gives you view of the complete system with all processes at once. It goes so far that you can even watch the OS scheduler doing its work. The call stacks with the official MS symbols can be resolved into the kernel and to all user mode processes. Lets say you call in one thread BeginInvoke and wait for some other event you can see in the CPU graph with the columns Old/New Thread Stacks to which other thread/process the OS scheduler has moved control.
Besides CPU consumption disc IO is the most important metric to closely have a look at. For this I did select the graph “Utilization by Process, IO Type” to see at which times the disc is maxed out. I can see the file names which were access and I can see how much time the OS did spend with the hard disc reading this file. The most important column there is Disc Service Time which gives you the timing at ns granularity. Here we do e.g. see that during the cold startup we did spend 0,355s loading the file System.Xml.ni.dll. When you click at the file you see in the graph at which points of time it was read. The bigger blue bars indicate the first and last read event. In the CPU Graph I did also get a lot of selected rows. Why is this? First I thought it was an error but it was of course a feature.
Since we have a complete view of the system at our hands the tooling is able to follow all call stacks. This means that when I select the accessed file the tool does select all processes and threads that were blocked during one of the accesses to this file! Now I can find out why my mouse cursor does jump when I do start a very disc intensive application. Some years ago this stuff was unmanaged land only. But the latest version of it is also able to fully traverse managed call stacks now as you can see in the above call stack of the CPU graph. There you do see also a blue selected method called GetProfile which will obviously later lead to a blocking call in the call stack when trying to deserialize its data with XmlSerializer leading to reads to System.Xml.ni.dll.
The managed call stacks for native images are retrieved by a matching pdb for each native image. When you do google for “ngen create pdb” you will find more info about this feature. It does look like a Windows 8 and .NET 4.5 feature but it is not. This does work on Windows 7 with .NET 4.0 perfectly well. The first trace collection will take a quite long time because for each native image is ngen createpdb called to store a synthetic pdb at C:\ProgramData\WindowsPerformanceRecorder. This little thingy allows it to fully walk managed and unmanaged call stacks. But what about assemblies that are not NGened?
Good question. The short and good answer is: It does work. Below is a screen shot of a normal application I did create to read from the event log. You do see the Main method, the wrapped delegate call…. All is there. There is some stuff in the NGen images stored which is needed because I do get no such stack when I did not load the unmanaged and synthetic Ngen symbols before.
My symbol path does therefore look like this:
It does not include the MS symbol server because I have the most important symbols stored at C:\Windows\symbols already. This makes symbol loading a matter of seconds and not many minutes. PerfMonitor was a tool that was enabled mixed mode stack walking with ETW events first. You can read more at Naveens blog. For the NGen pdb support there seems to be pretty much magic involved which includes some CLR Rundown ETW traces you can read more about at the official docs which makes this look like a Windows 8 feature again. Anyway. It does work on Windows 7 perfectly well and I do not care about the exact details.
Now we can start VS 2012 a second time and compare CPU and disc utilization.
The second startup is much faster. It is only about 3s compared to 6s the first time. The warm vs cold startup performance is a mystery you cannot solve with most profilers which are commercially available. Now you can see all with a free tool. The disc utilization is practically zero this time. We do spend only 13ms compared to 2163ms during the second startup in accessing the disc. All other data comes from the OS cache which boils down to a memcpy when the application does call ReadFile. As you can see this view gives you not a API level view of the disc where we do read from “disc” when we call CreateFile and ReadFile to open and read from it. This gives you direct insights when the hard disc is shuffling data around.
This tool has the potential to supersede many other profilers because you get a holistic view of the complete system with full managed call stack support now.
There is much more to say about this fantastic tool. But I need some sleep now.