Alois Kraus


  Home  |   Contact  |   Syndication    |   Login
  141 Posts | 8 Stories | 357 Comments | 162 Trackbacks



Post Categories


Last time I have shown how to have a look GC issues with my simplified WPA profile. But there is much more in this profile which will make it a lot easier to analyze standard issues. In general when you are dealing with performance you need to search for

  • CPU
  • Disk
  • Wait
  • Memory

issues. A special case is network which falls into the wait issue category where we wait for some external resource.

When you download my simplified profile and apply it to the the provided sample ETL file you get can analyze any of the above issues within much less time. Here is a screen shot of the default tab you will see when you open the ETL file.

Stack Tags are Important


The first and most important graph is CPU Usage Sampled with "Utilization by Process And StackTags" which is a customized view. It is usable for C++ as well as for .NET applications. If you ever did wonder what stack tags are good for you can see it for yourself. I have added a stack tag named Regular Expression which is set for all all calls to the .NET Regex class like this:

<Tag Name=".NET">
    <Tag Name="Regular Expression">
        <Entrypoint Module="" Method="*System.Text.RegularExpressions.Regex*"/>
        <Entrypoint Module="System.dll" Method="*System.Text.RegularExpressions.Regex*"/>

If more than one tag can match the deepest method with the first stack tag is used. This is the reason why the default stack tag file is pretty much useless. If you add tags for your application they will never match because the low level tags will match long before the tags in your own code could ever match. You have to uncomment all predefined stuff in it. If you use my stack tag file you need to remove WPA provided default.stacktags under Trace - Trace Properties - Stack Tags Definitions. In practice I overwrite the default file to get rid of it. If you leave it you will get e.g. very cheap CPU times for a GC heavy application because all call stacks of your application which will trigger GC stuff are added to the GC stack tag. This makes your application specific stack tags look much slimmer than they really are because your application tags come later in the call stack which are not used to tag your stacks.  

Why would I need a custom stack tag file? It makes it much easier to tag high level expensive operations of your application so you can see how much time you were spending with e.g. loading game data, decompressing things, rendering, … This makes it easy to detect patterns what your application was doing at a high level. Besides if you find a CPU bottleneck in your application you can add it under a e.g. a "Problems" node so you can document already known issues which are now easy to spot.

For our problematic application PerformanceIssuesGenerator.exe we see that it was doing for 17,5s CPU Regex stuff (Weight in View is ms in time units). To see how long the actual running time was we need to add the Thread ID column since currently we sum the CPU time of all threads which is not the actual clock time we spent waiting for completion.



The context menu is actually customized. It is much shorter and contains the most relevant columns I find useful. If you want more of the old columns then you can simply drag and drop columns from the View Editor menu which is part of all WPA tables. If you want to remove additional columns you can also drag and drop columns back to the left again. This way you can streamline all of your column selection context menus which is especially useful for the CPU Usage Precise context menu which is huge.

Select A Time Range to Analyze

Now we see that we have two large Regex CPU consumers with a large time gap in between. But what was the application actually doing? This is where marker events from your own application come in handy so you know what high level operation e.g. the user did trigger and how long it did take. This can be achieved with a custom ETW event provider or the special ETW marker events which WPA can display in the Marks graph if any of them are present in your ETL file. To be able to use them to navigate in the ETL file your application must of write them at interesting high level time points which indicate for example the start and stop of a user initiated action. For .NET applications the EventSource class is perfectly suited for this task. Marker events can be written with a custom PInvoke call to

Here is the code to write a marker event in C# which shows up in all kernel sessions. The TraceSession.GetKernelSessionHandles is adapted from the TraceEvent library. If you have "NT Kernel Logger" sessions only (e.g. if you use xperf) then you can use 0 as session handle to write to it. 

    /// <summary> Write to Windows Event Tracing a Marker string which does show up in the Marker tab when 
    /// you view the trace. This makes navigation in ETL files much easier. 
    /// </summary>
    public static class EtwMarker
        #region Fields
        const int FixedBufferSize = 512;

        [StructLayout(LayoutKind.Sequential, Pack = 1, CharSet = CharSet.Ansi)]
        unsafe struct ETWString
            public Int32 Prefix;
            public unsafe fixed byte String[FixedBufferSize];

        /// <summary>Write a marker string to ETW to enable easier navigation in ETL files in the viewer.</summary>
        /// <param name="str">String to write</param>
        /// <exception cref="Win32Exception">When string could not be written.</exception>
        public static unsafe void WriteMark(string fmt, params object[] args)
            String str = String.Format(fmt, args);
            ETWString etwStr = new ETWString();
            int len = CreateFromString(str, ref etwStr);
            var sessionHandles = TraceSession.GetKernelSessionHandles();

            // Write to all enabled kernel sessions so this marker shows up everywhere.
            foreach (var handle in sessionHandles)
                int ret = EtwSetMark(handle, &etwStr, len);
                if (ret != 0)
                    throw new Win32Exception(ret);

        /// <summary>
        /// Create an ETWString from a .NET string. The struct is fixed size.
        /// Only the first 511 bytes are copied from the ASCII string.
        /// </summary>
        /// <param name="str">String to convert</param>
        /// <param name="data">Reference to ETWString struct</param>
        /// <returns>Length of ETWString</returns>
        static unsafe int CreateFromString(string str, ref ETWString data)
            var bytes = Encoding.ASCII.GetBytes(str);
            int copyLen = bytes.Length < FixedBufferSize ? bytes.Length : FixedBufferSize - 1;

            fixed (void* p = &data)
                IntPtr dest = IntPtr.Add(new IntPtr(p), 4); // pointer to String buffer

                // copy at most Length-1 bytes of string to keep the terminating null char
                Marshal.Copy(bytes, 0, dest, copyLen);
                int length = 4 + copyLen + 1; // Prefix DWORD + string length + closing \0 char
                return length;

        /// <summary>
        /// Undocumented method of the kernel to write an ETW marker which does show
        /// up in the WPA viewer under the Marks tab. This makes navigation in trace files
        /// much easier to find important time marks.
        /// This is the same as xperf -m "string" which writes only to the "NT Kernel Logger" session.
        /// </summary>
        /// <param name="sessionHandle">if 0 the NT Kernel Logger Session is written to. Otherwise you need to use supply a handle to a kernel ETW session.</param>
        /// <param name="pStr">Ascii encoded string with a DWORD prefix</param>
        /// <param name="length">Length of complete string including the prefix and terminating \0 char</param>
        /// <returns>0 on success, otherwise a WIN32 error code</returns>
        extern unsafe static int EtwSetMark(ulong sessionHandle, ETWString* pStr, int length);

Now that we have our marks we can use them to navigate to key time points in our trace session:


We see that the first CPU spike comes comes from RegexProcessing which did take 3,1s. The second regex block was active between the Hang_Start/Stop events which did take 2,7s. This looks like we have some real problems in our PerformanceIssueGenerator code. Since we have according to our ETW marks Regex processing, many small objects, many large objects and a hang with simultaneous regex processing we need to select one problem after the other so we can look at each issue in isolation. That is the power which custom ETW providers or ETW Marks can give you. Normally you are lost if you know that you have several issues to follow up. But with application specific context marker events you can navigate to the first regex processing issue. To do that select the first event by and then hold down the Ctrl key while clicking at the stop event to multi select events. Then you can right click on the graph to zoom into the region defined by the first and last event.


Analyze CPU Issue

When you now look at the CPU consumption of the busiest threads we find 2,7s of CPU time. At the bottom WPA displays the selected duration which is 3,158s which matches quite well the reported timing of 3,178s. But the reported thread time of 2,7s is not quite the observed duration. In the graph you see some drops of the CPU graph which indicates that for some short time the thread was not active possibly waiting for something else.


Wait Chain Analysis

That calls for a Wait Chain Analysis. If you scroll down you will find a second CPU graph with the name CPU Usage (Precise) Waits. This customized graph is perfectly suited to find not only how much CPU was consumed but also how long any thread was waiting for something. Please note that this graph does not replace the CPU Usage Sampled graph. I have explained the difference between both CPU graphs already earlier. The column selection context menu of this graph has been massively thinned out to keep only the most relevant columns. Otherwise you would have to choose from over 50 items and the context menu even has scroll bars! Now we have only process, thread id, thread stack as groupings. Next comes a list of fixed columns which are always visible because they are so important. There we see how long each thread did wait for something (WaitForSingleObject …) as total and maximum wait time and the CPU usage in milliseconds. If we sum up for the most expensive thread the wait time of 0,385s and 2,726s of CPU time we get 3,111s which is within a small error margin exactly the time we did get by measuring the start and stop of our regex processing operation.


Is this conclusion correct? Not so fast since a thread can be only running or waiting (it can also wait in the ready queue which is only relevant if you have more threads running than cores) the sum of CPU and Wait times for each thread will always add up to 3,1s because this is the time range you did zoom into. To actually prove that this time is really waited while we were regex processing we have to sort by the wait column and then drill down for the thread 10764.


When we do this we see that all waiting did occur while the DoRegexProcessing delegate was called. It was waiting on the regular expression JIT compilation to complete. Now we have proven that that wait time is really spent while executing the regex processing stuff. If we would like to optimize the total running time we have two options: Either we could use more threads to parallelize even further or we need to tune our Regular expression or replace it by something else. Before going down that route you should always check if this string processing is necessary at all. Perhaps strings are not the best data structure and you should think about your data structures. If you need to sue strings you still should verify that the regex processing was really necessary at this point of time. Eventually you do not need the results of this regex processing stuff right now.

A Garbage Collection Issue?

In the list of marker events we see that the first regex issue overlaps with a GenerateManySmallObjects operation. Lets zoom into that one and check out the what we see under CPU usage. There we see that we are consuming a significant amount of CPU in the Other stack tag which categorizes unnamed stacks into its own node. If we drill into it we find out allocating method PerformanceIssueGenerator.exe!PerformanceIssueGenerator.IssueGenerator::GenerateManySmallObjects. That is interesting. Was it called on more than one thread?


To answer that question it is beneficial to select the method in question and open from the context menu View Callers- By Function


This will let you start to drill up to all callers of that method which is useful to find the total cost of a commonly used method (e.g. garbage_collect …). This is still the total sum for all threads. Now we need to bring back out Thread Ids to see on which threads this methods was called.


If the Thread ID column has more than one node beneath it it will be expandable like I have shown above. This proves that only one thread was calling this method and it did use 1,3s of CPU time. But there are still 0,4s of wait time missing. Where was the rest spent? We could use the Wait graph again to see where the wait operation was hanging around but since we know that it was allocating things we can open the Region for Garbage collections


And we just have found that during that time the garbage collector did block our application of 0,4s which explains the missing time in our CPU graph. We also know that we had 75 GCs during that time which is a rather high number. We can even see how many GCs of which generation we did have during that time by using my (also hand rolled) Generic Events GC Generations graph:


That's all about the small object allocation issue. Now lets go back by doing "Undo Zoom" in the graph to get our previous time range back where we can do the same analysis for the Generate Many Large objects issue. This is not terribly interesting which I leave to you as exercise.

UI Hang (2,727s) Analysis

Now lets investigate why our UI did hang. The hang operation has nice markers which we can use to zoom into the region of interest.


We are having a lot of CPU intensive threads here. If the UI was hung it must either using much CPU on the UI thread or the UI thread was blocked for some reason. If you know that the UI thread was started on your Main method you can search for CorExeMain which is the CLR method which calls into your main method. Or you search for a typical window method like user32.dll!DispatchMessageWorker. With that we quickly find thread 11972 as the thread which was hung in:


Obviously it was waiting for a task to complete. If you call Task.Wait on your UI thread you will block further UI message processing and your UI will not be able to redraw anymore. At least that was the visible observation. The Hang did take 2,727s which exactly matches up the total summation of all wait operations with 2,724s. If that is true we should see as maximum wait time the same the same time but there we have only 1,383s. When using WPA you can look deep into the inner workings of Windows and .NET. Lets be a little curious and check out why there were not one Wait operation but 30 waits while we were blocked in Taks.Wait. We observere that Task.Wait calls into the current Dispatcher to delegate the wait to its SynchronizationContext. This in turn calls into the current CLR thread to DoAppropriateWait which in turn calls on the UI thread MsgWaitForMultipleObjectEx. This method can block a UI thread but leave e.g. mouse events or COM messages through. Depending on which which parameters you call it it can even pump your UI thread with message which is sometimes necessary. This can lead to unintended side effects that you can execute code while you were waiting for a blocking operation. I have found such issues in Can You Execute Code While Waiting For A Lock? and Unexpected Window Message Pumping Part 2.

Digging Deeper

We know that MsgWaitForMultipleObjects can pump messages with the right flags. So what was happening in these 30 wait operations? If you open the column context menu you can add Ready Thread Stack which gives you all call stacks which did cause MsgWaitForMultpleObjects or its child methods to wake up.


Poor other tools. They will never be able to tell you that MsgWaitForMultipleObjects was woken up 28 times to by dwm.exe which is the Desktop Window Manager to tell our wait method: Here you have 28 mouse events which you can process or not depending on how your wait flags were configured. The last event which did finally end our Task.Wait operation was the last thread that did finish processing. This is the call where our Wait operation did wake up our blocked message loop.



You can ask: How do you know that? Well the Switch-In Time column of our event in the table which is to the right of the yellow bar with 14,425 tells me this. This is were our wait operation did end and our thread was woken up again. This is also the time where the CPU consumption has a sharp drop which is no coincidence. If you have sharp eyes you will notice in the graph at the right a blue line. This is when our thread was woken up. Every thread wake up is displayed in the graph which makes it much easier to spot regular patterns of activity. It is this wealth of information which makes WPA so powerful. But the default profile is mainly aimed at kernel developers which is a pity because it is an excellent system wide profiling tool which gives you deep insights into how Windows really works and sometimes how Windows or your application breaks.

If you have counted you will still miss one ready event. This one was Visual Studio which did send us a window message from vslog.dll!VSResponsiveness::Detours::DetourPeekMessageW. I am not sure why Visual Studio does this but if you profile your system more often you will notice that the Visual Studio responsiveness stuff gets from time to time into a loop and all Visual Studio instances will send each other messages (still the case with VS2015). This does not show up as a surge of CPU consumption. But it will increase your ETL file size considerably because you will get millions of context switches. This is not a big problem except for battery powered devices which have a higher power consumption than they could have and for people profiling their machines getting huge files.


.NET Exception Analysis

The previous analysis was a bit complex. Now for something nice and simple. You can easily check if your .NET application throws exceptions if you enable the .NET provider. With WPRUI you need to check ".NET Activity" to get these events and many others. For a short list of interesting .NET events you can check out the ETW manifest of .NET for yourself or you read Which Keywords Has An ETW Provider Enabled? which covers the most interesting .NET events as well. If you zoom into the Do Failing Things region which is visible as ETW Marker event in the Marks graph. Then you need to add a Generic Events Graph and change the graph type from "Activty by Provider, Task, Opcode" to .NET Exceptions which is only part of my WPA profile.


This gives you a nice view of all thrown exceptions in all .NET applications. You can play with the groupings and group e.g. by exception type and message to see how many different ones you have. To do that you only need to Drag the "Exception Message (Field 2)" and drop it to the right of ExceptionType. Now you have Process - Exception Type - ExceptionMessage in a nice cohesive view. You just have to remember that all columns to the left of the yellow bar are the columns which are grouped together by their values.

Starting with .NET 4.6 you can also use the .NET Exceptions Caught  view which shows you all thrown exceptions and in which method these exceptions were caught.


This can be interesting for automated regression tests if something breaks because an exception which was always happening is now caught in a different method which might be too late in the call chain. Having an exception is nice but the most interesting part of it is its call stack. From this graph we know the Thread on which it was happening. We do not need the process id because each thread Id is unique across all processes. Then you can zoom to a specific exception an investigate in CPU Usage (Precise or Sampled) the call stacks if something related to exception handling shows up. Searching for the method KernelBase.dll!RaiseException is usually a good way to find it. If that did not help you need to enable stack walking for the .NET provider. I blogged about how to do that a long time ago for a custom provider with xperf in Semantic Tracing for .NET 4.0. If you did record the provider "Microsoft-Windows-DotNETRuntime" with stack walking enabled you can add in the View Editor from the column context menu


the Stack column to display the call stack of all thrown exceptions. An alternative is to enable the StackKeyword+ExceptionKeyword  of the "Microsoft-Windows-DotNETRuntime" ETW provider. This will cause .NET to write an additional ETW event for every .NET event with a managed stackwalk event. This event is not human readable but PerfView can decode it. This is one of the things where PerfView is still the best tool to investigate managed application issues.


Although this is already a rather long post you still have only seen a small fraction of what you can do with ETW profiling and the WPA tool. I hope that I have convinced a few more people out there to try it out because it can help you to find the problems you never knew that they existed. But these problems might be ones your customers are always complaining about.

posted on Saturday, September 26, 2015 10:53 AM


# re: Making WPA Simple - CPU/Disk/Memory and Wait Analysis 10/15/2015 3:22 AM Glen
I love this series of articles and I'm surprised you have not gotten more comments. Thanks for the profile download and going to the time and trouble to write this article! You should consider submitting these entries as articles on

Post A Comment