Alois Kraus

blog

  Home  |   Contact  |   Syndication    |   Login
  131 Posts | 8 Stories | 345 Comments | 162 Trackbacks

News



Archives

Post Categories

Programming

Thursday, April 14, 2016 #

If you have ever tried to create a WPF application which has a larger memory footprint (>500MB) you will notice some random hangs in the UI which become several second hangs for no apparent reason. On a test machine with a decent Xeon CPU I have seen 35s UI hangs because WPF frequently (up to every 850ms) calls GC.Collect(2). The root cause of the problem is that WPF was designed with a business application developer in mind who never gets resource management right. For that reason WPF Bitmaps and other things do not even care to implement the IDisposable interface to clean up resources deterministically. Instead the cleanup is left as an exercise for the Garbage Collector with the Finalizer thread working hand in hand.

That can lead to problems. Suppose a 32 bit application where the user is scrolling through a virtual ListView with many bitmaps inside it. This operation will cause the allocation of many temporary Bitmaps which will quickly become garbage. Because the Bitmaps are small objects on the managed heap but the actual Bitmap data is stored in unmanaged memory the Garbage Collector sees no need to clean things up for a long time. In effect it did happen that your application ran of unmanaged memory long before the Garbage Collector was able to release the bitmaps in the Finalizer. That lead to one of the worst hacks in WPF. It is called MemoryPressure. Lets have a look how it is implemented:

http://referencesource.microsoft.com/#PresentationCore/Core/CSharp/MS/Internal/MemoryPressure.cs
//
// About the thresholds:  
// For the inter-allocation threshold 850ms is the longest time between allocations on a high-end
// machine for an image application loading many large (several M pixel) images continuously.
// This falls well below user-interaction time (which is on the order of several seconds) so it
// differentiates nicely between the two
//
// The initial threshold of 1MB is so we don't force GCs when the total amount of unmanaged memory
// isn't a big deal.  The point of this code is to stop unmanaged memory from spiraling out of control
// at that point it's typically in the 10s of MBs.  This threshold thus could potentially be increased
// but current testing shows it is adequate.
//
// The max time between collections was set to 30 sec because that is a 'long time' - this is
// for the case where allocations (and frees) of images are happening continously without 
// pause - we haven't seen scenarios that do this yet so it's possible this threshold could also 
// be increased
// 
private const int INITIAL_THRESHOLD = 0x100000;          // 1 MB initial threshold
private const int INTER_ALLOCATION_THRESHOLD = 850;     // ms allowed between allocations
private const int MAX_TIME_BETWEEN_COLLECTIONS = 30000; // ms between collections
/// <summary>
/// Check the timers and decide if enough time has elapsed to
/// force a collection
/// </summary>
private static void ProcessAdd()
{
    bool shouldCollect = false;

    if (_totalMemory >= INITIAL_THRESHOLD)
    {
        //
        // need to synchronize access to the timers, both for the integrity
        // of the elapsed time and to ensure they are reset and started
        // properly
        //
        lock (lockObj)
        {
            //
            // if it's been long enough since the last allocation
            // or too long since the last forced collection, collect
            //
            if (_allocationTimer.ElapsedMilliseconds >= INTER_ALLOCATION_THRESHOLD
                || (_collectionTimer.ElapsedMilliseconds > MAX_TIME_BETWEEN_COLLECTIONS))
            {
                _collectionTimer.Reset();
                _collectionTimer.Start();
            
                shouldCollect = true;
            }
            _allocationTimer.Reset();
            _allocationTimer.Start();
        }

        //
        // now that we're out of the lock do the collection
        //
        if (shouldCollect)
        {
            Collect();
        }
    }

    return;
}

/// <summary>
/// Forces a collection.
/// </summary>
private static void Collect()
{
    //
    // for now only force Gen 2 GCs to ensure we clean up memory
    // These will be forced infrequently and the memory we're tracking
    // is very long lived so it's ok
    //
    GC.Collect(2);
}

This beauty is calling GC.Collect(2) every 850ms if in between no Bitmap was allocated or every 30s regardless of how many Bitmaps were allocated. With .NET 4.5 we have got concurrent garbage collection which dramatically reduces blocking all application threads while a garbage collection is happening. For common application workloads a "normal" .NET application gets 10-15% faster by doing no change to the code. These improvements are all nullified by calling a forceful full blocking garbage collection.

To demonstrate the effect I have created a simple test application which allocates on a background thread 1 GB of small objects while on the UI thread we allocate one WPF bitmap every 850ms and compare that to allocating an "old" WinForms Bitmap object also every 850ms.

// WPF allocation                 
BitmapFrame img = null;
while (true)
{
    myBitmapStream2.Position = 0;
    img = BitmapFrame.Create(myBitmapStream2); // WPF Bitmaps have no Dispose method!
    Thread.Sleep(850);
    if (backgroundAllocation.IsCompleted)
    {
        break;
    }
}

// Winforms Bitmap allocation 
Bitmap img = null;
while (true)
{
    myBitmapStream2.Position = 0;
    img = new Bitmap(myBitmapStream2); 
    Thread.Sleep(850);
    if (backgroundAllocation.IsCompleted)
    {
        break;
    }
    img.Dispose();
}
 

If you measure for some heap sizes you quickly see that your application will become dramatically slower the more memory it consumes due to the forced blocking garbage collection caused by WPF. The x-axis shows the managed heap size in MB and the y-axis shows the time needed to allocate 1 GB of small objects in a background thread.

image

You have a multi GB WPF application and the user experience is just awful and slow? You can google for good answers on Stackoverflow

which tell you that you need to use Reflection to set private fields in the internal MemoryPressure class of WPF. Not exactly a production grade "fix" to the issue.

But there is hope. The new public beta of .NET Framework 4.6.2 contains a fix for it. The MemoryPressure class is gone and your Stackoverflow "fix" will cause exceptions if you did not prepare for the impossible that Microsoft did dare to remove internal classes. WPF now adheres to the long time recommended GC.AddMemoryPressure call to tell the Garbage Collector that some managed objects also consume significant unmanaged memory.

// System.Windows.Media.SafeMILHandleMemoryPressure
internal SafeMILHandleMemoryPressure(long gcPressure)
{
    this._gcPressure = gcPressure;
    this._refCount = 0;
    if (this._gcPressure > 8192L)
    {
        MemoryPressure.Add(this._gcPressure);
        return;
    }
    GC.AddMemoryPressure(this._gcPressure);
}

With .NET 4.6.2 you finally get the possibility back to create snappy managed applications without long forced garbage collection pauses. You can measure the GC pause times with my custom WPA profile in no time:

image

That is nice but you can see with my custom WPA profile and the streamlined default.stacktags file even more:

image

There you can clearly see that while the managed heap grows the Induced GC times get bigger just as you would expect from the GC regions. To get the same view you need to download my simplified WPA profile which I have updated with the latest stacktags I found useful during past analysis. To make that active you need to open from the menu Trace - Trace Properties and remove current file and add the downloaded stacktags file. Or you can also simply overwrite the default.stacktags file that comes with WPA.

image

The new improved stacktags file gives you fast insights into your application and your system which is not really possible with other tools. With a nice stacktags file you can create your very own view of the system. The updated stacktags file contains tags for common serializers, exception processing, and many more things which are useful during analysis of performance issues or application failures.


Tuesday, March 22, 2016 #

When dealing with custom data structures funny things can happen. This is especially true if multiple threads can mutate state and you have several locks playing in the mix. Doing a code review is quite challenging with such code but still some bugs slip through. After quite a lot of debugging the final bug turned out to have nothing to do with multithreading but a problematic IComparable implementation. What will the following small program print as output?

using System;
using System.Collections.Generic;

namespace SortedSetTest
{
    class Data : IComparable<Data>
    {
        public byte[] Payload
        {
            get;
            set;
        }
        public int CompareTo(Data other)
        {
           return this.Payload == other.Payload ? 0 : 1;
        }
    }

    class Program
    {
        static void Main(string[] args)
        {
            var set = new SortedSet<Data>();

            var payload = new byte[10];
            var payload2 = new byte[20];
            var payload3 = new byte[30];

            var d1 = new Data { Payload = payload };
            var d2 = new Data { Payload = payload2 };
            var d3 = new Data { Payload = payload3 };

            set.Add(d1);
            set.Add(d2);
            set.Add(d3);

            set.Remove(d1);
            set.Remove(d3);
            set.Remove(d2);

            Console.WriteLine($"Final Count: {set.Count}");
        }
    }
}

You can choose from

  • 0
  • 1
  • 2
  • 3
  • Other
  • Exception

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

And the winner is:

 

Final Count: 1

This was unexpected. If you break the rules of the IComparable interface you can get from never terminating sorts up to silent data structure corruption everything. The generic interface description is not particularly helpful but the non generic version spells it out explicitly:

1) A.CompareTo(A) must return zero.

2) If A.CompareTo(B) returns zero, then B.CompareTo(A) must return zero.

3) If A.CompareTo(B) returns zero and B.CompareTo(C) returns zero, then A.CompareTo(C) must return zero.

4) If A.CompareTo(B) returns a value other than zero, then B.CompareTo(A) must return a value of the opposite sign.

5) If A.CompareTo(B) returns a value x not equal to zero, and B.CompareTo(C) returns a value y of the same sign as x, then A.CompareTo(C) must return a value of the same sign as x and y.

6) By definition, any object compares greater than (or follows) null, and two null references compare equal to each other.

If you break any of the rules you will get undefined behavior which depends entirely on the collection class you use. You must remember to never ever take shortcuts while implementing the IComparable interface. In my case the rule number 4 and 6 were violated.

Proof 4: A,B are non null arrays

    A.CompareTo(B) = 1

    B.CompareTo(A) = 1 Cannot be according to rule 4!

The spirit of this rule is that in order to sort things you need to support the < operator which requires that property. The problem with SortedSet<T> is that it uses a red black tree for data storage. Since the < comparison is broken the tree operations which rely on a working comparison operator can break it subtle ways like sometimes to forget to remove an existing item from the collection.

So how can this be fixed? That all depends on what you treat as equal. If you care only about array sizes and not their contents then your CompareTo method becomes

        public int CompareTo(Data other)
        {
            int lret = 0;
            if( Object.ReferenceEquals(Payload, other.Payload))
            {
                lret = 0;
            }
            else if (other.Payload == null)
            {
                lret = 1;
            }
            else if (Payload == null)
            {
                lret= -1;
            }
            else
            {
                lret = Payload.Length.CompareTo(other.Payload.Length);
            }

            return lret;
        }

That fixed version will also follow rule 6 for null values which is also good practice. For some time I have thought that SortedSet<T> was broken but as usual the BCL classes are ok but the fault was in our code. How hard can it be to write a comparison method? It turns out there are 6 rules to be followed which are quite a lot for a seemingly simple function.


Monday, March 14, 2016 #

Finding handle leaks in all processes at once for all handle types without a debugger is no longer impossible. Since Windows 8.1 (0?) each handle creation and close call is instrumented with an ETW event. You only need to turn it on, execute your use case for some minutes or hours if you really need to and then stop the recording.

To start full handle tracing you need to install the Windows Performance Toolkit from the Windows 10 SDK or WDK. Then enter in an Administrator shell

  • wpr -start Handle
  • Execute your use case
  • wpr -stop c:\temp\Handle.etl

Then you can open the resulting .ETL file with WPA and add the graph Handles - Outstanding Count by Process to your analysis view.

image

Now you can filter for your process (e.g. in my case I did start Visual Studio). The original view gives me a system wide view of all processes which did allocate handles.

image

That is a nice view but if you are after a handle leak you need to Create Stack. No problem. Right click on the table header and add the Create Stack to the column list. Then you should load the symbol from MS and add your local symbol paths

image

With the call stacks you can drill into the allocation stack of any handle and search for your leak:

image

The graph nicely shows the not yet freed handles but the table shows all allocations which can be a bit confusing when you search for the not yet released handles. For big handle leaks the existing view is already enough but if you need in the table to drill down only into call stacks of not yet released handles you need to add a filter to exclude all lines in the table which have released a handle before the trace was stopped.

More Details

To add that filter click on the open the gear icon or press Ctrl+E:

image

Because we are doing advanced things we click on the Advanced icon

image

and there we can finally add the trace end time which is visible at the bottom of the WPA main window

image

Now the graph and the table is updated which now only shows the handles which have not been released since the start of Visual Studio in our example which should match the number of allocated handles shown by Task Manager.

image

You can also get more fancy. Normally I have some test which shows after some time a handle leak in a specific process. I start leak tracing and then the test and later I stop it. Since I do not want to treat first time initialization effects as leaks I can exclude the e.g. first 5 minutes of the test to get rid of first time init effects. I also want to make sure that I do not get handles as leaks which are allocated at the end because the test was still running at the end of the trace. To do that I need to look for recurring patterns in the trace and exclude all allocated handles which were created at some later time when the test run was just complete. The final result is a filter which hides all entries which match

[Close Time]:<"20,861s" OR [Create Time]:<"5s" OR [Create Time]:>"15s"

After all noise is removed any handle leak, even small ones are only a matter of drilling into the allocation call stacks and fixing the code. If you have a handle leak on a Windows 8.1 (0?) or later machine this approach is much easier and faster than to use Windbg and the !htrace command which is nicely explained at https://blogs.technet.microsoft.com/yongrhee/2011/12/19/how-to-troubleshoot-a-handle-leak/.

Why So Late?

I have no idea why this very useful capability of WPA was never documented anywhere. It showed up in the Windows 8 SDK years ago but Handle leak tracing did never work because I was at that time still with Windows 7.

Which Handle Type did I Leak?

The easiest way is to use another tool. Process Hacker is a Process Explorer clone which can show for any process a nice summary. Double click on a process and select the Statistics tab:

image

When you click on Details you can sort by Handle Count and you immediately know for which handle type you are searching a leak:

image

PerfView for Advanced Recording

The only other tool I know of which can enable handle leak tracing is PerfView v1.9 from 2/19/2016 or later

image

PerfView has the unique capability to stop tracing based on a performance counter threshold. This is extremely useful to find e.g. a sudden handle spike which occurs during a stress test over night at 5 a.m. in the morning but when you arrive at 6 a.m. (you are already too late Zwinkerndes Smiley) at the office the handle spike will long be overwritten by newer handle allocations of the 500MB ring buffer. Now you can get your breakfast and arrive relaxed at 9 a.m where you can start analyzing the random handle spike which your colleagues were missing while they were sitting in front of Windbg over night and present the results at 10 a.m in the morning to your manager.

The only issue I have with PerfView is that its performance counter query is locale sensitive which makes it not trivial to specify it on e.g. a Hungarian machine. For the record: On my German machine I can start Handle leak tracing which stops when the performance counter for the the first devenv instance has a value greater than 2000 handles with

  • perfview collect c:\temp\HandleLeak.etl /kernelEvents=Handle /StopOnPerfCounter:"Prozess:Handleanzahl:devenv>2000"

The feature finally seems to have been set free with the Windows 10 SDK but handle leak tracing exists also since Windows 8.1 (0?) in the kernel but no tool was capable to enable it until now. Before that ETW feature Handle leaks have been quite hard to track down but with such advanced and pretty easy to use tooling it is just a matter of two command line calls to get all allocated handles from all processes in one go.

If you leak User (Windows, Menus, Cursors, …) or GDI objects (Device Contexts, Brushes, Fonts, …) you still need to resort to intercepting the corresponding OS methods in your target process like I have shown in Generic Resource Leak Detection with ETW and EasyHook but as usual you need to use the right tool for the job at hand to nail all bugs of your application.

Conclusions

With the addition of ETW tracing to handle allocations it has never been so easy to solve handle leaks. Previously it was a pretty complex undertaking but now you can follow the steps above and you will have a nearly 100% fix rate if you analyze the gathered data correctly. If this has helped you to solve a long searched leak or you have other useful information you want to share sound off in the comments.


Saturday, February 27, 2016 #

I had an interesting case where a new WPF control was added to a legacy WinForms application. The WPF control worked perfectly in a test application but for some strange reason it was very slow in final WinForms application where it was hosted with the usual System.Windows.Forms.Integration.ElementHost. The UI did hang and one core was always maxed out. Eventually it built up after some minutes but even simple button presses did cause 100% CPU on one core for 20s. If you have high CPU consumption the vegetative reaction of a developer is to attach a debugger and break into the methods to see where the issue is. If you use a real debugger like Windbg you can use the !runaway command to find the threads with highest CPU usage

0:006> !runaway
User Mode Time
  Thread       Time
   0:368       0 days 0:00:11.625
   4:13a0      0 days 0:00:06.218
   6:301c      0 days 0:00:00.000
   5:21c8      0 days 0:00:00.000
   3:3320      0 days 0:00:00.000
   2:20e4      0 days 0:00:00.000
   1:39a0      0 days 0:00:00.000

but when I tried to break into the I was always just waiting for window messages:

# ChildEBP RetAddr  Args to Child             
00 0018f0d0 5cb1e13a dcd9ca4a 72f7a424 0018f370 USER32!NtUserWaitMessage+0xc
01 0018f154 5cb1db39 00000000 ffffffff 00000000 System_Windows_Forms_ni+0x1ae13a
02 0018f1a8 5cb1d9b0 024d09e4 1bda0002 00000000 System_Windows_Forms_ni+0x1adb39
03 0018f1d4 5cb06129 024d09e4 0018f29c 00000000 System_Windows_Forms_ni+0x1ad9b0
04 0018f1ec 00bc048b 02444410 0018f204 72f71396 System_Windows_Forms_ni+0x196129

Eventually I would find some non waiting stacks but it was not clear if these were the most expensive ones and why. The problem here is that most people are not aware that the actual drawing happens not in user mode but in an extended kernel space thread. Every time you wait in NtUserWaitMessage the thread on the kernel side can continue its execution but you cannot see what's happening as long as you are only looking at the user space side.

If debugging fails you can still use a profiler. It is about time to tell you some well hidden secret of the newest Windows Performance Toolkit. If you record profiling data with WPR/UI and enable the profile Desktop composition activity new views under Video will become visible when you open the trace file with WPA. Most views seem to be for kernel developers but one view named Dwm Frame Details Rectangle By Type is different. It shows all rectangles drawn by Dwm (the Desktop Window Manager). WPA shows not only the flat list of updated rectangles and its coordinates but it draws it in the graph for the selected time region. You can use this view as poor mans screenshot tool to visually correlate the displayed message boxes and other windows with the performed user actions. This way you can visually navigate through your ETL and see what windows were drawn at specific points in your trace!

image

That is a  powerful capability of WPA which I was totally unaware until I needed to analyze this WPF performance problem. If you are more an xperf fan you need to add to your user mode providers list

    • Microsoft-Windows-Dwm-Core:0x1ffff:0x6

and you are ready to record pretty much any screen rectangle update. This works only on Windows 8 machines or later. Windows 7 knows the DWM-Core provider but it does not emit the necessary events to draw the dwm rectangles in WPA. The rectangle drawing feature of WPA was added with the Win10 SDK Release of December 2016. Ok so we see more. Now back to our perf problem. I could see that only two threads are involved consuming large amounts of CPU in the UI thread and on the WPF render thread for a seemingly simple screen update. A little clicking around in the UI would cause excessive CPU usage. Most CPU is used in the WPF rendering thread

ntdll.dll!_RtlUserThreadStart
ntdll.dll!__RtlUserThreadStart
kernel32.dll!BaseThreadInitThunk
wpfgfx_v0400.dll!CPartitionThread::ThreadMain
wpfgfx_v0400.dll!CPartitionThread::Run
wpfgfx_v0400.dll!CPartitionThread::PresentPartition
wpfgfx_v0400.dll!CComposition::Present
wpfgfx_v0400.dll!CSlaveHWndRenderTarget::Present
wpfgfx_v0400.dll!CDesktopHWNDRenderTarget::Present
wpfgfx_v0400.dll!CDesktopRenderTarget::Present
wpfgfx_v0400.dll!CSwRenderTargetHWND::Present
wpfgfx_v0400.dll!CSwPresenter32bppGDI::Present
wpfgfx_v0400.dll!CMILDeviceContext::BeginRendering
user32.dll!NtUserGetDC
ntdll.dll!LdrInitializeThunk
ntdll.dll!_LdrpInitialize
wow64.dll!Wow64LdrpInitialize
wow64.dll!RunCpuSimulation
wow64cpu.dll!Thunk0Arg
wow64cpu.dll!CpupSyscallStub
ntoskrnl.exe!KiSystemServiceCopyEnd
win32kbase.sys!NtUserGetDC
ntoskrnl.exe!ExEnterPriorityRegionAndAcquireResourceShared
win32kbase.sys!_GetDCEx
wpfgfx_v0400.dll!CMILDeviceContext::EndRendering
user32.dll!ReleaseDC
user32.dll!NtUserCallOneParam
ntdll.dll!LdrInitializeThunk
ntdll.dll!_LdrpInitialize
wow64.dll!Wow64LdrpInitialize
wow64.dll!RunCpuSimulation
wow64cpu.dll!ServiceNoTurbo
wow64.dll!Wow64SystemServiceEx
wow64win.dll!whNtUserCallOneParam
wow64win.dll!ZwUserCallOneParam
ntoskrnl.exe!KiSystemServiceCopyEnd
win32kfull.sys!NtUserCallOneParam
ntoskrnl.exe!ExReleaseResourceAndLeavePriorityRegion
ntoskrnl.exe!KiCheckForKernelApcDelivery
ntoskrnl.exe!KiDeliverApc
win32kfull.sys!NormalAPCInvalidateCOMPOSITEDWnd
win32kbase.sys!EnterCrit …

If that does not make much sense to you, you are in good company. The WPF rendering thread is rendering a composite window (see CComposition::Present) which seems to use a feature of Windows which also knows about composite Windows. After looking with Spy on the actual window creation parameters of the hosting WinForms application

image

it turned out that the Windows Forms window had the WS_EX_COMPOSITED flag set. I write this here as if this is flat obvious. It is certainly not. Solving such problems always involves asking more people about their opinion what could be the issue. The final hint that the WinForms application had this extended style set was discovered by a colleague of me. Nobody can know everything but as a team you can tackle pretty much any issue.

A little googling reveals that many people before me had also problems with composite windows. This flag does basically inverse the z-rendering order. The visual effect is that the bottom window is rendered first. That allows you to create translucent windows where the windows below your window shine through as background. WPF uses such things for certain visual effects.

That is enough information to create a minimal reproducer of the issue. All I needed was a default Windows Forms application which hosts a WPF user control. The complete zipped sample can be found on Onedrive.

    public partial class Form1 : Form
    {
        protected override CreateParams CreateParams
        {
            get
            {
                CreateParams cp = base.CreateParams;
                cp.ExStyle |= 0x02000000;  // Turn on WS_EX_COMPOSITED
                return cp;
            }
        }

        public Form1()
        {
            InitializeComponent();
            cWPFHost.Child = new UserControl2();
        }
    }

The WPF user control is also very simple

    public partial class UserControl2 : UserControl
    {
        public UserControl2()
        {
            InitializeComponent();
            this.Loaded += UserControl2_Loaded;
        }

        void UserControl2_Loaded(object sender, RoutedEventArgs e)
        {
            HwndSource hwnd = System.Windows.PresentationSource.FromVisual(this) as HwndSource;
            HwndTarget target = hwnd.CompositionTarget;
            target.RenderMode = RenderMode.SoftwareOnly;
        }
    }

To get the high CPU issue three things need to come together

  1. Hosting window must have set the WS_EX_COMPOSITED window style.
  2. WPF child window must use composition.
  3. WPF child window must use software rendering.

When these three conditions are met then you have a massive WPF redraw problem. It seems that two composite windows cause some loops while rendering inside the OS deep in the kernel threads where the actual rendering takes place. If you let WPF use HW acceleration it seems to be ok but I have not measured how much GPU power is then wasted. Below is a screenshot of the sample Winforms application:

 

image

After was found the solution was to remove the WS_EX_COMPOSITED window style from the WinForms hosting window which did not need it anyway.

Media Experience Analyzer

The problem was solved but it is interesting to see the thread interactions happening while the high CPU issue is happening. For that you can use a new tool of MS named Media Experience Analyzer (XA) which was released in Dec 2015 but the latest version is from Feb 2016. If you have thought that WPA is complex then you have not yet seen how else you can visualize the rich ETW data. This tool is very good at visualizing thread interactions in a per core view like you can see below. When you hover over the threads the corresponding context switch and ready thread stacks are updated on the fly. If you zoom out it looks like a star field in Star Trek just with more colors.

image

If you want to get most out of XA you can watch the videos a Channel 9 which give you a pretty good understand how Media Experience Analyzer (XA) can be used

When should you use WPA and when Media Experience Analyzer?

So far the main goal of XA seems to be to find hangs and glitches in audio and video playback. That requires a thorough understanding of how the whole rendering pipeline in Windows works which is huge field on its own. But it can also be used to get a different view on the data which is not so easy to obtain in WPA. If threads are ping ponging each other this tool makes it flat obvious. XA is already powerful but I am not following entirely its UI philosophy where you must visually see the issue in the rendered data. Most often tabular data like in WPA is more powerful because you can sort by columns and filter away specific call stacks which seems not to be possible with XA. What I miss most in XA is a simple process summary timeline like in the first screenshot. XA renders some nice line graphs but that is not very helpful to get a fast overview of the total CPU consumption. If you look at the complete trace with the scheduler events and the per process CPU consumption in

XA

image

WPA

image

I am having a much easier time in WPA to identify my process with the table and color encoding. In XA you always need to hover over the data to see it actual value. A killer feature in XA would be a thread interaction view for a specific process. Ideally I would like to see all threads as bars and the bar length is either the CPU or wait time. Currently I can only see one thread color encoded on which core it is running. This is certainly the best view for device driver devs but normally I am not interested in a per core view but a per thread timeline view. Each thread should have a specific y-value and the horizontal bar length should show either its running or waiting time (or both) and with a line the readying thread as it is already done today.

That would be the perfect thread interaction view and I hope that will be added to XA. The current version is still a 1.0 so expect some crashes and bugs but it has a lot of potential. The issues I encountered so far are

  • If you press Turn Symbol Off while is still loading it crashes.
  • The ETL file loading time is very high because it seem to include some private MS symbol servers where the UI hangs for several minutes (zero CPU but a bit network IO).
  • UI Redraws for bigger (>200MB) ETL files are very slow. Most time seems to be spent in the GPU driver.
  • Spelling error in Scheduler view: Drivers, Processes, Threads per Core with Reaady threads.

XA certainly has many more features I have not yet found. The main problem with these tools is that the written documentation only touches the surface. Most things I have learned by playing around with the tools. If you want share your experiences with WPA or XA please sound off in the comments. Now stop reading and start playing with the next cool tool!


Tuesday, February 9, 2016 #

The number of bugs produced by developers are legion but why are advanced debugging skills still rare in the wild? How do you solve problems if you do not have the technical know how to to a full root cause analysis across all used tech stacks?

Simple bugs are always reproducible in your development environment and can easily be found with visual debuggers in your favorite IDE. Things get harder if your application consistently crashes at customer sites. In that case often environmental problems are the root cause which mostly cannot be reproduced in the lab. Either you install a debugger on production machines of your customers or you need to learn how to use memory dumps and analyze them back home.

There are also many other tools for Windows troubleshooting available like Process Explorer, Process Monitor, Process Hacker, VMMap, … which help a lot to diagnose many issues without ever using a debugger. With some effort you can learn to use these tools and you are good to solve many problems you can encounter during development or on customer machines.

Things get interesting if you get fatal sporadic issues in your application which results in data loss or it breaks randomly only on some customer machines. You can narrow it down where the application is crashing but if you have no idea how you did get there then some industry best practice anti patterns are used:

  • You know the module which breaks and you rewrite it.
  • You do not even know that. If the problem is sporadic tinker with the code until it gets sporadic enough to be no longer an urgent problem.

That is the spirit of good enough but certainly not of technical excellence. If you otherwise follow all the good patterns like Clean Code and the Refactoring you still will collect over the years more and more subtle race conditions and memory corruptions in central modules which need a rewrite not because the code is bad but because no one is able to understand why it fails and is able to fix it.

I am surprised that so many, especially small companies can get away with dealing technical debt that way without going out of business. Since most software projects are tight on budget some error margin is expected by the customers they can live pretty well with worked around errors. I am not complaining that this is the wrong approach. It may be more economical to bring a green banana to market to see what the customers are actually using and then polish the biggest user surfacing features fast enough before the users will step away from the product. The cloud business brings in some fascinating opportunities to quickly roll out software updates to all of your customers with new features or fixes. But you need to be sure that the new version does not break in a bad way or all of your customers will notice it immediately.

Did you ever encounter bugs which you were not able to solve? What creative solutions did you come up with?


Saturday, January 30, 2016 #

When you are tracking down handle or other leaks you usually need to resort to Windbg which is a great tool for unmanaged code but not so much for managed code because the usual stuff like !htrace only get the unmanaged call stack.  For managed code this approach breaks because you will not see which managed callers did allocate or free a handle. Other Windbg approaches like putting a breakpoint and then print out the handle and the call stack at the allocate/free methods work, but for heavily called methods they slow down the application too much. A much easier and faster approach is to hook the methods in question and write out an ETW event for each acquire/release call where the ETW infrastructure takes care of walking the stack. To hook specific exported methods I usually use EasyHook which is a great library to hook into pretty much any C style exported method.

Below I present a generic way to use hooking of resource acquire and release methods and tracing them with ETW which enables a generic way to track down any resource leakage as long as the necessary functions are within the reach of an API interception framework. To my knowledge that was not done before in any public library and I take therefore credit to make this approach public. Perhaps I have invented this approach but one never knows for sure.

EasyHook Introduction

EasyHook lets you define managed callbacks to intercept publicly exported methods of any processes. In the managed callback you can extend the functionality of an existing API or trace method arguments which is more often the case. If you have e.g. an executable which exports a method named Func which takes 5 arguments and returns an int like

extern "C" int WINAPI Func(int *a, int *b, int *c, int *d, int *e)
{
    *a = 1;
    *b = 2;
    *c = 3;
    *d = 4;
    *e = 5;
    return 1234;
}

you can write in managed code an interceptor for it which additionally needs a matching delegate and a PInvoke signature of the original method

        [UnmanagedFunctionPointer(CallingConvention.StdCall, CharSet = CharSet.Unicode, SetLastError = true)]
        unsafe delegate int FuncDelegate(int* a, int* b, int* c, int* d, int* e);

        [DllImport("UnmanagedWithExports.exe", CharSet = CharSet.Unicode, SetLastError = true, CallingConvention = CallingConvention.StdCall)]
        unsafe static extern int Func(int* a, int* b, int* c, int* d, int* e);
 
        unsafe static int Func_Hooked(int* a, int* b, int* c, int* d, int* e)
        {
            int lret = Func(a, b, c, d, e);
            return lret;
        }

To override in the target process you need to execute

        FuncHook = LocalHook.Create(LocalHook.GetProcAddress("UnmanagedWithExports.exe", "Func"), new FuncDelegate(Func_Hooked), this);
        FuncHook.ThreadACL.SetExclusiveACL(null);

The Hook creation is self explanatory but the hook is  disabled for all threads by default. You can either specify a list of threads for which the hook should be enabled ( LocalHook.ThreadACL.SetInclusiveACL(...) ) or you can configure a list of threads for which it should be disabled ( LocalHook.ThreadACL.SetExclusiveACL(…) ). By calling SetExclusiveACL with a null list or new int [] { 0 } which is not a valid thread number you are effectively enabling the hook for all threads.

That was the whirlwind tour into EasyHook which makes it really easy to tap into otherwise unreachable parts of code execution. Most Windows APIs are prepared to be hooked so you can intercept pretty much anything. In practice all C style exported methods are usually within your reach. With that new capability you can do previously impossible things like

  • Write a generic resource leak detection tool which tracks every resource acquire and release call.
  • Export a dummy method in your C++ dll which is called from code where previously sporadic bugs did happen. E.g. trace each invocation of the dummy method which passes internal state variables from the sporadically failing method. 
    • Now you can hook the empty method and intercept the method calls with full call stacks during your automated builds only on the machines where it did occur with minimal overhead.
  • Record the passed parameters to any exported method without the need to change the source code (which you probably don't have anyway) of this method.
  • Alter method arguments at runtime (e.g. redirect file open calls to other files).
  • ….

If you try it out you will find that the interception part works pretty well but for some reason the call stack for x64 processes looks broken when you inspect it in the debugger. We only see the managed intercepted call but no stack frames from our callers:

0:000> kc
 # Call Site
00 EasyHook64!LhPatchHookReturnAddress
01 EasyHook64!LhBarrierBeginStackTrace
02 0x0
03 0x0
04 0x0
05 0x0
06 clr!UM2MThunk_WrapperHelper
07 clr!UM2MThunk_Wrapper
08 clr!Thread::DoADCallBack
09 clr!UM2MDoADCallBack
0a clr!UMThunkStub
0b 0x0
0c 0x0
0d 0x0
0e 0x0
0f 0x0
10 0x0
11 0x0
12 0x0
13 0x0
14 0x0
15 0x0
16 0x0

but it should show our callers as well which should look like
0:000> kc
 # Call Site
00 EasyHook64!LhPatchHookReturnAddress
01 EasyHook64!LhBarrierBeginStackTrace
02 0x0
03 0x0
04 0x0
05 0x0
06 clr!UM2MThunk_WrapperHelper
07 clr!UM2MThunk_Wrapper
08 clr!Thread::DoADCallBack
09 clr!UM2MDoADCallBack
0a clr!UMThunkStub
0b UnmanagedWithExports!main
0c UnmanagedWithExports!__tmainCRTStartup
0d KERNEL32!BaseThreadInitThunk
0e ntdll!RtlUserThreadStart

The problem comes from the fact that under 64 bit stack walking works differently than under 32 bit. When the Windows x64 calling convention was defined the engineers wanted to get rid of the exception handling overhead in the happy code path unlike in the Windows x86 execution model. With x86 every method which contains an exception handler needs to set their exception handler  callback even if no exception ever happens.

In contrast to that x64 code has zero overhead in terms of code execution speed no matter how many exception handlers are present in any method! To make that possible in x64 all code must additionally register exception unwind information structures which describe the stack layout and their exception handlers for every part of the function until function exit to enable proper stack unwinding by parsing the additional metadata if an exception is about to be thrown.

x64 Stack Walking and Unwind Information

The exception unwind information is also used by the stackwalker which is part of the reason why our stack is broken. To intercept method calls EasyHook copies assembly code to a memory location which acts as trampoline code for our managed hook callback. This assembler trampoline code has no associated exception unwind information which is a problem if you try to step with a debugger through a hooked method. I tried to add the missing exception unwind information to the dynamically emitted assembler code by patching EasyHook. This led down into the dark corners of assembler ABI in function Prologue and Epilogue.

Here are some links with further information if you are interested:

It took a while to wrap my head around it since it was not clear how to verify that the unwind info registration was correct. Luckily Windbg has a nice command to show you for any address the associated Unwind Information. The command is call .fnent xxx. xxxx is the method address. Below is the output of Windbg for the trampoline assembler code with the now dynamically added Unwind Information:

0:000> .fnent 7ff6`79fa0318
Debugger function entry 0000003a`d75be570 for:

BeginAddress      = 00000000`00000318
EndAddress        = 00000000`000003c9
UnwindInfoAddress = 00000000`00000274

Unwind info at 00007ff6`79fa0274, 22 bytes
  version 1, flags 0, prolog 2d, codes f
  frame reg 5 (rbp), frame offs 00h
  00: offs 26, unwind op 8, op info 3    UWOP_SAVE_XMM128 FrameOffset: 50 reg: xmm3.
  02: offs 21, unwind op 8, op info 2    UWOP_SAVE_XMM128 FrameOffset: 40 reg: xmm2.
  04: offs 1c, unwind op 8, op info 1    UWOP_SAVE_XMM128 FrameOffset: 30 reg: xmm1.
  06: offs 17, unwind op 8, op info 0    UWOP_SAVE_XMM128 FrameOffset: 20 reg: xmm0.
  08: offs 12, unwind op 3, op info 0    UWOP_SET_FPREG.
  09: offs e, unwind op 2, op info c    UWOP_ALLOC_SMALL.
  0a: offs a, unwind op 0, op info 5    UWOP_PUSH_NONVOL reg: rbp.
  0b: offs 9, unwind op 0, op info 9    UWOP_PUSH_NONVOL reg: r9.
  0c: offs 7, unwind op 0, op info 8    UWOP_PUSH_NONVOL reg: r8.
  0d: offs 5, unwind op 0, op info 2    UWOP_PUSH_NONVOL reg: rdx.
  0e: offs 4, unwind op 0, op info 1    UWOP_PUSH_NONVOL reg: rcx.

If you need to do it by yourself you can look at https://github.com/Alois-xx/EasyHook/blob/develop/DriverShared/LocalHook/install.c in method AddUnwindInfos

By using the Windbg stack commands (k and related) I could immediately see if I was on the right track. If you want to see correct x64 method examples you can check out any C/C++ x64 compiler output to see what is possible. The dumpbin tool can also display the exception unwind infos with

dumpbin /unwindinfo dddd.exe/.dll

for any executable or dll.

The version 1 unwind codes are only necessary to describe the function Prologue which is sufficient to reconstruct the stack layout of all saved registers at any time in a method. If you have more exotic Epilogues you need to describe them with specific Unwind codes as well which were introduced with Windows 10 (I guess)? The Unwind Info version 2 is not documented at MSDN anywhere yet but you can have a look at a small snippet of the Windows Kernel stack unwinder which is copied into CoreCLR at: https://github.com/dotnet/coreclr/blob/master/src/unwinder/amd64/unwinder_amd64.cpp. See the methods OOPStackUnwinderAMD64::UnwindEpilogue, OOPStackUnwinderAMD64::VirtualUnwind and related.

After adding correct unwind codes stack walking did work only up to certain point where it still got corrupted because EasyHook repeatedly replaces the return address of the current method with the CLR stub and the backpatch routine to reset the old return address after our managed hook returns. During this time the stack is by definition not walkable because the return address of our caller is not even present on the stack while we are executing our managed callback. To get out of this EasyHook already provides a solution by replacing as direct return address the original hook entry point temporarily to make the stack walkable again. But that works only while we are in a hook handler in  terms of EasyHook speak. That means that we can restore the stack only before we call the hooked method. After that we cannot restore the return location anymore. That is a problem if we want to log the returned arguments of a hooked method (e.g. the actual handle value of CreateWindow, CreateFile, …). Since I know that our current return address must be the hook specific assembler code to restore the original caller address I did extend  NativeAPI.LhBarrierBeginStackTrace and NativeAPI.LhBarrierEndStackTrace of EasyHook to locate the hook stub code and to replace it with the original entry point of the first hooked function of the current thread. After that change EasyHook is able to get you into a state where the stack is fully walkable before and after you call the hooked method for x64 and x86.

Unfortunately the x64 ETW Stackwalker of Windows 7 stops on any method which has no matching loaded module (e.g. trampoline hook code and JITed code) so this is still not working with ETW events in Windows 7. But custom ETW events with proper Unwind Infos will work on Windows 8 or later. I still wish that MS would fix the Windows 7 ETW stackwalker because this also breaks profiling JITed x64 code on Windows 7 machines. 32 bit code works though since Windows 7.

Generic ETW Resource Tracing

At the time of writing you can see the changes on my EasyHook fork at https://github.com/Alois-xx/EasyHook/commits/develop. I hope that the changes will get into the main line of EasyHook soon. I still have left out an important aspect of a generic ETW leak tracer. How should we trace it so it is easy to diagnose with WPA? WPA for Windows 10 can graph custom event data if they are numeric and non hexadecimal Sum columns. We need to trace every resource acquire and release. Usually you get back some handle so this should be included as well. The resource could be something big like a bitmap which has  a specific memory size so it makes sense to trace the allocation size as well. The basic ETW trace calls should therefore have the form

    AcquireResource(ulong Handle, long AllocSize, string Allocator)
    ReleaseResource(ulong Handle, long AllocSize, string Allocator)

With such an API we can trace for CreateFile, CreateWindow, … and such with a simple ETW trace call all relevant information. For simple handle based resources where no metadata is available how big they are it makes sense to use 1 in the Aquire call as allocation size and -1 for the Release call. If all Aquire and Release calls for the Allocator e.g. CreateWindow are balanced out the WPA Sum column for the total allocation size (Field 2) will balance out to zero. If there is a leak we will see a positive excess in WPA for which we can filter.

 

image

Here we see e.g. that our free method which seems to do the wrong thing seems to release sometimes with an allocation size of zero 68 times which is exactly the number of outstanding allocation size which is graphed in Field 2.

As an example for CreateWindow we would issue

   AcquireResource(windowHandle, 1, "CreateWindow")

and for each hooked DestroyWindow call we would trace

   ReleaseResource(handle, -1, "DestroyWindow")

You can use the tool ETWstackwalk as a simple window handle leak detector which is now part of my EasyHook fork. You need to compile ETWStackwalk.exe as Administrator under x64 to register the ETW manifest. You get then something like this as output.

Window Handle Leak Tracing

ETWStackwalk is part of EasyHook which serves as demonstration what can be done with it. Here is its help output

ETWStackWalk.exe v1.0.0.0 by Alois Kraus 2016
Injects an ETW tracer to external processes to find resource leaks. Works for x64 since Win 8 or later. Works for 32 bit processes since Windows 7.
ETWStackWalk [-hook [createfile] [-outfile xxx.etl] [createwindow]] [-debug] [-help] [-pid ddd] [-start executable with optionalargs]
   -help                 This message
   -hook                 Hook specified methods to trace handle leaks.
                         createwindow Intercept Create/DestroyWindow calls.
                         createfile   Intercept CreateFile calls.
   -outfile xxxx.etl     Output file name to which the hook tracing data will be written to. Default is %temp%\ResourceLeaks.etl
   -pid ddd              Hook into specified process
   -start executable     Start executable and then hook after 5s to give it time to initialize.
   -noetw                Disable ETW recording (used for unit testing).
   -debug                Used to debug EasyHook trampoline code. Hook notepad.exe and start Windbg of Windows 10 SDK.
Examples:
Monitor all CreateWindow/DestroyWindow calls to detect window handle leaks in an already running application with process id ddd
 ETWStackWalk.exe -hook createwindow -pid dddd
Monitor all CreateFile calls. There you can log e.g. return code, passed file name, flags and other flags of interest.
 ETWStackWalk.exe -hook createfile -pid dddd
Test hooking a method which writes if hooked an ETW event for every allocation and deallcation event to simulate a handle leak.
 ETWStackWalk.exe -hook funcetw -start UnmanagedWithExports.exe 500000
If something goes wrong you can directly debug it with Windbg if you have the Win10 SDK installed by adding the -debug switch
 ETWStackWalk.exe -hook funcetw -debug -start UnmanagedWithExports.exe 500000

To start window handle leak tracing you need to get the pid of an already running process because EasyHook is not able to inject its code while the process is still starting the CLR which can lead to all sorts of weird errors. Instead you can attach to an already running process and tell ETWStackWalk to enable createwindow/destroywindow tracing. You can also configure and output etl file if you wish our you stick to the default.

D:\Source\git\easyhook\Build\netfx4-Release\x64>ETWStackWalk.exe -hook createwindow -outfile c:\temp\resourceLeak.etl  -pid 14960
Start ETW Recording ...
Hooking into proces 14960 for createwindow 
Press Enter to stop recording
ETWStackwalk has been installed in target 14960.


Stopping ETW Recording. This can take some minutes ...
        Stop WPR Output: To analyze this trace run: wpa "c:\temp\resourceleak.etl"
Start Analysis with wpa c:\temp\resourceleak.etl -profile easyhook.wpaprofile

I have crated a suitable easyhook.wpaprofile which is dedicated to easily analyze resource leaks.  There you get a nice overview per handle and all associated aquire and release operations. The graph shows the duration how long a handle was open until it was closed again. The Start/Stop ETW Opcode Name corresponds to acquire/release calls for this handle. If handles are reused you can still see any excess (too many aquire or release calls) in the AllocSize column which is a sum of all calls  where each CreateWindow counts with +1 and all calls to DestroyWindow count as -1. If all calls cancel out you see for that handle an summed AllocSize (Field 2) of zero. The allocation size shows you that in my test while opening a solution in VS2015 where 6 additional windows were created and not yet closed. This makes it easy to filter for the outstanding window handles which have an AllocSize > 0. Then you can open drill into the call stacks and you have found the root cause of pretty much any window handle leak with practically zero speed impact of your application while this approach is generic and can be applied to any resource leak. This tester serves only as demonstration that is based on the combination of ETW and EasyHook but with this concept we can tackle a wide range of sporadic resource leaks which are now solvable with a systematic approach.

 

image

If we hook other generic methods like CloseHandle we are in a bad position since it is a generic close method which closes many different handle types. But we can (literally) sort things out if we group in WPA by handle (Field 1) and filter away all handles which have a negative allocation size (Field 2). This removes all handles for which CloseHandle was called but no corresponding e.g. CreateFile call was traced. From that on we can continue to search for outstanding file handles by looking at the totals as usual.

Now you know why the handle is always the first argument to our ETW trace calls. If we position the handle at the same method parameter location in our ETW manifest we can group the acquire and release calls by handle which is exactly what we want in WPA got generate the graph and table above.

Hooking and ETW Performance

I was telling you that this approach is fast. So how fast can it get? The y-axis are not ms but us where the average function execution time of a hooked basically empty int Func(int *a, int *b, int *c, int *d, int *e) was measured. It is interesting that the raw function execution speed from a few nanoseconds jumps to ca. 1us with ETW tracing and goes up to 4us (measured on my Haswell i7-4770K CPU @ 3.50GHz under Windows 8.1) if ETW stackwalking is enabled.  That means we can get full ETW events and stack walking at a rate of 250.000 calls/s by using that approach. That is more than most tracing frameworks offer and these do not capture the full call stack. If that is still too slow you can use EasyHook as a pure unmanaged Interceptor which could give you perhaps a factor 2-4 but then you need to take care of a lot more things.

image

What is odd is that x86 ETW tracing seems to be slower for x86 and under x64 but I did not look into that more deeply. In general x86 code should still be faster if we do the same thing because the CPU caches do not grow but the pointer and instruction size of code and data increases under x64 which simply needs more cycles as Rico Mariani explains in A little 64 bit follow-up. As long as you do not need the big address space you should not simply buy the argument that x64 is the future because it is more modern. If you want to do your users a favor you should favor the small is beautiful approach. With small I mean small code, small and efficient data structures and small code complexity. 

WPA Performance Issue (Windows 10 SDK 10.0.10586.15)

When parsing custom ETW events WPA parses the strings in a very inefficient way which can lock up your machine easily for minutes. On my 8 core machine the custom view takes minutes to build up and uses 4 cores to parse the strings from my custom ETW events. The problem is that WPA also uses code from a static helper class which is grossly inefficient. Have a look at this profiling data:

image

WPA uses 4/8 cores where 3 cores are spending CPU cycles spinning to get a lock while only one core is doing actual work (12,5% CPU is one core on a 8 core machine). This effectively single threaded operation is also very inefficient implemented. The problem comes from TDHHelper.GetStringForPropertyAtIndex which parses a string from an ETW event. This is done in a static class under a global lock:

image

When I drill down into one thread I see that EnsureCapacity and EnsureRoomForTokensAndClear are responsible for 43% of the execution time which would go away if the List<IntPtr> instances would be reused. It could also help that the parsed strings are cached which would reduce the time quite a bit if I play around with groupings in WPA. That is a major bottleneck and should be solved as my top performance issue in the current WPA implementation. I have not noticed such long delays with WPA 8.1. A quick check reveals that the code is nearly the same in previous versions. So it was always slow but with more and more custom ETW events this display is becoming much more important than before. If you look at custom CLR events (e.g. Exceptions grouped by type) you will notice the very slow reaction of WPA due to this issue.

Outlook

If you have read this far congrats. You had to digest quite many technical details from many different areas. If you are searching for a way to get rid of Windbg for resource leak analysis and use a simpler tool ETWStackwalk could be it. It is easy to extend for any other API you are interested in and it works for x86 and x64 (Win 8 and later). If you want to intercept another API you only need to extend the method injector which is declared in

easyhook\Test\ETWStackWalk\Injector.cs

In the Injector you see some examples for CreateFile and CreateWindow which contains mostly boilerplate code. The hook generation code could certainly be simplified. A nice addition would it be to compile the interception code on the fly and pass the actual Interceptor class as configuration file which would make hooking new methods truly dynamic without the need to recompile ETWStackwalk every time a new API needs to be hooked. I had used a variation of this approach for my GDI bug in Windows which actually lead to a Windows Hotfix. But this was always bound to x86 code only. Since I have found some free time to greatly extend the usefulness of the hooking approach by making EasyHook playing nice with x64 stack walking. So far every resource leak (GDI handles, File, Semaphores, Desktop Heap, … ) needed a different approach or tool. It is time to come up with an approach which is powerful and generic enough to tackle all of such resource leaks with one tool and approach. I expect that at least the idea should become state of the art how resource problems are tackled in the next years. If you have used this approach already or you had success with the idea please share it here so we all can benefit by learning from each other. It was fun to get my hands dirty with x64 assembly and to find out how the basics of exception handling really work together. Going deep is my first nature. That was the main reason why I have studied particle physics to really understand how the universe works. The discovery of the Higgs boson was really a great thing and I can at least say that I have tested some small parts of the huge ATLAS detector (which was part of the discovery of the Higgs boson) during my time at the university 15 years ago. It is kind of cool to understand how my computer works from software to hardware down to subatomic level.

If you are wondering what on earth is that guy doing when he is not in front of a computer? Apparently this:

image

With my two kids, not my boat and my wife we were enjoying two fantastic weeks at the Caribbean sea on St Thomas this January. The kids had great fun in the warm water and this was my first vacation where I did never freeze. Even the nights during winter time were above 25 Celsius. The water is so full of live I have not seen anywhere else. The Mediterranean sea looks like a desert compared to this. If you go to Coral World at St. Thomas you see many different fishes and corrals in aquariums. But if you go snorkeling you will see all animals from the zoo again which is really amazing. During that time home in Germany the people had quite a lot of snow but we were sweating the whole time which really felt odd. That's all for today. Happy bug hunting!


Thursday, November 12, 2015 #

I see a lot of different code and issues. One interesting bug was where someone did remove a few lines of code but the regression test suite did consistently report a 100ms slowdown. Luckily the regression test suite was using ETW by default so I could compare the good baseline with the bad one and I could also take a look at the code change. The profiling diff did not make much sense since there was a slowdown but for no apparent reason in the CultureInfo.CurrentCulture.DisplayName property did become ca. 100ms slower.

How can that be? To make things even more mysterious when they changed some other unrelated code the numbers did return back to normal. After looking into it more deeply I found that the basic application logic did not slow down. But instead some unrelated methods did just become much more CPU hungry. The methods were internal CLR methods named COMInterlocked::CompareExchange64 and COMInterlocked::CompareExchange64. The interesting thing is that it happened only under 32 bit and under 64 bit the error did go away. If you are totally confused by now you are in good company. But there is hope. I had a similar problem encountered already over a year ago. I knew therefore that it has something to do with the interlocked intrinsics for 64 bit operands in 32 bit code. The most prominent on 32 bit is

lock cmpxchg8b qword ptr [some register pointing to a memory location] 

which is heavily used by the clr interlocked methods. To reproduce the problem cleanly I have written a little C program where I played a bit around to see what the real issue is. It turns out it is ……

                                                                                Memory Alignment

A picture will tell more than 1000 words:

image

The CPU cache is organized in cache lines which are usually 64 wide. You can find out the cache line size of your CPU with the nice Sysinternals tools coreinfo. On my Haswell home machine it prints something like this

Logical Processor to Cache Map:
**------  Data Cache          0, Level 1,   32 KB, Assoc   8, LineSize  64
**------  Instruction Cache   0, Level 1,   32 KB, Assoc   8, LineSize  64
**------  Unified Cache       0, Level 2,  256 KB, Assoc   8, LineSize  64
********  Unified Cache       1, Level 3,    8 MB, Assoc  16, LineSize  64

The most important number for the following is the LineSize of 64 which tells us how big the smallest memory unit is which is managed by the CPU cache controller. Now back to our slow lock cmpxchg8b instruction. The effect of the lock prefix is that one core gets exclusive access to a memory location. This is usually implemented on the CPU by locking one cache line which is quite fast. But what happens if the variable spans two cache lines? In that case the CPU seems to lock all cache lines which is much more expensive. The effect is that it is at least 10-20 times slower than before. It seems that our .NET application in x86 did allocate a 64 bit variable on a 4 byte (int32) boundary at an address that crossed two cache lines (see picture above). If by bad luck we are using variable 7 for a 64 bit interlocked operation we will cause an expensive global cache lock.

Since under 64 bit the class layout is usually 8 byte aligned we are practically never experiencing variables which are spanning two cache lines which makes all cache line related errors go away and our application was working as expected under 64 bit. The issue is still there but the class layout makes it much harder to get into this situation. But under 32 bit we can frequently find data structures with 4 byte alignment which can cause sudden slowdowns if the memory location we are hitting is sitting on a cache line boundary. Now it is easy to write a repro for the issue:

using System;
using System.Diagnostics;
using System.Globalization;

namespace InterlockedFun
{
    class Program
    {
        static void Main(string[] args)
        {
            int len = 1;
            if (args.Length == 1)
            {
                len = int.Parse(args[0]);
            }
            var b = new byte[len];

            var sw = Stopwatch.StartNew();

            var name = CultureInfo.CurrentCulture.DisplayName;

            sw.Stop();
            Console.WriteLine("DisplayName property did take {0:F0}ms", sw.Elapsed.TotalMilliseconds);
        }
    }
}

That is all. You only need to allocate on the managed heap enough data so the other data structures will at some point hit a cache line boundary. To force this you can try different byte counts with a simple for loop on the command line:

for /L %I in (1,1,64) do InterlockedFun.exe %i

At some point the measured times will change quite a lot:

InterlockedFun.exe 29
DisplayName property did take 17ms

InterlockedFun.exe 30
DisplayName property did take 17ms

InterlockedFun.exe 31
DisplayName property did take 17ms

InterlockedFun.exe 32
DisplayName property did take 17ms

InterlockedFun.exe 33
DisplayName property did take 128ms

InterlockedFun.exe 34
DisplayName property did take 93ms

InterlockedFun.exe 35
DisplayName property did take 77ms

You can play with the little sample for yourself to find the worst performing version on your machine. If you now look at WPA with a differential view you will find that CompareExchange64 is responsible for the measured difference:

image

Since that was a such a nice problem here is the actual C Code I did use to verify that the issue only pops up only at cache line boundaries:

#include "stdafx.h"
#include <windows.h>
#include <chrono>

const int Iterations = 1000 * 1000;  // yeah heavy locking
size_t  Alignment = 4; // simulate DWORD alignment

int main(int argc, char **argv)
{
    if (argc == 2)
    {
        Alignment = atoi(argv[1]);
        _tprintf(L"Alignment: %I64d", Alignment);
    }

    auto pAligned = (LONG64 *)_aligned_malloc(10 * sizeof(LONG64), 64);

    auto v1 = (LONG64 *) (((byte *)pAligned) + Alignment)+7; // Now shift our 64 byte cache line aligned variable by 4 bytes and then go 7 
                                                                // int64 to the right to land on the border of two cache lines

    auto start = std::chrono::high_resolution_clock::now();

    for (int k = 0; k < Iterations; k++)  // simulate many interlocked operations on a variable which crosses two cache lines
    {
        _InterlockedCompareExchange64(v1, 100, 100);
    }
    auto stop = std::chrono::high_resolution_clock::now();
    auto ms = std::chrono::duration_cast<std::chrono::milliseconds>(stop - start).count();
    _tprintf(L"\nInterlocked %d iterations did take %I64dms. Average duration interlocked operation: %f us", Iterations,  ms, (ms*1000.0f)/Iterations);

    _aligned_free(pAligned);
    return 0;
}

This will print with bad 4 byte alignment

Interlocked 1000000 iterations did take 1104ms. Average duration interlocked operation: 1.104000 us

but with 8 byte alignment

Interlocked 1000000 iterations did take 26ms. Average duration interlocked operation: 0.026000 us

That is a whooping factor of 42 faster. No wonder that the intel manual recommends to align the variables on page 258 of the 64 ia 32 architectures software developer system programing manual:

… The integrity of a bus lock is not affected by the alignment of the memory field. The LOCK semantics are followed
for as many bus cycles as necessary to update the entire operand. However, it is recommend that locked accesses
be aligned on their natural boundaries for better system performance:
• Any boundary for an 8-bit access (locked or otherwise).
• 16-bit boundary for locked word accesses.
• 32-bit boundary for locked doubleword accesses.
64-bit boundary for locked quadword accesses.  …

The word better should be written in big red letters. Unfortunately it seems that 32 bit code has a much high probability to cause random performance issues in real world applications than 64 bit code due to the memory layout of some data structures. This is not an issue which makes only your application slower. If you execute the C version concurrently

start  cmpxchg.exe && cmpxchg.exe

Then you will get not 1s but 1,5s of runtime because of the processor bus locking. In reality it is not as bad as this test suggests because if the other application uses correctly aligned variables they will operate at normal speed. But if two applications exhibit the same error they will slow each other down.

If you use an allocator which does not care about natural variable alignment rules such as the GC allocator you can run into issues which can be pretty hard to find. 64 bit code can also be plagued by such issues because we have also 128 bit interlocked intrinsics. With the AVX2 SIMD extensions memory alignment is becoming mainstream again. If people tell you that today memory alignment and CPU caches play no role in todays high level programing languages you can prove them wrong with a simple 8 line C# application. To come to an end and to answer the question of the headline: No it is not a CPU bug but an important detail how the CPU performance is affected if you use interlocked intrinsics on variables which span more than one cache line. Performance is an implementation detail. To find out how bad it gets you need to measure for yourself in your scenario.


Saturday, September 26, 2015 #

Last time I have shown how to have a look GC issues with my simplified WPA profile. But there is much more in this profile which will make it a lot easier to analyze standard issues. In general when you are dealing with performance you need to search for

  • CPU
  • Disk
  • Wait
  • Memory

issues. A special case is network which falls into the wait issue category where we wait for some external resource.

When you download my simplified profile and apply it to the the provided sample ETL file you get can analyze any of the above issues within much less time. Here is a screen shot of the default tab you will see when you open the ETL file.

Stack Tags are Important

image

The first and most important graph is CPU Usage Sampled with "Utilization by Process And StackTags" which is a customized view. It is usable for C++ as well as for .NET applications. If you ever did wonder what stack tags are good for you can see it for yourself. I have added a stack tag named Regular Expression which is set for all all calls to the .NET Regex class like this:

<Tag Name=".NET">
    <Tag Name="Regular Expression">
        <Entrypoint Module="System.ni.dll" Method="*System.Text.RegularExpressions.Regex*"/>
        <Entrypoint Module="System.dll" Method="*System.Text.RegularExpressions.Regex*"/>
    </Tag>
</Tag>

If more than one tag can match the deepest method with the first stack tag is used. This is the reason why the default stack tag file is pretty much useless. If you add tags for your application they will never match because the low level tags will match long before the tags in your own code could ever match. You have to uncomment all predefined stuff in it. If you use my stack tag file you need to remove WPA provided default.stacktags under Trace - Trace Properties - Stack Tags Definitions. In practice I overwrite the default file to get rid of it. If you leave it you will get e.g. very cheap CPU times for a GC heavy application because all call stacks of your application which will trigger GC stuff are added to the GC stack tag. This makes your application specific stack tags look much slimmer than they really are because your application tags come later in the call stack which are not used to tag your stacks.  

Why would I need a custom stack tag file? It makes it much easier to tag high level expensive operations of your application so you can see how much time you were spending with e.g. loading game data, decompressing things, rendering, … This makes it easy to detect patterns what your application was doing at a high level. Besides if you find a CPU bottleneck in your application you can add it under a e.g. a "Problems" node so you can document already known issues which are now easy to spot.

For our problematic application PerformanceIssuesGenerator.exe we see that it was doing for 17,5s CPU Regex stuff (Weight in View is ms in time units). To see how long the actual running time was we need to add the Thread ID column since currently we sum the CPU time of all threads which is not the actual clock time we spent waiting for completion.

image

 

The context menu is actually customized. It is much shorter and contains the most relevant columns I find useful. If you want more of the old columns then you can simply drag and drop columns from the View Editor menu which is part of all WPA tables. If you want to remove additional columns you can also drag and drop columns back to the left again. This way you can streamline all of your column selection context menus which is especially useful for the CPU Usage Precise context menu which is huge.

Select A Time Range to Analyze

Now we see that we have two large Regex CPU consumers with a large time gap in between. But what was the application actually doing? This is where marker events from your own application come in handy so you know what high level operation e.g. the user did trigger and how long it did take. This can be achieved with a custom ETW event provider or the special ETW marker events which WPA can display in the Marks graph if any of them are present in your ETL file. To be able to use them to navigate in the ETL file your application must of write them at interesting high level time points which indicate for example the start and stop of a user initiated action. For .NET applications the EventSource class is perfectly suited for this task. Marker events can be written with a custom PInvoke call to

Here is the code to write a marker event in C# which shows up in all kernel sessions. The TraceSession.GetKernelSessionHandles is adapted from the TraceEvent library. If you have "NT Kernel Logger" sessions only (e.g. if you use xperf) then you can use 0 as session handle to write to it. 

    /// <summary> Write to Windows Event Tracing a Marker string which does show up in the Marker tab when 
    /// you view the trace. This makes navigation in ETL files much easier. 
    /// </summary>
    public static class EtwMarker
    {
        #region Fields
        const int FixedBufferSize = 512;

        [StructLayout(LayoutKind.Sequential, Pack = 1, CharSet = CharSet.Ansi)]
        unsafe struct ETWString
        {
            public Int32 Prefix;
            public unsafe fixed byte String[FixedBufferSize];
        }
        #endregion

        /// <summary>Write a marker string to ETW to enable easier navigation in ETL files in the viewer.</summary>
        /// <param name="str">String to write</param>
        /// <exception cref="Win32Exception">When string could not be written.</exception>
        public static unsafe void WriteMark(string fmt, params object[] args)
        {
            String str = String.Format(fmt, args);
            ETWString etwStr = new ETWString();
            int len = CreateFromString(str, ref etwStr);
            var sessionHandles = TraceSession.GetKernelSessionHandles();

            // Write to all enabled kernel sessions so this marker shows up everywhere.
            foreach (var handle in sessionHandles)
            {
                int ret = EtwSetMark(handle, &etwStr, len);
                if (ret != 0)
                {
                    throw new Win32Exception(ret);
                }
            }
        }

        /// <summary>
        /// Create an ETWString from a .NET string. The struct is fixed size.
        /// Only the first 511 bytes are copied from the ASCII string.
        /// </summary>
        /// <param name="str">String to convert</param>
        /// <param name="data">Reference to ETWString struct</param>
        /// <returns>Length of ETWString</returns>
        static unsafe int CreateFromString(string str, ref ETWString data)
        {
            var bytes = Encoding.ASCII.GetBytes(str);
            int copyLen = bytes.Length < FixedBufferSize ? bytes.Length : FixedBufferSize - 1;

            fixed (void* p = &data)
            {
                IntPtr dest = IntPtr.Add(new IntPtr(p), 4); // pointer to String buffer

                // copy at most Length-1 bytes of string to keep the terminating null char
                Marshal.Copy(bytes, 0, dest, copyLen);
                int length = 4 + copyLen + 1; // Prefix DWORD + string length + closing \0 char
                return length;
            }
        }

        /// <summary>
        /// Undocumented method of the kernel to write an ETW marker which does show
        /// up in the WPA viewer under the Marks tab. This makes navigation in trace files
        /// much easier to find important time marks.
        /// This is the same as xperf -m "string" which writes only to the "NT Kernel Logger" session.
        /// </summary>
        /// <param name="sessionHandle">if 0 the NT Kernel Logger Session is written to. Otherwise you need to use supply a handle to a kernel ETW session.</param>
        /// <param name="pStr">Ascii encoded string with a DWORD prefix</param>
        /// <param name="length">Length of complete string including the prefix and terminating \0 char</param>
        /// <returns>0 on success, otherwise a WIN32 error code</returns>
        [DllImport("ntdll.dll")]
        extern unsafe static int EtwSetMark(ulong sessionHandle, ETWString* pStr, int length);
    }

Now that we have our marks we can use them to navigate to key time points in our trace session:

image

We see that the first CPU spike comes comes from RegexProcessing which did take 3,1s. The second regex block was active between the Hang_Start/Stop events which did take 2,7s. This looks like we have some real problems in our PerformanceIssueGenerator code. Since we have according to our ETW marks Regex processing, many small objects, many large objects and a hang with simultaneous regex processing we need to select one problem after the other so we can look at each issue in isolation. That is the power which custom ETW providers or ETW Marks can give you. Normally you are lost if you know that you have several issues to follow up. But with application specific context marker events you can navigate to the first regex processing issue. To do that select the first event by and then hold down the Ctrl key while clicking at the stop event to multi select events. Then you can right click on the graph to zoom into the region defined by the first and last event.

image

Analyze CPU Issue

When you now look at the CPU consumption of the busiest threads we find 2,7s of CPU time. At the bottom WPA displays the selected duration which is 3,158s which matches quite well the reported timing of 3,178s. But the reported thread time of 2,7s is not quite the observed duration. In the graph you see some drops of the CPU graph which indicates that for some short time the thread was not active possibly waiting for something else.

image

Wait Chain Analysis

That calls for a Wait Chain Analysis. If you scroll down you will find a second CPU graph with the name CPU Usage (Precise) Waits. This customized graph is perfectly suited to find not only how much CPU was consumed but also how long any thread was waiting for something. Please note that this graph does not replace the CPU Usage Sampled graph. I have explained the difference between both CPU graphs already earlier. The column selection context menu of this graph has been massively thinned out to keep only the most relevant columns. Otherwise you would have to choose from over 50 items and the context menu even has scroll bars! Now we have only process, thread id, thread stack as groupings. Next comes a list of fixed columns which are always visible because they are so important. There we see how long each thread did wait for something (WaitForSingleObject …) as total and maximum wait time and the CPU usage in milliseconds. If we sum up for the most expensive thread the wait time of 0,385s and 2,726s of CPU time we get 3,111s which is within a small error margin exactly the time we did get by measuring the start and stop of our regex processing operation.

image

Is this conclusion correct? Not so fast since a thread can be only running or waiting (it can also wait in the ready queue which is only relevant if you have more threads running than cores) the sum of CPU and Wait times for each thread will always add up to 3,1s because this is the time range you did zoom into. To actually prove that this time is really waited while we were regex processing we have to sort by the wait column and then drill down for the thread 10764.

image

When we do this we see that all waiting did occur while the DoRegexProcessing delegate was called. It was waiting on the regular expression JIT compilation to complete. Now we have proven that that wait time is really spent while executing the regex processing stuff. If we would like to optimize the total running time we have two options: Either we could use more threads to parallelize even further or we need to tune our Regular expression or replace it by something else. Before going down that route you should always check if this string processing is necessary at all. Perhaps strings are not the best data structure and you should think about your data structures. If you need to sue strings you still should verify that the regex processing was really necessary at this point of time. Eventually you do not need the results of this regex processing stuff right now.

A Garbage Collection Issue?

In the list of marker events we see that the first regex issue overlaps with a GenerateManySmallObjects operation. Lets zoom into that one and check out the what we see under CPU usage. There we see that we are consuming a significant amount of CPU in the Other stack tag which categorizes unnamed stacks into its own node. If we drill into it we find out allocating method PerformanceIssueGenerator.exe!PerformanceIssueGenerator.IssueGenerator::GenerateManySmallObjects. That is interesting. Was it called on more than one thread?

image

To answer that question it is beneficial to select the method in question and open from the context menu View Callers- By Function

image

This will let you start to drill up to all callers of that method which is useful to find the total cost of a commonly used method (e.g. garbage_collect …). This is still the total sum for all threads. Now we need to bring back out Thread Ids to see on which threads this methods was called.

image

If the Thread ID column has more than one node beneath it it will be expandable like I have shown above. This proves that only one thread was calling this method and it did use 1,3s of CPU time. But there are still 0,4s of wait time missing. Where was the rest spent? We could use the Wait graph again to see where the wait operation was hanging around but since we know that it was allocating things we can open the Region for Garbage collections

image

And we just have found that during that time the garbage collector did block our application of 0,4s which explains the missing time in our CPU graph. We also know that we had 75 GCs during that time which is a rather high number. We can even see how many GCs of which generation we did have during that time by using my (also hand rolled) Generic Events GC Generations graph:

image

That's all about the small object allocation issue. Now lets go back by doing "Undo Zoom" in the graph to get our previous time range back where we can do the same analysis for the Generate Many Large objects issue. This is not terribly interesting which I leave to you as exercise.

UI Hang (2,727s) Analysis

Now lets investigate why our UI did hang. The hang operation has nice markers which we can use to zoom into the region of interest.

image

We are having a lot of CPU intensive threads here. If the UI was hung it must either using much CPU on the UI thread or the UI thread was blocked for some reason. If you know that the UI thread was started on your Main method you can search for CorExeMain which is the CLR method which calls into your main method. Or you search for a typical window method like user32.dll!DispatchMessageWorker. With that we quickly find thread 11972 as the thread which was hung in:

image

Obviously it was waiting for a task to complete. If you call Task.Wait on your UI thread you will block further UI message processing and your UI will not be able to redraw anymore. At least that was the visible observation. The Hang did take 2,727s which exactly matches up the total summation of all wait operations with 2,724s. If that is true we should see as maximum wait time the same the same time but there we have only 1,383s. When using WPA you can look deep into the inner workings of Windows and .NET. Lets be a little curious and check out why there were not one Wait operation but 30 waits while we were blocked in Taks.Wait. We observere that Task.Wait calls into the current Dispatcher to delegate the wait to its SynchronizationContext. This in turn calls into the current CLR thread to DoAppropriateWait which in turn calls on the UI thread MsgWaitForMultipleObjectEx. This method can block a UI thread but leave e.g. mouse events or COM messages through. Depending on which which parameters you call it it can even pump your UI thread with message which is sometimes necessary. This can lead to unintended side effects that you can execute code while you were waiting for a blocking operation. I have found such issues in Can You Execute Code While Waiting For A Lock? and Unexpected Window Message Pumping Part 2.

Digging Deeper

We know that MsgWaitForMultipleObjects can pump messages with the right flags. So what was happening in these 30 wait operations? If you open the column context menu you can add Ready Thread Stack which gives you all call stacks which did cause MsgWaitForMultpleObjects or its child methods to wake up.

image

Poor other tools. They will never be able to tell you that MsgWaitForMultipleObjects was woken up 28 times to by dwm.exe which is the Desktop Window Manager to tell our wait method: Here you have 28 mouse events which you can process or not depending on how your wait flags were configured. The last event which did finally end our Task.Wait operation was the last thread that did finish processing. This is the call where our Wait operation did wake up our blocked message loop.

 

image

You can ask: How do you know that? Well the Switch-In Time column of our event in the table which is to the right of the yellow bar with 14,425 tells me this. This is were our wait operation did end and our thread was woken up again. This is also the time where the CPU consumption has a sharp drop which is no coincidence. If you have sharp eyes you will notice in the graph at the right a blue line. This is when our thread was woken up. Every thread wake up is displayed in the graph which makes it much easier to spot regular patterns of activity. It is this wealth of information which makes WPA so powerful. But the default profile is mainly aimed at kernel developers which is a pity because it is an excellent system wide profiling tool which gives you deep insights into how Windows really works and sometimes how Windows or your application breaks.

If you have counted you will still miss one ready event. This one was Visual Studio which did send us a window message from vslog.dll!VSResponsiveness::Detours::DetourPeekMessageW. I am not sure why Visual Studio does this but if you profile your system more often you will notice that the Visual Studio responsiveness stuff gets from time to time into a loop and all Visual Studio instances will send each other messages (still the case with VS2015). This does not show up as a surge of CPU consumption. But it will increase your ETL file size considerably because you will get millions of context switches. This is not a big problem except for battery powered devices which have a higher power consumption than they could have and for people profiling their machines getting huge files.

 

.NET Exception Analysis

The previous analysis was a bit complex. Now for something nice and simple. You can easily check if your .NET application throws exceptions if you enable the .NET provider. With WPRUI you need to check ".NET Activity" to get these events and many others. For a short list of interesting .NET events you can check out the ETW manifest of .NET for yourself or you read Which Keywords Has An ETW Provider Enabled? which covers the most interesting .NET events as well. If you zoom into the Do Failing Things region which is visible as ETW Marker event in the Marks graph. Then you need to add a Generic Events Graph and change the graph type from "Activty by Provider, Task, Opcode" to .NET Exceptions which is only part of my WPA profile.

image

This gives you a nice view of all thrown exceptions in all .NET applications. You can play with the groupings and group e.g. by exception type and message to see how many different ones you have. To do that you only need to Drag the "Exception Message (Field 2)" and drop it to the right of ExceptionType. Now you have Process - Exception Type - ExceptionMessage in a nice cohesive view. You just have to remember that all columns to the left of the yellow bar are the columns which are grouped together by their values.

Starting with .NET 4.6 you can also use the .NET Exceptions Caught  view which shows you all thrown exceptions and in which method these exceptions were caught.

image

This can be interesting for automated regression tests if something breaks because an exception which was always happening is now caught in a different method which might be too late in the call chain. Having an exception is nice but the most interesting part of it is its call stack. From this graph we know the Thread on which it was happening. We do not need the process id because each thread Id is unique across all processes. Then you can zoom to a specific exception an investigate in CPU Usage (Precise or Sampled) the call stacks if something related to exception handling shows up. Searching for the method KernelBase.dll!RaiseException is usually a good way to find it. If that did not help you need to enable stack walking for the .NET provider. I blogged about how to do that a long time ago for a custom provider with xperf in Semantic Tracing for .NET 4.0. If you did record the provider "Microsoft-Windows-DotNETRuntime" with stack walking enabled you can add in the View Editor from the column context menu

image

the Stack column to display the call stack of all thrown exceptions. An alternative is to enable the StackKeyword+ExceptionKeyword  of the "Microsoft-Windows-DotNETRuntime" ETW provider. This will cause .NET to write an additional ETW event for every .NET event with a managed stackwalk event. This event is not human readable but PerfView can decode it. This is one of the things where PerfView is still the best tool to investigate managed application issues.

Conclusions

Although this is already a rather long post you still have only seen a small fraction of what you can do with ETW profiling and the WPA tool. I hope that I have convinced a few more people out there to try it out because it can help you to find the problems you never knew that they existed. But these problems might be ones your customers are always complaining about.


Tuesday, August 25, 2015 #

All of us have experienced it but very few of us have gone into the details to really find the reasons for it. First of all why should I care? After the first time the application starts much faster. Well you should care if you want to give your customer not only a good second startup time but also a decent first startup experience. To do that you need to do repeated tests. The first startup is commonly referred to cold startup and all subsequent runs are warm startup scenarios. To test a cold startup scenario you need to reboot your computer every time and start your application. Not very practical and I have no one ever seen going down that road. The difference between warm and cold startup time is that the first time the dlls and data of the application need to be loaded from disc which can take quite some time. All subsequent runs can read the data not from disk but from the in memory file system cache which is managed by the operating system.

What Is The File System Cache?

Windows uses the memory of your system in some clever way. All memory that is used by your applications is shown in the Task Manager. The rest of the memory is free and not used by anything. Really? Of course not. The free memory consists of data of previously exited applications which was in use some time a ago. This "free" memory is called Standby List which is where the file system cache of Windows is really living. It is therefore not a good idea to max out your memory usage in task manager because if you do this you are flushing the file system cache which will result in a slow system. This cache is the reason why your second and all further startup times are much better. No disc accesses at all compared to the first run if all goes well.

How can you flush it?

If you want to measure cold startup times you need to clear the file system cache somehow. The easiest way is to use RAMMap from SysInternals. It has an option to flush the Standby List.

image

That is nice but unfortunately there is no command line tool available to do this in a test script to create repeatable tests. But not all is lost. Process Hacker is a better clone of Process Explorer which comes with full source code. There you can find the undocumented Windows API calls to flush the OS file system cache directly. It is basically

SYSTEM_MEMORY_LIST_COMMAND command = MemoryPurgeStandbyList;
NtSetSystemInformation(SystemMemoryListInformation, &command, sizeof(SYSTEM_MEMORY_LIST_COMMAND));

and you are done. There is also another trick I want to share: You want to clear the file system cache for a specific file? Windows will flush the file system cache if a file was opened with the FILE_FLAG_NO_BUFFERING flag. Here is little C# class that employs this trick:

    class FileSystemTools
    {
        const int FILE_FLAG_NO_BUFFERING = 0x20000000;
        const FileOptions Unbuffered = (FileOptions)FILE_FLAG_NO_BUFFERING;

        /// <summary>
        /// Flush the file system cache for this file. This ensures that all subsequent operations on the file really cause disc
        /// access and you can measure the real disc access time.
        /// </summary>
        /// <param name="file">full path to file.</param>
        public static void FlushFSCacheForFile(string file)
        {
            using (FileStream fileStream = new FileStream(file, FileMode.Open, FileAccess.Read, FileShare.ReadWrite, 4096, FileOptions.None | Unbuffered))
            {

            }
        }
    }

Hidden Dragons

If the first test run is slow and all other are fast we know that the file system cache is to blame. If you see a pattern like this

image

you know that the first run was cold and all others were warm. If we now flush the file system cache we should get rock solid cold times with no other influencers except virus scanner, indexing service and other pesky applications making our test unpredictable. If we measure again - this time with flushing the file system cache - we get this:

image

 

This is not warm and not cold. What is it then? The answer is it was mostly a warm startup because another instance of the application was still running. If the a process is alive and its working set was not trimmed then the dll data is NOT in the cache since it still belongs to the working set of a running process. You need therefore make sure that you close all applications which keep your dlls in memory (e..g Visual Studio with its designers ….) to test a real cold startup scenario. There are other pitfalls as well. If you kill a process without waiting until it has terminated you can get random good runs because a process kill is an asynchronous operation. You have to wait on the process handle until it becomes signaled to be sure that it is really gone. Or if you are using .NET you need to call Process.WaitForExit() to be sure.

After we have fixed that we get this graph:

image

What the hell is going on here? This was quite hard to find out since there is very little information on the net how to reliably test cold startup scenarios. The hidden thing is Superfetch.

Doesn´t ring a bell? Besides speeding up the boot it also tries to prefetch often used application dlls into memory before the application is started based on intelligent heuristics. This gets in our way if we flush the file system cache. After that the Superfetch system service named SysMain  which is hosted in a generic svchost.exe process tries to prefetch often used dlls at low IO priority. This is additional IO collides with our test where we want to load our dlls as well. To get consistent IO numbers it is therefore best to stop the Superfetch service.

sc stop SysMain

After doing that the graph still looks the same. What is interesting that if our application reads about 30MB more data it starts nearly 10s faster. This is a mystery which only xperf can solve. Here is a Disc IO graph of a 23s slow run

image

And here is a fast 14s run.

image

We read more data but the Disc Service Time is less! The average Disc Queue Depth is very high which should indicate an overwhelmed disc. What stands out is that we have about 70% less disc read operations on the assembly folder where the native images of .NET applications are stored. Sometimes we read the disc in a slow and sometimes in a fast way. To answer the question why we read from the disc in such different ways we need to check out the Disc IO Init call stacks. These will tell us who did initiate the read operations and most important when. First comes the slow run

image

It is no surprise that our slow cold startup causes many hard page faults which triggers the kernel to read the missing pages from disc. This is known to be slow. We can do little about it except to execute less code during startup. The graph shows a nice scattered pattern across the complete startup time. Now lets have a look at the fast run

image

That looks radically different. We load the same dlls but we have no hard faults? All dll data is read before the application main thread starts running. In our process the Prefetcher reads its prefetch database which is 1,3MB in size (see PfSnPrefetchMetaData in 687ms) which allows it to read our dlls in PfSnPrefetchSections (151MB in 3,1s) to read the dll data mostly with sequential reads in about half of the time. That is the real reason why we have so wildly varying startup times. The prefetcher will look at ca. the first 12s of application startup and will then update the prefetch database for your executable. Next time the startup is a little better and it will update it again causing it to prefetch a little more and so on. This is a nice self tuning system which not many people are aware of. From time to time the prefetcher seems to do nothing to check (that is my guess) if the prefetch database is helping at all. That explains the mostly good and sometimes very slow numbers whre no prefetching is active.

If we aim at stable cold repeatable tests we need to delete disable the Superfetch service because this process is responsible to write and update the prefetch database. The prefetch database file is located at

C:\Windows\Prefetch\ExeName-FileHashPrefetchFlag.pf

We need to delete that file as well and now we have real control over our IO during application startup. If you want to calculate the file name of your application prefetch database you need to take the full path of your executable and run it through a hash function which is described at http://forensicswiki.org/wiki/Prefetch. There are still some sites out there which tell you to delete the prefetch folder to prevent caching of dlls to make your system faster. If these guys would have really measured their startup time in a controlled and systematic manner they would not have written such misleading guidelines. As you can see you need to take care about a lot of details to really understand your system. But it is worth it. If you have looked closely at the .pf file name there is a PrefetchFlag at the end. That is a number which you can pass to generic container processes to generate for the same executable different prefetch databases. This can be useful if you have generic container processes with wildly varying dll sets inside them. How do you tell the OS loader that you want to use a different prefetch database to speed up cold startup times?

Simple: Pass it as command line parameter! If you create e.g. an executable with the name prefetchTest.exe and pass /prefetch:dd as command line argument to it you get different prefetch files for each scenario:

prefetchTest.exe
    C:\Windows\Prefetch\PREFETCHTEST.EXE-47E1C1B1.pf 
prefetchTest.exe /prefetch:1
    C:\Windows\Prefetch\PREFETCHTEST.EXE-47E1C1B2.pf
prefetchTest.exe /prefetch:2
    C:\Windows\Prefetch\PREFETCHTEST.EXE-47E1C1B3.pf
That is a nice find. Perhaps you can make your own customers happier by including a cold startup test in your test suite to make the pain public. Most test reports I have seen include only warm numbers because it is quite difficult to create repeatable and stable cold startup tests. Here are my tips as short bullets for repeatable cold startup measurements:

Test Machine Preparation

  • Stop/Disable the Superfetch (SysMain) service.
  • Disable Windows Search (WSearch) service.
  • Disable Virus scanners.
  • Disable scheduled tasks and other system management software.
  • Delete the prefetch database for your executable.

Test Execution

  • Kill all applications which keep your dlls loaded.
  • Flush the file system cache before the test.

When you follow these tips you will get easily repeatable and stable cold startup numbers. If for some reason during startup much more code is executed which causes many page faults you can easily spot it. The warm startup time might not be affected at all because the actual run time of the additional code might be low but the new disc access pattern could be pricy in terms of cold startup time.


Monday, August 17, 2015 #

When you open an etl file you normally want to see which methods were executed most often. To decode the recorded stack pointers WPA needs your symbols. By default WPA adds the Microsoft symbol server to the list. If you click on Load Symbols

image

then you will experience some slowness which is very nicely explained by Sasha here: http://blogs.microsoft.co.il/sasha/2014/12/02/diagnosing-native-memory-leaks-etw-wpa/. It is ok for the first time to get all relevant symbols from Microsoft but afterwards make sure you disable the Microsoft server from the symbol search list!

image

The symbol download folder is by default C:\Symbols. The directory after the srv* is the pdb download folder and after the second * the symbol server address is appended.

If you wonder why you need a SymCache folder:

image

The pdb format generated by the compilers is so arcane, bloated and parser unfriendly that the guys doing the pdb parsing within dbghelp.dll (part of Windbg) decided to convert the bloated pdb format into something much less bloated and much easier to parse. The SymCache folder contains therefore not the pdbs but the converted pdb files.

Here is the comparison:

image

And here the original pdb

image

Now you know why the SymCache folder is a good thing. You can safely delete it if it gets to big as long as you have the original pdbs still around nothing is lost. The vigilant reader has certainly noticed that I have added a path to C:\ProgramData\WindowsPerformanceRecorder\NGenPdbs_Cache. This is the path were WPR and xperf stores the generated managed pdbs while saving an ETL file to disk. By default WPR will create besides your xxx.etl file an extra folder named xxx.etl.NGENPDB which contains all managed pdbs from NGenned binaries. Since WPA of Windows 10 follows not its own convention to expand the symbol search path of the loaded etl file to its accompanying pdb directory I go directly to the source where WPR copies the generated pdbs from.

If you follow this advice and disable the MS symbol servers you will load the pdbs not within the next 20 minutes but in 3,6s for my sample ETL file. Kudos to the Symbol Cache and its inventors. While speaking of performance. PerfView has a much better (read faster) managed pdb generation strategy. WPA and xperf will call ngen.exe createpdb  xxxx.ni.dll for each and every native image in the trace recording. This is not necessary since you can read the pdb age guid directly from the dll and check then in the pdb cache folder if the pdb already exists. No need to call ngen.exe all the time! Finally it is not necessary to copy the generated pdbs to the xxx.etl.NGENPDB directory. My own trace collection tool simply creates a directory link to the NGenPdbs folder and I have all pdbs I could need without any file copies.

If you find that the default paths of WPA, xperf, … on your crowded C drive leave no room for pdbs you can configure WPA to a different storage folder. Or you leave all as is and create with mklink /D directory links from your C drive to a drive which has enough space. I prefer the latter approach since in my experience tools tend to rely on the defaults and at some point they will come back to it (e.g. Windbg …).


Sunday, August 16, 2015 #

Did you ever see a nice tool with a fancy UI and thought: Hey that thing is powerful. I will try it! But later you were left back scratching your head why other can use this tool and get great results but you just don't get it?

Then I have news for you I have created a WPA profile which aims at user and not kernel mode developers. Many columns were removed from the context menus to give you all the power you need to find issues in your application

where no one else has found the root cause.

The ETW profile can be downloaded from here as a zip file. Unpack it to a directory and you are ready to go. So what is inside it?

 

File Description
Simple.wpaProfile The main WPA profile you can use now.
JIT.xml WPA region file referenced by Simple.wpaProfile to get JIT times of your application like PerfView.
GC.xml WPA region file referenced by Simple.wpaProfile to get Garbage Collection metrics like in PerfView only better!
HookRegions.xml WPA region file referenced by Simple.wpaProfile to see mouse clicks in your ETL file when ETWControler is running and capturing your keyboard and mouse events.
default.stacktags Stacktag file which serves as base for your application to find common .NET issues like event handle leaks, too much regex usage, monitor contention and much more.

 

Besides this there are also some WPR profiles added to enable recording of specific .NET events like exceptions together with GC events into a larger 1 GB buffer.

WPR of Windows 10 has under Scenario Analysis .NET Activity which has a very small 160MB ring buffer which is way to small for my needs. It might be good

for you but I have added extra WPR profiles for GC and JIT as well.

To show you the difference I have created a small PerformanceIssueGenerator.exe application. This generates various issues which you can analyze with WPA.

I have recorded the data already and put the 7z file here. If you want to extract it you need to download the 7z from the official site. To view the ETL file

you need a >= Windows 8 machine and Windows Performance Toolkit from the Windows 10 SDK.

 

When you have downloaded the profile and the sample etl file you can apply the profile under Profiles - Apply…

image

Then you get two tabs. The first one contains .NET specific analysis stuff like GC and JIT. The second tab can be used for the usual bottleneck analysis regarding

CPU, disk and memory usage as well as wait chain analysis. You do not need to load the profile every time. You can save it as your default profile

by clicking on Save Startup Profile to open all of your ETL files in this view now.

image

Normally I use PerfView for GC issues to check out the GC Stats of an application to see if anything significant is going on. If you have ever used PerfView then

you will have noticed that it is a great tool combined with a crappy UI. At least for some key scenarios we can now use WPA with hand crafted region files instead of PerfView.

 

GC Views

You now get a GC view like this

image

You can visualize each GC generation type and its time it did take to execute. The Count column also tells you how many GCs you did have. This can help a lot if you

want to get GC metrics only for a specific time region in your application. Now it is easy to see how much time of your use case was available to your application

and how much time your threads did have to wait for the garbage collector. This was my main issue with PerfView and its GCStats view that it is

calculated for the complete duration of the profiling run. Most of the time I want GC metrics only for specific parts of my application because

I am optimizing mostly only at a single place at one time.

Here is how you get the GC statistics in PerfView:

 

image

 

The numbers in GC Rollup By Generation match pretty well to the region file. Also the GC pause time correlate quite well with the distribution of generations although the timings

are not 100% the same but the ratios are a good fit.

image

Since PerfView and ETW use the same data you can rightfully ask the question why there is a difference at all? The answer is that WPA sums not up all GC regions

by their duration. A WPA region is defined by a start and a stop event which is the displayed in the UI like above. But if the regions happen in parallel in several threads WPA will use as sum

only the time where at least one region was active at a time. This diagram illustrates how WPA region summation works:

image

This detail is for GC things not so relevant but it will be very important when we look at JIT statistics. Since a foreground GC is blocking your complete application there is little

concurrency going on. We can also visualize when GCs of which types are happening. The view Generic Events GC Generations will show you how many GCs of which type were happening.

Initially it is collapsed so you have to open it. This view takes a little while to build up since it needs to parse all ETW events for GC Start events which contain as data point the generation number which is

visualized here.

image

This is the greatest thing about WPA of Windows 10. Now you can visualize the data of your custom traces. With a little WPA graph configuration metric you can create for your own application key graphs

which show e.g. Frames per seconds, # of db transactions, … Who needs performance counters when you can visualize your trace data much better now?

Marker Events

A key point to know where you want to start looking for an issue are Marks. In WPRUI you can press Ctrl+Win+x to open a window which allows you to write a marker event to your trace file

while you are recording data. Unfortunately this functionality is not exposed to the outside world to set marks programmatically. If you revert to xperf you can set with xperf -m marks programmatically

if you wish to. But since marks are so useful to navigate in an ETL file I really wonder why the method which xperf calls is not documented at all. If you want to know how real power users

are using WPA then you need to call wpa /?

image

I guess at Microsoft they let their tests run with enabled profiling while saving screenshots as ETW events. If something happens the ETL file is downloaded

from a central repository and the ETL file is opened with a test specific profile. The file is opened and zoomed into the relevant test part which are identified by markers

or regions from a region file. At least some people at Microsoft use this tool so often that it makes sense to automate it even further since the controls of WPA are UI automatable to

script nearly your complete analysis. Now THAT is what I call good performance regression testing.

A little reverse engineering which APIs xperf -m calls did finally show up the method EtwSetMark which is exported by ntdll.dll. I am sure the readers of my blog can figure out the rest as well.

But I really wish it would become an officially supported API since it is so useful. Sure you can define your own marker events as well but since the support in WPA is already built in

it would really help. It would also be nice if xperf would emit the mark event no only to the "NT Kernel Logger" ETW session but all active kernel session so you could mark also the WPA kernel session

which is currently not possible.

 

JIT View

PerfView can also give you the timings how much time each method did take to compile. This is useful if your are compiling too much code on the fly for your WCF proxies or your serializers. Sure it is done only once

but if this time plays a significant part of your application startup you should rethink how your application works. Since .NET 4.5 will NGen all assemblies on the fly if they are used often enough you do not need to

consider using NGen explicitly for your application. But if you have much dynamic code generation going on you can still suffer from long JIT times.

 

image

You can get a similar view by switching from Garbage Collection to JIT Time per Thread

image

As I have shown you in the GC section. The summation of regions is not a simple sum of the duration of all JIT events. The JIT compiler can compile code in many different threads concurrently.

The Duration sum of all threads of JIT time reported by WPA is therefore largely irrelevant if you want to compare two different runs of your application. Instead you need to look at the JIT times

of each thread. You can copy the duration column with the context menu "Copy Column Selection" into Excel

image

image

which gives us the exact same JIT time as Perfview. Now I do not need PerfView for JIT analysis anymore. I am more and more getting away from programming to configuring WPA to give me just the view I need.

If you expand JITs you get at least the namespace of each JIT event. This is the best I could come up with since WPA does not support concatenating  strings of different fields into a region name.

image

But you still can open the also customized view _Activity by Provider Task Process to view the "raw" data if you need more information. If you would drop e.g. Field 5 to the left of the yellow bar you would get a nice summary how

many methods in this namespace were compiled.

image

As you can guess there are lots of more goodies inside the now published profiles to make WPA really usable for non kernel developers. I have streamlined nearly every context menu where I removed all useless or kernel only columns

from the context menus. You have a much easier time now to concentrate on the important things. With this profile WPA is nearly a new application and central analysis station for managed and unmanaged performance analysis.

In the next posts I will walk you through the problems of PerformanceIssuesGenerator and how you can visualize them in WPA efficiently.


Friday, July 31, 2015 #

It has been a long wait to finally get the new version of the new Windows Performance Toolkit. I had not much time to test the betas since until now managed call stack resolution was not working.

WPT for Windows 10 makes some nice progress which you can download here. Click on the Download Standalone SDK and run it. Then uncheck everything except Windows Performance Toolkit and

it will install on your machine within minutes. If you download the complete Windows 10 SDK you are not downloading 170MB but several GB.

 

Management Summary

  • Graphs are more readable.
  • Graphs with short spikes are much better rendered without the need to zoom into to see them.
  • Colors for graphs are changeable (finally).
  • Filtering can be undone now in the context menu.
  • Quick Search in all tables.
  • Symbol load dialog was improved.
  • You can configure your own stack tag files in Trace Properties.
  • No new Graph Types (at least with the default providers I enable).
  • 5 Added columns to CPU Usage Sampled
    • Two could be extremely useful: Source File Name and Source Line Number
  • 12 Added columns to CPU Usage Precise.
    • Two are to user mode developers useful: ProcessOutOfMemory and CPU Usage (in view).
  • Occasional crashes are also featured now.

CPU Usage Sampled

image

The biggest improvement is certainly the possibility to see source file and line numbers. You can find out this way where most CPU is spent on which line if the CPU consumption happens in your code.

A quick recap for the newbies: The CPU Usage Sampled graph is generated by taking the stack trace of all running threads 1000 times per second (this is the default). These call stacks are then added together.

A count in your Main method of 1000 means that the method has used one second of CPU time.

If the call stack ends not in your e.g. Main method when the stack trace is taken then your source file will not be displayed there. As usual you have to take a sharp look at the numbers. Your method

may be causing a high CPU consumption but it might never show up with a source and line number because the stack always ends in an external library call (printf in my case) for which I did not have the pdb loaded.

It would be nice to have file and line numbers for managed code as well but this feature is of limited use as it is now.

Ideally I want to see this stuff while I am drilling into a call stack all my way down and not only if the call stack ends in a method which I did compile.

 

For Reference here is the complete list of columns you can choose from. Green are unchanged columns, Bold are new ones. I never have found out how to give a thread a name in WPT.

If anyone knows more about the mystical thread names and how they can be set I would be very interested in it.

WPT 8.1 WPT 10
% Weight % Weight
Address Address
All Count All Count
Annotation Annotation
Compiler Optimization Compiler Optimization
Count Count
CPU CPU
Display Name Display Name
DPC/ISR DPC/ISR
Function Function
  Image RVA
Inlined Functions Inlined Functions
Is PGO'ed Is PGO'ed
Module Module
PGO Counts PGO Counts
PGO Dynamic Instruction Count PGO Dynamic Instruction Count
Priority Priority
Process Process
Process Name Process Name
Rank Rank
  Section Name
  Section RVA
  Source File Name
  Source Line Number
Stack Stack
Table Table
Thread Activity Tag Thread Activity Tag
Thread ID Thread ID
Thread Name Thread Name
Thread Start Function Thread Start Function
Thread Start Module Thread Start Module
TimeStamp TimeStamp
Trace # Trace #
Weight Weight
Weight (in view) Weight (in view)

 

CPU Usage Precise

This is by far the most complex table. With WPT 10 we have 66 columns to choose from to find out how our threads interact with each other.

I am not sure how WPA detects that a process was out of memory but I think the kernel knows quite well if something did go wrong. The ProcessOutOfMemory column could be handy

to check stress tests when the machine was under heavy load which could explain subsequent application failures.

CPU Usage (in view) seems to be an attempt to display a more true thread running time. I am not sure how this value is calculated but it seems that if all cores are in use and you

get significant Ready times then it differs quite a lot. Otherwise the differences are marginal.

 

There have been quite some renaming of columns which might render your custom filter and enablement conditions useless and WPA will show you an error.

image

If your old preset did enable by default all processes except the Idle process which clutters up the view it was set to:

([Series Name]:="NewProcess" AND NOT ([NewProcess]:="Idle (0)"))

With WPT 10 you need to change it to:

([Series Name]:="New Process" AND NOT ([New Process]:="Idle (0)"))

This happens to all users of WPT 8.1 which have saved a custom default profile which contains this enablement rule.

With an empty enablement rule your view will look like

image

and with the rule

image

you will see no difference when you apply it. Why? Because this rule only enables matching stuff but it does not disable the non matching

processes! I am with you that this is very counter intuitive but it is the way it is. If you save this as new default profile and then open a new file

you will no longer get a view where the Idle process is enabled by default.

For reference I list here all available columns with the old and new names so you have an easier time to update your custom filters, enablement

and expansion rules.

Green lines are unchanged, yellow are renamed columns and bold are new columns in WPT 10.

WPT 8.1 WPT 10
% CPU Usage % CPU Usage
  AdjustIncrement
  AdjustReason
Annotation Annotation
Count Count
Count:Waits Count:Waits
Cpu Cpu
CPU Usage (ms) CPU Usage
  CPU Usage (in view)
  DirectSwitchAttempt
IdealCpu Ideal Cpu
  KernelStackNotResident
LastSwitchOutTime (s) Last Switch-Out Time
NewInPri New In Pri
NewOutPri New Out Pri
NewPrevOutPri New Prev Out Pri
NewPrevState New Prev State
NewPrevWaitMode New Prev Wait Mode
NewPrevWaitReason New Prev Wait Reason
NewPriDecr New Pri Decr
NewProcess New Process
NewProcess Name New Process Name
NewQnt New Qnt
NewState New State
NewInSwitchTime (us) New Switch-In Time
NewThreadId New Thread Id
NewThreadStack New Thread Stack
NewThreadStartFunction New Thread Start Function
NewThreadStartModule New Thread Start Module
NewWaitMode New Wait Mode
NewWaitReason New Wait Reason
  NewThreadPrevRemainingQuantum
  NewThreadRank
  NewThreadRemainingQuantum
NextSwitchOutTime (s) Next Switch-Out Time
OldOutPri Old Out Pri
OldProcess Old Process
OldProcess Name Old Process Name
OldQnt Old Qnt
OldState Old State
OldInSwitchTime (us) Old Switch-In Time
OldThreadId Old Thread Id
OldThreadStartFunction Old Thread Start Function
OldThreadStartModule Old Thread Start Module
OldWaitMode Old Wait Mode
OldWaitReason Old Wait Reason
  OldThreadRank
  OldThreadRemainingQuantum
PrevCState Prev CState
  ProcessOutOfMemory
Ready (us) Ready
ReadyThreadStack Ready Thread Stack
ReadyTime (s) Ready Time
ReadyingProcess Readying Process
ReadyingProcess Name Readying Process Name
ReadyingThreadId Readying Thread Id
ReadyingThreadStartFunction Readying Thread Start Function
ReadyingThreadStartModule Readying Thread Start Module
  ReadyThreadInDPC
SwitchInTime (s) Switch-In Time
Table Table
Thread Activity Tag Thread Activity Tag
Thread Name Thread Name
TimeSinceLast (us) Time Since Last
Trace # Trace #
Waits (us) Waits

File IO View

The File IO view has now an additional column named File Path which acts like in Path Tree in Disc IO which is extremely useful if you do differential IO analysis. It seems that my

complaints about the not complete feature parity between Disc and File view have helped a bit.

image

This view has still a bug. If I zoom into it all graphs are disabled and I have to reenable them after every zoom operation. If I zoom out all is ok.

Generic Events

Besides some minor changes the biggest change is that finally it has got a DateTime (local) column. It shows the clock time and not the time since trace start.

This can be very useful for advanced scenarios when someone did report an issue and he did care to write down the time when it did happen.

image

Process View

I did not know it until it was removed. Process Tree is no longer with us. It was not really useful anyway.

Process Tree was added to WPT. That was the reason I never saw it before. After using it for some time I find it is easier to navigate in the tree if

you have a parent process which creates many childs.

 

Disc IO View

The Disc IO view has got only one additional column with the name Boosted which is for my disc IO always false. Not sure if this is a Windows 10 only feature. If someone knows more about it please comment.

 

Windows Performance Recorder

The recorder has not changed much. It has got some additional profiles like ones for .NET, Edge Browser, … The buffer size for .NET Activity is hard coded

to a 160 MB ring buffer which might not be enough for machines were many .NET apps are running. I still do not buy into this easy to use UI since the xml configs

need hard coding the values into it. It can be a hard value or a percentage of the system memory. Both choices are bad if I want to automate performance regression testing with

very specific workloads and tracing requirements. This GUI and the command line tool (wpr.exe) are just too inflexible.

Yes I know the complete xml has inheritance built into it so I could create derived profiles which override only the specific settings I care about. But this is still not dynamic enough

if I want to create a configuration for each problem and set them depending on the use case at hand.

image

  • xperf is still a good tool to stop a profiling session because it does a full CLR rundown.
  • But PerfView is much faster at creating the managed pdbs by not trying to recreate them over and over again with NGen.
  • Tracelog from the Windows SDK is still the only tool to start a trace session where you can configure the Win 8.1 enhancement to the ETW infrastructure.
    • E.g. filter events from a specific process, or disable events or stackwalking for user mode providers.

Summary

The new WPA tool has got some nice new features but they still need some polishing to become truly awesome. In my daily work I miss the automatic addition

of the etl.NGenPdb folder when present for the generated pdbs of an etl file. This is a useful convention and the old WPA was so nice to automatically include it to my symbol path.

WPA is very sophisticated. To use it right is an art which needs nearly daily practice. That could explain why this tool although it has a great UI has still got not so

much traction. It is like Windbg in disguise which will quickly cause information overload to the untrained brain.

But with some basic tricks which I will show in the near future you can crack otherwise unsolvable performance nuts with a few mouse clicks.


Thursday, June 18, 2015 #

Some bugs are easy some are hard. If you are dealing with a massive multithreaded distributed application consisting of many different processes talking to each other performance issues are pretty much impossible to detect with standard profilers. But with WPT the word impossible can be replaced with possible although much work is still needed. Recently I had an interesting regression issue where a use case did become over one second slower which was a quite big difference. So it should be easy to spot right? Two things were easy but one fraction ( 60% ) was still hiding in the data. Since the title of this post is already very clear here is a small test application which shows the issue:

    class Program
    {
        static Stopwatch StartTime = Stopwatch.StartNew();

        static TimeSpan WaitTime = TimeSpan.FromMilliseconds(2);

        static bool ReturnTrueAfter2ms()
        {
            if( StartTime.Elapsed.Ticks > WaitTime.Ticks )
            {
                StartTime.Restart();
                return true;
            }

            return false;
        }
        
        static void Main(string[] args)
        {
            while (true)
            {
                SpinWait.SpinUntil(ReturnTrueAfter2ms);
            }
        }
    }
In the main method we are continuously Spinwaiting for an operation which usually takes some ms (e.g. network IO) to complete. The goal of SpinWait is to spare you context switches in favor of burning CPU cycles. We are talking about some microseconds to spare instead of a context switch. Spinwait will therefore by default spin a few microseconds until it gives up and tries to yield to other threads with increasing persistence. If you look at the code of SpinWait

 

image

you find that for every 20th run it will sleep for 1ms. That in itself is not a big deal if your method takes longer to execute. Only by looking at the code all looks sound and reasonable. What could possibly go wrong? Let have a look at the costs for a context switch with Thread.Yield, Thread.Sleep ( 0 ) and Thread.Sleep( 1 ). Thread.Yield is quite cheap if the other thread has not much work to do. The cost looks like 6 microseconds for doing only a context switch and up to 70 microseconds if the other thread is doing some work until it waits again for something.

image

Thread.Sleep( 0 ) on the other hand lets the thread run forever if no other threads want to run on this CPU. The scheduler looks after the thread for ca. every 250ms where a non switching context switch event is logged but apart from that nothing is stopping my thread.

Things become interesting for Thread.Sleep(1) where we forcefully give up our thread quantum.

 

image

Remember the goal of SpinWait is to spare you an expensive context switch. It spins for about 20-80 microseconds. Then it tries from time to time a Thread.Yield, Thread.Sleep ( 0 ) and every 20th time a Thread.Sleep(1). In the comments of SpinWait it is noted that a thread can wait for a long time. In fact a Thread.Sleep(1) can sleep up to 15ms because you give up your current thread quantum and the scheduler looks after your thread only during the next clock tick which is 15,6ms later again to check if your thread is ready to run again. SpinWait is therefore only useful if your operation completes within a few microseconds in 99.9% of all cases. But if your operation frequently or nearly always takes much longer than a typical context switch then this class will not make you any faster. Instead of sparing a few microseconds you are potentially wasting up to 15 milliseconds every time you call SpingWait.SpinUntil!

If you have many call stacks in your application which cause SpinWait to sleep for a thread quantum of 15ms then you are using SpinWait wrongly!

image

This one was very hard to find because I usually search for long pauses when I am after a one second delay. But this time 600ms were scattered across dozens of calls almost evenly distributed throughout the use case because SpinWait was used in a central low level component which was frequently called.

If you looked at the Excel chart above and wondered where difference between with and without Visual Studio for Thread.Sleep(1) comes from I can tell you now. By default the windows timer ticks every 15,6ms which then triggers the thread scheduler to check out its running threads. The strange 15,6ms value comes from 64 ticks/s = 15,6 ms/tick. If someone uses a multimedia resolution timer or changes the system wide timer tick to increase the tick rate your threads will more often be watched after the thread scheduler. That is the reason why Thread.Sleep(1) can wait for "only" 2ms or 15 ms depending on which applications are running which may alter the system wide clock rate. Another great source of many more details can be found at https://randomascii.wordpress.com/2013/07/08/windows-timer-resolution-megawatts-wasted/.

You can use the clockres tool from SysInternals to check out your current timer resolution. The dynamic windows timer tick rate makes it especially tricky to find performance degradations with unit tests on machines where Visual Studio or another WPF GUI is running where the tick rate is 1kHz. But on an installed server the clock tick rate is 64 which will make a big difference. Luckily the performance regression tests did show the issue long before any customer would get this version. It really pays off to performance regression tests. It is even better if you not only execute performance regression tests but you use xperf to record profiling data for all performance tests and store profiling data somewhere. Later when a test breaks you don't need to go to the testing guys and ask them to install the current baseline which has the issue and an old baseline which was good. Then you need to profile both use cases and then you can start analyzing. What a waste of time!

It is much easier to let your central performance tests run with ETW profiling which will affect even heavy use cases < 10%. But you get all the data you could wish for to find the root cause much faster. If you train the people how to use WPA to analyze issues and how to use PerfView, Windbg, VMMap and memory dumps you can create software which has significantly less issues than your competitors because you really understand your system. The usual thinking to treat system libraries and the operating system as black boxes where miracles happen has its place to not be distracted by the many things you need to consider before you write one line of software. But you should not stop at the black boxes. Instead you should develop the ability to look into the black boxes when necessary and write a nice blog post about the found issues so other people can learn how things (not) work.


Tuesday, April 21, 2015 #

All are talking about Windows 10 but what about the developer Tools? Since 4/20/2015 you can also download the beta of the upcoming Visual Studio 2015 and Windows / Phone SDKs. This includes also a new version of the (at least for me) long awaited Windows Performance Analyzer. You can get the ISO image here:

http://dev.windows.com/en-us/windows-10-developer-preview-tools?ocid=WIP_r_Mar_Body_SDK

From there you can download the iso file for the tools. You do not need to install Visual Studio CTP for this.  Now mount the iso files and install for x64 Windows versions (x86 is also there)

\packages\OneSDK\Windows App Certification Kit x64-x86_en-us.msi

This will put the new Windows Performance Toolkit under

C:\Program Files (x86)\Windows Kits\10\Windows Performance Toolkit

You do not need Windows 10 to be able to install the toolkit beta. You can install the kit at least on Win 8.1 and (not tested) also on Win7 machines.

Update 1

You can download the released pieces with the Windows 10 SDK installer here: https://dev.windows.com/en-us/downloads/windows-10-sdk

To download just the SDK run the installer and uncheck everything except the Windows Performance Toolkit. If you want to install it offline you can choose a download directory from where you can copy  the msi files on machines with e.g. no internet connection. From the download directory you need only the files

WPTx64-x86_en-us.msi (64 bit Windows versions since Windows 7)

WPTx86-x86_en-us.msi (32 bit Windows versions since Windows 7)

So what's inside? As far as I can tell the changes are not too dramatic but some nice improvements are there.

 

Xperf

Added option to -injectonly -merge.

    -injectonly                  Do not copy (merge) events, only inject 
                                 image identification information. Cannot be 
                                 used with -mergeonly.

What is the meaning of this? You can extract from a kernel trace file the events into a separate file and then merge a user etl file which will then become readable with WPA (albeit the process names do not show up).

This will generate the extract of the kernel etl file:

xperf -merge  c:\temp\LowLatencyGCEvents.kernel.etl c:\temp\lowlatnetinjected.etl -injectonly

Now you  can merge it with a user mode session etl file without ending up with a multi GB etl file if the user mode session contains only a few interesting events in which you are interested in.

xperf -merge  c:\temp\LowLatencyGCEvents.etl c:\temp\lowlatnetinjected.etl c:\temp\mergedWithImage.etl

Besides this there are no new features to xperf that are worth noting.

WPA

Now you can open the enriched user mode etl file with WPA and finally you can read the events:

image

Which new features do you spot in the picture above?

  • Colors! You can select the colors you want for your graphs.
  • You can graph numerical data from custom events now!
    • This works currently only to some extent. I hope it will get better in the released version.
    • The main issue is that it cannot graph hex values which have defined in their event template as outType hex values:

<template tid="GCHeapStats_V1">
    <data name="GenerationSize0" inType="win:UInt64" outType="win:HexInt64"/>

    • Another issue that I would like the possibility to graph a running value as line chart (e.g. memory usage, durations, …). Currently it is displayed only as a spike in the line diagram. Ok I can use bar charts but the line diagram type looks not too useful to me since I never know how many events are hiding behind this bar. Yes I can hover over the bar and press Alt+Space (also a new feature) to see more information about the graphed values.
  • Finally you can add at least for generic events in the View Editor a DateTime (Local) column. This is a first step into the right direction. But why on earth is it still not possible to switch on local time for the x axis? This would really help when I spot an issue and I can look at the clock, write down the time, save the trace and continue with the tests. Later I can look at the gathered data and the timepoints from my notes to see if something suspicious was going on. This is even more annoying for long term recordings (hours or even days) where the customer tells you that around 2 o`clock something bad did happen. Currently I have to look into Trace - System Configuration - Traces -  Start/End (UTC) Time to calculate from the seconds since trace start and the reported time the actual time window I should take a look at.

Other Features

Symbol Loading was significantly improved. Now you can deselect symbol locations you do not want via a checkbox instead of manually editing a long string. That is good thing although I miss the automatic heuristic to add from an etl file to add xxx.etl.NGenPDB to the symbol path if the folder exists.

The symbol settings are remembered which is good. What I do not like is that although I have disabled the MS symbol server it tries to load missing symbols still from MS when I select to load the symbols automatically after trace load.

image

When this is enabled there is no way to cancel symbol load after it has started and reconfigure it. This is not the best solution for me because it can take a long time until all symbols are downloaded. I have not tried to check out what will happen on a machine with no internet connection. The old WPA had a cancel button which did work after a long time but at least it was possible to cancel the symbol download.

image

Custom stack tag files can now be added. Before that you needed to overwrite the default.stacktags file in the folder where WPA was installed.  It was easy but I still felt dirty when I did overwrite the file.

If you have not used stack tags yet you should definitely have a look at them because if you see the same issue over and over again you can add this method to your stack tag file and add it under a problem node. When the problem node gets much CPU you can spot the "old" issue within seconds without the need to drill deep into some call stacks.

image

 

What is still on my Wish List?

  • Path Tree as in File IO like in the Disc IO table view.
  • Add a process command line as column in CPU Usage graphs so I can see which generic process caused which CPU consumption.
  • Some help with Wait Chain analysis. I have seen in seen in some WPA video this option but it seems not to be published.
  • WPA still knows nothing about managed code. Basic GC metrics as in PerfView should be part of it!
  • Region files where the SelfNest feature works as expected. Ideally I would like to create a "stack trace" like structure from my custom traces as regions.
    • Currently WPA seems to ignore the TID option and uses for method enter/leave traces simply the nearest trace in the timeline which gives garbage call stacks.
  • Improved Stack Tagging
    • I want to see for expensive methods how expensive they really are. This includes GC overhead. Currently this is grouped extra under CLR - GC because the stack tags "steal" from other methods if they have a better matching call stack. The exact logic is not obvious but it could be something like a match counter. The rule with most matches wins. It would be nice to add some nesting to stacktags so I can visualize the GC overhead of specific known expensive methods.

image

My Stacktag file looks like this one:

  <Tag Name="MyApplication">
    <Tag Name="Main">
        <Entrypoint Module="LowLatencyGC.exe" Method="*Program::Main*"/>
        <Tag Name="GC">
            <Entrypoint Module="clr.dll" Method="*GC*"/>
        </Tag>

    </Tag>
  </Tag>


  <Tag Name="CLR"> 
    <Tag Name="GC">
      <Entrypoint Module="clr.dll" Method="*GC*"/>
    </Tag>
 
  </Tag>

      

I want to group all methods which call Main under MyApplication and then factor out the GC overhead. For all other methods the generic CLR GC node should match so I can see for specific methods and the "other" GC overhead nicely lined up. This is currently not possible. Perhaps a IsChild="true" attribute which then alters the tagging logic would do the trick?

That's all for now. I am sure I have missed some new features like the Analysis Assistant which currently looks like some scratch pad for my notes. Perhaps there will show up in a later build wait chain traversal? Lets keep the fingers crossed that the existing bugs of the Beta get fixed and we get an even better tool which is already awesome.


Wednesday, March 25, 2015 #

Todays little program has a simple task. It should decode a hex bitmask into its original bits as hex values from which it was composed from. As bonus we can also parse an ETW manifest and print the corresponding keywords. Since most low level tools use hex numbers which have specific bits turned on I often want to know which things were actually enabled. This is useful to e.g. see which providers PerfView enables for .NET specific stuff.

If you check in PerfView the .NET Alloc checkbox I want to see what is enabled.

xperf -Loggers | findstr /i Microsoft-Windows-DotNETRuntime

".NET Common Language Runtime":0xc14fccbd

BitmaskDecoder.exe 0xc14fccbd C:\Windows\Microsoft.NET\Framework64\v4.0.30319\CLR-ETW.man

0x1             GCKeyword
0x4             FusionKeyword
0x8             LoaderKeyword
0x10            JitKeyword
0x20            NGenKeyword
0x80            EndEnumerationKeyword
0x400           SecurityKeyword
0x800           AppDomainResourceManagementKeyword
0x4000          ContentionKeyword
0x8000          ExceptionKeyword
0x10000         ThreadingKeyword
0x20000         JittedMethodILToNativeMapKeyword
0x40000         OverrideAndSuppressNGenEventsKeyword
0x80000         TypeKeyword
0x400000        GCHeapSurvivalAndMovementKeyword
0x1000000       GCHeapAndTypeNamesKeyword
0x40000000      StackKeyword
0x80000000      ThreadTransferKeyword

"Microsoft-Windows-DotNETRuntimePrivate":0x4002000b

BitmaskDecoder.exe 0x4002000b d:\privclr\ClrEtwAll.man


0x1             undocumented
0x2             undocumented
0x8             undocumented
0x20000         MulticoreJitPrivateKeyword
0x40000000      StackKeyword

 

If you on the other hand enable ETW .NET Alloc you get

BitmaskDecoder.exe 0xc16fccbd C:\Windows\Microsoft.NET\Framework64\v4.0.30319\CLR-ETW.man

0x1             GCKeyword
0x4             FusionKeyword
0x8             LoaderKeyword
0x10            JitKeyword
0x20            NGenKeyword
0x80            EndEnumerationKeyword
0x400           SecurityKeyword
0x800           AppDomainResourceManagementKeyword
0x4000          ContentionKeyword
0x8000          ExceptionKeyword
0x10000         ThreadingKeyword
0x20000         JittedMethodILToNativeMapKeyword
0x40000         OverrideAndSuppressNGenEventsKeyword
0x80000         TypeKeyword
0x200000        GCSampledObjectAllocationHighKeyword
0x400000        GCHeapSurvivalAndMovementKeyword
0x1000000       GCHeapAndTypeNamesKeyword
0x40000000      StackKeyword
0x80000000      ThreadTransferKeyword

The only difference between .NET Alloc and ETW .NET Alloc is that PerfView addtionally enables the high sampling rate (max 100 alloc events/s) with GCSampledObjectAllocationHighKeyword. If you want to record longer with less events you can use this mask and replace GCSampledObjectAllocationHighKeyword with GCSampledObjectAllocationLowKeyword to get 20 times less allocation events (max 5 alloc events/s).

This makes it easy to find out what other people think about good ETW settings if you have the manifest for the provider. If you have found the list of hex values you want to combine you can call

BitmaskDecoder.exe -encode 0x1 0x2 0x4 0x80000

Encoded value: 0x80007

to get the corresponding hex mask. You can use it of course also for other bit masks if you omit the ETW manifest file name.

Here is the code for this simple tool:

Update: Now it also supports keywords. Thanks for the suggestion Andre.

You can now also write

BitmaskDecoder.exe -encode C:\Windows\Microsoft.NET\Framework64\v4.0.30319\CLR-ETW.man  JittedMethodILToNativeMapKeyword GCHeapAndTypeNamesKeyword

The executable can also be found here.

using System;
using System.Globalization;
using System.IO;
using System.Linq;
 
namespace BitMaskDecoder
{
    class Program
    {
        static void Main(string[] args)
        {
            try
            {
                if (args.Length == 0)
                {
                    Help();
                    return;
                }
 
                string[] lines = null;
 
                if (args[0] == "-encode")
                {
                    if( args.Length < 2 )
                    {
                        Help();
                        return;
                    }
 
                    int skip = 1;
                    if( File.Exists(args[1]) )
                    {
                        lines = File.ReadAllLines(args[1]);
                        skip++;
                    }
                    ulong res = args.Skip(skip).Select(x => x.TrimStart(LeadingHex))
                                               .Select(str => DecodeKeyword(str,lines))
                                               .Aggregate((x, y) => x | y);
                    Console.WriteLine("Encoded value: 0x{0:X}", res);
                    return;
                }
 
                if (args.Length > 1)
                {
                    lines = File.ReadAllLines(args[1]);
                }
 
                string hex = args[0].TrimStart(LeadingHex);
 
                ulong hexNum = ulong.Parse(hex, System.Globalization.NumberStyles.AllowHexSpecifier);
                for (int i = 0; i < 64; i++)
                {
                    ulong mask = 1ul << i;
                    ulong hexMask = mask & hexNum;
                    if (hexMask != 0)
                    {
                        string value = String.Format("0x{0:X}", hexMask);
                        string searchValue = value + "\"";
                        bool bFound = false;
                        if (lines != null)
                        {
                            string curProvider = null;
                            for (int line = 0; line < lines.Length; line++)
                            {
 
                                string lineStr = lines[line];
                                if( lineStr.Contains("<provider") )
                                {
                                    curProvider = ExtractName(lineStr);
                                }
 
                                // we are after lines like <keyword name="GCKeyword" mask="0x1" message="$(string.RuntimePublisher.GCKeywordMessage)" symbol="CLR_GC_KEYWORD"/>
                                if (lineStr.Contains(searchValue) && lineStr.Contains("keyword"))
                                {
                                    Console.WriteLine("{0,-8}\t{1,-50}\t{2}", value, ExtractName(lineStr), curProvider);
                                    bFound = true;
                                }
                            }
                        }
 
                        if (!bFound)
                        {
                            Console.WriteLine("{0,-8}\t{1}", value, "undocumented");
                        }
                    }
                }
            }
            catch (Exception ex)
            {
                Console.WriteLine("Error: {0} {1}", ex.Message, ex.StackTrace);
            }
        }
 
        static void Help()
        {
            Console.WriteLine("BitmaskDecoder 0xddddd [ETWManifest] or -encode [manifestfile] 0xddd 0xddddd keyword1 keyword2 ....");
            Console.WriteLine("\tDecode a hex value into its bits and print the bits as hex values.");
            Console.WriteLine("\tWhen ETWManifest file is present the file is parsed and the matching lines of the manifest with the keywords are displayed");
        }
 
        static ulong DecodeKeyword(string hexOrKeyword, string[] manifest)
        {
            ulong lret = 0;
            if (!ulong.TryParse(hexOrKeyword, NumberStyles.AllowHexSpecifier, CultureInfo.InvariantCulture, out lret))
            {
                lret = ulong.Parse(ExtractMaskValue(hexOrKeyword, manifest), NumberStyles.AllowHexSpecifier);
            }
            return lret;
        }
 
        static string ExtractName(string line)
        {
            return ExtractAttribute(line, "name");
        }
 
        static string ExtractMaskValue(string keyword, string[] lines)
        {
            string hex = lines.Where(line => line.Contains("<keyword") && line.Contains(keyword))
                              .Select(line => ExtractAttribute(line, "mask"))
                              .FirstOrDefault();
            if( hex == null )
            {
                throw new NotSupportedException(String.Format("The keyword {0} was not found in the manfifest", keyword));
            }
            return hex.TrimStart(LeadingHex);
        }
 
        static string ExtractAttribute(string line, string attribute)
        {
            return new string(line.Substring(line.IndexOf(attribute))
                                            .SkipWhile(c => c != '"') // search first "
                                            .Skip(1)                  // skip "
                                            .TakeWhile(c => c != '"') // take all until next "
                                            .ToArray());
        }
 
        static char[] LeadingHex = new char[] { '0', 'x' };
 
    }
}