Alois Kraus

blog

  Home  |   Contact  |   Syndication    |   Login
  114 Posts | 8 Stories | 297 Comments | 162 Trackbacks

News



Article Categories

Archives

Post Categories

Image Galleries

Programming

Sunday, December 21, 2014 #

We all make dumb mistakes which are not interesting to share. But sometimes people make interesting mistakes. Below is a simple helper class that makes combining several delegates in a logical and manner very easy.

 

    static class Constraints<T>
    {
        /// <summary>
        /// Logical and of a list of predicates.
        /// </summary>
        public static Predicate<T> And(params Predicate<T>[] ands)
        {
            // return a new delegate which checks if all delegates in the array did return 
            // true for a given item
            return item => ands.All(x => x(item));
        }
    }

You can use it then like this:

   class Program
    {
        static void Main(string[] args)
        {
            int[] data = new int[20 * 1000 * 1000]; // 20 million ints

            Predicate<int> CheckForZero = Constraints<int>.And(i => i > -1, 
                                                               i => i < 1);

            var sw = Stopwatch.StartNew();
            for (int i = 0; i < data.Length; i++)
            {
                if (CheckForZero(data[i]) == false)
                {
                    Console.WriteLine("Violation found at {0}", i);
                }
            }
            sw.Stop();
            Console.WriteLine("Did take: {0:F2}s", sw.Elapsed.TotalSeconds);
        }
   }
Now you have a nice way to combine separate checks in a reusable way. The only issue is that for many calls to CheckForZero the running time is quite high.
Did take: 0,90s

If we compare this to the "normal" way

            for (int i = 0; i < data.Length; i++)
            {
                if ( !( data[i] > -1 && data[i] < 1) )
                {
                    Console.WriteLine("Violation found at {0}", i);
                }
            }

Did take: 0,01s

Whoa. The delegate version is 90 times slower! I found this issue by looking at the mount Everest's in terms of CPU consumption. What is causing so much overhead?

Well you just did wake up a sleeping dragon

image

with your LINQ delegate version. It is the Garbage Collector! Here is the allocation report of our slow sample:

image

We have a 80MB int array with 20 million elements. But we allocate 2,5GB of closure classes! Our Constraints.And method needs one closure to capture the list of delegates.

This is only needed once since we cache the returned delegate and will not show up in the profiler at all.

image

 

Things get interesting when we invoke the And method of the captured class. In this case we are allocating one additional closure to give access to the current item to check and the enclosing capturing class which gives us access to the ands array needs also be allocated. On top of that we need to create a new Func<Predicate<T>, bool> delegate instance as argument to the All LINQ statement.

image

To sum it up we allocate on each iteration

  • 1 c_DisplayClass4   32 bytes
  • 1 Func<Predicate<T,bool>> delegate instance 64 bytes
  • 1 SZArrayHelper+SZGenericArrayEnumerator 32 bytes
    • Where does this come from? It is caused by the All LINQ statements which iterates over an IEnumerable. If you foreach over an array it needs to get an enumerator from the object to enumerate. Since an array has none out of the box the CLR allocates a helper enumerator which is this strange SZArrayHelper class. If you directly foreach over an array this is NOT used.  Instead more efficient code is generated.

 

In total we allocate 128 bytes on each integer we want to check (4 bytes) which is quite some overhead. 128*20.000.000=2.560.000.000 which is exactly what the profiler reports. Since I did use from PerfView the ETW .NET Alloc method

image 

I get a correct total sum of the allocated bytes back. But the types which are responsible for the allocation and especially the counts can be misleading. If you use .NET Alloc you get all allocations but the use case will not run 1s but >2 minutes which is not something I really like to do. If you want to get exact allocation number you need to check the .NET Alloc checkbox which injects a profiling dll into all new started processes so you can get exact allocation reports. Unfortunately this does not seem to work with the old and current PerfView (1.7) version where I see anything but my new allocations.

image

When in doubt I would always use ETW .NET Alloc to get reliable although sampled numbers. This works also for JITed code in x64 on Windows 7 machines because for each allocation stack a managed stackwalk is also performed which gives you the full allocation stack. ETW does not work for JITed x64 code on Win7 except if you precompile it with NGen as workaround. On Win 8 and beyond this is no longer an issue except if you try to use API Hooking. There the x64 stack walker still stops walking the stack when the method interception is done in a form it cannot recognize.

This sounds like some complex stuff. Can we fix it? How about providing an overload to our class?

        public static Predicate<T> And(Predicate<T> p1, Predicate<T> p2)
        {
            return item => p1(item) && p2(item);
        }
 

This little overload spares us pretty much all of the overhead. We do not allocate a single byte while iterating over our array. This time we get

Did take: 0,14s

Which is nearly 9 times faster. This is quite good for such a simple overload which is preferred by the compiler when it finds this one as best match. Now we only pay for 3 delegate invocations which are not avoidable if we want to keep the flexibility here.

Now go and check your LINQ statements for hidden dragons. 2,5GB allocations with one line are quite easy to achieve with delegates and LINQ. This is the main reason why LINQ is banned from the Roslyn project because of these subtle delegate issues. It is kind of ironic that the inventors of LINQ do not want to use their own invention in their C# compiler backend for performance reasons.


Sunday, December 14, 2014 #

When you want fast and efficient File IO in Windows you can specify a myriad of flags to CreateFile. But which of these many flags help in which scenarios? MSDN says that some read ahead buffering is enabled if Sequential Scan is done. First a little look into the docs can help:

 

FILE_FLAG_SEQUENTIAL_SCAN  0x08000000

Access is intended to be sequential from beginning to end. The system can use this as a hint to optimize file caching.

This flag should not be used if read-behind (that is, reverse scans) will be used.

This flag has no effect if the file system does not support cached I/O and FILE_FLAG_NO_BUFFERING.

For more information, see the Caching Behavior section of this topic.

Under Caching Behavior is more information:

Specifying the FILE_FLAG_SEQUENTIAL_SCAN flag can increase performance for applications that read large files using sequential access. Performance gains can be even more noticeable for applications that read large files mostly sequentially, but occasionally skip forward over small ranges of bytes. If an application moves the file pointer for random access, optimum caching performance most likely will not occur. However, correct operation is still guaranteed.

FILE_FLAG_RANDOM_ACCESS   0x10000000

Access is intended to be random. The system can use this as a hint to optimize file caching.

This flag has no effect if the file system does not support cached I/O and FILE_FLAG_NO_BUFFERING.

For more information, see the Caching Behavior section of this topic.

 

The Old New Thing clears up the confusion by adding that if you specify neither flag the OS is free to use read ahead or not depending on the observed usage pattern. An interesting note is that if you specify RANDOM_ACCESS then the read pages are evicted from the file system cache with the usual LRU policy. This ensures that random reads do not pollute the file system cache. But if you read the same stuff several times shortly after the first read you still get the cached pages back.  Lets try it out by reading from 1000 files the first 10KB  with two 5000 byte reads for each file:

        static void ReadOnly(string dir)
        {
            string[] files = Directory.GetFiles(dir, "*.bin");
            var sw = Stopwatch.StartNew();
            int bytesRead = 0;
            foreach(var file in files)
            {
                Parser.FlushFileBufferOfFileByOpeningItUnbuffered(file);
                FileStream stream = new FileStream(file, FileMode.Open, FileAccess.Read, FileShare.None, 1, FileOptions.RandomAccess);
                byte[] buffer = new byte[5000];

                bytesRead += stream.Read(buffer, 0 , buffer.Length);
                bytesRead += stream.Read(buffer, 0, buffer.Length);
            }
            sw.Stop();
            Console.WriteLine("Did read {0:N0} bytes from {1} Files in {2:F2}s", bytesRead, files.Length, sw.Elapsed.TotalSeconds);
        }

To see the actual disc access every time we execute this code we need to delete the file system cache for this file. It turns out that you can do this quite easily by opening the file file once for unbuffered IO like this

        const FileOptions FileFlagNoBuffering = (FileOptions)0x20000000;

        public static void FlushFileBufferOfFileByOpeningItUnbuffered(string fileName)
        {
            var tmp = new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.Read, 1, FileOptions.SequentialScan | FileFlagNoBuffering);
            tmp.Dispose();
        }

FileAccess.RandomAccess

Now the file data is always read from the physical disc and the File IO layer. Since the file system sector size is usually 4KiB you will need for the first 5 KB read 8192 bytes and for the rest another 4096 bytes to successfully read 10 000 bytes. After enough data from the device has been read you can see the two actual 5000 byte reads from offset 0 and 5000 as you would expect them. From the hard disc actually 12288 bytes are read. It makes therefore from an IO perspective no difference if you read 10000 bytes or 12KiB from a disc with a 4KiB cluster size.

image

If you want to find the true file size requested by actual reads you need to filter for the File IO Init stacks (checked with Win 8.1)

fltmgr.sys!FltpPerformPreCallbacks

fltmgr.sys!FltpPassThrough

whereas the other stacks deal with read ahead caching and the translated calls to the actual device where the read size is adjusted to the actual sector read size the device actually supports.

image

You might ask what the IRP column is for. IRP is an I/O request packet which is the similar to an opaque file handle. Each device driver can create its own IRP where it issues IO requests to other drivers. That way you can follow how different drivers split their work up. So far I have not found a useful grouping for it.

It looks like the FileAccess.RandomAccess behaves largely like unbuffered IO where you exactly read as much data as you need in the granularity of the cluster size of the actual device.

FileAccess.SequentialScan

Now we can try it out a second time by using FileAccess.SequentialScan. The first read is 8KiB and the next one is 184KiB which comes from the read ahead prefetching mechanism. The assumption that we read much more than the first 10KB is quite optimistic which leads in this case to over 18 times more hard disc reads than we would need to for this use case.

image

FileAccess.SequentialScan is certainly a bad choice for small buffered reads. If we look at the timings we see that as well.

Scenario: 1K Files with 2x5000 bytes reads Time in s
FileAccess.RandomAccess 3,9
FileAccess.SequentialScan 6,5

FileAccess.None

Last but not least what will happen when we do not specify any option (FileOptions.None). At least in this case it behaves exactly like Sequential Scan where also 196MB of data is fetched from the disc although we only wanted to read only 10MB from the file headers.

Conclusions

That sounds like if you never should use SequentialScan. But in most scenarios where you read a file, parse a little bit and then read the next chunk of the file this is exactly what you want. In theory you can use async IO to read the file asynchronously while parsing the stuff on another thread. But if you are lucky you can do it without extra threading on your side while the OS is kind enough to read ahead of your application logic while you are still reading and parsing the file single threaded. You get the the benefits of multithreading without using extra application threads for it!

The read ahead logic is not optimal in all cases but helps where otherwise much more complex code would be necessary to read from a hard disc at full speed (60-300MB/s). What is the morale of this story? Never use FileOptions.SequentialScan or FileOptions.None if you want to read only the file headers! Otherwise you are consuming much more IO bandwidth than necessary to achieve your goal and you are left scratching your head why switching from unbuffered IO to buffered IO has made reading your files so much slower.


Monday, December 1, 2014 #

Recently I got an interesting drawing issue which looks like a Windows bug. Let´s suppose you want to draw a string in Chinese in a Windows Forms or an WPF application. There you have two options.

  1. TextRenderer.DrawText
  2. Graphics.DrawString

The problem is both methods lead to crappy rendered text when the used Font is bold. The "old" method by using Graphics.DrawString renders some Chinese glyphs wrong. If you try a new shiny WPF application you will find out that WPF uses the equivalent to Graphics.DrawString since the text looks the same. Graphics.DrawString and TextRenderer draw the following string with an Arial font like this:

image

Did you notice that the last 3 characters are slightly shifted above and rendered in some bold font? This is normally a sign that the used font (Arial) in this case did not contain all Unicode characters of the string to draw. Before drawing empty rectangles Windows uses the next suitable font which looks like Microsoft YaHei in this case.

To fix this you need to tweak the registry and add a key for Arial to prefer instead a different font to render the complete string with the same font. With the following setting you switch Arial to SimSun as first preferred Font (the Arial key did not even exist on my 8.1 machine).

image

After doing this modification you need to log out and log in again to enable this change. The only issue is that Graphics.DrawString draws the string just like before. No change here. Either it ignores Font Linking in favor of some awkward switch like if (char > dddd) "Microsoft YaHei"  else "Arial" or it has a plain bug in it. The usual suspects on StackOverflow recommend to switch to TextRenderer which works much better. But it breaks if you try to render bold fonts which look quite bad if you do not set explicitly a background color. Here are some pictures how the text changes if the font size grows:

FontRendering

In my application I cannot set a background color since I am drawing onto a Bitmap which is overlayed with other graphics which must remain visible behind the rendered string. The other option is to use as background e.g. Magenta and use this a transparent color. Unfortunately this doesn't work due to the anti aliasing which leaves some purple shadows on the drawn text which looks really ugly. So far it looks like Windows leaves you with two bad options if you want to make a professional looking graphical localizable application:

  • Either the Chinese characters are drawn with two different fonts. (TextRenderer).
  • Or you end up with barely readable text  (Graphics.DrawString).

If you have solved the issue without setting an explicit background color I would love to hear about your solution. The code to draw the text in both Picture boxes is like this:

        const string combined = "测试标记好不好This is a aber long";
        private void cFontSize_ValueChanged(object sender, EventArgs e)
        {
            Bitmap bmp = new Bitmap(750, 140);
            Font f = new Font("Arial", (float)cFontSize.Value, IsBold ? FontStyle.Bold : FontStyle.Regular);

            using (var g = Graphics.FromImage(bmp))
            {
                g.CompositingMode = System.Drawing.Drawing2D.CompositingMode.SourceOver;

                // Rendering with a Background Color solves the issue but 
                // this would produce boxes of white or black color in the displayed image which looks even worse
                TextRenderer.DrawText(g, combined, f, new Point(0, 0), FontColor);
            }
            cPictureBox.Image = bmp;

            Bitmap bmp2 = new Bitmap(750, 140);

            using (var g = Graphics.FromImage(bmp2))
            {
                g.CompositingMode = System.Drawing.Drawing2D.CompositingMode.SourceOver;
                g.DrawString(combined, f, FontBrush, 0, 0);
            }

            cPicture2.Image = bmp2;
        }

Pretty straightforward code I would think. WPF is also not the solution since it draws the text in the same way like Windows Forms. I wonder what market share Chinese people currently have that this issue is even in Windows 8.1 not fixed.

Update 1

After adding Arial Bold as Font Link with the settings

It looks like Graphics.DrawString is the winner with the best rendered string again.

image

WPF still does not love my Font Linking settings. Whatever. Graphics.DrawString looks good now in all languages which is just what I need for now.

image

Update 2

It turns out that a German Windows 8.1 behaves differently than a English Windows 8.1 although I have installed the language pack for Chinese  for it.

  1. You do not need to log out and log in again to see the changes in effect and the Arail Bold setting was not required at all. It is sufficient to have one Font Linking setting in place for a German Windows 8.1
  2. On an English Windows 8.1 the bold fonts are still rendered wrong. Windows 7 dito. This is not a solution.

The only real solution I have come up with was to disable bold fonts for Chinese UI languages and live with non bold fonts. The Old New Thing has an "old" post on it called The danger of using boldface for Chinese characters which also points towards the direction that there is no way to render small bold fonts in Chinese which remain readable.


Sunday, October 12, 2014 #

Although there was no part one my other post "Can You Execute Code While Waiting For A Lock?" can be considered as part one. I hope that this will not become a larger series where code is executed in the wrong order. Suppose you want to execute the commands

  1. Init
  2. Action
  3. OtherAction
  4. Dispose
  5. Shutdown

in this order on the UI thread of an application from another thread. Then you would do something like this:

using System;
using System.Diagnostics;
using System.Threading;
using System.Windows.Forms;

namespace MessagePumping
{
    public partial class Form1 : Form
    {
        public SynchronizationContext WindowsFormsSynchronizationContext;

        public Form1()
        {
            InitializeComponent();
            var tmp = this.Handle;

            WindowsFormsSynchronizationContext = System.Windows.Forms.WindowsFormsSynchronizationContext.Current;
            BlockFor();
        }

        void BlockFor()
        {
            Barrier b = new Barrier(2);
            Thread t = new Thread(() =>
            {
                lock (this)
                {
                    b.SignalAndWait(); // let main UI thread run into lock                    
                    Thread.Sleep(300); // enter the lock on the UI thread

                    Action otherAction = () => Debug.Print("OtherAction");
                    WindowsFormsSynchronizationContext.Post((o) => Debug.Print("Init"), null);
                    WindowsFormsSynchronizationContext.Post((o) => Debug.Print("Action"), null);
                    WindowsFormsSynchronizationContext.Post((o) => otherAction(), null);
                    WindowsFormsSynchronizationContext.Post((o) => Debug.Print("Dispose"), null);
                    WindowsFormsSynchronizationContext.Post((o) => Debug.Print("Shutdown"), null);
                }
            });
            t.Start();

            b.SignalAndWait();
            lock (this) // deadlock which never releases lock
            {
            }
        }

    }
This will print the messages in order just as you would expect it:

Init
Action
OtherAction
Dispose
Shutdown

Now lets slightly change the code of BlockFor

 

        void BlockFor()
        {
            Barrier b = new Barrier(2);
            Thread t = new Thread(() =>
            {
                lock (this)
                {
                    b.SignalAndWait(); // let main UI thread run into lock                    
                    Thread.Sleep(300); // enter the lock on the UI thread 
                    Action otherAction = () => Debug.Print("OtherAction");
                    WindowsFormsSynchronizationContext.Post((o) => Debug.Print("Init"), null);
                    WindowsFormsSynchronizationContext.Post((o) => Debug.Print("Action"), null);
                    WindowsFormsSynchronizationContext.Post((o) =>
                    {
                        WindowsFormsSynchronizationContext.Send((o2) => otherAction(), null);
                    },null);
                    WindowsFormsSynchronizationContext.Post((o) => Debug.Print("Dispose"), null);
                    WindowsFormsSynchronizationContext.Post((o) => Debug.Print("Shutdown"), null);
                }
            });
            t.Start();

            b.SignalAndWait();
            lock (this) // deadlock which never releases lock
            {
            }
        }

and check if during a post a send operation is working as expected:

Init
Action
OtherAction
Dispose
Shutdown

The output is till the same. So far so good. Now keep the code but change something else and now we get:

Init
Action
Dispose
Shutdown
OtherAction

What has changed?

Still not it.

 

The SynchronizationContext! You were thinking you were using a WindowsFormsSynchronizationContext? No you weren´t. The WindowsFormsSynchronizationContext.Current is returning not a WindowsFormsSynchronizationContext but a SynchronizationContext instance. The reason is that the static Current property comes from SynchronizationContext. You were using the current SynchronizationContext whatever it was. In the first run it was actually a WPF DispatcherSynchronizationContext  instance which did work just as expected. But if we actually force the usage of an WindowsFormsSynchronizationContext by setting it explictely

        public Form1()
        {
            InitializeComponent();
            var tmp = this.Handle;

            SynchronizationContext.SetSynchronizationContext(new WindowsFormsSynchronizationContext());
            WindowsFormsSynchronizationContext = System.Windows.Forms.WindowsFormsSynchronizationContext.Current;
            BlockForThreeSeconds2();
        }

Then we get the faulty behavior that shutdown is executed before our OtherAction which was not our intention. The first question is why are we getting this? When using window messages you are adding messages to a queue. When you post a message you add it at the end of the queue. The same happens if you Send a messages with the only exception that the call blocks until all pending messages and your message have been processed. Our OhterAction message is special since it calls Send to the message pump while it is being processed. The logical picture you have in mind is

image

 

but what actually happens is that while your wrapper call is executed it enqueues OtherAction at the end of the message queue and waits until OtherAction has been called. But since the Dispose and Shutdown messages come first our OtherAction message will be called far too late.

image

Ok that explains why WinForms suck. But why did it work with the WPF Dispatcher? The WPF dispatcher uses a little optimization: If you try to enqueue a message on the thread that runs the message pump then it will NOT be enqueued but directly executed. The WPF Dispatcher model is therefore much saner thane Winforms and prevents you from shooting yourself in the foot. If you do not use the WinForms synchronization context but you call directly on the Control.Begin/Invoke you can fix this faulty model by calling InvokeRequired

                    Action otherAction = () => Debug.Print("OtherAction");
                    this.BeginInvoke((Action) (() => Debug.Print("Init")));
                    this.BeginInvoke((Action) (() => Debug.Print("Action")));
                    this.BeginInvoke((Action) (() =>
                    {
                        if (this.InvokeRequired) // only enqueue to message queue if we are beeing called from a different thread
                        {
                            this.Invoke((Action)(() => otherAction()));
                        }
                        else
                        {
                            otherAction(); // otherwise execute the callback directly
                        }
                    }));
                    this.BeginInvoke((Action) (() => Debug.Print("Dispose")));
                    this.BeginInvoke((Action) (() => Debug.Print("Shutdown"))); 

The rule to check with InvokeRequiredworks before actually enqueuing messages works well and helps to alleviate the problem of unintended message reordering. Unfortunately this does not work with WindowsFormsSynchronizationContexts since it has no InvokeRequired method to do the check by yourself. Until then your best bet would be to abandon the usage of WindowsFormsSynchronizationContext because of this or you simply use the WPF DispatcherSynchronizationContext which works well. But you should be aware that you then have up to 3 message queues in parallel. First it is the window message queue of your window. Then comes the Winforms message queue which uses its own window message to start working on queued messages. And as last queue you have the WPF message queue which also has its own private message queue and window message.

If you interleave WPF and WinForms messages in some calling code like this

        public WindowsFormsSynchronizationContext WinFormsContext;
        public DispatcherSynchronizationContext WPFContext;

        public Form1()
        {
            InitializeComponent();
            var tmp = this.Handle;

            WinFormsContext = new System.Windows.Forms.WindowsFormsSynchronizationContext();
            WPFContext = new DispatcherSynchronizationContext();
            BlockFor();
        }

        void BlockFor()
        {
            Barrier b = new Barrier(2);
            Thread t = new Thread(() =>
            {
                lock (this)
                {
                    b.SignalAndWait(); // let main UI thread run into lock                    
                    Thread.Sleep(300); // enter the lock on the UI thread

                    Action otherAction = () => Debug.Print("OtherAction");
                    WPFContext.Post((o) => Debug.Print("Init"), null);
                    WinFormsContext.Post((o) => Debug.Print("WinForms Init"),null);
                    WPFContext.Post((o) => Debug.Print("Action"), null);
                    WinFormsContext.Post((o) => Debug.Print("WinForms Action"), null);
                    WPFContext.Post((o) =>
                       WPFContext.Send((o2) => otherAction(), null), null);
                    WPFContext.Post((o) => Debug.Print("Dispose"), null);
                    WinFormsContext.Post((o) => Debug.Print("WinForms Dispose"), null);
                    WPFContext.Post((o) => Debug.Print("Shutdown"), null);
                    WinFormsContext.Post((o) => Debug.Print("WinForms Shutdown"), null);
                }
            });
            t.Start();

then you do not get interleaved messages as the calling code suggests but this:

Init
WinForms Init
WinForms Action
WinForms Dispose
WinForms Shutdown

Action
OtherAction
Dispose
Shutdown

This is another subtle difference between WPF and Windows Forms message pumping. WPF will execute only one posted message whereas Winforms will execute all pending posted messages at once until there are no more left to execute. If you have a mixed WPF/Winforms application and you want to get rid of Invokes in favor of BeginInvokes because message pumping is evil you just are about to find the next trap door late during your system integration test. If you can you should stick to only one message queue and prefer Invoke/Send where you can. If you are using Winforms BeginInvoke you not only do not know when it will be run (perhaps too late) but if you need coordination with WPF window messages you are out of luck in this scenario. I am sure you will not design a system this way but at some day someone will add a hidden dependency between both message pumps leaving you little chances to design you out of this mess.

I hope that MS will address these (obvious?) deficiencies in Winforms message pumping so we do not have to worry about stuff we never wanted to know in such great detail in the first place.

Update 1

No I will not make a different post out of it. The last example has an interesting twist if you are using the WPF Dispatcher. It has changed between .NET Framework 4.0 and 4.5. Depending on the target platform you are compiling for (.NET 4.0 or 4.5+) you will get different behavior. The .NET 4.0 Dispatcher WILL pump messages just like WinForms. This means the code above for the WPF context post messages only returns for .NET 4.5:

Init
Action
OtherAction
Dispose
Shutdown

But if you compile for .NET 4.0

image

you get

Init
Action
Dispose
Shutdown
OtherAction

just like in the bad case of Windows Forms. The magic happens in the DispatcherSynchronizationContext class at:

public override void Send(SendOrPostCallback d, object state)
{
    if (BaseCompatibilityPreferences.GetInlineDispatcherSynchronizationContextSend() && this._dispatcher.CheckAccess())
    {
        this._dispatcher.Invoke(DispatcherPriority.Send, d, state);
        return;
    }
    this._dispatcher.Invoke(this._priority, d, state);
}

This returns true or false depending on what is set in the TargetFramwork attribute of your currently executing assembly.

[assembly: TargetFramework(".NETFramework,Version=v4.5.2", FrameworkDisplayName = ".NET Framework 4.5.2")]

Interesting. You can find the future .NET Framework version numbers (but not their release dates) in the BinaryCompatiblityMap which contains the promising name AddQuirksForFramework.

 

        if (buildAgainstVersion >= 50000)
        {
            this.TargetsAtLeast_Desktop_V5_0 = true;
        }
        if (buildAgainstVersion >= 40504)
        {
            this.TargetsAtLeast_Desktop_V4_5_4 = true;
        }
        if (buildAgainstVersion >= 40503)
        {
            this.TargetsAtLeast_Desktop_V4_5_3 = true;
        }
        if (buildAgainstVersion >= 40502)
        {
            this.TargetsAtLeast_Desktop_V4_5_2 = true;
        }
        if (buildAgainstVersion >= 40501)
        {
            this.TargetsAtLeast_Desktop_V4_5_1 = true;
        }
        if (buildAgainstVersion >= 40500)
        {
            this.TargetsAtLeast_Desktop_V4_5 = true;
            this.AddQuirksForFramework(TargetFrameworkId.Phone, 70100);
            this.AddQuirksForFramework(TargetFrameworkId.Silverlight, 50000);
            return;
        }
       …

So we can expect a .NET 4.5.3, 4.5.4 and .NET 5.0 in the future with some confidence now.


Tuesday, October 7, 2014 #

Threading can sometimes lead to impossible bugs where you have a hard time to find out how you got into this state. Lets suppose you have a nice WPF application which you dutifully test via the UIAutomation provider which can alter some state in your application.

    /// <summary>
    /// Interaction logic for TestWindow.xaml
    /// </summary>
    public partial class TestWindow : Window
    {
        int Data;

        public TestWindow()
        {
            InitializeComponent();
        }

        public void AlterData(int newValue)
        {
            Debug.Print("Executing AlterData {0}", Environment.StackTrace);
            Data = newValue;
        }

        protected override AutomationPeer OnCreateAutomationPeer()
        {
            var lret = new OwnPeer(this);
            return lret;
        }

        class OwnPeer : WindowAutomationPeer
        {
            public OwnPeer(TestWindow owner)
                : base(owner)
            { }

            protected override HostedWindowWrapper GetHostRawElementProviderCore()
            {
                Debug.Print("GetHostRawElementProviderCore called");
                ((TestWindow)base.Owner).AlterData(500);
                return base.GetHostRawElementProviderCore();
            }
        }
}

When you look at the call stacks all is normal:

Executing AlterData    at System.Environment.GetStackTrace(Exception e, Boolean needFileInfo)
   at System.Environment.get_StackTrace()
   at ThreadingIsHard.TestWindow.AlterData(Int32 newValue) in d:\Source\ThreadingIsHard\ThreadingIsHard\TestWindow.xaml.cs:line 52
   at ThreadingIsHard.TestWindow.OwnPeer.GetHostRawElementProviderCore() in d:\Source\ThreadingIsHard\ThreadingIsHard\TestWindow.xaml.cs:line 69
   at MS.Internal.Automation.ElementProxy.InContextGetHostRawElementProvider(Object unused)
   at MS.Internal.Automation.ElementUtil.<>c__DisplayClass1.<Invoke>b__0(Object unused)
   at System.Windows.Threading.ExceptionWrapper.InternalRealCall(Delegate callback, Object args, Int32 numArgs)
   at MS.Internal.Threading.ExceptionFilterHelper.TryCatchWhen(Object source, Delegate method, Object args, Int32 numArgs, Delegate catchHandler)
   at System.Windows.Threading.DispatcherOperation.InvokeImpl()
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Windows.Threading.DispatcherOperation.Invoke()
   at System.Windows.Threading.Dispatcher.ProcessQueue()
   at System.Windows.Threading.Dispatcher.WndProcHook(IntPtr hwnd, Int32 msg, IntPtr wParam, IntPtr lParam, Boolean& handled)
   at MS.Win32.HwndWrapper.WndProc(IntPtr hwnd, Int32 msg, IntPtr wParam, IntPtr lParam, Boolean& handled)
   at MS.Win32.HwndSubclass.DispatcherCallbackOperation(Object o)
   at System.Windows.Threading.ExceptionWrapper.InternalRealCall(Delegate callback, Object args, Int32 numArgs)
   at MS.Internal.Threading.ExceptionFilterHelper.TryCatchWhen(Object source, Delegate method, Object args, Int32 numArgs, Delegate catchHandler)
   at System.Windows.Threading.Dispatcher.LegacyInvokeImpl(DispatcherPriority priority, TimeSpan timeout, Delegate method, Object args, Int32 numArgs)
   at MS.Win32.HwndSubclass.SubclassWndProc(IntPtr hwnd, Int32 msg, IntPtr wParam, IntPtr lParam)
   at MS.Win32.UnsafeNativeMethods.DispatchMessage(MSG& msg)
   at MS.Win32.UnsafeNativeMethods.DispatchMessage(MSG& msg)
   at System.Windows.Threading.Dispatcher.PushFrameImpl(DispatcherFrame frame)
   at System.Windows.Application.RunInternal(Window window)
   at System.Windows.Application.Run()
   at ThreadingIsHard.App.Main() in d:\Source\UIAutomationSample\UIAutomationTest\obj\Debug\App.g.cs:line 0

So far so expected. Now lets make our well behaving application at little more unexpected and take a lock for some time. To make it simple I deadlock the main UI thread by trying to enter a lock which is owned by another thread like this

        private void Button_Click(object sender, RoutedEventArgs e)
        {
            Deadlock();
        }

        public void Deadlock()
        {
            Barrier b = new Barrier(2);
            Thread t = new Thread(() =>
            {
                lock (this)
                {
                    b.SignalAndWait(); // let main UI thread run into lock
                    Thread.Sleep(Timeout.Infinite); // never leave the lock
                }
            });
            t.Start();

            b.SignalAndWait();
            lock (this) // deadlock which never releases lock
            {
            }
        }

If you have never seen Barrier it is a class introduced with .NET 4.0 which is very handy to synchronize threads to a specific time point. Only when both threads have signaled then both threads are released at the same time. In our case this ensures a nice deadlock. Here is the quiz:

Can something bad happen while the main UI thread is deadlocked?

As it turns out it can. Lets have a look at this impossible call stack:

Executing AlterData    at System.Environment.GetStackTrace(Exception e, Boolean needFileInfo)
   at System.Environment.get_StackTrace()
   at ThreadingIsHard.TestWindow.AlterData(Int32 newValue) in d:\Source\ThreadingIsHard\ThreadingIsHard\TestWindow.xaml.cs:line 25
   at ThreadingIsHard.TestWindow.OwnPeer.GetHostRawElementProviderCore() in d:\Source\ThreadingIsHard\ThreadingIsHard\TestWindow.xaml.cs:line 44
   at MS.Internal.Automation.ElementProxy.InContextGetHostRawElementProvider(Object unused)
   at MS.Internal.Automation.ElementUtil.<>c__DisplayClass1.<Invoke>b__0(Object unused)
   at System.Windows.Threading.ExceptionWrapper.InternalRealCall(Delegate callback, Object args, Int32 numArgs)
   at MS.Internal.Threading.ExceptionFilterHelper.TryCatchWhen(Object source, Delegate method, Object args, Int32 numArgs, Delegate catchHandler)
   at System.Windows.Threading.Dispatcher.LegacyInvokeImpl(DispatcherPriority priority, TimeSpan timeout, Delegate method, Object args, Int32 numArgs)
   at MS.Internal.Automation.ElementUtil.Invoke(AutomationPeer peer, DispatcherOperationCallback work, Object arg)
   at MS.Internal.Automation.ElementProxy.get_HostRawElementProvider()
   at MS.Internal.Automation.UiaCoreProviderApi.RawUiaReturnRawElementProvider(IntPtr hwnd, IntPtr wParam, IntPtr lParam, IRawElementProviderSimple el)
   at MS.Internal.Automation.UiaCoreProviderApi.RawUiaReturnRawElementProvider(IntPtr hwnd, IntPtr wParam, IntPtr lParam, IRawElementProviderSimple el)
   at System.Windows.Automation.Provider.AutomationInteropProvider.ReturnRawElementProvider(IntPtr hwnd, IntPtr wParam, IntPtr lParam, IRawElementProviderSimple el)
   at System.Windows.Interop.HwndTarget.CriticalHandleWMGetobject(IntPtr wparam, IntPtr lparam, Visual root, IntPtr handle)
   at System.Windows.Interop.HwndTarget.HandleMessage(WindowMessage msg, IntPtr wparam, IntPtr lparam)
   at System.Windows.Interop.HwndSource.HwndTargetFilterMessage(IntPtr hwnd, Int32 msg, IntPtr wParam, IntPtr lParam, Boolean& handled)
   at MS.Win32.HwndWrapper.WndProc(IntPtr hwnd, Int32 msg, IntPtr wParam, IntPtr lParam, Boolean& handled)
   at MS.Win32.HwndSubclass.DispatcherCallbackOperation(Object o)
   at System.Windows.Threading.ExceptionWrapper.InternalRealCall(Delegate callback, Object args, Int32 numArgs)
   at MS.Internal.Threading.ExceptionFilterHelper.TryCatchWhen(Object source, Delegate method, Object args, Int32 numArgs, Delegate catchHandler)
   at System.Windows.Threading.Dispatcher.LegacyInvokeImpl(DispatcherPriority priority, TimeSpan timeout, Delegate method, Object args, Int32 numArgs)
   at MS.Win32.HwndSubclass.SubclassWndProc(IntPtr hwnd, Int32 msg, IntPtr wParam, IntPtr lParam)
   at System.Threading.SynchronizationContext.WaitHelper(IntPtr[] waitHandles, Boolean waitAll, Int32 millisecondsTimeout)
   at System.Windows.Threading.DispatcherSynchronizationContext.Wait(IntPtr[] waitHandles, Boolean waitAll, Int32 millisecondsTimeout)
   at System.Threading.Monitor.Enter(Object obj)
   at ThreadingIsHard.TestWindow.Deadlock() in d:\Source\ThreadingIsHard\ThreadingIsHard\TestWindow.xaml.cs:line 69
   at ThreadingIsHard.TestWindow.Button_Click(Object sender, RoutedEventArgs e) in d:\Source\ThreadingIsHard\ThreadingIsHard\TestWindow.xaml.cs:line 52
   at System.Windows.EventRoute.InvokeHandlersImpl(Object source, RoutedEventArgs args, Boolean reRaised)
   at System.Windows.UIElement.RaiseEventImpl(DependencyObject sender, RoutedEventArgs args)
   at System.Windows.Controls.Button.OnClick()
   at System.Windows.Controls.Primitives.ButtonBase.OnMouseLeftButtonUp(MouseButtonEventArgs e)
   at System.Windows.RoutedEventArgs.InvokeHandler(Delegate handler, Object target)
   at System.Windows.EventRoute.InvokeHandlersImpl(Object source, RoutedEventArgs args, Boolean reRaised)
   at System.Windows.UIElement.ReRaiseEventAs(DependencyObject sender, RoutedEventArgs args, RoutedEvent newEvent)
   at System.Windows.RoutedEventArgs.InvokeHandler(Delegate handler, Object target)
   at System.Windows.EventRoute.InvokeHandlersImpl(Object source, RoutedEventArgs args, Boolean reRaised)
   at System.Windows.UIElement.RaiseEventImpl(DependencyObject sender, RoutedEventArgs args)
   at System.Windows.UIElement.RaiseTrustedEvent(RoutedEventArgs args)
   at System.Windows.Input.InputManager.ProcessStagingArea()
   at System.Windows.Input.InputProviderSite.ReportInput(InputReport inputReport)
   at System.Windows.Interop.HwndMouseInputProvider.ReportInput(IntPtr hwnd, InputMode mode, Int32 timestamp, RawMouseActions actions, Int32 x, Int32 y, Int32 wheel)
   at System.Windows.Interop.HwndMouseInputProvider.FilterMessage(IntPtr hwnd, WindowMessage msg, IntPtr wParam, IntPtr lParam, Boolean& handled)
   at System.Windows.Interop.HwndSource.InputFilterMessage(IntPtr hwnd, Int32 msg, IntPtr wParam, IntPtr lParam, Boolean& handled)
   at MS.Win32.HwndWrapper.WndProc(IntPtr hwnd, Int32 msg, IntPtr wParam, IntPtr lParam, Boolean& handled)
   at MS.Win32.HwndSubclass.DispatcherCallbackOperation(Object o)
   at System.Windows.Threading.ExceptionWrapper.InternalRealCall(Delegate callback, Object args, Int32 numArgs)
   at MS.Internal.Threading.ExceptionFilterHelper.TryCatchWhen(Object source, Delegate method, Object args, Int32 numArgs, Delegate catchHandler)
   at System.Windows.Threading.Dispatcher.LegacyInvokeImpl(DispatcherPriority priority, TimeSpan timeout, Delegate method, Object args, Int32 numArgs)
   at MS.Win32.HwndSubclass.SubclassWndProc(IntPtr hwnd, Int32 msg, IntPtr wParam, IntPtr lParam)
   at MS.Win32.UnsafeNativeMethods.DispatchMessage(MSG& msg)
   at MS.Win32.UnsafeNativeMethods.DispatchMessage(MSG& msg)
   at System.Windows.Threading.Dispatcher.PushFrameImpl(DispatcherFrame frame)
   at System.Windows.Application.RunInternal(Window window)
   at System.Windows.Application.Run()
   at ThreadingIsHard.App.Main() in d:\Source\UIAutomationSample\UIAutomationTest\obj\Debug\App.g.cs:line 0

Hm the lock is still held but how can AlterData be called while we are trying to enter the lock which was clearly not released? We do see it even in the call stack (green line)? To understand this you need to take a deep look back in history about STA threads and window message pumping. Monitor.Enter is not a replacement for WaitForSingleObject on a STA thread. Instead it calls into CoWaitForMultipleHandles which ultimately calls MsgWaitForMultipleObjectsEx. Here is the (simplified) call stack:

USER32!MsgWaitForMultipleObjectsEx
combase!CCliModalLoop::SetPointerInputModeAsAppropriateForQueueAttachmentConditions
combase!CCliModalLoop::BlockFn
combase!ClientCallRetryContext::StartTimer
combase!ClassicSTAThreadWaitForHandles
combase!CoWaitForMultipleHandles
System.Windows.Threading.DispatcherSynchronizationContext.Wait(IntPtr[], Boolean, Int32)
System.Threading.Monitor.Enter(System.Object) 
ThreadingIsHard.TestWindow.Deadlock() [d:\Source\ThreadingIsHard\ThreadingIsHard\TestWindow.xaml.cs @ 69] 
ThreadingIsHard.TestWindow.Button_Click(System.Object, System.Windows.RoutedEventArgs) [d:\Source\ThreadingIsHard\ThreadingIsHard\TestWindow.xaml.cs @ 52] 
System.Windows.EventRoute.InvokeHandlersImpl(System.Object, System.Windows.RoutedEventArgs, Boolean)

MsgWaitForMultipleObjectsEx is just like WaitForMultipleObjects with the important exception that the wait can also return when specific window messages are put into the window message queue. The official docs are silent about which window messages are pumped. I have tried

  • Keyboard press
  • Mouse move
  • Mouse button
  • WM_PAINT
  • BeginInvoke
  • Invoke
  • Dispatcher.Invoke

NONE of these window messages are pumped while holding a lock. Chris Brumme says about this in his blog

On lower operating systems, the CLR uses MsgWaitForMultipleHandles / PeekMessage / MsgWaitForMultipleHandlesEx and whatever else is available on that version of the operating system to try to mirror the behavior of CoWaitForMultipleHandles. The net effect is that we will always pump COM calls waiting to get into your STA. And any SendMessages to any windows will be serviced. But most PostMessages will be delayed until you have finished blocking.

Although the blog post is over 10 years old the information is still relevant. What I have seen so far all Window messages will be blocked except for COM calls as it is the case for UIAutomation calls which use COM as transport vehicle to marshal your UI automation calls across processes. If you have a custom UI Automation provider you should check twice if you are calling into state altering methods of your UI objects which can be reentrant called while your application was trying get access to a lock. While you were waiting for the lock your window could already be disposed! Bad things can and will happen. UIAutomation is not only used for integration testing but also by screen readers and tools for people who have problems reading a computer screen. It is therefore possible that these issues not only show up during tests but also on customer machines. If you get mysterious crashes at places where you would not have expected them check the call stack  for AutomationInteropProvider calls. If there are some you most likely have hit a problem due to unexpected "continuations" while taking a lock. Unfortunately there is not really a good solution known to me except to pay vigilant attention to your UI Automation provider objects which must ensure not to call any methods which could alter the state of objects. If you have pending Begin/Invoke calls pending in your message queue these messages will still not get pumped. Only the direct COM calls and their side effects can tunnel through blocking calls. Perhaps I am telling you things you already knew but this one was quite surprising. Perhaps it would be the best to stop pumping COM messages at all so your calls nicely line up with the other window messages as it should be. But I guess there is still too much legacy code out there which would be deadlocked if no message pumping occurs anymore. The current Windows 10 Build still pumps messages as always. Perhaps we could get a flag to switch over to the "new" behavior to prevent such nasty problems.


Monday, September 29, 2014 #

One common thing while debugging is to search who owns/references a handle. For x86 processes you can simply search the full address space but in x64 this approach no longer works. While debugging a hang from a memory dump I did find out that the process MainWindowHandle did change to a different window which was not the main window handle. Searching for the values of NativeWindow objects did also lead to no trace on the managed heap so it must be some unmanaged window handle.

First I look at the Process objects which could contain interesting data

0:033> !DumpHeap -type System.Diagnostics.Process -stat
Statistics:
              MT    Count    TotalSize Class Name
000007fef3c7d090        1           24 System.Diagnostics.ProcessModuleCollection
000007fef3c76470        1           24 System.Diagnostics.ProcessThreadCollection
000007fef3c62c70        1           64 System.Func`2[[System.Diagnostics.Process, System],[System.Boolean, mscorlib]]
000007fef3c47718        8         2240 System.Diagnostics.Process
000007fef3c89140       57         3192 System.Diagnostics.ProcessThread
000007fef5b44458        2         3664 System.Object[]
000007fef3c71b90      291        16296 System.Diagnostics.ProcessModule
000007fef3c871c0      161        20608 System.Diagnostics.ProcessInfo

When I enable .prefer_dml 1 in Windbg I can click on the process instance and get a list of all 8 Process objects where I can look at each one. The -mt command with the Method Table (MT) as argument is the only way to get only Process objects and nothing else. This is exactly what the blue underlined links in Windbg do when you click on them.

   
0:033> !DumpHeap /d -mt 000007fef3c47718
         Address               MT     Size
0000000003b52658 000007fef3c47718      280     
0000000003c29970 000007fef3c47718      280     
0000000003c8b680 000007fef3c47718      280     
0000000003e192e0 000007fef3c47718      280     
0000000003e19d80 000007fef3c47718      280     
0000000003e1c818 000007fef3c47718      280     
0000000003e1ce90 000007fef3c47718      280     
0000000003e520d0 000007fef3c47718      280    

By clicking through the blue links I do not need to enter any command by hand which is quite convenient.

0:033> !DumpObj /d 0000000003b52658
Name:        System.Diagnostics.Process
MethodTable: 000007fef3c47718
EEClass:     000007fef38a2238
Size:        280(0x118) bytes
File:        C:\Windows\Microsoft.Net\assembly\GAC_MSIL\System\v4.0_4.0.0.0__b77a5c561934e089\System.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
000007fef5b97de0  4002c1e       f0       System.Boolean  1 instance                1 haveProcessId
000007fef5b992b8  4002c1f       d8         System.Int32  1 instance             5848 processId
000007fef5b97de0  4002c20       f1       System.Boolean  1 instance                0 haveProcessHandle
000007fef3c4f2c8  4002c21       20 ...SafeProcessHandle  0 instance 0000000000000000 m_processHandle
000007fef5b97de0  4002c22       f2       System.Boolean  1 instance                0 isRemoteMachine
000007fef5b96508  4002c23       28        System.String  0 instance 0000000002b347d0 machineName
000007fef3c871c0  4002c24       30 ...stics.ProcessInfo  0 instance 0000000000000000 processInfo
000007fef5b992b8  4002c25       dc         System.Int32  1 instance          2035711 m_processAccess
000007fef3c76470  4002c26       38 ...sThreadCollection  0 instance 0000000000000000 threads
000007fef3c7d090  4002c27       40 ...sModuleCollection  0 instance 0000000000000000 modules
000007fef5b97de0  4002c28       f3       System.Boolean  1 instance                1 haveMainWindow
000007fef5b9a338  4002c29       b8        System.IntPtr  1 instance          11505b4 mainWindowHandle

The main question was who was owning the window handle. While examining all NativeWindow instances which wrap a window handle I could not find the handle in use by any managed window. In that case I want to search the complete address space which can be non trivial in a multi GB x64 application.

Luckily there exists some not so well known Windbg command to help you out. It is the !address command which shows you all allocated regions of memory (mapped files, loaded dlls, thread stacks, native heaps, …). It has also scripting support to execute a command for every address region. Now I only need to search for the window handle for every allocated memory region. %1 is the start address and %2 is the length of the memory region.

0:033> !address -c:"s -d %1 %2 11505b4"
00000000`00df6560  011505b4 00000000 00000004 00000000  ................
00000000`03b52710  011505b4 00000000 00000000 00000000  ................
00000000`28b57370  011505b4 00000000 0000200e 00000000  ......... ......
00000000`28ce4568  011505b4 00000000 000100d4 00000000  ................
00000000`4c603798  011505b4 00000000 00010052 00000000  ........R.......

Then I search one by one who references the pointers to the window handle

0:033> !address -c:"s -d %1 %2 28b57370"
00000000`1b6d2220  28b57370 00000000 000030f8 00000000  ps.(.....0......
00000000`1b6d2740  28b57370 00000000 000030f8 00000000  ps.(.....0......
00000000`1b6e5740  28b57370 00000000 00000000 00000000  ps.(............
00000000`1b706530  28b57370 00000000 00000003 00000000  ps.(............
00000000`203387e8  28b57370 00000000 00001fd7 00000000  ps.(............
00000000`20355450  28b57370 00000000 00000010 00000000  ps.(............
00000000`20448af8  28b57370 00000000 b60115d0 ffffffff  ps.(............
00000000`204854b0  28b57370 00000000 c15f5308 000007fe  ps.(.....S_.....
00000000`204d6a90  28b57370 00000000 00000000 00000000  ps.(............
00000000`28b53290  28b57370 00000000 00000000 00000000  ps.(............
00000000`2bb8f7a0  28b57370 00000000 00000000 00000000  ps.(............
00000000`2c0ffc40  28b57370 00000000 00000001 00000000  ps.(............

Knowing that some thread must hold the window handle I have searched the call stacks with method arguments for all stack locations. The memory region marked red is a region inside a thread stack.

!address
+        0`2b790000        0`2bb8c000        0`003fc000        Stack      [~33; 16d8.2bbc]
+        0`2bb8c000        0`2bb90000        0`00004000        Stack      [~33; 16d8.2bbc]

The chances are high that this thread is the owner of the window handle. From the call stack it gets obvious:

0:051> ~33s
ntdll!NtWaitForMultipleObjects+0xa:
00000000`77c8186a c3              ret
0:033> kb
# RetAddr           : Args to Child                                                           : Call Site
00 000007fe`fdb01430 : 00000000`00000000 00000000`004dbe58 00000000`004dc200 00000000`77c8592e : ntdll!NtWaitForMultipleObjects+0xa
01 00000000`77a31723 : 00000000`2bb8f820 00000000`2bb8f810 00000000`00000000 00000000`2bb8f860 : KERNELBASE!WaitForMultipleObjectsEx+0xe8
02 00000000`77b48f7d : 00000000`2bb8f8a0 00000000`000927c0 00000000`00000001 00000000`000010f4 : kernel32!WaitForMultipleObjectsExImplementation+0xb3
03 00000000`77b462b2 : 00000000`2bb8f9f0 00000000`20481fb0 00000000`00000001 00000000`2bb8f9e0 : user32!RealMsgWaitForMultipleObjectsEx+0x12a
04 000007fe`ff67acd6 : 00000000`2bb8f9e0 00000000`00000000 00000000`2bb8f9f0 00000000`20481fb0 : user32!MsgWaitForMultipleObjectsEx+0x46
05 000007fe`ff79a2a2 : 00000000`80010115 00000000`2bb8faf0 00000000`80010115 00000000`203f1c22 : ole32!CCliModalLoop::BlockFn+0xc2
06 000007fe`c11c1e6d : 00000000`250ab220 00000000`20481f50 00000000`00000000 00000000`20481fb0 : ole32!CoWaitForMultipleHandles+0x102
07 000007fe`c11a33ba : 00000000`00000000 00000000`20481f50 00000000`00000000 00000000`00000000 : mshtml!CDwnTaskExec::ThreadExec+0x153
08 00000000`77a259ed : 00000000`00000000 000007fe`c10d0000 00000000`00000000 00000000`00000000 : mshtml!CExecFT::ThreadProc+0x4e
09 00000000`77c5c541 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : kernel32!BaseThreadInitThunk+0xd
0a 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!RtlUserThreadStart+0x1d

It was a hosted Internet Explorer window which has become for some tenth of a second active causing some code to break which assumed that the MainWindowHandle the main window of an application causing sporadic hangs on some automated test agents.

There is one more thing. When you are working in a bigger software shop you have some form of continuous integration and automated testing after a build has succeeded. Usually the tests are automatically deployed on several test agents to run tests in parallel. This makes it a challenge to debug sporadically failing / hanging tests. What do you do? Try to repro on your machine, add tracing, run the tests in a loop in isolation? There is one thing you can do get down to the root cause of sporadically failing tests without all of that.

[Test]
public void Test()
{
    try 
    {
         // blahh bad test 

    }
    catch(Exception)
    {
        ProcessDumper.DumpProcess(); // Take a dump of your own process or several more if you need to
        throw;
    } 
}

Too easy? If you have some hung tests which only fail 1/50 times and you cannot attach a debugger or add tracing because the error goes away if you change the timing automatically taking memory dumps when the test fails can do miracles. This only makes sense if you can open the dump in Windbg. Even long time developers do not want to get into this arcane debugging stuff because "this happens only once a year I do not want to learn all that strange stuff!". I can assure you you will constantly take dumps once you know how to read them because it is the fastest and most efficient way to diagnose random failures without much guessing what the probable root cause could be.

For your test you can simply add procdump from the SysInternals suite as resource, write the file to %temp% and you are ready to create one or more dump files.

I am a huge fan of as much data as I can get to nail a hard to find issue. Memory dumps are an important part of gathering forensic data of crashes and hangs.

If you still do not like Windbg here is a secret: You can get quite far with Visual Studio 2013 and its latest Service Packs! The only thing you must have are matching pdbs from a symbol server or your drop folder. The parallel stacks window in Visual Studio is a great way to check a memory dump with many threads. Besides this Windbg is not really good at resolving STL containers correctly in release builds.

The following code will cause an unhandled exception in C++ in release builds because of a double free call:

#include "stdafx.h"
#include <comutil.h>

bstr_t *g_globalVar;

DECLSPEC_NOINLINE void Allocate()
{
    g_globalVar = new bstr_t(L"Hello world");
    delete g_globalVar;
    _tprintf(L"Created and deleted");
}

DECLSPEC_NOINLINE void FreeASecondTime()
{
    delete g_globalVar;
    _tprintf(L"Deleted a second time");
}

int _tmain(int argc, _TCHAR* argv[])
{
    ::Sleep(10 * 1000); // Give some time to attach dumper
    Allocate();
    FreeASecondTime();

    return 0;
}

With procdump you can create a dump for e.g. first chance exceptions (you can event filter for specific exception types)

C:\>procdump -accepteula -ma -e 1  DoubleFree c:\temp\DoubleFreeDump.dmp

ProcDump v6.00 - Writes process dump files
Copyright (C) 2009-2013 Mark Russinovich
Sysinternals - www.sysinternals.com
With contributions from Andrew Richards

Process:               DoubleFree.exe (5872)
Number of dumps:       1
Hung window check:     Disabled
Exception monitor:     First Chance+Unhandled
Exception filter:      *
Terminate monitor:     Disabled
Dump file:             c:\temp\DoubleFreeDump_YYMMDD_HHMMSS.dmp


Press Ctrl-C to end monitoring without terminating the process.

[23:35:48] Exception: C0000005.ACCESS_VIOLATION

First-chance Exception.
Writing dump file c:\temp\DoubleFreeDump_140929_233548.dmp ...
Writing 10MB. Estimated time (less than) 1 seconds.
Dump written.

Dump count reached.

Now you use VS2013 with File - Open menu to open the dump file:

image

 

Now you need to press Debug with Native Only to start "debugging" and open the Call Stack Window to find the second free call:

image

 

This works with unmanaged, managed and mixed code as well. A very useful VS addition is the Parallel Stacks window which can help you a lot to find hangs in your code. For our single threaded application it looks not terribly useful but if you have 100 threads it can help you a lot.

image

What can you do if you do not have the pdbs because the build did run, test did fail, binaries and pdbs are deleted after a few days and you only have a memory dump left? It is not as bad as it sounds. Since you know the baseline of your software you can rebuild the dlls and pdbs on your local machine.

When you load the dump initially and try to get a stack trace you get:

0:000> kb
ChildEBP RetAddr  Args to Child              
WARNING: Stack unwind information not available. Following frames may be wrong.
00c8f430 003f1d05 00c8f5f0 00c8f7cc 7f82f000 DoubleFree+0x11b9c
00c8f510 003f180b 00c8f6d0 00c8f7cc 7f82f000 DoubleFree+0x11d05
00c8f5f0 003f191b 00c8f7c4 00c8f7cc 7f82f000 DoubleFree+0x1180b
00c8f6d0 003f1b0b 00000001 00c8f898 00c8f7cc DoubleFree+0x1191b
00c8f7c4 003f1d6c 00000000 00000000 7f82f000 DoubleFree+0x11b0b
00c8f898 003f2fe9 00000001 00e6cad8 00e6cb60 DoubleFree+0x11d6c
00c8f8e8 003f31dd 00c8f8fc 7717919f 7f82f000 DoubleFree+0x12fe9
00c8f8f0 7717919f 7f82f000 00c8f940 7744a22b DoubleFree+0x131dd
00c8f8fc 7744a22b 7f82f000 e2df5dd0 00000000 kernel32!BaseThreadInitThunk+0xe
00c8f940 7744a201 ffffffff 7743f20b 00000000 ntdll!__RtlUserThreadStart+0x20
00c8f950 00000000 003f10d2 7f82f000 00000000 ntdll!_RtlUserThreadStart+0x1b

Not terribly useful. But if you add to your symbol path your output folder where you newly built pdb resides.

For this test I have converted the project to a managed C++ project for reasons you will see shortly:

.sympath+  D:\Source\DoubleFree\Debug\

and then force Windbg to load it even if the pdb timestamps do not match:

.reload /i DoubleFree.exe

you get full fidelity with line numbers back:

0:000> !ClrStack
OS Thread Id: 0xd74 (0)
Child SP       IP Call Site
00e3f2f0 7743cd7c [InlinedCallFrame: 00e3f2f0] Unknown
00e3f260 00fe1522 <Module>._bstr_t.Data_t.Release(Data_t*) [c:\program files (x86)\microsoft visual studio 12.0\vc\include\comutil.h @ 773]
00e3f270 00fe1791 <Module>.FreeASecondTime() [c:\program files (x86)\microsoft visual studio 12.0\vc\include\comutil.h @ 639]
00e3f27c 00fe1181 <Module>.wmain(Int32, Char**) [d:\source\doublefree\doublefree.cpp @ 28]
00e3f2f0 00f5a135 [InlinedCallFrame: 00e3f2f0] 
00e3f2ec 00fe112e DomainBoundILStubClass.IL_STUB_PInvoke()
00e3f2f0 73dc2552 [InlinedCallFrame: 00e3f2f0] <Module>._wmainCRTStartup()
00e3f498 73dc2552 [GCFrame: 00e3f498] 

Unfortunately Visual Studio has no way to load mismatched pdbs even if you have rebuilt the dll and pdb from the same source code base.

 

image

But all is not lost. You can get Visual Studio to load mismatched pdbs:

image

How? There is a little tool called chkmatch which modifies your "new" pdb to match and "old" binary. When you have a managed application you can save the dll to e.g. c:\temp with the managed debugger extension command !SaveModule in Windbg which is the reason why I switched from a pure C++ to a managed C++ application to make it possible to dump a dll from memory to disc. The .writeMem command did not work so I had to revert to a managed host process to debug a pure C++ issue.

Then you can make the "old" exe matching with your new pdb:

C:\Windows\system32>chkmatch -m c:\temp\DoubleFree.exe c:\temp\DoubleFree.pdb
ChkMatch - version 1.0
Copyright (C) 2004 Oleg Starodumov
http://www.debuginfo.com/

Executable: c:\temp\DoubleFree.exe
Debug info file: c:\temp\DoubleFree.pdb

Executable:
TimeDateStamp: 5429d867
Debug info: 2 ( CodeView )
TimeStamp: 5429d867  Characteristics: 0  MajorVer: 0  MinorVer: 0
Size: 68  RVA: 00007aec  FileOffset: 00006eec
CodeView format: RSDS
Signature: {36317c4a-a90c-482a-a8a7-c96ac8050a2e}  Age: 1
PdbFile: D:\Source\DoubleFree\Release\DoubleFree.pdb
Debug info: 12 ( Unknown )
TimeStamp: 5429d867  Characteristics: 0  MajorVer: 0  MinorVer: 0
Size: 20  RVA: 00007b30  FileOffset: 00006f30

Debug information file:
Format: PDB 7.00
Signature: {7cec4b85-0aca-4680-b797-82ead65f95ad}  Age: 1

Writing to the debug information file...
Result: Success.

Viola. You have matching pdbs and file and line numbers back to debug your dump file with Visual Studio. A common scenario where this is useful is that if your build stores only the binaries in a repository but does not upload for build performance reasons the pdbs. When you get dumps from failed test runs from some test machines you have perhaps still the binaries but not the pdbs. In that case Windbg or the chkmatch tool can help you out to get back file and line information from otherwise unreadable call stacks.

I know that the VS product group does not want a "load mismatched pdbs" feature in VS but it would be very useful if you have a dump but not exactly matching symbols. At least for the guys which do not want to learn Windbg for such "rare" cases.


Sunday, September 21, 2014 #

When you have a performance bug you usually look at your code why something takes longer than expected. In conjunction with a debugger this is a great approach for easy bugs. Things get much harder if the issue only surfaces sporadically at customer sites. Application logging which is always on helps you to see that there was a problem but since logging must not flood the hard disc you cannot log every method call with all parameters like IntelliTrace tries to do. The next level is application tracing which can generate enormous amounts of output is a good method to find production issues on customer sites. Now you see in your traces that something takes much longer than normally but you have no clue WHY it is taking so much longer. This is the time where system wide profiling with ETW (yes I keep coming back to it) is a nice way to drill deeper to solve otherwise impossible to solve bugs. Performance Counters are also a good approach to check if something strange is going on but then you still have (usually) no clue what the reason for the slowdown was. It is therefore the best to bring your application specific tracing and ETW in WPA together to find the slowdown in your application traces and then drill deeper in the other ETW profiling events.

As a rule of the thumb if you have carefully added tracing to your application you have also written the best trace viewer in the world to visualize your traces. No problem you can write an extractor tool to store your traces as text file in your own format which can be viewed by your good old trace viewer as usual (Tx is a nice library to extract events from an etl file). If you do want to stay within WPA you certainly want a better viewer than a plain list of your application traces. I use application traces to get an idea what the application was doing at a high level and then drill down to the details once I have found the slowdown in the traces. It would be cool to see in WPA your traces per thread with a duration timeline so I can see which methods take longest to execute. That allows a call stack like view where I can simply see in the graph which operations take most of the time.

image

The only issue with WPA is that in the graph there is no expand menu item (in the list view you can expand but there seems to be a limit how many items in the graph can be expanded)  and there is no possibility to easily filter for regions e.g. with a method execution time >0,1s. But there are workarounds. You need to bring up the advanced graph settings dialog

image

image

and to force the graph to expand all items you can define in the Expansion tab a query that matches all regions.

image

Then you get in the UI graph a nice per thread expansion of all called methods. Once you have visually found an interesting hot spot it is easy to drill deeper in the CPU Usage Precise graph to find unexpected waits for virtually anything.

You can check it out for yourself when you write your traces to ETW like with this little ETW tracer which depends on the EventSource Nuget package.

[EventSource(Name = "ETWTracer", Guid="950FD22D-086C-4D16-98A0-FCC2B8DE2B07")]
public sealed class Tracer : EventSource
{
    #region Singleton instance
    static public Tracer Log = new Tracer();
    #endregion

    [Event(1, Keywords = Keywords.InOut, Task = Tasks.InOut, Opcode = EventOpcode.Start)]
    public void MethodEnter(string Method)
    { WriteEvent(1, Method); }

    [Event(2, Keywords = Keywords.InOut, Level = EventLevel.Verbose, Task = Tasks.InOut, Opcode=EventOpcode.Stop)]
    public void MethodLeave(string Method, int DurationInus)
    { WriteEvent(2, Method, DurationInus); }

    [Event(3, Keywords = Keywords.Info, Task = Tasks.Info, Opcode = EventOpcode.Info)]
    public void Info(string Method, string Message)
    { WriteEvent(3, Method, Message); }
    #region Keywords / Tasks / Opcodes

    public class Keywords   // This is a bitvector
    {
        public const EventKeywords InOut = (EventKeywords)0x0001;
        public const EventKeywords Info = (EventKeywords)0x0002;
    }

    public class Tasks
    {
        public const EventTask InOut = (EventTask)0x1;
        public const EventTask Info = (EventTask)0x2;
    }

    #endregion
}

With this little class we can define a method/enter leave tracer with just a few lines

    class SimpleTracer : IDisposable
    {
        string Method;
        Stopwatch Watch = Stopwatch.StartNew();
        public SimpleTracer(string method)
        {
            Method = method;
            Tracer.Log.MethodEnter(method);
        }

        public void Info(string message)
        {
            Tracer.Log.Info(Method, message);
        }

        public void Dispose()
        {
            Watch.Stop();
            Tracer.Log.MethodLeave(Method, (int)Watch.Elapsed.TotalMilliseconds * 1000);
        }
    }

and use it right away
       void Method1()
        {
            using(new SimpleTracer("Method1"))
            {
                Thread.Sleep(1000);
                using(new SimpleTracer("Method2"))
                {
                    using(new SimpleTracer("Method3"))
                    {
                        Looper();
                        using (new SimpleTracer("Method4"))
                        {
                        }
                    }
                }
            }
        }

With PerfView you can enable collection your brand new ETW event provider by specifying the Guid for it. I still prefer the Guid and register them in the normal way with EventRegister for my own ETW provider since only PerfView and WPA knows how do the correct rundown based on the provider name by calculating its Guid by convention when the ETW provider class name is used with a * before the name.

image

Now it is time to load the generated data and load our own region of interest file. Go to Trace - Trace Properties and add your region file to it.

image

Then you get a new graph in the System Activity diagram list that you can add to your current analysis pane as usual.

image

You might be interested how the regions file does look like. Here it is:

TraceRegions.xml

<?xml version='1.0' encoding='utf-8' standalone='yes'?>
<InstrumentationManifest>
   <Instrumentation>
      <Regions>
         <RegionRoot Guid="{65D8D484-90A9-4BA4-9247-51E7C159F000}" Name="" FriendlyName="">  <!-- This guid is simply a randum number -->
            <Region  Guid="{65D8D484-90A9-4BA4-9247-51E7C159F001}" Name="" FriendlyName="">  <!-- This guid is simply a randum number -->
               <Start>
                  <Event Provider="{950FD22D-086C-4D16-98A0-FCC2B8DE2B07}" Id="1" Version="0" /> <!-- Method Enter event. The guid is your ETW Provider guid. Event id and version must match! -->
               </Start>
               <Stop>
                  <Event Provider="{950FD22D-086C-4D16-98A0-FCC2B8DE2B07}" Id="2" Version="0" /> <!-- Method Leave event. The guid is your ETW Provider guid. Event id and version must match! -->
               </Stop>
                <Naming>
                   <PayloadBased NameField="Method" />  <!-- The regions get the method names. It is a pity that no concatenation of more than one field is supported. -->
                 </Naming>
                <Match>
                    <Parent TID="true" /> <!-- Parent event must be from the same thread otherwise strange durations and correlations would be displayed as region -->
                    <Event TID="true">    <!-- The start stop events must originate from the same thread -->
                      <Payload FieldName="Method" TargetFieldName="Method" /> <!-- Region start/stop must come from the same method. 
                                                                                  You can use several payload rules here to match several conditions here. -->
                    </Event>
                    <!-- Nesting does not really work although it would be nice feature to get a call stack like view 
                         The main problem is that the region matching algo takes not the nearest possible event as parent but the first matching one which 
                         is almost always wrong if you call yourself in a loop the nesting is awkward.
                         It would work if we would give each method enter leave a unique activity guid so we could nest based on matching activity guids
                         but no one is writing traces out in this way.
                     -->
                    <!--
                     <SelfNest TID="true">
                    </SelfNest> 
                    -->
                </Match>
            </Region>
         </RegionRoot>
      </Regions>
   </Instrumentation>
</InstrumentationManifest>

As you can see in the comments there are still quite a few things to be desired to make regions even more user friendly but they are a nice way to get a visual "call stack" like view with durations as bars for one or all threads. The call stack view works only if you have no recursive or duplicate method names which would make the SelfNest feature work correctly. If more than one parent region is possible the first one in time and not the nearest on the current thread is chosen which is unfortunate to get a more call stack like view. Zooming into the graph works well for the x axis but there is no way to zoom out on they y axis to see where the time is spent when many threads at the same time are working together.

Although still many things are left to be desired the region based trace view is much better than a plain trace method enter/leave view as you get it for Generic events:

image

This can be a killer feature if you have complex application tracing in your application and you want to combine your application know how with the system response to the actions of your application.


Sunday, September 14, 2014 #

Recently I have seen an interesting performance question on SO which deals with Parallel.For in .NET. The code was basically

            Parallel.For(0,1000*1000*1000, 
            i => {});

The actual work is simulated by an empty delegate which is an easy measure the overhead of Parallel.For. When you let this run on some multicore machines in x86 Release you get number like

image


The running time is not constant as one might expect but sometimes there are slowdowns in the order of 6 times or more. With WPT is was quickly able to find as root cause the parallel for loop counter increment method:

clr.dll!COMInterlocked::ExchangeAdd64 

It is used to increment the loop counter in a lock free way. With all cores in use we have massive memory contention on it but not all the time. By measuring the counts per thread I could see that sometimes one thread gets much less work than the others but there was no clear pattern which did relate to the slowdowns. Since we deal here with memory contention it should surface in L2/L3 cache misses because some other thread constantly manages it to overwrite the value while he is rarely able to update the counter.

The ExchangeAdd64 method uses basically the following approach:

        static void InterlockedAdd()
        {
                do 
                {
                    old = Counter;
                    @new = Counter + Increment;
                }
                while( old != Interlocked.CompareExchange(ref Counter, @new, old));
        }

It adds some value to the existing counter and then tries to atomically exchange it. This seems to work in x64 all the time not if you let it run as 32 bit application.

The 64 bit code uses

lock xadd qword ptr [rdx+8],rax ds:xxxxx

whereas the x86 version uses

lock cmpxchg8b qword ptr [edi] ds:xxx

Both approaches use long (64 bit) values although the Parallel.For counter we did use is an integer counter only. This seems to cause funny side effects on the CPU cache lines. There exist no performance counters to measure cache miss rates which makes it hard to verify this claim. But Intel has released the free Performance Counter Monitor (pcm.exe)  which offers much deeper insights what your processor is currently doing. To compile it and to get it working is not entirely straightforward since you need to compile a kernel driver and install it as test signed or you download a publicly signed version of it from the net. The Intel devs seems to be more Linux guys since the project files can be compiled with VS2013 but for every project the binaries end up in a sub folder of the project in the Debug folder (even for release). There is NO common bin folder were all binaries are compiled together. Anyway. The most interesting tool is pcm.exe which can show you per core statistics how many instruction have been executed and how the cache hit/miss counters are for the L2/L3 caches. The executed instructions per core show a sharp decrease which explains why the running time of the Parallel.For loop is sometimes soo slow. But still we do not see the reason for the drop in Execution performance.

 

image

But when I look at L3 Cache misses it becomes clear that one core seems to get into a death spiral where he consistently losses for several seconds to update this value which seems to distort the parallel execution of instructions because the data needs to be read from the far slower main memory many times while synchronizing the CPU caches.

 

image

I am not sure how to fix it since it should not happen at all and it will hopefully be fixed by a future CPU version. It makes no sense to emit extra code for such corner conditions. This example makes a nice demo what memory contention can do to your application performance. If you have no contention any lock free version will be fast. But if you have many concurrent accesses to the same memory locations you can get much longer running times while at the same time you waste enormous amounts of CPU cycles. Does this matter? Well yes it does if you use e.g. the lock keyword in .NET which internally first spins a bit in the hope that the lock will be released fast. If not it will take a real lock and stop wasting CPU cycles. If you have e.g. 5 threads updating some central variable and you use the lock keyword you will find many burned CPU cycles only to increment a counter. The Interlocked methods offer good alternatives to using locks if you have simple tasks. The interlocked methods are nice but they JIT compiler will not replace your code directly with the corresponding CPU code but you pay with an extra method call which can cost you in the worst cases a factor 10 in performance because you need to swap out quite a lot of data from registers onto the thread stack.

I really like the pcm tool from Intel because it gives me the ability to measure things. Previously I had to guess why this or that was slowing down my application. Now I can prove it. Here is some sample output of it so you can check if the tool might be useful to you as well.


EXEC  : instructions per nominal CPU cycle
IPC   : instructions per CPU cycle
FREQ  : relation to nominal CPU frequency='unhalted clock ticks'/'invariant timer ticks' (includes Intel Turbo Boost)
AFREQ : relation to nominal CPU frequency while in active state (not in power-saving C state)='unhalted clock ticks'/'invariant timer ticks while in C0-state
L3MISS: L3 cache misses
L2MISS: L2 cache misses (including other core's L2 cache *hits*)
L3HIT : L3 cache hit ratio (0.00-1.00)
L2HIT : L2 cache hit ratio (0.00-1.00)
L3CLK : ratio of CPU cycles lost due to L3 cache misses (0.00-1.00), in some cases could be >1.0 due to a higher memory latency
L2CLK : ratio of CPU cycles lost due to missing L2 cache but still hitting L3 cache (0.00-1.00)
READ  : bytes read from memory controller (in GBytes)
WRITE : bytes written to memory controller (in GBytes)
IO    : bytes read/written due to IO requests to memory controller (in GBytes); this may be an over estimate due to same-cache-line partial requests
TEMP  : Temperature reading in 1 degree Celsius relative to the TjMax temperature (thermal headroom): 0 corresponds to the max temperature


Core (SKT) | EXEC | IPC  | FREQ  | AFREQ | L3MISS | L2MISS | L3HIT | L2HIT | L3CLK | L2CLK  | READ  | WRITE |  IO   | TEMP

   0    0     0.08   1.05   0.08    0.31     355 K   1013 K    0.65    0.32    0.21    0.08     N/A     N/A     N/A     66
   1    0     0.00   0.20   0.01    0.31     111 K    226 K    0.51    0.17    0.53    0.13     N/A     N/A     N/A     66
   2    0     0.01   0.27   0.02    0.30     193 K    517 K    0.63    0.19    0.35    0.15     N/A     N/A     N/A     68
   3    0     0.08   1.00   0.08    0.53     706 K   1838 K    0.62    0.50    0.38    0.17     N/A     N/A     N/A     68
   4    0     0.01   0.38   0.03    0.32     247 K    667 K    0.63    0.22    0.32    0.13     N/A     N/A     N/A     66
   5    0     0.01   0.56   0.02    0.40     177 K    479 K    0.63    0.26    0.37    0.15     N/A     N/A     N/A     66
   6    0     0.01   0.35   0.03    0.32     235 K    500 K    0.53    0.24    0.37    0.10     N/A     N/A     N/A     70
   7    0     0.02   0.43   0.05    0.57     116 K    807 K    0.86    0.41    0.11    0.21     N/A     N/A     N/A     70
-----------------------------------------------------------------------------------------------------------------------------
SKT    0     0.03   0.70   0.04    0.38    2145 K   6051 K    0.65    0.37    0.29    0.14     N/A     N/A    0.00     63
-----------------------------------------------------------------------------------------------------------------------------
TOTAL  *     0.03   0.70   0.04    0.38    2145 K   6051 K    0.65    0.37    0.29    0.14     N/A     N/A    0.00     N/A

Instructions retired:  917 M ; Active cycles: 1318 M ; Time (TSC): 3986 Mticks ; C0 (active,non-halted) core residency: 10.81 %

C1 core residency: 15.87 %; C3 core residency: 0.96 %; C6 core residency: 0.00 %; C7 core residency: 72.36 %;
C2 package residency: 0.00 %; C3 package residency: 0.00 %; C6 package residency: 0.00 %; C7 package residency: 0.00 %;

PHYSICAL CORE IPC                 : 1.39 => corresponds to 34.79 % utilization for cores in active state
Instructions per nominal CPU cycle: 0.06 => corresponds to 1.44 % core utilization over time interval


Monday, June 30, 2014 #

When you want to schedule some not so important work you can have the option to lower the thread priority of a thread or you can change the base priority of the whole process. That can help to keep Windows responsive while CPU intensive tasks (e.g. some processes with Parallel.For instances) can use all cores if nothing else is going on. For that purpose you can change the process priority via SetPriorityClass or if you are using .NET you can use the Process.PriorityClass property to get and set it. Please note that these methods have nothing to do with classes and object oriented programming at all. The xxxx_Class values define the base priority for all threads in a process. This value can be changed at runtime as well via e.g. the task manager. The thread scheduler of Windows uses the base priority of your thread and "adds" your thread priority to your thread base priority which is then the effective thread priority. Thread priorities in Windows have can have values from 0-31 but the many APIs to influence the effective thread priorities make things more indirect. Below is a table that shows the effect of setting the process class priority (aka class) and the thread priority:

image

What is interesting that if you set a process to Below Normal it still can interfere with threads from a Normal process if it has as thread priority Highest set. A Normal thread in a Normal process gets as thread base priority the value 8 which is the same as a Below Normal process with a Highest thread priority set. Which thread in a managed application runs with Highest priority? Yes it is the finalizer. If you have much work in your finalizers in Below Normal processes then they compete with your Normal processes for CPU. This might catch you by surprise if you have thought that Below Normal processes do not interfere with Normal processes.

There is an even more severe way how threads of lower prio processes can compete with your foreground tasks. Windows has a nice service to keep your interactive application running by boosting its priority so your Windows which just receives the keyboard input is responsive. This is called priority boosting. Why should I care about priority boosts in a server application? Well there are actually three reasons why Windows can decide to boost your thread priority (copied from MSDN):

  • When a process that uses NORMAL_PRIORITY_CLASS is brought to the foreground, the scheduler boosts the priority class of the process associated with the foreground window, so that it is greater than or equal to the priority class of any background processes. The priority class returns to its original setting when the process is no longer in the foreground.
  • When a window receives input, such as timer messages, mouse messages, or keyboard input, the scheduler boosts the priority of the thread that owns the window.
  • When the wait conditions for a blocked thread are satisfied, the scheduler boosts the priority of the thread. For example, when a wait operation associated with disk or keyboard I/O finishes, the thread receives a priority boost.

    You can disable the priority-boosting feature by calling the SetProcessPriorityBoost or SetThreadPriorityBoost function. To determine whether this feature has been disabled, call the GetProcessPriorityBoost or GetThreadPriorityBoost function.

I have marked the relevant sentence bold: If your server application has processes which are waiting for something their thread priority is also increased. The priority boost can add up to four (tested on my Win 8.1 x64 Box) to your current thread priority which can make your Lowest priority thread effectively a Highest priority thread!

Now we have all factors together to calculate the effective thread priority:

image

The effective thread priority is calculated by the process base priority and the current thread priority and an optional thread boost value. Although I have used the + operator in the formula it is not s simple addition of some values. Now you know how to read the thread priority and dynamic priority in process explorer. Base Priority is the thread priority when the process priority and the thread priority have been "added" without any priority boost.

image

Dynamic priority is then the effective boosted thread priority value the OS scheduler works with. Now we can check our knowledge with a my little GCAllocations tester from a previous post. As a normal process it should work with thread of priority 8 most of the time:

image

As it turns out this is not the case. Due to the many context switches (waits) the thread priorities were boosted to 9 and 10 mainly. This effect is also there if the console application is not a foreground window or has keyboard input. This is only due to the fact that the OS boosted your thread priorities because it thought it would be a good idea to give you some time to work after you were woken up. Can we prevent this somehow? Yes there is an API SetProcessPriorityBoost when you call it the second parameter as 1 then you disable the priority boost feature. Or you use

Process.GetCurrentProcess().PriorityBoostEnabled = false;

image

Now that looks more like what we would expect from a well behaved process. So should we turn off Priority Boosting and do we gain anything from it? To check this I have run GCAllocations

  • Single Instance
  • Two Instances which are bound to four cores where four threads allocate in parallel. One process has Normal prio the other BelowNormal.
    • One time with Priority Boosting enabled
    • Second time with Priority Boosting disabled.

I have measured the duration of each instance how long it did need to execute, how much CPU it used and the Ready Times. Below are the results:

image

What can we learn from these graphs? First if no contention on the Cores is happening we are fastest with 2,7s for the Single Instance which could run undisturbed on all four cores. When there is contention on the four cores (two instances are running at the same time) things become slower by ca. 20%. The normal priority process and the one with Priority Boost enabled show no difference in run time there. But the BelowNormal process with Priority Boosting enabled gains ca. 7% in run time which is sort of unexpected. The ready times are quite high with 6,5 and 6,3s for these processes but since 4 threads are running at the same time we have to divide the process ready time by the number of active thread to get an estimation of the increased run time. This is is shown in the Projected Duration graph where to the Single Instance Duration the ReadyTime/4 is added and graphed for each process. This approximation is not perfect either because it does neglect the CPU usage of each process which differs quite a lot. It is interesting that the process with BelowNormal had the lowest total CPU consumption of 5,8s. How can this be? I did check the Sample Profiler where the CPU consumption does differ. It turns out that with Priority Boosting enabled the page faults from the Windows memory manager are much less costly. 

image

This is a nice bonus for boosting the thread priority. In this sample application priority boosting had no negative effect on the foreground thread but the background activity did become 7% faster. From these experiments I would say it is safe to leave CPU Priority boosting enabled. That's enough for thread and process priorities for now. There is another priority which was introduced in Windows Vista which is IO priority. There are no direct APIs to influence the IO priority except if you call SetThreadPriority:

    enum ThreadPriority
    {
        THREAD_MODE_BACKGROUND_BEGIN = 0x00010000,
        THREAD_MODE_BACKGROUND_END = 0x00020000,
        THREAD_PRIORITY_ABOVE_NORMAL = 1,
        THREAD_PRIORITY_BELOW_NORMAL = -1,
        THREAD_PRIORITY_HIGHEST = 2,
        THREAD_PRIORITY_IDLE = -15,
        THREAD_PRIORITY_LOWEST = -2,
        THREAD_PRIORITY_NORMAL = 0,
        THREAD_PRIORITY_TIME_CRITICAL = 15
    }


    [DllImport("kernel32.dll", SetLastError = true)]
    extern static int SetThreadPriority(IntPtr hThread, ThreadPriority nPriority);

     // Will set in a normal process the Thread Priority to 4 and the IO Priority to Very Low
     SetThreadPriority(GetCurrentThread(), ThreadPriority.THREAD_MODE_BACKGROUND_BEGIN);
     // Do Low Priority Work and do not forget to reset it when you want 
     // normal scheduling
     SetThreadPriority(GetCurrentThread(), ThreadPriority.THREAD_MODE_BACKGROUND_END);

 

That can help to leave the hard disc spinning while you are indexing some data or running some low priority background task. There are actually 5 levels of IO priorities which can only be set for your thread by calling the undocumented function NtSetInformationProcess. You can set the IO priority from a normal process only to 0-2 but not higher than normal.

    [DllImport("kernel32.dll", SetLastError = true)]
    extern static IntPtr GetCurrentProcess();

    [DllImport("ntdll.dll", SetLastError = true)]
    public static extern int NtSetInformationProcess(IntPtr processHandle,
       PROCESS_INFORMATION_CLASS processInformationClass, ref IntPtr processInformation, uint processInformationLength);


    enum IOPriority : uint
    {
        VeryLow = 0,
        Low = 1,
        Normal = 2,
    }

    static int SetIOPriority(IntPtr hProcess, IOPriority newPrio)
    {
        IntPtr ioPrio = (IntPtr) newPrio;
        int lret= NtSetInformationProcess(hProcess, PROCESS_INFORMATION_CLASS.ProcessIoPriority, ref ioPrio, 4);
        return lret;
    }

ProcessIOPriority is a constant with the value 21 if you want to try it out. There are not many use cases out there except if you want to penalize some processes which scan your hard disc for no real purpose except to make your IT department happy. In that case it could help to give such processes a lower IO and Process priority to teach them not to interfere with foreground processes.

If you design a server software which must stay responsive you must ensure that processes which require interactive performance are not held up by background jobs. It makes sense to ban these processes to lower process priorities to ensure that all cores can be utilized by your users if necessary. It is of no help if e.g. SQL Server uses all of your cores to index its database again because the interactive users processes will suffer from latency issues. Thread ready times can become arbitrary high if you are in a CPU oversubscription scenario. In that case all Normal processes should belong to interactive users. This way you get a scalable server which can support up to N concurrent users with varying background load. If latency issues should show up then the only valid reason is that too many users are logged onto your server. But it should never happen that background activities can disturb interactive tasks.

It is important to note that all priorities are used by the OS inside priority queues which are worked from the top down to the lowest priorities with some mechanisms to prevent starvation. But these priorities have absolutely no effect if you have e.g. no contention on a CPU. When only your task sits in a priority queue which is empty you will be served immediately. To see these priorities in action you need to load your system with >=100% CPU intensive tasks to watch how things are sorted out. SQL Server will not become slower if the system load never reaches 100% and you have lowered its CPU priority. But it can help to mitigate spikes of SQL server which can influence interactive tasks if the system is heavily loaded.

Todays software is often written in a scalable way by using all cores that are present. This is nice for a single process using all cores and no other activity. But if you have a system where more than one thing needs to run concurrently such "scalable" software is a recipe for disaster. You can try to run some batch scripts in parallel which use a multi core enabled compression algorithm and watch your system. It will not respond to user input at all anymore. Some time ago a colleague wanted to speed up a batch script by starting a tool 100 times concurrently. I told him before he pressed enter to start it that script could render the system unresponsive. After letting it run for 10 minutes and seeing no progress he pressed the power button. When he would have started the script with Idle process priority he might have had a chance to kill the offending processes.

If you start batch scripts parallel to your normal workload you should try out

start /LOW xxxx.cmd

to start it with Idle process priority. That can make a huge difference when you sit in front of the machine and you want to continue to work. You can even restrict it to run only on some cores with the /AFFINITY mask of 0xF to run your batch script only on the first four cores.

If you want to check out thread IO priorities you cannot count on Process Explorer. These internals are only displayed by Process Hacker which is a Process Explorer clone with more options and less bugs. The best feature of it is certainly that it supports Ctrl+C in all dialogs which is a big plus if you want to get data out the UI.


Wednesday, April 30, 2014 #

On my home machine VS2013 starts up in 2s from my SSD. But on another Windows 7 machine it takes over 13s. I never bothered really since I thought that some plugins and connecting to TFS simply takes its time. It stayed and itch until I installed Resharper 8 were things noticeably got slower. No big deal. Suspend Resharper and it will play dead. It works except that the startup time of VS still looked bad. Time for another profiling session with WPT. The approach is as simple as starting up WPRUI

image

  1. Press Start.
  2. Start Visual Studio.
  3. Wait until it is has started.
  4. Press "Save" button to stop profiling.
  5. Enter a file name to which the data should be saved
  6. Be patient while data is saved.
  7. Press "Open in WPA" button to analyze the data.

When I look at the CPU consumption until the VS 2013 is usable it takes 13,1s to start up while consuming 9,6s of CPU time. The big gap in the middle looks like it needs to connect to the network. Lets ignore that for now.

image

To see what the CPU was doing I need sample profiling (CPU Usage Sampled). Many people are confused by the fact that WPT shows you two CPU usage graphs with nearly the same content. The graphs (should) look the same but the underlying data to get the graphs is fundamentally different.

CPU Usage Sampled

This graph comes to existence by parsing CPU sampling ETW events. When sample profiling is enabled the Windows Kernel will log the call stacks of all running threads every 1ms. The sample rate is configurable from seconds between a sample and up to 8kHz (every 0,125ms see xperf -SetProfInt for more infos). Sample profiling gives you an accurate view what your application was doing with the CPU. When you look into the call stacks you should have at least 10K of CPU sample events in your view selected. Then you can check the call stacks in your application down to stacks with at least 100 Samples which represents an error of 1% which is usually enough. If you want to optimize a method which shows up in the sample profiling with only 10 samples you are chasing ghosts.

CPU Usage Precise

Although the graphs look identical you have a lot more columns (54!) to choose from. This time you are not looking at sampled call stacks but at a parsed view of Context Switch ETW events. A context switch occurs when your thread stops or starts running on a CPU. A thread is switched in (NewThreadxxxx columns) and another thread stopped running (Oldxxxxxx columns) while a third thread has enabled your thread to run by releasing a lock, signaling an event, timer expiration and a few other reasons. The thread that allowed your thread to run is called Readying Thread/Process/Stack columns. This very fine granular view allows you to find out e.g. why your thread was sitting in ReadFile for 5s to read a 1KB file.

A more thorough explanation of context switch handling can be found in the WPA documentation for CPU Analysis. To make sense of the many columns you should memorize this picture which shows all important times of a context switch. The picture below is shamelessly copied from the WPA documentation.

Figure 10 CPU Usage Precise Diagram

The most important thing to remember is that with CPU Usage Precise you can find out

Now back to the original question. We are after a CPU consumption issue. What do we see in CPU Usage (Precise)? Only context switches. A context switch occurs when your thread stops running because it is waiting for something (WaitForSingleObject) or the thread quantum was used up. It is 15ms by default but it may be more. On my Win 8.1 x64 box it is actually 32ms. The reason why a context switch event is logged that your thread is waiting (not CPU bound) for something or your thread has already executed code for 15-32ms and has used up its allotted time slice. In effect you will see in the stacks of CPU Usage Precise all of your calls that did block for some reason but you will almost never see a high CPU consuming operation.

How do I know that? Because I have tested it with a small CPU consuming application:

class Program
{
    [DllImport("kernel32.dll")]
    extern static int GetCurrentThreadId();

    static void Main(string[] args)
    {
        Action acc = () =>
        {
            Process proc = Process.GetCurrentProcess();
            var  currentThread = proc.Threads.Cast<ProcessThread>().Where(t => t.Id == GetCurrentThreadId()).Single();
            currentThread.ProcessorAffinity = (IntPtr)0x02; // Force thread affinity on core 2

            while (true)
            {
            }
        };

        Parallel.Invoke(Enumerable.Repeat(acc, 2).ToArray());  // Execute two concurrent threads and force execution on one CPU
    }
}

This little app does nothing but spawning two threads which run on the same CPU to give the thread scheduler some work. Before that it was a simple while(true); loop in the Main method which did consume CPU but after it was running it never caused and context switches for seconds. The first observation was that a context switch is only performed if another thread must run on this CPU. If not there is no reason to bring it to a halt it can run happily forever. Internally the thread dispatcher still checks if the thread quantum did expire but since no other thread is scheduled for this CPU its quantum is renewed without a context switch. It turns out that you need to cause actual contention on a CPU to make the thread dispatcher do some actual work. The very busy while loops show up in the profiler like this:

image

There we see that on my Win 8.1 machine when two threads compete for the same CPU

  • Every  thread is run again after 32ms (TimeSinceLast).
  • It runs for 32ms (NewInSwitchTime).
  • While it did wait for the CPU for 32ms (Ready).
  • It did spend no time in some WaitForSingleObject synchronization primitive (Wait).
  • The previous thread is put to sleep because it has used up its thread quantum (OldWaitReason with value WrQuantumEnd).
  • During the selected time of 1,62s 54 context switch events did happen which is nearly optimal 1,62/0,032=50 context switches with 32ms time slices (Count).
  • The CPU was maxed out. 12,5% is one core on my 8 core machine (% CPU Usage).

With context switch events you can find out pretty much anything you ever wanted to know why your thread was not running. Thread Scheduling is explained in some free EBooks by a Windows Kernel Developer named David Probert. He was written quite a lot about the internal mechanics of a context switch. If you want to know more about the possible values of OldWaitReason you will find some kernel defines but no explanation. The best explanation for some values I have found in the .NET documentation for the ThreadWaitReason enum. In reality you will find for OldWaitReason almost always the value Executive which means that there thread was waiting inside a WaitForSingleObject and has given up its current quantum.

Below is a picture of a normal Visual Studio startup which prints the time gaps in us of the main thread when it had to wait wait for something. With wait I mean waiting for the CPU to become ready to run the thread (high Ready times) or it was waiting on something to complete or simply the next window message. As you can see we have a pretty busy message pump which has 3402 context switches where the total time gap was 450ms while no code did run on the main UI thread. Since there are so many of these small gaps it makes not much sense to search for useful work on the UI thread there. Instead you should check out single event where e.g. for 145ms nothing had happened. Perhaps there is room for improvement there. I did not see TimeSinceLast used often in in performance analysis sessions so far but it is definitely useful to check out where the largest execution gaps of a supposed to be busy thread are lurking because it is basically the sum of wait times and ready times which catches both thread scheduling issues and wait times for the completion of async operations.

image

After showing off so many details of CPU Usage (Precise) I need to come back (again) to our CPU consumption issue.  With CPU Usage Precise we would see at most one call stack every 15ms-30ms because the thread was switched out because the thread quantum expired. This is usually a lousy sample rate which is polluted by the wait call stacks. In theory it should be possible to filter for WrQuantumEnd switch out reasons to get the same data over a longer period of time like with sampling. In practice this does not work since the the VS Studio startup example not a single thread ever is switched out because its quantum has expired. The table below shows a typical thread switch out distribution. Executive means that the thread did enter a WaitForSingleObject which is nearly the only reason why a thread is switched out.

OldWaitReason Count
Executive 4377
WrQueue 100
WrDispatchInt 26
UserRequest 10
WrYieldExecution 8
WrLpcReceive 8
WrUserRequest 2
WrLpcReply 1
WrResource 1

After all of the cool context switch stuff we must downgrade to the second CPU view named CPU Usage (Sampled) since this is still the only view where we can reliable track down the call stacks which consume most CPU.

image

I ignore threads for now to view an accumulated call stack of CPU usage across all threads in Visual Studio. There it is immediately obvious that Resharper is using a lot of CPU while VS is starting up although it is suspended. I am not sure why in version 8 of their product they still think that startup time is not important and they seem to fully initialize their stuff on the main UI thread. To get a more complete picture I can select an expensive method and group it by function to see who is calling it and how much CPU is consumed down the call chain.

image

Interesting: Resharper has its own brand of MEF embedded which does the initialization. They also seem to use concepts of TPL DataFlow to trigger their components from multiple threads. That certainly looks sophisticated but the CPU graph shows that very little multi threading is happening during startup. On my 8 core machine very little action goes above single core usage. I do not know why they need to initialize so much of their stuff on the UI thread only to add some menus to Visual Studio while being suspended.

Looks like we have a suspect. In case of doubt lets uninstall Resharper 8 and check again. 

image

The CPU consumption has dropped from 9,6s down to 2,8s. Startup time has dropped to 8,2s.

Time in s With Resharper 8 Without Resharper Difference
CPU Consumption 9,6 2,8 6,8
Startup Time 13,1 8,2 4,9

We have huge savings in terms of CPU consumption here which almost 1:1 translates to reduced startup time. The Dataflow stuff of Resharper is able during the startup of VS to move 14% of the CPU work onto background threads which helps a little bit but still the amount of consumed CPU is enormous for a disabled plugin. To be fair on my Windows 8.1 box with an SSD the startup time of VS 2013 with Resharper is only 3,1s vs 1,9s without it. On good dev machines this is not a big deal but on slower machines a disabled Resharper is quite noticeable. That's it for the CPU intensive stuff. Now why lets check out the flat lines in CPU consumption.

What is causing such long delays during the startup of VS? It looks like the main thread is blocked for a long time. After reading so far your natural reaction must be: Open CPU Usage Precise and check out the Wait and Ready times! In this case the UI thread was blocked by something not very complex:

image

While the visual studio is starting up it tries to initialize Direct3D as part of a WPF application. Then Nvidia kicks in (nvd3dum.dll) which calls into some nvSCPAPI.dll which seems to poll for something over 5,4s. I have no idea why this is happening but this feels very wrong. The machine in question has some NVidia Quadro card installed and only NVidia knows why on some machines (not all machines exhibit this behavior) support for stereoscopic gaming is tried to enable. Since NVidia does not release pdbs to their drivers I cannot dig any deeper here but this is a major issue which hits all WPF enabled applications running on NVidia boards under still unknown circumstances. To prove that that the nvSCPAPI.dll was to blame I simply renamed the folder where the dll is located

C:\Program Files (x86)\NVIDIA Corporation\3D Vision

to something different. Now the startup time is down to 2,6s which feels really good now. I kept the renamed folder since I do not have the need for stereoscopic gaming.

image

It is nice to see that WPT can help in these simple cases to find the root issues within minutes. I hope that the Jetbrains can make their good product still better by reducing the startup time more when it is suspended and that NVidia fix their drivers to prevent searching in vain for not installed hardware on some machines. When Nvidia fixes this issue it could help to convince some people that it is not WPF and .NET which is slow but that after their graphics card driver update some apps start much faster than before.


Friday, April 4, 2014 #

Finally after long years of stagnation the JIT compiler got significant attention to support SIMD instructions. As an experienced .NET Developer your might think SIMD? What? Never heard of! Don´t worry: You will hear the same from most experienced C/C++ developers. SIMD is more a geek thing for performance addicts which stands for Single Instruction Multiple Data.

SIMD support means that the compiler (Intel, MSVC++ and now .NET JIT) can emit assembly instructions that take advantage of CPU features which can execute one operation (e.g. add) on multiple inputs at once. Why should I care? My Core I7 has 4 physical cores (8 with hyper threading). Now lets add to the mandelbrot sample SIMD support.

image

Multithreading stays important but when you can get with a single core a speedup of 3,4 simply by employing SIMD. That is something you cannot ignore. If your server today needs 40 cores and you can cut the number of cores down by a factor 3,4. You can spend the saved money on other things than hardware (e.g. paying developers to program SIMD enabled code Zwinkerndes Smiley). This is all nice and well but how fast can it get? When I look at the generated assembler code

d:\Source\SIMDSample\C#\Mandelbrot\VectorFloatStrict.cs @ 162:
00007ffd`e99d2135 450f58f4        addps   xmm14,xmm12
00007ffd`e99d2139 4183461004      add     dword ptr [r14+10h],4
00007ffd`e99d213e 410f28c6        movaps  xmm0,xmm14
00007ffd`e99d2142 410f28c8        movaps  xmm1,xmm8
00007ffd`e99d2146 660fefd2        pxor    xmm2,xmm2
00007ffd`e99d214a 0f28d0          movaps  xmm2,xmm0
00007ffd`e99d214d 0fc2d102        cmpleps xmm2,xmm1
00007ffd`e99d2151 660f76c0        pcmpeqd xmm0,xmm0
00007ffd`e99d2155 0f28ca          movaps  xmm1,xmm2
00007ffd`e99d2158 660f76c8        pcmpeqd xmm1,xmm0
00007ffd`e99d215c 660f70d94e      pshufd  xmm3,xmm1,4Eh

I see these xmm registers. These are SSE 128 bit wide registers which are the foundations for the current SIMD support of the JIT compiler. But this is still not the end. My Haswell CPU supports AVX2 which can cope with 256 bit wide registers which should give me a speedup of 8. AVX-512 will arrive in some time giving us 512 bit wide registers. We can achieve then speedup a single thread up to factor of 16! In theory we should be able to get this graph with an AVX-512 enabled CPU and a perfect .NET JIT compiler

image

For some problem domains it can happen that SIMD support may be more important than using more threads. You can cope with heavy load on a single core without the need to use many threads. That can help servers which are shared by many users a lot. Here are the facts how to get started with SIMD in .NET:

  • .NET uses SIMD automatically for my application?

No. You need to use a specialized SIMD library which offers SIMD data types which can be translated by the JIT compiler to SIMD instructions on CPUs supporting it.

  • Can I use the normal .NET Framework?

No. You need to download and install the new RyuJIT CTP3

  • Does it work on Debug Builds?

No. SIMD instructions are only used for Release builds with RyuJIT enabled (COMPLUS_AltJit=*) and SIMD instructions enabled (COMPLUS_FeatureSIMD=1) for this CTP on Windows 8 and above for x64 applications only.

  • How looks SIMD vs non SIMD code?

Here is the "normal" Modelbrot core method

C#\Mandelbrot\ScalarFloat.cs

        // Render the fractal with no data type abstraction on a single thread with scalar floats
        public void RenderSingleThreadedNoADT(float xmin, float xmax, float ymin, float ymax, float step)
        {
            int yp = 0;
            for (float y = ymin; y < ymax && !Abort; y += step, yp++)
            {
                int xp = 0;
                for (float x = xmin; x < xmax; x += step, xp++)
                {
                    float accumx = x;
                    float accumy = y;
                    int iters = 0;
                    float sqabs = 0f;
                    do
                    {
                        float naccumx = accumx * accumx - accumy * accumy;
                        float naccumy = 2.0f * accumx * accumy;
                        accumx = naccumx + x;
                        accumy = naccumy + y;
                        iters++;
                        sqabs = accumx * accumx + accumy * accumy;
                    } while (sqabs < limit && iters < max_iters);
                    DrawPixel(xp, yp, iters);
                }
            }
        }

And here is the SIMD version. Not very different but the performance differs a lot. If you use double types the benefit will be smaller (factor 2) because double uses 8 bytes of memory whereas single precision (float) values use up only 4 bytes. You might ask what is this version with the strange name NoADT? It means No Abstract Data Types. The other version uses its own Complex data type which is a lot slower. This should be improved in future CTPs either by getting our BCL enabled Complex data type or by optimizing custom data types which contain SIMD types in general.

C#\Mandelbrot\VectorFloat.cs

        // Render the fractal on a single thread using raw Vector<float> data types
        // For a well commented version, go see VectorFloatRenderer.RenderSingleThreadedWithADT in VectorFloat.cs
        public void RenderSingleThreadedNoADT(float xmin, float xmax, float ymin, float ymax, float step)
        {
            Vector<float> vmax_iters = new Vector<float>(max_iters);
            Vector<float> vlimit = new Vector<float>(limit);
            Vector<float> vstep = new Vector<float>(step);
            Vector<float> vxmax = new Vector<float>(xmax);
            Vector<float> vinc = new Vector<float>((float)Vector<float>.Length * step);
            Vector<float> vxmin = VectorHelper.Create(i => xmin + step * i);

            float y = ymin;
            int yp = 0;
            for (Vector<float> vy = new Vector<float>(ymin); y <= ymax && !Abort; vy += vstep, y += step, yp++)
            {
                int xp = 0;
                for (Vector<float> vx = vxmin; Vector.LessThanOrEqualAll(vx, vxmax); vx += vinc, xp += Vector<int>.Length)
                {
                    Vector<float> accumx = vx;
                    Vector<float> accumy = vy;

                    Vector<float> viters = Vector<float>.Zero;
                    Vector<float> increment = Vector<float>.One;
                    do
                    {
                        Vector<float> naccumx = accumx * accumx - accumy * accumy;
                        Vector<float> naccumy = accumx * accumy + accumx * accumy;
                        accumx = naccumx + vx;
                        accumy = naccumy + vy;
                        viters += increment;
                        Vector<float> sqabs = accumx * accumx + accumy * accumy;
                        Vector<float> vCond = Vector.LessThanOrEqual<float>(sqabs, vlimit) &
                            Vector.LessThanOrEqual<float>(viters, vmax_iters);
                        increment = increment & vCond;
                    } while (increment != Vector<float>.Zero);

                    viters.ForEach((iter, elemNum) => DrawPixel(xp + elemNum, yp, (int)iter));
                }
            }
        }

This is a long awaited step into the right direction of the JIT compiler. Finally you can write performance sensitive code without much hassle in .NET as well. Still the JIT compiler could become smarter for many common code constructs to emit more efficient code.


Monday, March 24, 2014 #

.NET is like Java a very productive environment. The increased convenience comes at the cost of less control over your execution environment. Especially the burden of manual memory management is much less of a problem in managed environments. Garbage collection does not prevent memory leaks (e.g. delete pData) but makes it considerable harder to introduce ones (statics, timers, forgotten event unregistrations and others still remain). A Garbage collector is a very nice thing but it can get into your way if you want to provide a responsive application. When your customers complain that the UI is frozen or that the UI redraw causes flickering all over it you need to take a deeper look under the .NET covers.

How does Garbage Collection affect application performance?

As the name collection suggests it needs to to something. You can expect increased CPU usage while a garbage collection is running in the background. .NET 4.5 has made significant progress to perform more tasks during a garbage collection on dedicated garbage collection threads. CPU usage is normally not a problem as it does not eat up all of your cores. The prime reason why people complain about garbage collection is that their application is stopped while a full garbage collection is running. This behavior is called stop world collector where the garbage collector stops all application threads, looks at all objects, finds the garbage and compacts memory to prevent memory fragmentation. For the last step it is very important that the application does not run since the address of the objects can change. It would be a disaster if your application would access a memory location that contains data of completely unrelated objects because the address of your object has changed but the garbage collector did forget to update your pointer while your thread was frozen. This process includes not only addresses on the managed heap but also in the thread stacks and CPU registers. Every object address your application has access to needs to be updated by the GC somehow. The first question you might ask: Is there a way around that the GC must stop my threads to be able to update object addresses? The answer is yes there are garbage collectors out there that do not stop all application threads like the Azul C4. It can collect heaps of hundreds of GB in size with maximal delays of less than 10ms. One secret ingredient to make this possible seems to be a kernel driver that locks portions of memory which are compacted to prevent writes to it. I am not sure how they handle reads but the docs talk about Loaded Value Barriers to ensure that object references are always valid. This looks like some overhead for every object reference access which can be substantial. Besides this it has given up the concept of generations for a fully asynchronous garbage collector.

Without having measured it I would conclude from the docs that you pay for low latency with slower object reference access and slower memory reclamation which increases the working set of your application. Every solution to manual memory management comes with its own set of drawbacks:

  • Manual Memory Management (e.g. C/C++)
    • High Complexity
    • Low CPU Consumption
    • Low Memory Consumption
    • Low Latency

 

  • Garbage Collected Environment (e.g. Java, .NET)
    • Lower complexity
    • Increased CPU Consumption
    • Increased Memory Consumption
    • Potentially High Latency

Is this conclusion correct? Only under the assumption that you need to allocate memory. What is the most efficient way to allocate memory? No allocation at all. The secret to real time performance in any environment is to make the hot path of your application allocation free. When you need hard real time (e.g. avionics) not only allocations are banned but also exceptions because you cannot predict how long exception handling will take. Very few systems need to cope with hard real time. For the 99% part of all software you can live with small millisecond pauses even if it means that your real time trading system will issue a million dollar order after your competitors. You can do this in C/C++ but there are also ways to become (mostly) allocation free in .NET.

In .NET and Java object pooling is an easy way to prevent future object allocations. This works to some extent but you cannot pool data types like strings which are immutable and can therefore not be changed later. Perhaps we will get some day mutable strings and UTF-8 strings to keep the memory footprint low and the throughput high. Besides becoming allocation free you can also try to allocate less objects (profile your app) or more value types to reduce the memory footprint which results in smaller heaps which can be collected faster. In .NET are also different GC flavors available (server and desktop). The default is the desktop GC. Starting with .NET 4.5 the both GC flavors are concurrent and employ a background GC thread that scans for garbage while your application threads are still running which significantly reduces latency already.

I have seen an improvement in run time behavior of larger apps of ca. 15% only by switching from .NET 4.0 to .NET 4.5. A good portion of it can be attributed to the improved GC. Lets check how the GC behaves on my Windows 8.1 box with .NET 4.5.1 with a simple sample application. Here is the code

class Program
{
    static int N = 25 * 1000 * 1000;  // Execute allocations in a loop 25 million times
    static int Clear = 1000 * 1000;   // Clear list after every 1 million allocations to give GC 25 chances to clear things up

    static void Main(string[] args)
    {
        // do some warmup
        AllocateRefContainer();
        GC.Collect();
        GC.Collect();

        var sw = Stopwatch.StartNew();
        ExecuteParallel(() => AllocateRefContainer(), "AllocateRefContainer");
        sw.Stop();
        Console.WriteLine("RefContainer Allocation {0:F2}s, {1:N0} Allocs/s", sw.Elapsed.TotalSeconds, N / sw.Elapsed.TotalSeconds);

        GC.Collect();
        GC.Collect();
        sw = Stopwatch.StartNew();
        ExecuteParallel(() => AllocateValueContainer(), "AllocateValueContainer");
        sw.Stop();
        Console.WriteLine("ValueContainer Allocation {0:F2}s, {1:N0} Allocs/s", sw.Elapsed.TotalSeconds, N / sw.Elapsed.TotalSeconds);

        GC.Collect();
        GC.Collect();
        sw = Stopwatch.StartNew();
        ExecuteParallel(() => AllocateValueTypes(), "AllocateValueTypes");
        sw.Stop();
        Console.WriteLine("ValueType Allocation {0:F2}s, {1:N0} Allocs/s", sw.Elapsed.TotalSeconds, N / sw.Elapsed.TotalSeconds);

        GC.Collect();
        GC.Collect();
        sw = Stopwatch.StartNew();
        ExecuteParallel(() => PooledRefContainer(), "PooledRefContainer");
        sw.Stop();
        Console.WriteLine("PooledRefContainer Allocation {0:F2}s, {1:N0} Allocs/s", sw.Elapsed.TotalSeconds, N / sw.Elapsed.TotalSeconds);
        Console.WriteLine("Private Bytes: {0:N0} MB", Process.GetCurrentProcess().PrivateMemorySize64 / (1024 * 1024));
        Console.ReadLine();
    }

    class ReferenceContainer // Class with one object reference
    {
        public ReferenceContainer Obj;
    }

    static ThreadLocal<List<ReferenceContainer>> RContainer = new ThreadLocal<List<ReferenceContainer>>(() => new List<ReferenceContainer>());
    static void AllocateRefContainer()
    {
        var container = RContainer.Value;
        for(int i=0; i< N;i++)
        {
            container.Add(new ReferenceContainer());
            Calculate();
            if (i % Clear == 0)
            {
                container.Clear();
            }
        }
    }

    class ValueContainer // Class with one pointer sized value type
    {
        IntPtr Obj;
    }
    static ThreadLocal<List<ValueContainer>> VContainer = new ThreadLocal<List<ValueContainer>>(() => new List<ValueContainer>());
    static void AllocateValueContainer()
    {
        var container = VContainer.Value;
        for (int i = 0; i < N; i++)
        {
            container.Add(new ValueContainer());
            Calculate();
            if (i % Clear == 0)
            {
                container.Clear();
            }
        }
    }

    // The object overhead is 8 or 16 bytes on x86 x64
    // We add the payload to it to stay close to be equal in size to the others to get a true comparison when 
    // the actual size which needs to be moved is equal.
    struct ValueType
    {
        public IntPtr SyncBlock;
        public IntPtr MT;
        public IntPtr Payload;
    }

    static ThreadLocal<List<ValueType>> ValueTypeContainer = new ThreadLocal<List<ValueType>>(() => new List<ValueType>());
    static void AllocateValueTypes()
    {
        var container = ValueTypeContainer.Value;
        for (int i = 0; i < N; i++)
        {
            container.Add(new ValueType());
            Calculate();
            if (i % Clear == 0)
            {
                container.Clear();
            }
        }
    }

    // Uses Reference Container but allocates the objects only once and does not clear the array but
    // overwrites the contents 
    static void PooledRefContainer()  
    {
        bool bAllocate = true;
        var container = RContainer.Value;
        for (int i = 0; i < N; i++)
        {
            if (bAllocate)
            {
                container.Add(new ReferenceContainer());
            }
            else
            {
                var tmp = container[i % container.Count]; // Grab on item from array inside the array bounds
                tmp.Obj = null; // set new value 
            }

            Calculate();
            if (i % Clear == 0)
            {
                bAllocate = false;
            }
        }
    }

    // Simulate some CPU only calculation
    static void Calculate()
    {
        long lret = 0;
        for (int i = 0; i <  100; i++)
        {
            lret++;
        }
    }

    // Execute a scenario on 4 different threads
    // The strange Directory.GetFiles calls are there track to track the start and stop of one scenario when File access ETW tracing is enabled 
    static void ExecuteParallel(Action acc, string scenario)
    {
        Directory.GetFiles("C:\\",String.Format("ETW ExecuteParallel Start {0}", scenario));
        Parallel.Invoke(Enumerable.Repeat(acc, 4).ToArray());
        Directory.GetFiles("C:\\", String.Format("ETW ExecuteParallel Stop {0}", scenario));
    }
}

The tests were performed on this machine:

image

We do basically allocate 100 million objects on 4 threads concurrently with various object designs.

  • Class with one object reference
  • Class with one pointer sized value
  • Struct of the same size like the class
  • Pooled class with one object reference

To enable the server GC you need to add to your app.config

<?xml version="1.0" encoding="utf-8" ?>
<configuration>
  <runtime>
    <gcServer enabled="true"/>
  </runtime>
</configuration>

GC Behaviors (Server, Desktop in x86/x64)

When you measure the timings you get some interesting graphs.

image

The first obvious difference is that x64 is faster than x86. For value types up to a factor 2. In the profiler we can see that the main differences comes from the relocate phase where 64 bit achieves much higher throughput. The numbers on the x axis are the number of samples taken by the kernel.

image

A net win of 30% only by switching from x86 to x64 is nice. But it comes at the cost of higher memory footprint (factor 2). When you look closer you see that the effect of server GC is much higher than 30%. The same method did run nearly 6 times faster (e.g. check out RefContainer x64 Default GC vs x64 Server GC)! As with every miracle there are also some hidden costs associated with that. In this case it is memory consumption. When server GC is enabled the CLR allocates a managed heap for each logical core. If you have a 40 core machine with many processes running things concurrently you quickly will run out of memory if you are not careful in your application design.

The main difference between workstation and server GC are

  • One GC Thread per core
  • One GC Heap per core

The allocations which did happen previously on one heap are now happening on 4 of them. Each heap is compacted independently by its own GC thread. In theory we should have 4 times less GCs since each heap could become as big as the original one and since we now have 4 GC threads we should be able to allocate 16 times faster. This is not the reality since the memory consumption has some limits which seems to be with this workload a factor 3. Besides this the GC must also cross check other heaps if the objects have dependencies between different heaps. If no GC Heap dependencies would be checked we should be 12 times faster but we are "only" 6 times faster which shows that the heap cross checks are not free but necessary.

The test suite did consume nearly three times more memory while running nearly 6 times faster. Enabling server GC can be a good thing to do if you have plenty of free memory and you do not allocate huge amounts of data on many different threads.

 

image

When you look at the CPU consumption with default GC enabled in x64 Release

image

and here the same scenario with server GC

image

The GC suspension times have dropped dramatically and at the same time the summed CPU consumption has dropped by 30%. Except for the memory consumption this is a big win in terms of GC induced pause times and overall application performance. But you need to be aware that your now very scalable app will spike with 100% CPU from time to time perform full GCs which can become a problem if you have other work waiting to be done which also needs all cores.

A very useful visualization is to select the GC Suspension events and then right click in the graph to enable highlighting of the current selection (the green bars). Now you can search in the other graphs for e.g. no CPU activity to find out why your application was doing nothing without ever looking at GC induced latency issues. This helps a lot to differentiate between GC issues and other problems in your application logic.

I have beefed up my original GC regions file with GC suspension visualization which now looks like this:

GC_SuspendRegions.xml

<?xml version='1.0' encoding='utf-8' standalone='yes'?>
<?Copyright (c) Microsoft Corporation. All rights reserved.?>
<InstrumentationManifest>
  <Instrumentation>
    <Regions>
     <RegionRoot Guid="{d8d639a0-cf4c-45fb-976a-0000DEADBEEF}" Name="GC" FriendlyName="GC Activity">
              <Region Guid="{d8d639a1-cf4c-45fb-976a-100000000101}" Name="GCs" FriendlyName="GCs">
              <Region Guid="{d8d639a1-cf4c-45fb-976a-000000000001}" Name="GCStart" FriendlyName="GC Start">
                <Start>
                  <Event Provider="{e13c0d23-ccbc-4e12-931b-d9cc2eee27e4}" Id="1" Version="1"/>
                </Start>
                <Stop>
                  <Event Provider="{e13c0d23-ccbc-4e12-931b-d9cc2eee27e4}" Id="2" Version="1"/>
                </Stop>
             </Region>
             </Region>
                <Region Guid="{d8d639a2-cf4c-45fb-976a-000000000003}" Name="GCSuspends" FriendlyName="GC Suspensions">
                   <Region Guid="{d8d639a2-cf4c-45fb-976a-000000000002}" Name="GCSuspend" FriendlyName="GC Suspension">
                
                <Start>
                  <Event Provider="{e13c0d23-ccbc-4e12-931b-d9cc2eee27e4}" Id="9" Version="1"/>
                </Start>
                <Stop>
                  <Event Provider="{e13c0d23-ccbc-4e12-931b-d9cc2eee27e4}" Id="3" Version="1"/>
                </Stop>
               </Region>
              </Region>
     </RegionRoot>
  </Regions>
  </Instrumentation>
 </InstrumentationManifest>

To be able to record the GC events with the easy to use WPRUI tool I have created an extra profile

GC.wprp

<?xml version="1.0" encoding="utf-8"?>
<WindowsPerformanceRecorder Version="1.0">
  <Profiles>
    <EventCollector Id="EventCollector_MyEventSource" Name="GC">
      <BufferSize Value="2048" />
      <Buffers Value="1024" />
    </EventCollector>
    <EventProvider Id=".NETCommonLanguageRuntimeGCTracking" Name="e13c0d23-ccbc-4e12-931b-d9cc2eee27e4" >
     <Keywords>
     <!-- GC 0x1 and 0x8000 Exceptions -->
     <Keyword Value="8001"/>
     </Keywords>
    </EventProvider>
    <Profile Id="GC.Verbose.File" Name="GC" Description=".NET GC Tracking"  Base="" LoggingMode="File" DetailLevel="Verbose">
      <Collectors Operation="Add">
        <SystemCollectorId Value="SystemCollector_WPRSystemCollectorInFile">
          <SystemProviderId Value="SystemProvider_Base"/>
        </SystemCollectorId>
        <EventCollectorId Value="EventCollector_MyEventSource">
          <EventProviders>
            <EventProviderId Value=".NETCommonLanguageRuntimeGCTracking" />
            <EventProviderId Value="EventProvider_DotNetProvider" />
          </EventProviders>
        </EventCollectorId>
      </Collectors>
    </Profile>
    <Profile Id="GC.Verbose.Memory" Name="GC" Description=".NET GC Tracking" Base="GC.Verbose.File" LoggingMode="Memory" DetailLevel="Verbose" />    
  </Profiles>
</WindowsPerformanceRecorder>

When you load this file into WPRUI you enable the necessary GC ETW events to track GC pause times with the region file above.

image

Armed with these tools and visualization capabilities it should be easy to see in your own application if enabling server GC does help at critical points of your application. You will not measure any difference if your scenario is IO or network bound and very few GCs are happening. But if your scenario is a GC heavy operation you can check out if these spots can be improved by simply switching a configuration flag in your App.config or if you need to rewrite your allocation strategy.

Did I forget anything? Yes. How do you navigate in an ETL file when your application has no ETW provider enabled? You only need to know that the kernel has already ETW provider for all file IO operations.  If you want to issue a ETW event at the start/stop of a scenario you are interested in it is sufficient to do a

Directory.GetFiles("C:\\",String.Format("ETW ExecuteParallel Start {0}", scenario));

to get with File IO tracing nice ETW events which you can search for. Searching for a not existing file is a quite fast operation which takes about 20us per call on my machine. The ETW collection overhead was within the usual differences that occur from measurement to measurement and not a visible effect at all. This is of course not always true. If you have a network intensive app running and you use PerfView you get also network correlation tracing enabled by default which adds quite some overhead to all socket operations. In the CPU profiling data networking operations look much more expensive than they really are. You need always to take a sharp look at your data if it looks reasonable.




image

When you are doing sampling profiling you should have a look at the article from Vance to be sure to have enough samples in your data to be sure that you are seeing a real effect and you are not chasing some ghost. I hope that I have convinced you that the CLR emits so much more useful data you were never be able to collect from simple performance counters which can give you only very limited insights.


Monday, February 17, 2014 #

When dealing with application responsiveness issues you often need to get data from the actual users to be sure what is going on. There are plenty of reasons why your application does not respond. Most prominent examples are

  • Network
    • A bad network connection can totally ruin the user experience. Wireshark is your best friend to check if you have a network issue at all.
  • Paging
    • If your application has a high memory footprint you will experience sudden slowdowns because the OS did silently page out unused memory. When you touch it again the memory needs to be read in from the page file again which is about 1000 times slower than plain memory access. A SSD can greatly improve the user experience. (Check performance counter Memory\Pages Input/sec).
  • Disc IO
    • If your app hangs because it is just busy writing 500MB to disc you should never do this on the UI thread although it is much easier to program.
  • Garbage Collection

After you have looked at network (we need that data), paging (we need so much memory), disc IO (we need to read/write so much data) you should also check if garbage collection is an issue. GC times can become quite high if you create a lot of temporary objects. Especially de/serialization is a source of pain for the GC. How do you find out if you have a GC issue? First of all you need to check the usual performance counters which do tell you how often the GC kicks in and how much CPU is burned by cleaning up your mess. If you stay low e.g. 10-20% then you do usually not need to worry much about it except if your application hangs too often during a UI interaction. In that case you need to dig deeper.

The best tool to check managed memory issues including GC induced latency is PerfView from Vance Morrison. When you uncheck everything except "GC Collect Only" you can let it run for a long time on any machine to check how many GCs you have.

image

 

When you have pressed Start/Stop Collection you get an etl file that contains a GCStats section:

image

To complete your analysis you need to double click GCStats to get a nice overview for each process how much memory it did allocate and how much GC pause times your application did experience.

image

If you want to drill deeper you can export the data to Excel and check every single GC event if you wish to. That is all nice but what if the customer does not want to install strange tools on their machines? In that case you still can get the data and analyze it on another machine. PerfView relies on the CLR ETW providers which can be enabled with the command line tool logman as well. Logman is part of Windows since XP. To check which data the CLR ETW provider can give you you can execute:

logman query providers ".NET Common Language Runtime"   

The most interesting keywords are

Keyword (Hex) Name Description
0x0000000000000001 GCKeyword GC
0x0000000000000004  FusionKeyword       
Binder (Log assembly loading attempts from various locations)
0x0000000000000008  LoaderKeyword        Loader (Assembly Load events)
0x0000000000008000      ExceptionKeyword Exception

The most prominent keyword is 0x1 which is GC. The other also very interesting keyword is Exception (0x8000) which logs every thrown exceptions in all managed applications. If you want to check for GC latency issues and all exceptions I have created a little script to automate this task:

gcEvents.cmd

@echo off
REM enable GC Events 0x1
REM Exception tracing is 0x8000
logman start clr -p ".NET Common Language Runtime" 0x1,0x8000 0x5 -ets -o "%temp%\logmanclr.etl" -ct perf -bs 1024 -nb 50 500 -max 500
REM Needed to decode clr events later
logman start "NT Kernel Logger" -p "Windows Kernel Trace"  (process,thread)  -ets -o "%temp%\logmanKernel.etl" -ct perf -bs 1024 -nb 50 500 -max 500
pause
logman -stop clr -ets
logman -stop "NT Kernel Logger" -ets
if "%1" EQU "-Merge" (
xperf -merge "%temp%\logmanclr.etl"  "%temp%\logmanKernel.etl" "%temp%\merged.etl"
)

You can send your customer this script and tell him to start it and execute the use and then press any key to stop collecting data. The result are two files named logmanclr.etl and logmanKernel.etl which can be zipped and mailed back to you. On your machine you need to merge the two etl files to be able to load the data into PerfView. This is what the script does with the -Merge option on your machine if you have the Windows Performance Toolkit installed. The other logman options are there to prevent loosing events even if very many of them come in and to prevent filling up the hard disc by allowing for each file a maximum size of 500MB.

For GC relevant data PerfView is the best tool to analyze the captured data. If you are after exceptions you can also use WPA to check which exceptions were thrown while your and all other applications where running on this machine. This will give you only the exception type and exception message but no stacks because a very limited set of ETW providers is enabled which is well suited for long running tests. 

image

Since each garbage collection has a start and stop event which does roughly correlate to the stop times of your application you can also create a regions file to visualize the GC events. This makes it extremely simple to check if your app was hanging because some long GC was running which did (probably) suspend your thread which tried to allocate just more memory. When concurrent GC is enabled you can allocate data and even new GC segments (for Gen 0,1)  but not large objects or Gen 2. Most of the time GC time does correlate with app pause times.

image

Here is the regions file I came up with to make the graph above.

gcRegions.xml

<?xml version='1.0' encoding='utf-8' standalone='yes'?>
<?Copyright (c) Microsoft Corporation. All rights reserved.?>
<InstrumentationManifest>
  <Instrumentation>
    <Regions>
     <RegionRoot Guid="{d8d639a0-cf4c-45fb-976a-0000DEADBEEF}" Name="GC" FriendlyName="GC Times">
          <Region Guid="{d8d639a0-cf4c-45fb-976a-000000000001}" Name="GCStart" FriendlyName="GC Start">
            <Start>
              <Event Provider="{e13c0d23-ccbc-4e12-931b-d9cc2eee27e4}" Id="1" Version="0"/>
            </Start>
            <Stop>
              <Event Provider="{e13c0d23-ccbc-4e12-931b-d9cc2eee27e4}" Id="2" Version="0"/>
            </Stop>
          </Region>

          <Region Guid="{d8d639a0-cf4d-45fb-976a-000000000002}" Name="GCStart_V1" FriendlyName="GC">
            <Start>
              <Event Provider="{e13c0d23-ccbc-4e12-931b-d9cc2eee27e4}" Id="1" Version="1"/>
            </Start>
            <Stop>
              <Event Provider="{e13c0d23-ccbc-4e12-931b-d9cc2eee27e4}" Id="2" Version="1"/>
            </Stop>
          </Region>

           <Region Guid="{d8d639a0-cf4d-45fb-976a-000000000003}" Name="GCStart_V2" FriendlyName="GC">
            <Start>
              <Event Provider="{e13c0d23-ccbc-4e12-931b-d9cc2eee27e4}" Id="1" Version="2"/>
            </Start>
            <Stop>
              <Event Provider="{e13c0d23-ccbc-4e12-931b-d9cc2eee27e4}" Id="2" Version="1"/>
            </Stop>
          </Region>
          </RegionRoot>
  </Regions>
  </Instrumentation>
</InstrumentationManifest>

Not bad for such a simple file to make WPA GC aware. To make this file active you need select in the menu Trace - Trace Properties and add the file to it. Then a new Regions of Interest graph is added to your existing graphs which you can add to your analysis pane as usual. 

image

To check when a GC did start/stop this overview is really nice. If you need a deeper analysis there is no way around PerfView which gives you all the numbers pre analyzed in a per process view. When you have managed memory leaks you should check out the Memory menu of PerfView to take GC Heap dumps from running processes (nice) but also from memory dumps (even better). The latter functionality is essential if you only have a dump from a crashed process but it is not obvious who is holding most objects. I know .NET Memory Profiler from SciTech which also can import dump files but I got lost in the fancy UI. PerfView on the other hand hand looks old fashioned but it does its job much better than the UI might suggest:

image

Here I do see that I keep 100% of my data (10GB in a static variable in 10003 instances) which is very easy too see. Other profilers can do this also very well. But if you double click on the type you get the shortest roots to all objects holding your data nicely printed in a tree view. Cyclic references are broken up and the relevant nodes are marked bold. Other commercial tools had a hard time to tell me who is holding my data if graph cycles are involved. PerfView was able to show me it directly without much navigation in a complex interdependent object graph. To check who is allocating too much data you can also use PerfView when you enable ".NET SamplAlloc" to sample your allocations with minimum performance impact. Then you get the allocation stacks nicely lined up by object type. If you can afford it performance wise you can also enable ".NET Alloc" to track every managed object allocation which gives you exact numbers. That's all for today. I am constantly amazed how much useful data you can get out of ETW and shocked at the same time that so few people know about it. If your new years resolution is already history how about a new goal? Learn how to get most out of ETW!


Thursday, February 6, 2014 #

Most people using Workflow Foundations only know TrackingParticipants to enable activity tracing. With WF 4 the Sql TrackingParticipant was deprecated and replaced by EtwTrackingParticipant. WF has been wasteful with system resources since its very beginning but it has become better with WF 4. If you think that a SQL tracking participant is a good idea think again why MS did deprecate it:

[ObsoleteAttribute("The System.Workflow.* types are deprecated.  Instead, please use the new types from System.Activities.*")]
public sealed class SqlTrackingService : TrackingService, 

You should not try to create your own version of it because it is such a brilliant idea. It never was and never will be a well performing tracing destination. Lets see how far we get with the new ETW support for WF. The easiest way for e.g. a WF Console application is to add it directly to your WorkflowInvoker:

static void Main(string[] args)
{
    Activity workflow1 = new Workflow1();
    var invoker = new WorkflowInvoker(workflow1);
    invoker.Extensions.Add(new EtwTrackingParticipant());
    invoker.Invoke();
}

Another option is to enable it indirectly via your App.config

<system.diagnostics>
  <sources>
    <source name="System.Activities"
            switchValue="Information, ActivityTracing"
            propagateActivity="true">
      <listeners>
        <add name="console" type="System.Diagnostics.DefaultTraceListener"/>
      </listeners>
    </source>
  </sources>
</system.diagnostics>

There is no mention of ETW in the used listeners but the ETW listener is added automatically when you have trace listeners for System.Activities configured.

Now you can enable ETW tracing with xperf like this:

xperf -on proc_thread+Loader -start workflow -on c651f5f6-1c0d-492e-8ae1-b4efd7c9d503  

The guid is the ETW provider for WCF and WF tracing. Its name is "Microsoft-Windows-Application Server-Applications". This xperf command will create in your root folder a kernel.etl and user.etl file to which the events are logged to. The default buffer size is 20 MB which you can alter if the event rate is higher than the buffer can be flushed to disc.

To stop tracing you can use

xperf -stop -stop workflow -d c:\temp\workflow.etl   

Finally you want to watch your traces WPA. Navigate to Generic Events and open the "Microsoft-Windows-Application Server-Applications" event provider:

image

There you see directly how your activities were scheduled and executed along with their activity arguments as long as they have a string representation. Since we have ETW we can add our own traces to ETW and the system wide profiling events as well to get the full picture in one file with a good analysis tool. Besides being central you have also chosen one of the fasted possible ways to write tracing information to an output device. The slowest thing is the DefaultTraceListener in your App.config which writes them via OutputDebugString. These events show up in VS in the output window which is nice but the output speed of OutputDebugString is not great. If you want to get maximum performance you need to write your own NullTraceListener which simply discards all traced data to enable writing only to ETW.


MS is investing heavily into its ETW infrastructure now and in the future. You can find very good online trainings at MS Virtual Academy: http://www.microsoftvirtualacademy.com/training-courses/windows-performance-jump-start#?fbid=g4hL6pwMe94. I have watched some videos today and they have really great content online about rare topics. Personally I found On/Off Transition Experience the most enlightening one. After this video you can rip your slow booting machine apart and make it boot much faster. With the content presented there I was able to cut down my boot time from over two minutes down to one minute, which is really great once you know where to look. The biggest issue on my machine was the DebugSvc was having over 90s of service init time which delayed the whole boot and the initialization of all other services a very long time. By simply disabling the service I got a really good improvement there. Another thing is to check if  during your boot .NET 2 applications are still running. Most of them can run also under .NET 4.0 when you change the app.config file and add

<configuration>
   <startup>
     <supportedRuntime version="v4.0"/>
   </startup>
</configuration>

you can convince them to use .NET 4.0. This will get rid of all hard disc accesses to the .NET 2.0 dlls which can cost you several seconds of hard disc access time. Another issue is SQL server. Here is a slide which I have shamelessly copied from Analysis at Scale.

image_thumb5

 

This is real world data from over 50 000 computers. After seeing this graph I immediately changed my SQL Server instance to delayed boot so it gets out of the way of the other stuff which needs my hard disc more at this time. If you are working on a Laptop with a slow hdd and you go green with envy when you see how much faster your colleagues can work with their "old" desktop PCs. Here is a graph which shows boot time depending on the used hard disc:

image_thumb7

If you need more proof why your money is well spent on a new SSD for your boot drive you can show this to your boss. It does really make a huge difference no only during boot time. What I really found fascinating in the On/Off Transition video is that there are companies out there which have Windows machines which need two hours to boot, and they were ok with that. Well not entirely. Finally they called in MS support to check their machines and they applied a hotfix to reduce the boot time from 2h down to 8 minutes. A few tweaks later they were down to two minutes. But the most shocking fact was that their workaround was that one member of the staff needed to arrive two hours earlier than all other ones to boot up all office machines. Incredible. I wonder if the company is still in business if they have equivalent efficient business processes…