Alois Kraus


  Home  |   Contact  |   Syndication    |   Login
  107 Posts | 8 Stories | 295 Comments | 162 Trackbacks


Article Categories


Post Categories

Image Galleries


Sunday, September 14, 2014 #

Recently I have seen an interesting performance question on SO which deals with Parallel.For in .NET. The code was basically

            i => {});

The actual work is simulated by an empty delegate which is an easy measure the overhead of Parallel.For. When you let this run on some multicore machines in x86 Release you get number like


The running time is not constant as one might expect but sometimes there are slowdowns in the order of 6 times or more. With WPT is was quickly able to find as root cause the parallel for loop counter increment method:


It is used to increment the loop counter in a lock free way. With all cores in use we have massive memory contention on it but not all the time. By measuring the counts per thread I could see that sometimes one thread gets much less work than the others but there was no clear pattern which did relate to the slowdowns. Since we deal here with memory contention it should surface in L2/L3 cache misses because some other thread constantly manages it to overwrite the value while he is rarely able to update the counter.

The ExchangeAdd64 method uses basically the following approach:

        static void InterlockedAdd()
                    old = Counter;
                    @new = Counter + Increment;
                while( old != Interlocked.CompareExchange(ref Counter, @new, old));

It adds some value to the existing counter and then tries to atomically exchange it. This seems to work in x64 all the time not if you let it run as 32 bit application.

The 64 bit code uses

lock xadd qword ptr [rdx+8],rax ds:xxxxx

whereas the x86 version uses

lock cmpxchg8b qword ptr [edi] ds:xxx

Both approaches use long (64 bit) values although the Parallel.For counter we did use is an integer counter only. This seems to cause funny side effects on the CPU cache lines. There exist no performance counters to measure cache miss rates which makes it hard to verify this claim. But Intel has released the free Performance Counter Monitor (pcm.exe)  which offers much deeper insights what your processor is currently doing. To compile it and to get it working is not entirely straightforward since you need to compile a kernel driver and install it as test signed or you download a publicly signed version of it from the net. The Intel devs seems to be more Linux guys since the project files can be compiled with VS2013 but for every project the binaries end up in a sub folder of the project in the Debug folder (even for release). There is NO common bin folder were all binaries are compiled together. Anyway. The most interesting tool is pcm.exe which can show you per core statistics how many instruction have been executed and how the cache hit/miss counters are for the L2/L3 caches. The executed instructions per core show a sharp decrease which explains why the running time of the Parallel.For loop is sometimes soo slow. But still we do not see the reason for the drop in Execution performance.



But when I look at L3 Cache misses it becomes clear that one core seems to get into a death spiral where he consistently losses for several seconds to update this value which seems to distort the parallel execution of instructions because the data needs to be read from the far slower main memory many times while synchronizing the CPU caches.



I am not sure how to fix it since it should not happen at all and it will hopefully be fixed by a future CPU version. It makes no sense to emit extra code for such corner conditions. This example makes a nice demo what memory contention can do to your application performance. If you have no contention any lock free version will be fast. But if you have many concurrent accesses to the same memory locations you can get much longer running times while at the same time you waste enormous amounts of CPU cycles. Does this matter? Well yes it does if you use e.g. the lock keyword in .NET which internally first spins a bit in the hope that the lock will be released fast. If not it will take a real lock and stop wasting CPU cycles. If you have e.g. 5 threads updating some central variable and you use the lock keyword you will find many burned CPU cycles only to increment a counter. The Interlocked methods offer good alternatives to using locks if you have simple tasks. The interlocked methods are nice but they JIT compiler will not replace your code directly with the corresponding CPU code but you pay with an extra method call which can cost you in the worst cases a factor 10 in performance because you need to swap out quite a lot of data from registers onto the thread stack.

I really like the pcm tool from Intel because it gives me the ability to measure things. Previously I had to guess why this or that was slowing down my application. Now I can prove it. Here is some sample output of it so you can check if the tool might be useful to you as well.

EXEC  : instructions per nominal CPU cycle
IPC   : instructions per CPU cycle
FREQ  : relation to nominal CPU frequency='unhalted clock ticks'/'invariant timer ticks' (includes Intel Turbo Boost)
AFREQ : relation to nominal CPU frequency while in active state (not in power-saving C state)='unhalted clock ticks'/'invariant timer ticks while in C0-state
L3MISS: L3 cache misses
L2MISS: L2 cache misses (including other core's L2 cache *hits*)
L3HIT : L3 cache hit ratio (0.00-1.00)
L2HIT : L2 cache hit ratio (0.00-1.00)
L3CLK : ratio of CPU cycles lost due to L3 cache misses (0.00-1.00), in some cases could be >1.0 due to a higher memory latency
L2CLK : ratio of CPU cycles lost due to missing L2 cache but still hitting L3 cache (0.00-1.00)
READ  : bytes read from memory controller (in GBytes)
WRITE : bytes written to memory controller (in GBytes)
IO    : bytes read/written due to IO requests to memory controller (in GBytes); this may be an over estimate due to same-cache-line partial requests
TEMP  : Temperature reading in 1 degree Celsius relative to the TjMax temperature (thermal headroom): 0 corresponds to the max temperature

Core (SKT) | EXEC | IPC  | FREQ  | AFREQ | L3MISS | L2MISS | L3HIT | L2HIT | L3CLK | L2CLK  | READ  | WRITE |  IO   | TEMP

   0    0     0.08   1.05   0.08    0.31     355 K   1013 K    0.65    0.32    0.21    0.08     N/A     N/A     N/A     66
   1    0     0.00   0.20   0.01    0.31     111 K    226 K    0.51    0.17    0.53    0.13     N/A     N/A     N/A     66
   2    0     0.01   0.27   0.02    0.30     193 K    517 K    0.63    0.19    0.35    0.15     N/A     N/A     N/A     68
   3    0     0.08   1.00   0.08    0.53     706 K   1838 K    0.62    0.50    0.38    0.17     N/A     N/A     N/A     68
   4    0     0.01   0.38   0.03    0.32     247 K    667 K    0.63    0.22    0.32    0.13     N/A     N/A     N/A     66
   5    0     0.01   0.56   0.02    0.40     177 K    479 K    0.63    0.26    0.37    0.15     N/A     N/A     N/A     66
   6    0     0.01   0.35   0.03    0.32     235 K    500 K    0.53    0.24    0.37    0.10     N/A     N/A     N/A     70
   7    0     0.02   0.43   0.05    0.57     116 K    807 K    0.86    0.41    0.11    0.21     N/A     N/A     N/A     70
SKT    0     0.03   0.70   0.04    0.38    2145 K   6051 K    0.65    0.37    0.29    0.14     N/A     N/A    0.00     63
TOTAL  *     0.03   0.70   0.04    0.38    2145 K   6051 K    0.65    0.37    0.29    0.14     N/A     N/A    0.00     N/A

Instructions retired:  917 M ; Active cycles: 1318 M ; Time (TSC): 3986 Mticks ; C0 (active,non-halted) core residency: 10.81 %

C1 core residency: 15.87 %; C3 core residency: 0.96 %; C6 core residency: 0.00 %; C7 core residency: 72.36 %;
C2 package residency: 0.00 %; C3 package residency: 0.00 %; C6 package residency: 0.00 %; C7 package residency: 0.00 %;

PHYSICAL CORE IPC                 : 1.39 => corresponds to 34.79 % utilization for cores in active state
Instructions per nominal CPU cycle: 0.06 => corresponds to 1.44 % core utilization over time interval

Monday, June 30, 2014 #

When you want to schedule some not so important work you can have the option to lower the thread priority of a thread or you can change the base priority of the whole process. That can help to keep Windows responsive while CPU intensive tasks (e.g. some processes with Parallel.For instances) can use all cores if nothing else is going on. For that purpose you can change the process priority via SetPriorityClass or if you are using .NET you can use the Process.PriorityClass property to get and set it. Please note that these methods have nothing to do with classes and object oriented programming at all. The xxxx_Class values define the base priority for all threads in a process. This value can be changed at runtime as well via e.g. the task manager. The thread scheduler of Windows uses the base priority of your thread and "adds" your thread priority to your thread base priority which is then the effective thread priority. Thread priorities in Windows have can have values from 0-31 but the many APIs to influence the effective thread priorities make things more indirect. Below is a table that shows the effect of setting the process class priority (aka class) and the thread priority:


What is interesting that if you set a process to Below Normal it still can interfere with threads from a Normal process if it has as thread priority Highest set. A Normal thread in a Normal process gets as thread base priority the value 8 which is the same as a Below Normal process with a Highest thread priority set. Which thread in a managed application runs with Highest priority? Yes it is the finalizer. If you have much work in your finalizers in Below Normal processes then they compete with your Normal processes for CPU. This might catch you by surprise if you have thought that Below Normal processes do not interfere with Normal processes.

There is an even more severe way how threads of lower prio processes can compete with your foreground tasks. Windows has a nice service to keep your interactive application running by boosting its priority so your Windows which just receives the keyboard input is responsive. This is called priority boosting. Why should I care about priority boosts in a server application? Well there are actually three reasons why Windows can decide to boost your thread priority (copied from MSDN):

  • When a process that uses NORMAL_PRIORITY_CLASS is brought to the foreground, the scheduler boosts the priority class of the process associated with the foreground window, so that it is greater than or equal to the priority class of any background processes. The priority class returns to its original setting when the process is no longer in the foreground.
  • When a window receives input, such as timer messages, mouse messages, or keyboard input, the scheduler boosts the priority of the thread that owns the window.
  • When the wait conditions for a blocked thread are satisfied, the scheduler boosts the priority of the thread. For example, when a wait operation associated with disk or keyboard I/O finishes, the thread receives a priority boost.

    You can disable the priority-boosting feature by calling the SetProcessPriorityBoost or SetThreadPriorityBoost function. To determine whether this feature has been disabled, call the GetProcessPriorityBoost or GetThreadPriorityBoost function.

I have marked the relevant sentence bold: If your server application has processes which are waiting for something their thread priority is also increased. The priority boost can add up to four (tested on my Win 8.1 x64 Box) to your current thread priority which can make your Lowest priority thread effectively a Highest priority thread!

Now we have all factors together to calculate the effective thread priority:


The effective thread priority is calculated by the process base priority and the current thread priority and an optional thread boost value. Although I have used the + operator in the formula it is not s simple addition of some values. Now you know how to read the thread priority and dynamic priority in process explorer. Base Priority is the thread priority when the process priority and the thread priority have been "added" without any priority boost.


Dynamic priority is then the effective boosted thread priority value the OS scheduler works with. Now we can check our knowledge with a my little GCAllocations tester from a previous post. As a normal process it should work with thread of priority 8 most of the time:


As it turns out this is not the case. Due to the many context switches (waits) the thread priorities were boosted to 9 and 10 mainly. This effect is also there if the console application is not a foreground window or has keyboard input. This is only due to the fact that the OS boosted your thread priorities because it thought it would be a good idea to give you some time to work after you were woken up. Can we prevent this somehow? Yes there is an API SetProcessPriorityBoost when you call it the second parameter as 1 then you disable the priority boost feature. Or you use

Process.GetCurrentProcess().PriorityBoostEnabled = false;


Now that looks more like what we would expect from a well behaved process. So should we turn off Priority Boosting and do we gain anything from it? To check this I have run GCAllocations

  • Single Instance
  • Two Instances which are bound to four cores where four threads allocate in parallel. One process has Normal prio the other BelowNormal.
    • One time with Priority Boosting enabled
    • Second time with Priority Boosting disabled.

I have measured the duration of each instance how long it did need to execute, how much CPU it used and the Ready Times. Below are the results:


What can we learn from these graphs? First if no contention on the Cores is happening we are fastest with 2,7s for the Single Instance which could run undisturbed on all four cores. When there is contention on the four cores (two instances are running at the same time) things become slower by ca. 20%. The normal priority process and the one with Priority Boost enabled show no difference in run time there. But the BelowNormal process with Priority Boosting enabled gains ca. 7% in run time which is sort of unexpected. The ready times are quite high with 6,5 and 6,3s for these processes but since 4 threads are running at the same time we have to divide the process ready time by the number of active thread to get an estimation of the increased run time. This is is shown in the Projected Duration graph where to the Single Instance Duration the ReadyTime/4 is added and graphed for each process. This approximation is not perfect either because it does neglect the CPU usage of each process which differs quite a lot. It is interesting that the process with BelowNormal had the lowest total CPU consumption of 5,8s. How can this be? I did check the Sample Profiler where the CPU consumption does differ. It turns out that with Priority Boosting enabled the page faults from the Windows memory manager are much less costly. 


This is a nice bonus for boosting the thread priority. In this sample application priority boosting had no negative effect on the foreground thread but the background activity did become 7% faster. From these experiments I would say it is safe to leave CPU Priority boosting enabled. That's enough for thread and process priorities for now. There is another priority which was introduced in Windows Vista which is IO priority. There are no direct APIs to influence the IO priority except if you call SetThreadPriority:

    enum ThreadPriority
        THREAD_MODE_BACKGROUND_BEGIN = 0x00010000,
        THREAD_MODE_BACKGROUND_END = 0x00020000,

    [DllImport("kernel32.dll", SetLastError = true)]
    extern static int SetThreadPriority(IntPtr hThread, ThreadPriority nPriority);

     // Will set in a normal process the Thread Priority to 4 and the IO Priority to Very Low
     SetThreadPriority(GetCurrentThread(), ThreadPriority.THREAD_MODE_BACKGROUND_BEGIN);
     // Do Low Priority Work and do not forget to reset it when you want 
     // normal scheduling
     SetThreadPriority(GetCurrentThread(), ThreadPriority.THREAD_MODE_BACKGROUND_END);


That can help to leave the hard disc spinning while you are indexing some data or running some low priority background task. There are actually 5 levels of IO priorities which can only be set for your thread by calling the undocumented function NtSetInformationProcess. You can set the IO priority from a normal process only to 0-2 but not higher than normal.

    [DllImport("kernel32.dll", SetLastError = true)]
    extern static IntPtr GetCurrentProcess();

    [DllImport("ntdll.dll", SetLastError = true)]
    public static extern int NtSetInformationProcess(IntPtr processHandle,
       PROCESS_INFORMATION_CLASS processInformationClass, ref IntPtr processInformation, uint processInformationLength);

    enum IOPriority : uint
        VeryLow = 0,
        Low = 1,
        Normal = 2,

    static int SetIOPriority(IntPtr hProcess, IOPriority newPrio)
        IntPtr ioPrio = (IntPtr) newPrio;
        int lret= NtSetInformationProcess(hProcess, PROCESS_INFORMATION_CLASS.ProcessIoPriority, ref ioPrio, 4);
        return lret;

ProcessIOPriority is a constant with the value 21 if you want to try it out. There are not many use cases out there except if you want to penalize some processes which scan your hard disc for no real purpose except to make your IT department happy. In that case it could help to give such processes a lower IO and Process priority to teach them not to interfere with foreground processes.

If you design a server software which must stay responsive you must ensure that processes which require interactive performance are not held up by background jobs. It makes sense to ban these processes to lower process priorities to ensure that all cores can be utilized by your users if necessary. It is of no help if e.g. SQL Server uses all of your cores to index its database again because the interactive users processes will suffer from latency issues. Thread ready times can become arbitrary high if you are in a CPU oversubscription scenario. In that case all Normal processes should belong to interactive users. This way you get a scalable server which can support up to N concurrent users with varying background load. If latency issues should show up then the only valid reason is that too many users are logged onto your server. But it should never happen that background activities can disturb interactive tasks.

It is important to note that all priorities are used by the OS inside priority queues which are worked from the top down to the lowest priorities with some mechanisms to prevent starvation. But these priorities have absolutely no effect if you have e.g. no contention on a CPU. When only your task sits in a priority queue which is empty you will be served immediately. To see these priorities in action you need to load your system with >=100% CPU intensive tasks to watch how things are sorted out. SQL Server will not become slower if the system load never reaches 100% and you have lowered its CPU priority. But it can help to mitigate spikes of SQL server which can influence interactive tasks if the system is heavily loaded.

Todays software is often written in a scalable way by using all cores that are present. This is nice for a single process using all cores and no other activity. But if you have a system where more than one thing needs to run concurrently such "scalable" software is a recipe for disaster. You can try to run some batch scripts in parallel which use a multi core enabled compression algorithm and watch your system. It will not respond to user input at all anymore. Some time ago a colleague wanted to speed up a batch script by starting a tool 100 times concurrently. I told him before he pressed enter to start it that script could render the system unresponsive. After letting it run for 10 minutes and seeing no progress he pressed the power button. When he would have started the script with Idle process priority he might have had a chance to kill the offending processes.

If you start batch scripts parallel to your normal workload you should try out

start /LOW xxxx.cmd

to start it with Idle process priority. That can make a huge difference when you sit in front of the machine and you want to continue to work. You can even restrict it to run only on some cores with the /AFFINITY mask of 0xF to run your batch script only on the first four cores.

If you want to check out thread IO priorities you cannot count on Process Explorer. These internals are only displayed by Process Hacker which is a Process Explorer clone with more options and less bugs. The best feature of it is certainly that it supports Ctrl+C in all dialogs which is a big plus if you want to get data out the UI.

Wednesday, April 30, 2014 #

On my home machine VS2013 starts up in 2s from my SSD. But on another Windows 7 machine it takes over 13s. I never bothered really since I thought that some plugins and connecting to TFS simply takes its time. It stayed and itch until I installed Resharper 8 were things noticeably got slower. No big deal. Suspend Resharper and it will play dead. It works except that the startup time of VS still looked bad. Time for another profiling session with WPT. The approach is as simple as starting up WPRUI


  1. Press Start.
  2. Start Visual Studio.
  3. Wait until it is has started.
  4. Press "Save" button to stop profiling.
  5. Enter a file name to which the data should be saved
  6. Be patient while data is saved.
  7. Press "Open in WPA" button to analyze the data.

When I look at the CPU consumption until the VS 2013 is usable it takes 13,1s to start up while consuming 9,6s of CPU time. The big gap in the middle looks like it needs to connect to the network. Lets ignore that for now.


To see what the CPU was doing I need sample profiling (CPU Usage Sampled). Many people are confused by the fact that WPT shows you two CPU usage graphs with nearly the same content. The graphs (should) look the same but the underlying data to get the graphs is fundamentally different.

CPU Usage Sampled

This graph comes to existence by parsing CPU sampling ETW events. When sample profiling is enabled the Windows Kernel will log the call stacks of all running threads every 1ms. The sample rate is configurable from seconds between a sample and up to 8kHz (every 0,125ms see xperf -SetProfInt for more infos). Sample profiling gives you an accurate view what your application was doing with the CPU. When you look into the call stacks you should have at least 10K of CPU sample events in your view selected. Then you can check the call stacks in your application down to stacks with at least 100 Samples which represents an error of 1% which is usually enough. If you want to optimize a method which shows up in the sample profiling with only 10 samples you are chasing ghosts.

CPU Usage Precise

Although the graphs look identical you have a lot more columns (54!) to choose from. This time you are not looking at sampled call stacks but at a parsed view of Context Switch ETW events. A context switch occurs when your thread stops or starts running on a CPU. A thread is switched in (NewThreadxxxx columns) and another thread stopped running (Oldxxxxxx columns) while a third thread has enabled your thread to run by releasing a lock, signaling an event, timer expiration and a few other reasons. The thread that allowed your thread to run is called Readying Thread/Process/Stack columns. This very fine granular view allows you to find out e.g. why your thread was sitting in ReadFile for 5s to read a 1KB file.

A more thorough explanation of context switch handling can be found in the WPA documentation for CPU Analysis. To make sense of the many columns you should memorize this picture which shows all important times of a context switch. The picture below is shamelessly copied from the WPA documentation.

Figure 10 CPU Usage Precise Diagram

The most important thing to remember is that with CPU Usage Precise you can find out

Now back to the original question. We are after a CPU consumption issue. What do we see in CPU Usage (Precise)? Only context switches. A context switch occurs when your thread stops running because it is waiting for something (WaitForSingleObject) or the thread quantum was used up. It is 15ms by default but it may be more. On my Win 8.1 x64 box it is actually 32ms. The reason why a context switch event is logged that your thread is waiting (not CPU bound) for something or your thread has already executed code for 15-32ms and has used up its allotted time slice. In effect you will see in the stacks of CPU Usage Precise all of your calls that did block for some reason but you will almost never see a high CPU consuming operation.

How do I know that? Because I have tested it with a small CPU consuming application:

class Program
    extern static int GetCurrentThreadId();

    static void Main(string[] args)
        Action acc = () =>
            Process proc = Process.GetCurrentProcess();
            var  currentThread = proc.Threads.Cast<ProcessThread>().Where(t => t.Id == GetCurrentThreadId()).Single();
            currentThread.ProcessorAffinity = (IntPtr)0x02; // Force thread affinity on core 2

            while (true)

        Parallel.Invoke(Enumerable.Repeat(acc, 2).ToArray());  // Execute two concurrent threads and force execution on one CPU

This little app does nothing but spawning two threads which run on the same CPU to give the thread scheduler some work. Before that it was a simple while(true); loop in the Main method which did consume CPU but after it was running it never caused and context switches for seconds. The first observation was that a context switch is only performed if another thread must run on this CPU. If not there is no reason to bring it to a halt it can run happily forever. Internally the thread dispatcher still checks if the thread quantum did expire but since no other thread is scheduled for this CPU its quantum is renewed without a context switch. It turns out that you need to cause actual contention on a CPU to make the thread dispatcher do some actual work. The very busy while loops show up in the profiler like this:


There we see that on my Win 8.1 machine when two threads compete for the same CPU

  • Every  thread is run again after 32ms (TimeSinceLast).
  • It runs for 32ms (NewInSwitchTime).
  • While it did wait for the CPU for 32ms (Ready).
  • It did spend no time in some WaitForSingleObject synchronization primitive (Wait).
  • The previous thread is put to sleep because it has used up its thread quantum (OldWaitReason with value WrQuantumEnd).
  • During the selected time of 1,62s 54 context switch events did happen which is nearly optimal 1,62/0,032=50 context switches with 32ms time slices (Count).
  • The CPU was maxed out. 12,5% is one core on my 8 core machine (% CPU Usage).

With context switch events you can find out pretty much anything you ever wanted to know why your thread was not running. Thread Scheduling is explained in some free EBooks by a Windows Kernel Developer named David Probert. He was written quite a lot about the internal mechanics of a context switch. If you want to know more about the possible values of OldWaitReason you will find some kernel defines but no explanation. The best explanation for some values I have found in the .NET documentation for the ThreadWaitReason enum. In reality you will find for OldWaitReason almost always the value Executive which means that there thread was waiting inside a WaitForSingleObject and has given up its current quantum.

Below is a picture of a normal Visual Studio startup which prints the time gaps in us of the main thread when it had to wait wait for something. With wait I mean waiting for the CPU to become ready to run the thread (high Ready times) or it was waiting on something to complete or simply the next window message. As you can see we have a pretty busy message pump which has 3402 context switches where the total time gap was 450ms while no code did run on the main UI thread. Since there are so many of these small gaps it makes not much sense to search for useful work on the UI thread there. Instead you should check out single event where e.g. for 145ms nothing had happened. Perhaps there is room for improvement there. I did not see TimeSinceLast used often in in performance analysis sessions so far but it is definitely useful to check out where the largest execution gaps of a supposed to be busy thread are lurking because it is basically the sum of wait times and ready times which catches both thread scheduling issues and wait times for the completion of async operations.


After showing off so many details of CPU Usage (Precise) I need to come back (again) to our CPU consumption issue.  With CPU Usage Precise we would see at most one call stack every 15ms-30ms because the thread was switched out because the thread quantum expired. This is usually a lousy sample rate which is polluted by the wait call stacks. In theory it should be possible to filter for WrQuantumEnd switch out reasons to get the same data over a longer period of time like with sampling. In practice this does not work since the the VS Studio startup example not a single thread ever is switched out because its quantum has expired. The table below shows a typical thread switch out distribution. Executive means that the thread did enter a WaitForSingleObject which is nearly the only reason why a thread is switched out.

OldWaitReason Count
Executive 4377
WrQueue 100
WrDispatchInt 26
UserRequest 10
WrYieldExecution 8
WrLpcReceive 8
WrUserRequest 2
WrLpcReply 1
WrResource 1

After all of the cool context switch stuff we must downgrade to the second CPU view named CPU Usage (Sampled) since this is still the only view where we can reliable track down the call stacks which consume most CPU.


I ignore threads for now to view an accumulated call stack of CPU usage across all threads in Visual Studio. There it is immediately obvious that Resharper is using a lot of CPU while VS is starting up although it is suspended. I am not sure why in version 8 of their product they still think that startup time is not important and they seem to fully initialize their stuff on the main UI thread. To get a more complete picture I can select an expensive method and group it by function to see who is calling it and how much CPU is consumed down the call chain.


Interesting: Resharper has its own brand of MEF embedded which does the initialization. They also seem to use concepts of TPL DataFlow to trigger their components from multiple threads. That certainly looks sophisticated but the CPU graph shows that very little multi threading is happening during startup. On my 8 core machine very little action goes above single core usage. I do not know why they need to initialize so much of their stuff on the UI thread only to add some menus to Visual Studio while being suspended.

Looks like we have a suspect. In case of doubt lets uninstall Resharper 8 and check again. 


The CPU consumption has dropped from 9,6s down to 2,8s. Startup time has dropped to 8,2s.

Time in s With Resharper 8 Without Resharper Difference
CPU Consumption 9,6 2,8 6,8
Startup Time 13,1 8,2 4,9

We have huge savings in terms of CPU consumption here which almost 1:1 translates to reduced startup time. The Dataflow stuff of Resharper is able during the startup of VS to move 14% of the CPU work onto background threads which helps a little bit but still the amount of consumed CPU is enormous for a disabled plugin. To be fair on my Windows 8.1 box with an SSD the startup time of VS 2013 with Resharper is only 3,1s vs 1,9s without it. On good dev machines this is not a big deal but on slower machines a disabled Resharper is quite noticeable. That's it for the CPU intensive stuff. Now why lets check out the flat lines in CPU consumption.

What is causing such long delays during the startup of VS? It looks like the main thread is blocked for a long time. After reading so far your natural reaction must be: Open CPU Usage Precise and check out the Wait and Ready times! In this case the UI thread was blocked by something not very complex:


While the visual studio is starting up it tries to initialize Direct3D as part of a WPF application. Then Nvidia kicks in (nvd3dum.dll) which calls into some nvSCPAPI.dll which seems to poll for something over 5,4s. I have no idea why this is happening but this feels very wrong. The machine in question has some NVidia Quadro card installed and only NVidia knows why on some machines (not all machines exhibit this behavior) support for stereoscopic gaming is tried to enable. Since NVidia does not release pdbs to their drivers I cannot dig any deeper here but this is a major issue which hits all WPF enabled applications running on NVidia boards under still unknown circumstances. To prove that that the nvSCPAPI.dll was to blame I simply renamed the folder where the dll is located

C:\Program Files (x86)\NVIDIA Corporation\3D Vision

to something different. Now the startup time is down to 2,6s which feels really good now. I kept the renamed folder since I do not have the need for stereoscopic gaming.


It is nice to see that WPT can help in these simple cases to find the root issues within minutes. I hope that the Jetbrains can make their good product still better by reducing the startup time more when it is suspended and that NVidia fix their drivers to prevent searching in vain for not installed hardware on some machines. When Nvidia fixes this issue it could help to convince some people that it is not WPF and .NET which is slow but that after their graphics card driver update some apps start much faster than before.

Friday, April 4, 2014 #

Finally after long years of stagnation the JIT compiler got significant attention to support SIMD instructions. As an experienced .NET Developer your might think SIMD? What? Never heard of! Don´t worry: You will hear the same from most experienced C/C++ developers. SIMD is more a geek thing for performance addicts which stands for Single Instruction Multiple Data.

SIMD support means that the compiler (Intel, MSVC++ and now .NET JIT) can emit assembly instructions that take advantage of CPU features which can execute one operation (e.g. add) on multiple inputs at once. Why should I care? My Core I7 has 4 physical cores (8 with hyper threading). Now lets add to the mandelbrot sample SIMD support.


Multithreading stays important but when you can get with a single core a speedup of 3,4 simply by employing SIMD. That is something you cannot ignore. If your server today needs 40 cores and you can cut the number of cores down by a factor 3,4. You can spend the saved money on other things than hardware (e.g. paying developers to program SIMD enabled code Zwinkerndes Smiley). This is all nice and well but how fast can it get? When I look at the generated assembler code

d:\Source\SIMDSample\C#\Mandelbrot\VectorFloatStrict.cs @ 162:
00007ffd`e99d2135 450f58f4        addps   xmm14,xmm12
00007ffd`e99d2139 4183461004      add     dword ptr [r14+10h],4
00007ffd`e99d213e 410f28c6        movaps  xmm0,xmm14
00007ffd`e99d2142 410f28c8        movaps  xmm1,xmm8
00007ffd`e99d2146 660fefd2        pxor    xmm2,xmm2
00007ffd`e99d214a 0f28d0          movaps  xmm2,xmm0
00007ffd`e99d214d 0fc2d102        cmpleps xmm2,xmm1
00007ffd`e99d2151 660f76c0        pcmpeqd xmm0,xmm0
00007ffd`e99d2155 0f28ca          movaps  xmm1,xmm2
00007ffd`e99d2158 660f76c8        pcmpeqd xmm1,xmm0
00007ffd`e99d215c 660f70d94e      pshufd  xmm3,xmm1,4Eh

I see these xmm registers. These are SSE 128 bit wide registers which are the foundations for the current SIMD support of the JIT compiler. But this is still not the end. My Haswell CPU supports AVX2 which can cope with 256 bit wide registers which should give me a speedup of 8. AVX-512 will arrive in some time giving us 512 bit wide registers. We can achieve then speedup a single thread up to factor of 16! In theory we should be able to get this graph with an AVX-512 enabled CPU and a perfect .NET JIT compiler


For some problem domains it can happen that SIMD support may be more important than using more threads. You can cope with heavy load on a single core without the need to use many threads. That can help servers which are shared by many users a lot. Here are the facts how to get started with SIMD in .NET:

  • .NET uses SIMD automatically for my application?

No. You need to use a specialized SIMD library which offers SIMD data types which can be translated by the JIT compiler to SIMD instructions on CPUs supporting it.

  • Can I use the normal .NET Framework?

No. You need to download and install the new RyuJIT CTP3

  • Does it work on Debug Builds?

No. SIMD instructions are only used for Release builds with RyuJIT enabled (COMPLUS_AltJit=*) and SIMD instructions enabled (COMPLUS_FeatureSIMD=1) for this CTP on Windows 8 and above for x64 applications only.

  • How looks SIMD vs non SIMD code?

Here is the "normal" Modelbrot core method


        // Render the fractal with no data type abstraction on a single thread with scalar floats
        public void RenderSingleThreadedNoADT(float xmin, float xmax, float ymin, float ymax, float step)
            int yp = 0;
            for (float y = ymin; y < ymax && !Abort; y += step, yp++)
                int xp = 0;
                for (float x = xmin; x < xmax; x += step, xp++)
                    float accumx = x;
                    float accumy = y;
                    int iters = 0;
                    float sqabs = 0f;
                        float naccumx = accumx * accumx - accumy * accumy;
                        float naccumy = 2.0f * accumx * accumy;
                        accumx = naccumx + x;
                        accumy = naccumy + y;
                        sqabs = accumx * accumx + accumy * accumy;
                    } while (sqabs < limit && iters < max_iters);
                    DrawPixel(xp, yp, iters);

And here is the SIMD version. Not very different but the performance differs a lot. If you use double types the benefit will be smaller (factor 2) because double uses 8 bytes of memory whereas single precision (float) values use up only 4 bytes. You might ask what is this version with the strange name NoADT? It means No Abstract Data Types. The other version uses its own Complex data type which is a lot slower. This should be improved in future CTPs either by getting our BCL enabled Complex data type or by optimizing custom data types which contain SIMD types in general.


        // Render the fractal on a single thread using raw Vector<float> data types
        // For a well commented version, go see VectorFloatRenderer.RenderSingleThreadedWithADT in VectorFloat.cs
        public void RenderSingleThreadedNoADT(float xmin, float xmax, float ymin, float ymax, float step)
            Vector<float> vmax_iters = new Vector<float>(max_iters);
            Vector<float> vlimit = new Vector<float>(limit);
            Vector<float> vstep = new Vector<float>(step);
            Vector<float> vxmax = new Vector<float>(xmax);
            Vector<float> vinc = new Vector<float>((float)Vector<float>.Length * step);
            Vector<float> vxmin = VectorHelper.Create(i => xmin + step * i);

            float y = ymin;
            int yp = 0;
            for (Vector<float> vy = new Vector<float>(ymin); y <= ymax && !Abort; vy += vstep, y += step, yp++)
                int xp = 0;
                for (Vector<float> vx = vxmin; Vector.LessThanOrEqualAll(vx, vxmax); vx += vinc, xp += Vector<int>.Length)
                    Vector<float> accumx = vx;
                    Vector<float> accumy = vy;

                    Vector<float> viters = Vector<float>.Zero;
                    Vector<float> increment = Vector<float>.One;
                        Vector<float> naccumx = accumx * accumx - accumy * accumy;
                        Vector<float> naccumy = accumx * accumy + accumx * accumy;
                        accumx = naccumx + vx;
                        accumy = naccumy + vy;
                        viters += increment;
                        Vector<float> sqabs = accumx * accumx + accumy * accumy;
                        Vector<float> vCond = Vector.LessThanOrEqual<float>(sqabs, vlimit) &
                            Vector.LessThanOrEqual<float>(viters, vmax_iters);
                        increment = increment & vCond;
                    } while (increment != Vector<float>.Zero);

                    viters.ForEach((iter, elemNum) => DrawPixel(xp + elemNum, yp, (int)iter));

This is a long awaited step into the right direction of the JIT compiler. Finally you can write performance sensitive code without much hassle in .NET as well. Still the JIT compiler could become smarter for many common code constructs to emit more efficient code.

Monday, March 24, 2014 #

.NET is like Java a very productive environment. The increased convenience comes at the cost of less control over your execution environment. Especially the burden of manual memory management is much less of a problem in managed environments. Garbage collection does not prevent memory leaks (e.g. delete pData) but makes it considerable harder to introduce ones (statics, timers, forgotten event unregistrations and others still remain). A Garbage collector is a very nice thing but it can get into your way if you want to provide a responsive application. When your customers complain that the UI is frozen or that the UI redraw causes flickering all over it you need to take a deeper look under the .NET covers.

How does Garbage Collection affect application performance?

As the name collection suggests it needs to to something. You can expect increased CPU usage while a garbage collection is running in the background. .NET 4.5 has made significant progress to perform more tasks during a garbage collection on dedicated garbage collection threads. CPU usage is normally not a problem as it does not eat up all of your cores. The prime reason why people complain about garbage collection is that their application is stopped while a full garbage collection is running. This behavior is called stop world collector where the garbage collector stops all application threads, looks at all objects, finds the garbage and compacts memory to prevent memory fragmentation. For the last step it is very important that the application does not run since the address of the objects can change. It would be a disaster if your application would access a memory location that contains data of completely unrelated objects because the address of your object has changed but the garbage collector did forget to update your pointer while your thread was frozen. This process includes not only addresses on the managed heap but also in the thread stacks and CPU registers. Every object address your application has access to needs to be updated by the GC somehow. The first question you might ask: Is there a way around that the GC must stop my threads to be able to update object addresses? The answer is yes there are garbage collectors out there that do not stop all application threads like the Azul C4. It can collect heaps of hundreds of GB in size with maximal delays of less than 10ms. One secret ingredient to make this possible seems to be a kernel driver that locks portions of memory which are compacted to prevent writes to it. I am not sure how they handle reads but the docs talk about Loaded Value Barriers to ensure that object references are always valid. This looks like some overhead for every object reference access which can be substantial. Besides this it has given up the concept of generations for a fully asynchronous garbage collector.

Without having measured it I would conclude from the docs that you pay for low latency with slower object reference access and slower memory reclamation which increases the working set of your application. Every solution to manual memory management comes with its own set of drawbacks:

  • Manual Memory Management (e.g. C/C++)
    • High Complexity
    • Low CPU Consumption
    • Low Memory Consumption
    • Low Latency


  • Garbage Collected Environment (e.g. Java, .NET)
    • Lower complexity
    • Increased CPU Consumption
    • Increased Memory Consumption
    • Potentially High Latency

Is this conclusion correct? Only under the assumption that you need to allocate memory. What is the most efficient way to allocate memory? No allocation at all. The secret to real time performance in any environment is to make the hot path of your application allocation free. When you need hard real time (e.g. avionics) not only allocations are banned but also exceptions because you cannot predict how long exception handling will take. Very few systems need to cope with hard real time. For the 99% part of all software you can live with small millisecond pauses even if it means that your real time trading system will issue a million dollar order after your competitors. You can do this in C/C++ but there are also ways to become (mostly) allocation free in .NET.

In .NET and Java object pooling is an easy way to prevent future object allocations. This works to some extent but you cannot pool data types like strings which are immutable and can therefore not be changed later. Perhaps we will get some day mutable strings and UTF-8 strings to keep the memory footprint low and the throughput high. Besides becoming allocation free you can also try to allocate less objects (profile your app) or more value types to reduce the memory footprint which results in smaller heaps which can be collected faster. In .NET are also different GC flavors available (server and desktop). The default is the desktop GC. Starting with .NET 4.5 the both GC flavors are concurrent and employ a background GC thread that scans for garbage while your application threads are still running which significantly reduces latency already.

I have seen an improvement in run time behavior of larger apps of ca. 15% only by switching from .NET 4.0 to .NET 4.5. A good portion of it can be attributed to the improved GC. Lets check how the GC behaves on my Windows 8.1 box with .NET 4.5.1 with a simple sample application. Here is the code

class Program
    static int N = 25 * 1000 * 1000;  // Execute allocations in a loop 25 million times
    static int Clear = 1000 * 1000;   // Clear list after every 1 million allocations to give GC 25 chances to clear things up

    static void Main(string[] args)
        // do some warmup

        var sw = Stopwatch.StartNew();
        ExecuteParallel(() => AllocateRefContainer(), "AllocateRefContainer");
        Console.WriteLine("RefContainer Allocation {0:F2}s, {1:N0} Allocs/s", sw.Elapsed.TotalSeconds, N / sw.Elapsed.TotalSeconds);

        sw = Stopwatch.StartNew();
        ExecuteParallel(() => AllocateValueContainer(), "AllocateValueContainer");
        Console.WriteLine("ValueContainer Allocation {0:F2}s, {1:N0} Allocs/s", sw.Elapsed.TotalSeconds, N / sw.Elapsed.TotalSeconds);

        sw = Stopwatch.StartNew();
        ExecuteParallel(() => AllocateValueTypes(), "AllocateValueTypes");
        Console.WriteLine("ValueType Allocation {0:F2}s, {1:N0} Allocs/s", sw.Elapsed.TotalSeconds, N / sw.Elapsed.TotalSeconds);

        sw = Stopwatch.StartNew();
        ExecuteParallel(() => PooledRefContainer(), "PooledRefContainer");
        Console.WriteLine("PooledRefContainer Allocation {0:F2}s, {1:N0} Allocs/s", sw.Elapsed.TotalSeconds, N / sw.Elapsed.TotalSeconds);
        Console.WriteLine("Private Bytes: {0:N0} MB", Process.GetCurrentProcess().PrivateMemorySize64 / (1024 * 1024));

    class ReferenceContainer // Class with one object reference
        public ReferenceContainer Obj;

    static ThreadLocal<List<ReferenceContainer>> RContainer = new ThreadLocal<List<ReferenceContainer>>(() => new List<ReferenceContainer>());
    static void AllocateRefContainer()
        var container = RContainer.Value;
        for(int i=0; i< N;i++)
            container.Add(new ReferenceContainer());
            if (i % Clear == 0)

    class ValueContainer // Class with one pointer sized value type
        IntPtr Obj;
    static ThreadLocal<List<ValueContainer>> VContainer = new ThreadLocal<List<ValueContainer>>(() => new List<ValueContainer>());
    static void AllocateValueContainer()
        var container = VContainer.Value;
        for (int i = 0; i < N; i++)
            container.Add(new ValueContainer());
            if (i % Clear == 0)

    // The object overhead is 8 or 16 bytes on x86 x64
    // We add the payload to it to stay close to be equal in size to the others to get a true comparison when 
    // the actual size which needs to be moved is equal.
    struct ValueType
        public IntPtr SyncBlock;
        public IntPtr MT;
        public IntPtr Payload;

    static ThreadLocal<List<ValueType>> ValueTypeContainer = new ThreadLocal<List<ValueType>>(() => new List<ValueType>());
    static void AllocateValueTypes()
        var container = ValueTypeContainer.Value;
        for (int i = 0; i < N; i++)
            container.Add(new ValueType());
            if (i % Clear == 0)

    // Uses Reference Container but allocates the objects only once and does not clear the array but
    // overwrites the contents 
    static void PooledRefContainer()  
        bool bAllocate = true;
        var container = RContainer.Value;
        for (int i = 0; i < N; i++)
            if (bAllocate)
                container.Add(new ReferenceContainer());
                var tmp = container[i % container.Count]; // Grab on item from array inside the array bounds
                tmp.Obj = null; // set new value 

            if (i % Clear == 0)
                bAllocate = false;

    // Simulate some CPU only calculation
    static void Calculate()
        long lret = 0;
        for (int i = 0; i <  100; i++)

    // Execute a scenario on 4 different threads
    // The strange Directory.GetFiles calls are there track to track the start and stop of one scenario when File access ETW tracing is enabled 
    static void ExecuteParallel(Action acc, string scenario)
        Directory.GetFiles("C:\\",String.Format("ETW ExecuteParallel Start {0}", scenario));
        Parallel.Invoke(Enumerable.Repeat(acc, 4).ToArray());
        Directory.GetFiles("C:\\", String.Format("ETW ExecuteParallel Stop {0}", scenario));

The tests were performed on this machine:


We do basically allocate 100 million objects on 4 threads concurrently with various object designs.

  • Class with one object reference
  • Class with one pointer sized value
  • Struct of the same size like the class
  • Pooled class with one object reference

To enable the server GC you need to add to your app.config

<?xml version="1.0" encoding="utf-8" ?>
    <gcServer enabled="true"/>

GC Behaviors (Server, Desktop in x86/x64)

When you measure the timings you get some interesting graphs.


The first obvious difference is that x64 is faster than x86. For value types up to a factor 2. In the profiler we can see that the main differences comes from the relocate phase where 64 bit achieves much higher throughput. The numbers on the x axis are the number of samples taken by the kernel.


A net win of 30% only by switching from x86 to x64 is nice. But it comes at the cost of higher memory footprint (factor 2). When you look closer you see that the effect of server GC is much higher than 30%. The same method did run nearly 6 times faster (e.g. check out RefContainer x64 Default GC vs x64 Server GC)! As with every miracle there are also some hidden costs associated with that. In this case it is memory consumption. When server GC is enabled the CLR allocates a managed heap for each logical core. If you have a 40 core machine with many processes running things concurrently you quickly will run out of memory if you are not careful in your application design.

The main difference between workstation and server GC are

  • One GC Thread per core
  • One GC Heap per core

The allocations which did happen previously on one heap are now happening on 4 of them. Each heap is compacted independently by its own GC thread. In theory we should have 4 times less GCs since each heap could become as big as the original one and since we now have 4 GC threads we should be able to allocate 16 times faster. This is not the reality since the memory consumption has some limits which seems to be with this workload a factor 3. Besides this the GC must also cross check other heaps if the objects have dependencies between different heaps. If no GC Heap dependencies would be checked we should be 12 times faster but we are "only" 6 times faster which shows that the heap cross checks are not free but necessary.

The test suite did consume nearly three times more memory while running nearly 6 times faster. Enabling server GC can be a good thing to do if you have plenty of free memory and you do not allocate huge amounts of data on many different threads.



When you look at the CPU consumption with default GC enabled in x64 Release


and here the same scenario with server GC


The GC suspension times have dropped dramatically and at the same time the summed CPU consumption has dropped by 30%. Except for the memory consumption this is a big win in terms of GC induced pause times and overall application performance. But you need to be aware that your now very scalable app will spike with 100% CPU from time to time perform full GCs which can become a problem if you have other work waiting to be done which also needs all cores.

A very useful visualization is to select the GC Suspension events and then right click in the graph to enable highlighting of the current selection (the green bars). Now you can search in the other graphs for e.g. no CPU activity to find out why your application was doing nothing without ever looking at GC induced latency issues. This helps a lot to differentiate between GC issues and other problems in your application logic.

I have beefed up my original GC regions file with GC suspension visualization which now looks like this:


<?xml version='1.0' encoding='utf-8' standalone='yes'?>
<?Copyright (c) Microsoft Corporation. All rights reserved.?>
     <RegionRoot Guid="{d8d639a0-cf4c-45fb-976a-0000DEADBEEF}" Name="GC" FriendlyName="GC Activity">
              <Region Guid="{d8d639a1-cf4c-45fb-976a-100000000101}" Name="GCs" FriendlyName="GCs">
              <Region Guid="{d8d639a1-cf4c-45fb-976a-000000000001}" Name="GCStart" FriendlyName="GC Start">
                  <Event Provider="{e13c0d23-ccbc-4e12-931b-d9cc2eee27e4}" Id="1" Version="1"/>
                  <Event Provider="{e13c0d23-ccbc-4e12-931b-d9cc2eee27e4}" Id="2" Version="1"/>
                <Region Guid="{d8d639a2-cf4c-45fb-976a-000000000003}" Name="GCSuspends" FriendlyName="GC Suspensions">
                   <Region Guid="{d8d639a2-cf4c-45fb-976a-000000000002}" Name="GCSuspend" FriendlyName="GC Suspension">
                  <Event Provider="{e13c0d23-ccbc-4e12-931b-d9cc2eee27e4}" Id="9" Version="1"/>
                  <Event Provider="{e13c0d23-ccbc-4e12-931b-d9cc2eee27e4}" Id="3" Version="1"/>

To be able to record the GC events with the easy to use WPRUI tool I have created an extra profile


<?xml version="1.0" encoding="utf-8"?>
<WindowsPerformanceRecorder Version="1.0">
    <EventCollector Id="EventCollector_MyEventSource" Name="GC">
      <BufferSize Value="2048" />
      <Buffers Value="1024" />
    <EventProvider Id=".NETCommonLanguageRuntimeGCTracking" Name="e13c0d23-ccbc-4e12-931b-d9cc2eee27e4" >
     <!-- GC 0x1 and 0x8000 Exceptions -->
     <Keyword Value="8001"/>
    <Profile Id="GC.Verbose.File" Name="GC" Description=".NET GC Tracking"  Base="" LoggingMode="File" DetailLevel="Verbose">
      <Collectors Operation="Add">
        <SystemCollectorId Value="SystemCollector_WPRSystemCollectorInFile">
          <SystemProviderId Value="SystemProvider_Base"/>
        <EventCollectorId Value="EventCollector_MyEventSource">
            <EventProviderId Value=".NETCommonLanguageRuntimeGCTracking" />
            <EventProviderId Value="EventProvider_DotNetProvider" />
    <Profile Id="GC.Verbose.Memory" Name="GC" Description=".NET GC Tracking" Base="GC.Verbose.File" LoggingMode="Memory" DetailLevel="Verbose" />    

When you load this file into WPRUI you enable the necessary GC ETW events to track GC pause times with the region file above.


Armed with these tools and visualization capabilities it should be easy to see in your own application if enabling server GC does help at critical points of your application. You will not measure any difference if your scenario is IO or network bound and very few GCs are happening. But if your scenario is a GC heavy operation you can check out if these spots can be improved by simply switching a configuration flag in your App.config or if you need to rewrite your allocation strategy.

Did I forget anything? Yes. How do you navigate in an ETL file when your application has no ETW provider enabled? You only need to know that the kernel has already ETW provider for all file IO operations.  If you want to issue a ETW event at the start/stop of a scenario you are interested in it is sufficient to do a

Directory.GetFiles("C:\\",String.Format("ETW ExecuteParallel Start {0}", scenario));

to get with File IO tracing nice ETW events which you can search for. Searching for a not existing file is a quite fast operation which takes about 20us per call on my machine. The ETW collection overhead was within the usual differences that occur from measurement to measurement and not a visible effect at all. This is of course not always true. If you have a network intensive app running and you use PerfView you get also network correlation tracing enabled by default which adds quite some overhead to all socket operations. In the CPU profiling data networking operations look much more expensive than they really are. You need always to take a sharp look at your data if it looks reasonable.


When you are doing sampling profiling you should have a look at the article from Vance to be sure to have enough samples in your data to be sure that you are seeing a real effect and you are not chasing some ghost. I hope that I have convinced you that the CLR emits so much more useful data you were never be able to collect from simple performance counters which can give you only very limited insights.

Monday, February 17, 2014 #

When dealing with application responsiveness issues you often need to get data from the actual users to be sure what is going on. There are plenty of reasons why your application does not respond. Most prominent examples are

  • Network
    • A bad network connection can totally ruin the user experience. Wireshark is your best friend to check if you have a network issue at all.
  • Paging
    • If your application has a high memory footprint you will experience sudden slowdowns because the OS did silently page out unused memory. When you touch it again the memory needs to be read in from the page file again which is about 1000 times slower than plain memory access. A SSD can greatly improve the user experience. (Check performance counter Memory\Pages Input/sec).
  • Disc IO
    • If your app hangs because it is just busy writing 500MB to disc you should never do this on the UI thread although it is much easier to program.
  • Garbage Collection

After you have looked at network (we need that data), paging (we need so much memory), disc IO (we need to read/write so much data) you should also check if garbage collection is an issue. GC times can become quite high if you create a lot of temporary objects. Especially de/serialization is a source of pain for the GC. How do you find out if you have a GC issue? First of all you need to check the usual performance counters which do tell you how often the GC kicks in and how much CPU is burned by cleaning up your mess. If you stay low e.g. 10-20% then you do usually not need to worry much about it except if your application hangs too often during a UI interaction. In that case you need to dig deeper.

The best tool to check managed memory issues including GC induced latency is PerfView from Vance Morrison. When you uncheck everything except "GC Collect Only" you can let it run for a long time on any machine to check how many GCs you have.



When you have pressed Start/Stop Collection you get an etl file that contains a GCStats section:


To complete your analysis you need to double click GCStats to get a nice overview for each process how much memory it did allocate and how much GC pause times your application did experience.


If you want to drill deeper you can export the data to Excel and check every single GC event if you wish to. That is all nice but what if the customer does not want to install strange tools on their machines? In that case you still can get the data and analyze it on another machine. PerfView relies on the CLR ETW providers which can be enabled with the command line tool logman as well. Logman is part of Windows since XP. To check which data the CLR ETW provider can give you you can execute:

logman query providers ".NET Common Language Runtime"   

The most interesting keywords are

Keyword (Hex) Name Description
0x0000000000000001 GCKeyword GC
0x0000000000000004  FusionKeyword       
Binder (Log assembly loading attempts from various locations)
0x0000000000000008  LoaderKeyword        Loader (Assembly Load events)
0x0000000000008000      ExceptionKeyword Exception

The most prominent keyword is 0x1 which is GC. The other also very interesting keyword is Exception (0x8000) which logs every thrown exceptions in all managed applications. If you want to check for GC latency issues and all exceptions I have created a little script to automate this task:


@echo off
REM enable GC Events 0x1
REM Exception tracing is 0x8000
logman start clr -p ".NET Common Language Runtime" 0x1,0x8000 0x5 -ets -o "%temp%\logmanclr.etl" -ct perf -bs 1024 -nb 50 500 -max 500
REM Needed to decode clr events later
logman start "NT Kernel Logger" -p "Windows Kernel Trace"  (process,thread)  -ets -o "%temp%\logmanKernel.etl" -ct perf -bs 1024 -nb 50 500 -max 500
logman -stop clr -ets
logman -stop "NT Kernel Logger" -ets
if "%1" EQU "-Merge" (
xperf -merge "%temp%\logmanclr.etl"  "%temp%\logmanKernel.etl" "%temp%\merged.etl"

You can send your customer this script and tell him to start it and execute the use and then press any key to stop collecting data. The result are two files named logmanclr.etl and logmanKernel.etl which can be zipped and mailed back to you. On your machine you need to merge the two etl files to be able to load the data into PerfView. This is what the script does with the -Merge option on your machine if you have the Windows Performance Toolkit installed. The other logman options are there to prevent loosing events even if very many of them come in and to prevent filling up the hard disc by allowing for each file a maximum size of 500MB.

For GC relevant data PerfView is the best tool to analyze the captured data. If you are after exceptions you can also use WPA to check which exceptions were thrown while your and all other applications where running on this machine. This will give you only the exception type and exception message but no stacks because a very limited set of ETW providers is enabled which is well suited for long running tests. 


Since each garbage collection has a start and stop event which does roughly correlate to the stop times of your application you can also create a regions file to visualize the GC events. This makes it extremely simple to check if your app was hanging because some long GC was running which did (probably) suspend your thread which tried to allocate just more memory. When concurrent GC is enabled you can allocate data and even new GC segments (for Gen 0,1)  but not large objects or Gen 2. Most of the time GC time does correlate with app pause times.


Here is the regions file I came up with to make the graph above.


<?xml version='1.0' encoding='utf-8' standalone='yes'?>
<?Copyright (c) Microsoft Corporation. All rights reserved.?>
     <RegionRoot Guid="{d8d639a0-cf4c-45fb-976a-0000DEADBEEF}" Name="GC" FriendlyName="GC Times">
          <Region Guid="{d8d639a0-cf4c-45fb-976a-000000000001}" Name="GCStart" FriendlyName="GC Start">
              <Event Provider="{e13c0d23-ccbc-4e12-931b-d9cc2eee27e4}" Id="1" Version="0"/>
              <Event Provider="{e13c0d23-ccbc-4e12-931b-d9cc2eee27e4}" Id="2" Version="0"/>

          <Region Guid="{d8d639a0-cf4d-45fb-976a-000000000002}" Name="GCStart_V1" FriendlyName="GC">
              <Event Provider="{e13c0d23-ccbc-4e12-931b-d9cc2eee27e4}" Id="1" Version="1"/>
              <Event Provider="{e13c0d23-ccbc-4e12-931b-d9cc2eee27e4}" Id="2" Version="1"/>

           <Region Guid="{d8d639a0-cf4d-45fb-976a-000000000003}" Name="GCStart_V2" FriendlyName="GC">
              <Event Provider="{e13c0d23-ccbc-4e12-931b-d9cc2eee27e4}" Id="1" Version="2"/>
              <Event Provider="{e13c0d23-ccbc-4e12-931b-d9cc2eee27e4}" Id="2" Version="1"/>

Not bad for such a simple file to make WPA GC aware. To make this file active you need select in the menu Trace - Trace Properties and add the file to it. Then a new Regions of Interest graph is added to your existing graphs which you can add to your analysis pane as usual. 


To check when a GC did start/stop this overview is really nice. If you need a deeper analysis there is no way around PerfView which gives you all the numbers pre analyzed in a per process view. When you have managed memory leaks you should check out the Memory menu of PerfView to take GC Heap dumps from running processes (nice) but also from memory dumps (even better). The latter functionality is essential if you only have a dump from a crashed process but it is not obvious who is holding most objects. I know .NET Memory Profiler from SciTech which also can import dump files but I got lost in the fancy UI. PerfView on the other hand hand looks old fashioned but it does its job much better than the UI might suggest:


Here I do see that I keep 100% of my data (10GB in a static variable in 10003 instances) which is very easy too see. Other profilers can do this also very well. But if you double click on the type you get the shortest roots to all objects holding your data nicely printed in a tree view. Cyclic references are broken up and the relevant nodes are marked bold. Other commercial tools had a hard time to tell me who is holding my data if graph cycles are involved. PerfView was able to show me it directly without much navigation in a complex interdependent object graph. To check who is allocating too much data you can also use PerfView when you enable ".NET SamplAlloc" to sample your allocations with minimum performance impact. Then you get the allocation stacks nicely lined up by object type. If you can afford it performance wise you can also enable ".NET Alloc" to track every managed object allocation which gives you exact numbers. That's all for today. I am constantly amazed how much useful data you can get out of ETW and shocked at the same time that so few people know about it. If your new years resolution is already history how about a new goal? Learn how to get most out of ETW!

Thursday, February 6, 2014 #

Most people using Workflow Foundations only know TrackingParticipants to enable activity tracing. With WF 4 the Sql TrackingParticipant was deprecated and replaced by EtwTrackingParticipant. WF has been wasteful with system resources since its very beginning but it has become better with WF 4. If you think that a SQL tracking participant is a good idea think again why MS did deprecate it:

[ObsoleteAttribute("The System.Workflow.* types are deprecated.  Instead, please use the new types from System.Activities.*")]
public sealed class SqlTrackingService : TrackingService, 

You should not try to create your own version of it because it is such a brilliant idea. It never was and never will be a well performing tracing destination. Lets see how far we get with the new ETW support for WF. The easiest way for e.g. a WF Console application is to add it directly to your WorkflowInvoker:

static void Main(string[] args)
    Activity workflow1 = new Workflow1();
    var invoker = new WorkflowInvoker(workflow1);
    invoker.Extensions.Add(new EtwTrackingParticipant());

Another option is to enable it indirectly via your App.config

    <source name="System.Activities"
            switchValue="Information, ActivityTracing"
        <add name="console" type="System.Diagnostics.DefaultTraceListener"/>

There is no mention of ETW in the used listeners but the ETW listener is added automatically when you have trace listeners for System.Activities configured.

Now you can enable ETW tracing with xperf like this:

xperf -on proc_thread+Loader -start workflow -on c651f5f6-1c0d-492e-8ae1-b4efd7c9d503  

The guid is the ETW provider for WCF and WF tracing. Its name is "Microsoft-Windows-Application Server-Applications". This xperf command will create in your root folder a kernel.etl and user.etl file to which the events are logged to. The default buffer size is 20 MB which you can alter if the event rate is higher than the buffer can be flushed to disc.

To stop tracing you can use

xperf -stop -stop workflow -d c:\temp\workflow.etl   

Finally you want to watch your traces WPA. Navigate to Generic Events and open the "Microsoft-Windows-Application Server-Applications" event provider:


There you see directly how your activities were scheduled and executed along with their activity arguments as long as they have a string representation. Since we have ETW we can add our own traces to ETW and the system wide profiling events as well to get the full picture in one file with a good analysis tool. Besides being central you have also chosen one of the fasted possible ways to write tracing information to an output device. The slowest thing is the DefaultTraceListener in your App.config which writes them via OutputDebugString. These events show up in VS in the output window which is nice but the output speed of OutputDebugString is not great. If you want to get maximum performance you need to write your own NullTraceListener which simply discards all traced data to enable writing only to ETW.

MS is investing heavily into its ETW infrastructure now and in the future. You can find very good online trainings at MS Virtual Academy: I have watched some videos today and they have really great content online about rare topics. Personally I found On/Off Transition Experience the most enlightening one. After this video you can rip your slow booting machine apart and make it boot much faster. With the content presented there I was able to cut down my boot time from over two minutes down to one minute, which is really great once you know where to look. The biggest issue on my machine was the DebugSvc was having over 90s of service init time which delayed the whole boot and the initialization of all other services a very long time. By simply disabling the service I got a really good improvement there. Another thing is to check if  during your boot .NET 2 applications are still running. Most of them can run also under .NET 4.0 when you change the app.config file and add

     <supportedRuntime version="v4.0"/>

you can convince them to use .NET 4.0. This will get rid of all hard disc accesses to the .NET 2.0 dlls which can cost you several seconds of hard disc access time. Another issue is SQL server. Here is a slide which I have shamelessly copied from Analysis at Scale.



This is real world data from over 50 000 computers. After seeing this graph I immediately changed my SQL Server instance to delayed boot so it gets out of the way of the other stuff which needs my hard disc more at this time. If you are working on a Laptop with a slow hdd and you go green with envy when you see how much faster your colleagues can work with their "old" desktop PCs. Here is a graph which shows boot time depending on the used hard disc:


If you need more proof why your money is well spent on a new SSD for your boot drive you can show this to your boss. It does really make a huge difference no only during boot time. What I really found fascinating in the On/Off Transition video is that there are companies out there which have Windows machines which need two hours to boot, and they were ok with that. Well not entirely. Finally they called in MS support to check their machines and they applied a hotfix to reduce the boot time from 2h down to 8 minutes. A few tweaks later they were down to two minutes. But the most shocking fact was that their workaround was that one member of the staff needed to arrive two hours earlier than all other ones to boot up all office machines. Incredible. I wonder if the company is still in business if they have equivalent efficient business processes…

Tuesday, February 4, 2014 #

Recently I had an interesting issue to diagnose. We all (readers of my blog) know that forgotten unregister calls to events are a common source of managed memory leaks. The weak event pattern is a common cure to this plague and WPF has a nice implementation for it. With .NET 4.5 WeakEventManager has got a new cousin which does not require you to derive from it which is even nicer. Every time when I needed to deal with weak event registrations I used it and it worked fine. Since WeakEventManager is a low level class I have used it also in non ui applications which turned out to be a bad design decision.

What do you think will this fine sample application do?

    class Program
        public static event EventHandler OnSomething;

        static void Main(string[] args)
            var p = new Program();
            for(int i=0;i<1000*1000;i++)
                ThreadStart func = () =>
                        // Register a weak event handler 
                        WeakEventManager<Program,EventArgs>.AddHandler(p, "OnSomething", (o,e)=> {});

                // Do the registration every time on a new thread
                var thread = new Thread(func);

            // lets see how far we get

No not the answer ….

Still no

It will give you this strange looking exception:

System.ComponentModel.Win32Exception: Not enough storage is available to process this command

   at MS.Win32.UnsafeNativeMethods.RegisterClassEx(WNDCLASSEX_D wc_d)
   at MS.Win32.HwndWrapper..ctor(Int32 classStyle, Int32 style, Int32 exStyle, Int32 x, Int32 y, Int32 width, Int32 height, String name, IntPtr parent, HwndWrapperHook[] hooks)
   at System.Windows.Threading.Dispatcher..ctor()
   at System.Windows.Threading.Dispatcher.get_CurrentDispatcher()
   at MS.Internal.WeakEventTable..ctor()
   at MS.Internal.WeakEventTable.get_CurrentWeakEventTable()
   at System.Windows.WeakEventManager`2.CurrentManager(String eventName)
   at System.Windows.WeakEventManager`2.AddHandler(TEventSource source, String eventName, EventHandler`1 handler)
   at WhatsWrongWithThisCode.Program.<>c__DisplayClass3.<Main>b__0() in d:\Media\Blog\WhatsWrongWithThisCode\WhatsWrongWithThisCode\Program.cs:Zeile 23.
   at System.Threading.ThreadHelper.ThreadStart_Context(Object state)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Threading.ThreadHelper.ThreadStart()


What happened here? The WeakEventManager has a hidden dependency to a Dispatcher object which assumes that you live on a thread that has an active window message pump. My newly created threads did never pump messages which causes the scheduled background operations to cleanup garbage collected event subscribers to hang forever. If you execute this logic on only a few threads nothing bad will happen. But if you use this in a server application which runs for weeks on dynamically generated threads then your task manager will look like this:




All looks normal except that we seem to leak many handles. In fact we do leak thread handles. It seems that the RegisterClassEx causes the managed thread objects to get pinned. That is not nice but no big deal. Windows can handle many more handles. But here is the catch:

  • You just have screwed your whole machine!

All of a sudden you cannot open new applications. Windows cannot be created, the C++ compiler will stop with an error that no .NET Framework is installed and many other very odd messages from any application can happen. At this point you usually reboot the server and hope that the error does not show up too often. Did you let this application run? Congratulations. A reboot is required Zwinkerndes Smiley quite soon.

When you google for the error message you get some links back that a non paged pool allocation failed and you fix it by changing some magic registry keys. That's what I did but it did not help. My application did not run out of non paged pool memory but it ran out of Desktop Heap. If you have never heard about Desktop Heap you are in good company. This heap is global for all applications in your session and many Window related methods allocate stuff from it. According to the NT debugging blog it is mainly used by CreateWindow but other functions (which are not documented) can also cause Desktop Heap allocations. Apparently it seems that we have just found another method named RegisterClassEx which also allocates from the Desktop Heap. You can change the desktop heap size for your interactive session by changing the registry key

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\SubSystems\Windows

which has on my Windows 8.1 box this value

%SystemRoot%\system32\csrss.exe ObjectDirectory=\Windows SharedSection=1024,20480,768 Windows=On SubSystemType=Windows ServerDll=basesrv,1 ServerDll=winsrv:UserServerDllInitialization,3 ServerDll=sxssrv,4 ProfileControl=Off MaxRequestThreads=16

The yellow marked value is the one you usually care about which is the size of your interactive desktop heap in KB. 20 MB is not so much but under Windows XP it was only 3MB where this error was much more common. By increasing this value to an even bigger one I was able to make this quite often happening issue to a very sporadic one which made this a not urgent issue. I tried to analyze the desktop heap usage of a kernel dump with !dheap but this Windbg command did never work so I was stuck until I did check the handle leak. When I have tracked down the root cause of the handle leak to the WPF WeakEventManager class it all made sense. The bad thing was that I always searched for high values of User Objects (window handles) which never happened when the machine ran out of desktop heap, leaving not many clues which process was responsible for allocating so much desktop heap.

What is the morale of this story? Never use WeakEventManager in threads which have no Window message pump running or you will run out of desktop heap sooner or later if you call it from many threads!

Reason found. Now what? When you search for Weak event patterns on the net you will find samples which

Although the concept is simple, a working implementation which works in all circumstances seems to be surprisingly hard to create. So far I have found FastSmartWeakEvent from Daniel Grunwald the best implementation which keeps the original code with += and -= to register and unregister a weak delegate and does even throw an exception at you when you try to register a delegate which contains closures which would cause the event never to fire because the compiler generated closure class is not referenced by anyone except the weak (too weak) delegate which is collected as soon as the method where the registration happened is left. Next time you see "Not enough storage is available to process this command" you know now where to look at. If the handle count is high the odds are good that someone did use the WPF WeakEventManager on non ui threads like I did.

Saturday, January 25, 2014 #

During the end of 2013 I had a little free time to create a tool to help diagnose performance problems in distributed systems. If you have a machine connected to a remote server and you want to find out why e.g. your game sometimes has hiccups you can run ETWControler on both boxes to get ETW tracing from both machines at the same time.

That alone of course could also be achieved with a shell script. But there is more to it. It comes with a full keyboard and mouse logger where all keyboard and mouse events are sent to ETW on the client, over the network to the server. There are the the received keyboard/mouse events also written to ETW.

That allows you to correlate in an easy way the local/remote and network traces with full timing information. More info's can be found at its CodePlex site.

The main use cases are covered here


Here are some screenshots:




It was fun to learn WPF and MVVM by creating my first WPF application. WPF although no longer hip is a great way to create good looking desktop apps quite fast. Perhaps I was learning an old thing but old things are likely to stay for the next 20 years. If you read the code you will find that it uses an socket quite directly to send the keyboard events as fast and clean over the wire as possible which was also an interesting exercise to learn how hard it is to use plain sockets. Sockets have way to many options to configure them in funny ways where the connection will never time out and things like that.

If you have found creative uses for this tool or interesting ideas how to extend it drop me note. Currently I am thinking about to add more control about what should be recorded by using TraceEvent instead of WPR to have more ETW collection options.

Wednesday, January 8, 2014 #

Many developers mainly know their application frameworks to rapidly deliver business driven applications. This is a good thing but still there are things like the OS, CPU, memory and the network which are critical to application performance.

The network is often viewed as a bidirectional connection between two computers




This is all well until things start to fail. The network topology then can quickly can change from two arrows to a black hole which tends to suck up all information and

Side Note: Black holes are not entirely black and will eventually fully disclose their stored information as an exploding white hole or a dim radiation originating from quantum effects where e.g. a virtual electron positron pair is separated by the gravitational force. One particle lands in the black hole and the other one has no chance than to become a real particle since there is no anti partner available anymore. If this effect is real no one knows but it is an interesting idea. Search for Hawking radiation if you want to know more.

never gives it back. Sudden hangs in network connections are exceptionally hard to analyze since it involves two computers and the magic between them.



In that case you need to check both computers and the network in one run to fully analyze the issue. To be fully able to analyze network glitches you need to use WPT (Windows Performance Toolkit) on both machines and Wireshark on one machine to capture all in/out packets. It is difficult to correlate both ETW traces with the Wireshark capture. Things become a lot easier if you have a remote interactive application like an online game were the server reacts on user input. In that case you can simply capture all mouse/keyboard events and save them to ETW locally and send them as well over a dedicated port over the wire and save the captured keyboard/mouse events on the remote machine as well. Then it is much easier to correlate network traces with ETW events because you have the keyboard/mouse events as markers in your ETW and network event stream. I had some time to play with this idea and I did try to learn something new during Christmas. In my case it was WPF so I created a small tool to put this idea to work.

Currently it looks like this.


Here I can capture mouse and keyboard events and save them to ETW and over the network on both machines. At the same time I can start on both machines ETW traces.


I think this is a powerful capability to correlate network and ETW events captured from different machines. You did perhaps notice the Slow button which defines a hot key which logs an additional Slow ETW event into the stream to mark an interesting condition e.g. a small hang. That makes it easy to correlate perceived user experience with a good marker event in the Wireshark and ETW traces. You can employ this concept in general to distributed systems as well if you know which action on one machine triggers operations on one or more other remote machines. If you log the start requests into your ETW and network stream (e.g. an diagnostics port unencrypted to make reading easier) there is no room for dark spots. The downside of course is that this type of analysis triples the amount of work since you need to check two machines and one network trace. Reading TCP traces is not the most natural thing to a developer but it is an interesting undertaking.

If you want to know more about the TCP protocol, HTTP, SPDY, …. you should have a look at High Performance Browser Networking which explains things from the bits up to the high level HTTP protocol and where you can tune things. Since the author is a performance engineer at Google he knows what he is talking about.

More details are the topic of a future post when I publish the code on CodePlex.

But back to the original topic of network hangs and the Silly Window Syndrome (SWS). If you are e.g. sending every keystroke over the network and you are fast at typing you can create the SWS since a lot of very small packets are going over the wire. In that case SWS avoidance algorithms do kick in which do delay the communication on purpose to force the receiver or the sender to generate bigger packets so the TCP window size can be increased from 0 bytes to a reasonable value. On Windows there is a Silly Window Syndrome avoidance timer used. The receiver of a small packet needs to ACK the small packet. The server will wait up to 5s for the next packet when the client does not respond to the packet with an ACK and an updated bigger tcp receive window size.

This all deep network stuff. Why should I care on my local machine and the loopback device? Well it happens that there is a bug in Windows 7 which causes your network connection to stall for 5s sometimes. The issue is that the SWS Timer is started AFTER the packet has been sent. If the ACK is received before the timer can start it will wait for the ACK (which will never come) for the full five seconds before sending the next packet.

This is known since 2011 but it took many support request to MS to release a hotfix for Windows 7 and Server 2008 R2 July 2013. Windows is used in many mission critical soft real time systems where inter process communication via sockets can break at any time for 5s. This issue has certainly caused much pain in the wild and very few people will have found the root cause of intermittent loopback device hangs.

If you have software communicating over the loopback device you should really get this hotfix to get rid of these sporadic errors. A workaround is to plug in a real loopback network cable which gives you enough delay so that the timer does start before the ACK is received. I am quite sure that this issue is the source of many failed UI automation tests where the UI did not react to user input sometimes within the usual time frame (e.g. 5s) but sometimes it did take much longer. It could be a good idea to add this fix to your test machines as well.

How does this issue look in reality? Lets have a look at some busy processes communication with each other:


The region should be red but that's the color how WPA shows highlighted region. All processes are busy but the whole machine hangs for 5s. There is one innocent looking event named TcpSWSAvoidanceEnd which signals that it has done its full 5s wait and it has fixed the non existing SWS issue. The TcpSwsAvoidanceBegin events are quite common in this scenario. From 200K packets 40K are TcpSwsAvoidanceBegin events which all spin up a 5s timer which is cancelled when the ACK from the client is received. Due to the heavy network load the race condition has a good chance to happen and it will happen from time to time.

Today is my side tracking day. I have another network hang which is until now not fully understood. If you enable in Internet Explorer Automatic Proxy Detection your http requests are routed by a java script downloaded from a proxy server. The mechanism behind this is the Web Proxy Autodiscovery Protocol. This uses your full qualified computer name as starting point to find the server which hosts the http url routing script. Your computer has e.g. the name

The browser will download from the following urls the proxy redirection java script.

If you have enabled this it can happen that some http requests take a long time for no apparent reason. The fix is to disable automatic proxy detection and configure the proxy manually. This is a good thing anyway since for some reason only known the NSA a server exists. This means if your company has no WPAD server configured you are downloading java script code from a server someone else has control over regardless of the web page you are trying to access …


Friday, January 3, 2014 #

What is wrong with this code?

List<Bitmap> bitmaps = new List<Bitmap>();
private void button1_Click(object sender, EventArgs e)
   for (int i = 0; i < 10 * 1000; i++)
       var bmp = Resources.SmallIcon;

Update: This issue is fixed with KB 2929755 from Windows 7 up to Windows 8.0 and

               with KB 2919355 for Windows 8.1 and Server 2012 R2.

This is my first Windows bug that actually got fixed. The later update was rolled out via Windows update already in April so you should have it by know already.</Update End>

From a correctness point of view nothing. You allocate 10 000 bitmaps which are e.g. a 32x32 icon which has about 1KB size. That would make another 10 MB of memory which is nothing out of the ordinary. You let your application run for a few hours and Task Manager shows you 1 GB of memory consumption and boom you run out of memory. The interesting thing is that you have not used up all available memory but your 32 bit application has run out of address space.

When I do execute this on my home machine with Windows 8.1 x64. I get the following numbers


We allocate 163 MB in private bytes. That is expected. But what about the 1.7 GB of reserved memory? Ok Windows might have a bad day. Lets try it with 5000 bitmaps.


# of Bitmaps Reserved KB KB Reserved per Bitmap KB # 64 KB Blocks/Bitmap










When I do this for zero and one bitmap I get 3 64 KB file mappings for each bitmap which has one KB size. When all address space is used up Windows GDI seems to free some memory to leave place for future allocations (that explains the factor of 2,7).

Windows is reserving memory 192 times the size of our bitmaps!

This is true for 64 bit processes on Windows 8/8.1 as well but there it does not matter. The behavior is the same for all editions of Windows 7 x86 and Windows 8.x x86/x64. The only Windows edition where this does not happen is Windows 7 x64 where the GDI implementation is different. When I look at some applications which make use of bitmaps like Windows Live Writer I do see 80 MB of shareable memory where lots of 64 KB blocks are allocated. This is a real problem in all applications that read many bitmaps out of resources! Interestingly the most prominent example VS2013 on my machine has no such problems. It seems that MS is not using their own infrastructure…

Next time your managed application does run out of memory check your Shareable Reserved bytes count for many 64KB blocks. If you have many of them then you should check out your bitmap usage.

What is this Shareable Memory anyway? Good question. Shareable Memory are file mapping objects which are page file backed. Any file mapping that does point to a real file on disc is shown as Mapped File with the full file name. That is the reason why there is not much to see except the allocation address inside VMMap. That was an interesting bug which involved some time with Windbg and a memory dump of a crashed process.

I can recommend the !address extension of Windbg to dump of a process all pages and their allocation mode and export the data into excel from where it is easy to create pivot charts to find out that the Sharable Memory is all allocated as 64KB blocks. Once that was known I created some Windbg scripts to break into the CreateFileMapping and MapViewOfFile methods to get the allocation stacks. Unfortunately due to the many file mapping objects created by the process it became so slow that the application was unusable. A faster approach than Windbg was needed.

After a little thinking I thought why not intercept the calls with a hooking library like EasyHook? It is a great library which allows you to execute managed code in your hook while you can intercept nearly any Windows API call. Now I can get my hands on the mapping address (MapViewOfFile return the mapped address, UnMapViewOfFile receives the mapping address to unmap) and allocation size besides other things. But one thing is missing: Call Stacks.

Here it is good to know that .NET support ETW events by now. What is easier than to create an ETW provider in managed code which logs the all calls to MapViewOfFile and UnmapViewOfFile as structured events? The kernel knows how to walk the stack very fast. That's what I did and it did work perfectly. When you know how WPA displays custom ETW events you can even use it to analyze unbalanced map/unmap calls. The key insight here is that you need to put the mapping address of Un/MapViewOfFile into the same column. Now you can group by this field and get as grouping the mapping address. Now sort by count and look for all events which have the only one mapping call but no unmap. Select all unbalanced maping calls and filter for them.

Now you can earn the fruits of your grouping by adding the StackTrace column and look in which methods most unbalanced allocations are happening. Perhaps I can get permission to make the code public so you can use it for other purposes as well. The concept is general. You need to log the allocation and deallocation events into ETW and put the handle (e.g. the address or HANDLE) as first property into your ETW event. Then you can record all allocation and deallocation events with full call stacks. WPA will display the handle in the first column for your allocation and deallocation calls which allows further grouping and filtering.

You can skip that part and use wpaexporter to export your events to a CSV file and write your own tool to do more sophisticated analysis. Or you could use the TraceEvent Library and process the ETW events directly. It is up to you.

If you are having a hard time to find leaks related to GDI objects like Bitmap, Brush, Device Context, Font, Metafile, Pen, Region, Palette you can hook the calls and have an easy day. This approach is not new and has been described already 2003 in this MSDN magazine article. There are also the method names listed you might be interested in to hook. The new thing is that it was never so easy to get full managed and unmanaged call stacks with nearly no performance impact. The MSDN article relies on SetWindowsHookEx to inject code into other processes and dbghelp.dll of walk the stack with the debugging Apis. The combined ETW and EasyHook approach is much more performant and general (full mixed mode stack walking).

Without that it would have been much harder to find out the root cause. With ETW tracing you have access to first class information which is not based on guessing but exact measurements. That not only allows you to find bugs but you can now also make quantitative predictions if you fix the bitmap usage in this or that method how much address space and private bytes  memory you will get back. It is kinda cool to go to the devs and tell them: If you fix that method you will free up 300MB of virtual address space. If we fix the top 5 then we can reduce the address space usage of Shareable Memory by 70%. Most of the time they say: "I do not believe that this innocent method is responsible for such a big amount of allocations. Here is the fix and check again." After measuring the gain of the fix I can compare the predicted results and tell them: "Looks like my prediction was correct and we have an exact match."

This is a real game changer. Now you can work focused on the most important methods which cause a leak based on accurate measurements. Thanks to the low impact ETW tracing it is possible to run all your tests with hooks enabled giving you in the case of leaks accurate data to find the leak without the need to repeat the tests! At least in an ideal world. But since setting the hooks up may cause random crashes it is safer to employ this technique only when a leak was found.

There is one drawback to hooking and ETW stack walking: If you are on Windows 7 stack walking will only work in 32 bit processes which (lucky me) was the case for this problem. The ETW stackwalker in Windows 7 stops at the first dynamically generated stack frame (the assembly code for the hooks) which will give you only a stack which ends in your trace call. Not terribly useful. This has been fixed in Windows 8+ where you can hook with full call stacks also in 64 bit processes.

Since Windows 7 will stay for some time I really would like to get a fix for the Windows 7 ETW infrastructure to get full call stacks in 64 bit processes. For managed code you can NGEN your assemblies which fixes the issue if you do not have much JIT generated code running in your process. The stackwalker of Windows 7 will stop when it encounters the first frame which is not inside any loaded module.

I can only speculate how many 32 bit applications are affected by this issue and run out of memory much earlier than they would need to. At least for regular .NET applications this is a real issue nobody did notice so far. The 64KB value is no accident. It is the OS allocation granularity which is the reason why always 64 KB chunks of memory are reserved although you can map smaller regions (e.g. 4KB as well). The OS allocation granularity seems to be used by GDI and put into the Maping calls explicitly. The called GDI method is GdipCreateBitmapFromStream which assumes an IStream interface which is converted to a file mapping region under the hood. Other GDI methods seem not to be affected by this issue but one can never know. If you have a bitmap intensive managed x32 application (many icons) you should better check with VMMap the amount of Shareable Memory when your application has run out of memory.

Saturday, December 28, 2013 #

Yes I know. If a unit test accesses a file it is by definition no longer a unit test. My personal definition of a unit test is: If it is fast I can run many of them which is ok for the main goal of unit tests to provide fast feedback. Personally I am not big fan of repetitive code so I created a small class that takes care of creating a temp folder in the %TEMP%\utest\ directory which creates a directory named after your unit test method name which is quite handy for debugging failed unit tests.

public void Can_Create_Temp_Folder()
    using (var tmp = TempDir.Create())
        var txtfile = Path.Combine(tmp.Name, "test.text");
        File.WriteAllText(txtfile, "hi");

This will give you a fresh directory for every test


which is deleted after the using statement has been left including its contents. You can create variations of that to e.g. leave the temp files in place when the unit test fails to make debugging easier (See here how to check in a finally block if you are in an exception unwind scenario). Or you can postfix every test with a guid to allow 100% separation of concurrently running unit tests and to make absolutely sure that you are not reusing old data of an unit test which had some files opened which could not be deleted last time. The simplest form is a small code snippet like this:

    interface ITempDir : IDisposable
        string Name

    class TempDir : ITempDir
        public string Name
            private set;

        /// <summary>
        /// Create a temp directory named after your test in the %temp%\uTest\xxx directory
        /// which is deleted and all sub directories when the ITempDir object is disposed.
        /// </summary>
        /// <returns></returns>
        public static ITempDir Create()
            var stack = new StackTrace(1);
            var sf = stack.GetFrame(0);
            return new TempDir(sf.GetMethod().Name);

        public TempDir(string dirName)
            if( String.IsNullOrEmpty(dirName))
                throw new ArgumentException("dirName");
            Name = Path.Combine(Path.GetTempPath(), "uTests", dirName);

        public void Dispose()

            if( Name.Length<10)
                throw new InvalidOperationException(String.Format("Directory name seesm to be invalid. Do not delete recursively your hard disc.", Name));

            // delete all files in temp directory
             foreach(var file in Directory.EnumerateFiles(Name, "*.*", SearchOption.AllDirectories))

            // and then the directory

Small, simple but effective.

Saturday, December 14, 2013 #

When you try to make sense of Windows Event Tracing files (.etl) it is often helpful to add some special events into the event stream so you can closer examine near the “Here it was slow” event which helps a lot to look at the right spots. If you are lucky you can do this by routing your tracing data to ETW or you need to set some marker events by hand. The traditional was of doing this was to call xperf –m “Here it was slow” during your repro session. I have observed several times since then that this command did no longer work on some machines.

C:\Program Files (x86)\Windows Kits\8.1\bin\x64> xperf -m "hi"
xperf: error: Set Mark failed                                 

These things did never happen on my Windows 7 machine. I thought it was some oversight in Windows 8. Interestingly if I started the tracing session with xperf all worked well again. But when I did start a tracing session with the new WPR/UI tool from the Windows 8.1 SDK this error was consistently there. After digging a little deeper with xperf –Loggers I found that WPR/UI do not create the usual "NT Kernel Logger" session but instead a WPR_initiated_WprApp_WPR System Collector session for which I cannot see the trace flags anymore.

C:\Windows\system32>xperf -Loggers
Logger Name           : WPR_initiated_WprApp_WPR System Collector
Logger Id             : 11
Logger Thread Id      : 0000000000000000
Buffer Size           : 1024
Maximum Buffers       : 1632
Minimum Buffers       : 1632
Number of Buffers     : 1632
Free Buffers          : 1624
Buffers Written       : 0
Events Lost           : 0
Log Buffers Lost      : 0
Real Time Buffers Lost: 0
Flush Timer           : 0
Age Limit             : 0
Log File Mode         : Secure Buffered PersistOnHybridShutdown SystemLogger
Maximum File Size     : 0
Log Filename          :
Trace Flags           :
PoolTagFilter         : *

That does explain why I cannot add mark  events anymore on Windows 8 machines. There is no more "NT Kernel Logger" session to set the mark event. This is bad news for xperf –m but good news for all people who like to keep Process Explorer and other tools open while they are taking kernel trace events. It seems that xperf was not updated to display the tracing flags for the new type of kernel tracing sessions because another tool from the Windows 8.1 SDK can display them: tracelog

C:\Program Files (x86)\Windows Kits\8.1\bin\x64>tracelog.exe  -l    

Logger Name:            WPR_initiated_WprApp_WPR System Collector
Logger Id:              0x11
Logger Thread Id:       0000000000000000
Guid:                   c7f66f95-6045-11e3-8265-d43d7ef06eb2
Session Security:       D:(A;;0x800;;;WD)(A;;0x120fff;;;SY)(A;;0x120fff;;;LS)(A;;0x120fff;;;NS)(A;;0x120fff;;;BA)(A;;0xee5;;;LU)(A;;LC;;;MU)(A;;
Buffer Size:            1024 Kb
Maximum Buffers:        1632
Minimum Buffers:        1632
Number of Buffers:      1632
Free Buffers:           1624
Buffers Written:        0
Events Lost:            0
Log Buffers Lost:       0
Real Time Buffers Lost: 0
AgeLimit:               0
Real Time Consumers:    0
ClockType:              PerfCounter
Log Mode:               Secure  Buffering-only
Hybrid Shutdown:        Persist
Maximum File Size:      not set
Buffer Flush Timer:     not set
Enabled tracing:        Process Thread ProcCounters ImageLoad DiskIo HardFaults CxtSwap Dpc Isr Profile Power MemInfo MemInfoWs Priority Dispatcher … 
Stack Tracing:         
DiskRead DiskWrite DiskFlush ThreadDCEnd CSwitch ReadyThread KernelQueueEnqueue KernelQueueDequeue Profile
Log Filename:


When you look at the documentation of tracelog you will find many Windows 8 and Windows 8.1 additions which make it a cool tool to record trace sessions. Although I must admit that the sheer number of tools to start traces is getting out of hand. We have

ETW Control Tool Part Of
WPR/UI Windows Performance Toolkit
xperf Windows Performance Toolkit
logman Windows
tracelog Since Windows 8.1 SDK formerly only part of WDK.
PerfView Extra Download from MS

Each tool has another syntax and philosophy behind which makes it a bit cumbersome to use the right tool. But as far as I can tell you should check out tracelog first since it is the only tool which is able to make use of the new cool Windows 8/8.1 enhancements. Below is a list of the new ETW capabilities of Windows 8/8.1

Since Tracelog Feature Description
8.1 -bt n Specifies the number (n) of buffers to fill before starting to flush them.
8.1 -EventIdFilter {-in|-out}nid1 id2 ... Specifies an event id filter with n event ids (maximum 64 event ids allowed).
8.1 -ExeFilter Executable_file [; Executable_file ...] Specifies the names of executable files to filter. You can specify a list of files. Separate the names of the files using semi-colons. Files not listed are excluded.
8 -hybridshutdown {stop|persist} Controls hybrid shutdown logger behavior.
8.1 -independent Enables independent mode on the trace session.
8.1 -PidFilter n pid1 pid2 ... Specifies a Pid filter with n Pids (maximum 8 allowed).
8.1 -PkgIdFilter Package Full Name [ ;Package Full Name...] Specifies a package ID filter. You can specify a list of package files. Separate the names of the files using semi-colons.
8.1 -PkgAppIdFilter PRAID [ ;PRAID...] Specifies a package-relative app identifier (PRAID) filter. The PRAID is the unique identifier of the application within the package. You can specify more than one PRAID. Separate the ids using semi-colons.
8 -Pmc Ctrs:Events Configures the performance monitor counter (PMC) sampling on events.
8 -ProfileSource src Configure profiling source to use. For list of sources, use the command tracelog -ProfileSource Help.
8 -SetProfInt n src Configure the profiling interval (n) for specified source. Where n represents units of 100ns. The default is 10000 (which is equivalent to 1ms.
8.1 -StackWalkFilter {-in|-out}nid1 id2 ... Specifies an event id filter with n event ids (maximum 64 event ids allowed).
8 -systemlogger Logger can receive SystemTraceProvider events. See Configuring and Starting a SystemTraceProvider Session.

By default tracelog will start loggers with kernel mode providers with the systemlogger option so you can start up to 8 additional Kernel Loggers which should be enough for everyone. To log e.g. all Context Switch events with call stacks you can use

tracelog -start MySession -f c:\Kernel1.etl -img -eflag PROC_THREAD+LOADER+CSWITCH+DISPATCHER -stackwalk CSWITCH+ReadyThread         

This will create a new kernel trace session (Windows 8 and later only!) without blocking the default Kernel Logger Session which is the most useful thing of the new capabilities of the tracelogger tool. When I stop it with

tracelog -stop mySession 

then no image identification rundown is performed which means that I have call stacks but WPR cannot identify in which images the stack addresses do reside. It seems also not be able to merge etl files like xperf can do. Not to mention that it does not create the manged pdbs when stopping a trace session like wpr and xperf are able to. Still tracelog is the only tool which exposes the full power of ETW as a command line tool. It has the most interesting options to start new trace sessions but it seems to (correct me I really would like to know) be the wrong tool to stop a trace session by initiating all the necessary image identification and managed pdb creation. I really would like to see additional parameters in xperf to create the managed pdbs later when I have gathered and etl file.

To limit the potentially huge output of ETW can now be filtered at least at process level for user mode providers (-PidFilter or -ExeFilter) and to enable/disable stack walking for specific events and not the complete provider (-StackWalkFilter …) which is a welcome addition for Windows 8.1 users of ETW. Based on my limited experience with the new tracelog tool I would use it to start trace sessions for sophisticated analysis tasks on Windows 8 machines but stop the trace session with xperf to get the usual image identification and ngen pdb creation support.

The command

xperf -stop MySession -d c:\temp\xperfStop.etl   

did the job to stop the system logger and I got readable call stacks with a full .NET support. If you do not want to use a myriad of tools you can also use PerfView which Vance Morrison uses a lot to troubleshoot VS issues. The command line syntax is vastly different than the previous tools but it can be quite handy if you only need to deploy a 10MB exe which can collect and view the ETW data.

perfview start c:\temp\perfview.etl  -nogui

will start profiling with a sample interval of 1ms and also capture all disc related stuff which is a good default to start. If you need to do some wait chain analysis you can use

perfview start c:\temp\perfview.etl  -nogui -ThreadTime

which will enable an additional bunch of additional providers:

Enabled tracing:       Process Thread ProcCounters ImageLoad DiskIo DiskIoInit HardFaults CxtSwap TcpIp Profile Dispatcher
Stack Tracing:          DiskRead DiskWrite DiskFlush 0x0220 ProcessCreate ImageLoad ThreadCreate CSwitch ReadyThread Profile

To stop tracing you need to do

perfview -zip:false stop   

You can view the generated file with

perfview c:\temp\perfview.etl   

or with WPA or your favorite viewing tool. Perfview is for some managed analysis very handy. With handy I mean that it is the only tool to analyze managed processes for things like

  • GC wait times
  • JIT times
  • Managed Allocation Tracing (type and stack)
  • thrown exceptions without using a debugger, profiler or your own tracing

I did refuse to try perfview for a long time because it complained always that nearly all my stacks were broken. The issues is that it handles loading pdbs for call stack resolution and filtering quite different.


The first hurdle is to notice that there are always stack filters enabled (See GroupPats) which will fold your stack to nearly useless many times. After deleting the GroupPats text box I get real stacks but no symbols:


Now I have the stacks but still no symols. To fix that I must use the context menu to load for this call stack the corresponding symbols. The MS symbol servers are configured by default. If you have your own symbol server you need to reconfigure it in the File - Set Symbol Path menu first.


After that I get call stacks with method names as usual:


Once I did realize that I found PerfView quite useful. It is not only for performance work interesting but also as a system wide debugger which captures all thrown managed exceptions in all processes at once. The tool itself has many more options which you can explore with

Perfview /?

It supports some really advanced scenarios like collecting data in a ring buffer until a specific performance counter reaches a threshold then capture x seconds further and then stop. The main drawback of ETW with full profiling and context switch capturing is that the amount of data on a busy system exceeds 1 GB of data within one minute easily. If you do want to capture data for a really long time and stop when the interesting condition has happened you need more sophisticated stop triggers. PerfView has captured this operational know how in its command line parameters.

It can stop on

  • Performance Counter Thresholds
  • Application Event Log Messages
  • ETW Events
  • Gen 2 Collections
  • Long Gen 2 Collections
  • Managed exceptions of a specific type
  • AppFabric request time (Azure)

This makes it a cool tool to diagnose sporadic errors but you cannot afford to let the application to run under a debugger. The biggest pain point of PerfView is that it is great to collect data but it cannot live with the generated etl file and needs to convert it to its own format (etlx) for further processing. This can take some additional minutes to load bigger etl files but I hope it will become faster in future versions.

Below is the full command line of PerfView which has gotten not enough attention (just like tracelog) so far. Tracelog can be excused by living its hidden live in the Windows Driver Development Kit until it was moved to the Windows SDK with 8.1. It seems that device driver developers do not share their tricks online like "normal" .NET developers do.


The PerfView application has a number of commands associated with it, each with its own set of parameters and
qualifiers.  They are listed below.  Options that are common to all commands are listed at the end.

Usage: PerfView [DataFile]

  View profile data.

    [DataFile]                           ETL or ETLX file containing profile data.

Usage: PerfView run CommandAndArgs ...

  Starts data collection, runs a command and stops.

    CommandAndArgs ...                   Command to run and arguments.

Usage: PerfView collect [DataFile]

  Starts data collection, wait for user input, then stops.

    [DataFile]                           ETL file containing profile data.

Usage: PerfView start [DataFile]

  Starts machine wide profile data collection.

    [DataFile]                           ETL file containing profile data.

Usage: PerfView stop

  Stop collecting profile data (machine wide).

Usage: PerfView mark [Message]

  Add a PerfView 'Mark' event to the event stream with a optional string message

    [Message]                            The string message to attach to the PerfView Mark event.

Usage: PerfView abort

  Insures that any active PerfView sessions are stopped.

Usage: PerfView merge [DataFile]

  Combine separate ETL files into a single ETL file (that can be decoded on another machine).

    [DataFile]                           ETL file containing profile data.

Usage: PerfView listSessions

  Lists active ETW sessions.

Usage: PerfView ListCpuCounters

  Lists the ListCpuCounters CPU counters available on the system (win8+ only).

Usage: PerfView EnableKernelStacks

  On X64 machines if you have problems with broken stacks when the code is executing in the kernel, setting this option
  and rebooting may improve things

Usage: PerfView DisableKernelStacks

  Resets the registry keys set by EnableKernelStack.

Usage: PerfView HeapSnapshot Process [DataFile]

  Take a snapshot of the CLR GC heap of a process.

    Process                              The process ID or Process Name (Exe without extension) of the process take a
                                         heap snapshot.
    [DataFile]                           The name of the file to place the heap snapshot.

    [-SaveETL]                           Save an ETL file along with the GCDump file when dumping the JS Heap.
    [-MaxDumpCountK:250]                 The maximum number of objects (in K or thousands) to place int the .gcDump
                                         file.  Sample sufficiently to hit this metric.
    [-Freeze]                            Freeze the dump while data is taken.

Usage: PerfView ForceGC Process

  Forces a GC on the specified process

    Process                              The process ID or Process Name (Exe without extension) of the process to force
                                         a GC.

Usage: PerfView HeapSnapshotFromProcessDump ProcessDumpFile [DataFile]

  Extract the CLR GC heap from a process dump file specified.

    ProcessDumpFile                      The name of the input process dump file.
    [DataFile]                           The name of the file to place the heap snapshot.

Usage: PerfView GuiRun

  Opens the 'Run' dialog box.

Usage: PerfView GuiCollect

  Opens the 'Collect' dialog box.

Usage: PerfView GuiHeapSnapshot

  Opens the 'TakeHeapSnapshot' dialog box.

Usage: PerfView UserCommand CommandAndArgs ...

  Runs a user defined command.  Type 'PerfView UserCommandHelp' to see the help for all the user commands.  See PerfView
  Extensions in the users guide for more on creating user commands.

    CommandAndArgs ...                   User command to run and any arguments.

Usage: PerfView UserCommandHelp

  Displays help for user commands.  Also see Help->User Command Help in the GUI.

Usage: PerfView CreateExtensionProject [ExtensionName]

  Creates a VS project for creates a perfView extension.

    [ExtensionName.Global]               The name of the extension (no .DLL)

Usage: PerfView FetchSymbolsForProcess [DataFile]

  Fetch all the PDBs files needed for viewing locally.

    [DataFile]                           ETL file containing profile data.
Qualifiers global to all commands:

    [-LogFile:STRING]                    Send messages to this file instead launching the GUI.  Intended for batch
                                         scripts and other automation.
    [-BufferSize:64]                     The size the buffers (in MB) the OS should use to store events waiting to be
                                         written to disk.
    [-Circular:0]                        Do Circular logging with a file size in MB.  Zero means non-circular.
    [-BufferSizeMB:64]                   The size the buffers (in MB) the OS should use to store events waiting to be
                                         written to disk.
    [-CircularMB:0]                      Do Circular logging with a file size in MB.  Zero means non-circular.
    [-MaxCollectSec:0]                   Turn off collection (and kill the program if perfView started it) after this
                                         many seconds.  Zero means no timeout.
    [-StopOnPerfCounter:STRING,...]      This is of the form CATEGORY:COUNTERNAME:INSTANCE OP NUM where
                                         CATEGORY:COUNTERNAME:INSTANCE, identify a performance counter (same as
                                         PerfMon), OP is either < or >, and NUM is a number.  When that condition is
                                         true then collection will stop.  You can specify this qualifier more than once.
                                         See 'Stop Trigger' in the users guide for more.
    [-StopOnGen2GC:STRING]               This will stop on any Gen2 GC from the given process (can be a process ID or a
                                         process Name (exe file name without path or extension) or * (any process)
    [-StopOnEtwEvent:STRING,...]         This is of the form PROVIDER:TASK:OPCODE here only PROVIDER and TASK are
                                         required.  (Opcode=Info, Keywords=ALL, Level=INFORMATIONAL are the defaults.
                                         Provider can be a Guid or Name (for EventSources you can use *Name to represent
                                         the Guid.  Task is the task Name or 'Task(NNN)' where NNN is the task number in
                                         decimal.  You can specify this argument more than once
    [-StopOnException:STRING]            Where the text is a regular expression that will be used to match the full name
                                         of the .NET Exception thrown.The empty string represents any exception.
    [-StopOnEventLogMessage:STRING]      Stop when an event log message that matches the given (ignore case) regular
                                         expression is written to the Windows 'Application' event log.  You can specify
                                         a particular event log with the syntax eventLogName@RegExp.
    [-StopOnRequestOverMsec:0]           Trigger a stop of a collect command if there is any IIS request that is longer
                                         than the given number of MSec.
    [-StopOnGCOverMsec:0]                Trigger a stop of a collect command if there is a .NET Garbage Collection (GC)
                                         is longer than the given number of MSec.
    [-StopOnAppFabricOverMsec:0]         Trigger a stop of a collect command if there is a AppFabric request is longer
                                         than the given number of MSec.
    [-DecayToZeroHours:0]                The trigger value used in StopOnPerfCounter or StopOn*OverMSec will decay to
                                         zero in this interval of time.
    [-DelayAfterTriggerSec:10]           Wait this number of seconds after a trigger before actually stopping the trace.
    [-CollectMultiple:0]                 Collect Multiple instance (used in conjunction with StopTrigger).
    [-MonitorPerfCounter:STRING,...]     This is of the form CATEGORY:COUNTERNAME:INSTANCE@NUM where
                                         CATEGORY:COUNTERNAME:INSTANCE, identify a performance counter (same as
                                         PerfMon), and NUM is a number representing seconds.  The @NUM part is optional
                                         and defaults to 2.  The value of the performance counter is logged to the ETL
                                         file as an event ever NUM seconds
    [-CpuSampleMSec:1]                   The interval (MSec) between CPU samples (.125Msec min).
    [-Merge:BOOLEAN]                     Do a merge after stopping collection.
    [-Zip:BOOLEAN]                       Zip the ETL file (implies /Merge).
    [-NoRundown]                         Don't request CLR Rundown events.
    [-NoNGenRundown]                     Don't do rundown of symbolic information in NGEN images (only needed pre V4.5).
    [-NoClrRundown]                      Don't do rundown .NET (CLR) rundown information )(for symbolic name lookup).
    [-RundownTimeout:120]                Maximum number of seconds to wait for CLR rundown to complete.
    [-MinRundownTime:0]                  Minimum number of seconds to wait for CLR rundown to complete.
    [-KeepAllEvents]                     A debug option to keep all events, even symbolic rundown events.
    [-MaxEventCount:0]                   Limits the total number of events.  Useful for trimming large ETL files.  1M
                                         typically yields 300-400Meg.  of data considered.
    [-SkipMSec:0]                        Skips the first N MSec of the trace.  Useful for trimming large ETL files in
                                         conjunction with the /MaxEventCount qualifier.
    [-CpuCoutners:STRING,...]            A comma separated list of hardware CPU counters specifications NAME:COUNT to
                                         turn on.  See Users guide for details.  See ListCpuCounters for available
                                         sources (Win8 only)
    [-Providers:STRING,...]              Additional providers.  This is comma separated list of
                                         ProviderGuid:Keywords:Level:Stack specs.  This qualifier has the same syntax as
                                         the Additional Providers TextBox in the collection window.  See help on that
                                         for more.
    [-OnlyProviders:STRING,...]          Like the Providers qualifier, but also turns off the default Kernel and CLR
    [-ThreadTime]                        Shortcut for turning on context switch and readyThread events
    [-GCOnly]                            Turns on JUST GC collections an allocation sampling.
    [-GCCollectOnly]                     Turns on GC collections (no allocation sampling).
    [-DumpHeap]                          Capture a heap snapshot on profile stop
    [-ClrEventLevel:Verbose]             The verbosity for CLR events Legal values: Always, Critical, Error, Warning,
                                         Informational, Verbose.
    [-ClrEvents:Default]                 A comma separated list of .NET CLR events to turn on.  See Users guide for
                                         details.  Legal values: None, GC, GCHandle, Binder, Loader, Jit, NGen,
                                         StartEnumeration, StopEnumeration, Security, AppDomainResourceManagement,
                                         JitTracing, Interop, Contention, Exception, Threading,
                                         JittedMethodILToNativeMap, OverrideAndSuppressNGenEvents, SupressNGen, Type,
                                         GCHeapDump, GCSampledObjectAllocationHigh, GCHeapSurvivalAndMovement,
                                         GCHeapCollect, GCHeapAndTypeNames, GCHeapSnapshot,
                                         GCSampledObjectAllocationLow, PerfTrack, Stack, Default, All.
    [-KernelEvents:Default]              A comma separated list of windows OS kernel events to turn on.  See Users guide
                                         for details.  Legal values: None, Process, Thread, ImageLoad, ProcessCounters,
                                         ContextSwitch, DeferedProcedureCalls, Interrupt, SystemCall, DiskIO,
                                         DiskFileIO, DiskIOInit, Dispatcher, MemoryPageFaults, MemoryHardFaults,
                                         VirtualAlloc, VAMap, NetworkTCPIP, Registry, AdvancedLocalProcedureCalls,
                                         SplitIO, Driver, OS, Profile, Default, TheadTime, FileIO, FileIOInit, Verbose,
                                         All, ReferenceSet, PMCProfile.
    [-DotNetAlloc]                       Turns on per-allocation .NET profiling.
    [-DotNetAllocSampled]                Turns on per-allocation .NET profiling, sampling types in a smart way to keep
                                         overhead low.
    [-OSHeapProcess:0]                   Turn on per-allocation profiling of allocation from the OS heap for the process
                                         with the given process ID.
    [-OSHeapExe:STRING]                  Turn on per-allocation profiling of allocation from the OS heap for the process
                                         with the given EXE (only filename WITH extension).
    [-ForceNgenRundown]                  By default on a V4.0 runtime NGEN rundown is suppressed, because NGEN PDB are a
                                         less expensive way of getting symbolic information for NGEN images.  This
                                         option forces NGEN rundown, so NGEN PDBs are not needed.  This can be useful in
                                         some scenarios where NGEN PDB are not working properly.
    [-AcceptEULA]                        Accepts the EULA associated with PerfView.
    [-DataFile:STRING]                   FileName of the profile data to generate.
    [-NoView]                            Normally after collecting data the data is viewed.  This suppresses that.
    [-UnsafePDBMatch]                    Allow the use of PDBs even when the trace does not contain PDB signatures.
    [-ShowUnknownAddresses]              Displays the hexadecimal address rather than ? when the address is unknown.
    [-VS:BOOLEAN]                        Turn on Visual Studio Providers.  It also insures that theses providers are
                                         registered and can be decoded properly.
    [-NoGui]                             Use the Command line version of the command (like on ARM).  Brings up a console
                                         window.  For batch scripts use /LogFile instead.
    [-Process:STRING]                    A process name (exe file name without directory or extension).
    [-SafeMode]                          Turn off parallelism and other risky features.
    [-RestartingToElevelate:STRING]      Internal: indicates that perfView is restarting to get Admin privileges.
    [-SessionName:PerfViewSession]       Define the name for the user mode session, if kernel events are off.
    [-MaxNodeCountK:0]                   The maximum number of objects (in K or thousands) that will even be examined
                                         when dumping the heap.  Avoids memory use at collection time.  This is useful
                                         if heap dumping causes out of memory exceptions.

Thursday, November 14, 2013 #

I constantly try out new and old tools to check out if they can be useful in my daily work. Did you ever wonder where Windows does put the file contents of your just read 200MB file in memory and why reading the same file is almost always faster? The answer to both question is the file system cache. When you look at the performance counters you will find quite a lot cache related counters but when you look at the values they are typically only a few hundred MB but the task manager shows very different values.

In theory on my Windows 7 machine I should get the value from Task Manager as “System Cache Resident Bytes” of the Memory section. The description sounds good:

“System Cache Resident Bytes is the size, in bytes, of the pageable operating system code in the file system cache. This value includes only current physical pages and does not include any virtual memory pages not currently resident. It does equal the System Cache value shown in Task Manager. As a result, this value may be smaller than the actual amount of virtual memory in use by the file system cache. This value is a component of Memory\\System Code Resident Bytes which represents all pageable operating system code that is currently in physical memory. This counter displays the last observed value only; it is not an average.”

The bold statement about equality is indeed a very bold one. On my Windows 7 x32 machine Task Manager reports as Cached 504 MB whereas the performance counter only gives 185 MB. There seems to be quite a bit of confusion what should be counted as cache and what not. This is an old description from Windows XP times which did have in the task manager a “System Cache” value.

So far I have not found a really good explanation what exactly the task manager is showing as Cached value.


But there are better tools around like Resource Monitor (perfmon /res) which has more to offer. It shows tooltips at unexpected places like value descriptions in the Memory Tab:


Now we can explain what Task Manager is actually showing there. Standby and Modifed are the names of Windows kernel structures which are used by the Windows Memory Manager.

The tool RamMap gives you all the details about the current state of the Memory Manager and its priority lists for different types of memory. More details can be found on the web what all these different types of memory mean and how the Memory Manager does organize its stuff.





Total Installed Memory in your computer Total = Active List + Standby List + Modified List + Transition List + Zeroed List + Free List
Cached Memory used by File System Cache Cached = Standby List + Modified List
Available Can be allocated as private memory by other applications Available = Standby List + Free List
Free Unused Memory Free = Zeroed List + Free List

Why should I care about all that stuff? Because if you are reading this you are generally interested how things work. Besides of curiosity you can now answer questions like

  1. Why needs a performance regression test two warmup runs up before the numbers become stable?
  2. What files did my colleague have open but deleted them?

Lets start with the most disturbing question. I can have a look at all files you did have open at your computer if I have direct access with Admin rights. Open RamMap and check out the File Details tab. There are all files listed which are currently completely or partially stored in the file system cache in the one or other way.


With livekd from SysInternals I can also view the contents of the cached files

0: kd> !db 7e3ee000
#7e3ee000 3b 20 66 6f 72 20 31 36-2d 62 69 74 20 61 70 70 ; for 16-bit app
#7e3ee010 20 73 75 70 70 6f 72 74-0d 0a 5b 66 6f 6e 74 73  support..[fonts
#7e3ee020 5d 0d 0a 5b 65 78 74 65-6e 73 69 6f 6e 73 5d 0d ]..[extensions].
#7e3ee030 0a 5b 6d 63 69 20 65 78-74 65 6e 73 69 6f 6e 73 .[mci extensions
#7e3ee040 5d 0d 0a 5b 66 69 6c 65-73 5d 0d 0a 5b 4d 43 49 ]..[files]..[MCI
#7e3ee050 20 45 78 74 65 6e 73 69-6f 6e 73 2e 42 41 4b 5d  Extensions.BAK]
#7e3ee060 0d 0a 33 67 32 3d 4d 50-45 47 56 69 64 65 6f 0d ..3g2=MPEGVideo.
#7e3ee070 0a 33 67 70 3d 4d 50 45-47 56 69 64 65 6f 0d 0a .3gp=MPEGVideo..
0: kd> !db
#7e3ee080 33 67 70 32 3d 4d 50 45-47 56 69 64 65 6f 0d 0a 3gp2=MPEGVideo..
#7e3ee090 33 67 70 70 3d 4d 50 45-47 56 69 64 65 6f 0d 0a 3gpp=MPEGVideo..
#7e3ee0a0 61 61 63 3d 4d 50 45 47-56 69 64 65 6f 0d 0a 61 aac=MPEGVideo..a
#7e3ee0b0 64 74 3d 4d 50 45 47 56-69 64 65 6f 0d 0a 61 64
#7e3ee0c0 74 73 3d 4d 50 45 47 56-69 64 65 6f 0d 0a 6d 32 ts=MPEGVideo..m2
#7e3ee0d0 74 3d 4d 50 45 47 56 69-64 65 6f 0d 0a 6d 32 74 t=MPEGVideo..m2t
#7e3ee0e0 73 3d 4d 50 45 47 56 69-64 65 6f 0d 0a 6d 32 76 s=MPEGVideo..m2v
#7e3ee0f0 3d 4d 50 45 47 56 69 64-65 6f 0d 0a 6d 34 61 3d =MPEGVideo..m4a=

C:\Windows\system32>type c:\Windows\win.ini
; for 16-bit app support
[mci extensions]
[MCI Extensions.BAK]

Yes the cache is really working and I can view its contents which is quite interesting and disturbing.

Now to question one. Why do some people claim that they need two warmup runs before the numbers become stable? A cache is a cache. If you have a test which does use nearly all of your physical memory (>75%) the cache manager of Windows needs to throw out the least used cached files to make room for the memory hungry application.

The first run of your test loads the data from disc and into the cache which does work well until the memory footprint of your application becomes so big that the Cache Manager needs to evict pages from the cached files. During a second run of the test the same things happen until the cache manager has learned which usage pattern you do follow and tries to adjust its internal policies. Since I do not want to learn every detail of the cache manager it is much easier to switch to a machine with much more RAM than needed by the application to run mostly everything from the cache during the second run.

The Windows task manager does lure millions of users into the wrong safety that you have plenty of free RAM and you can consume a few more GB of it without negative effects. Especially Notebook users with 64 bit Windows, 4 GB of RAM and a slow hard disc do experience long pauses during their routine work. A colleague of mine did ask me if I could investigate why his notebook was so slow. After I have seen his memory configuration I could tell him that he needed simply more RAM to make his work much smoother. Slow notebook hard discs do show you very clearly that the cache is there for a reason and when you do turn it off by using all of the memory application data you will experience many hard page faults which result in a sluggish user experience.

But also developers are not immune to such traps. You want to make your compile times faster by exchanging your hard disc with a solid state disc? Sure you can but how about getting more memory first? If you do get a few GB more RAM it can make a huge difference to your compile times if most of your stuff is stored in the file system cache. That does also explain why some tests with a RAM disc did show no improvements of compile times when we did copy all of the data onto a RAM disc. If you have enough RAM you do not need a RAM disc (most of the time)! At my work machine with 16 GB of RAM I do see that the file system cache begins shrinking already when I do hit 12 GB private bytes count.

What can you learn of this? Be nice to your cache and keep at least 1 GB of free memory around to be sure that the cache can work as a cache. Today's hard discs are still 100 000 times slower than memory which is a good reason to keep your cache big.

If you want to check for performance regression tests if your cache is filled you could execute

rammap C:\temp\ram.xml

to get a nice xml file which shows which files are being in the cach now. The cached files are stored as xml nodes of the form

<File Key="3114984784" Path="C:\windows\system32\aelupsvc.dll"/>

which can be easily parsed. Then you can filter e.g. for your application path and the Windows directory where most dlls during startup (especially NGen images for managed code) are loaded. Once you have set a cache baseline you can verify during subsequent tests that the cache still looks similar compared to your baseline run. If not you can skip the test run or try another warmup run to get consistent values for your performance regression tests. This can give you some hints why a performance regression test on that machine did run so slow compared to the others. You do not need to debug but can immediately prove that the test did most likely experience much more hard faults during its test execution because of a different cache state.

Recently I have talked to a MS program manager about performance tests and I was surprised that their test strategy was quite the opposite. They flush the file system cache before doing the test which always profiles the worst (cold startup) scenario which also leads to consistent numbers. He claimed that when you optimize cold startup warm startup will also benefit from it. That may be true to a certain extent but since warm startup is about a factor two faster you will experience concurrency especially during startup at places where none has been before and a much higher CPU utilization. This can shift the places where you should invest more time to optimize quite a lot if you are not careful.