Alois Kraus

blog

  Home  |   Contact  |   Syndication    |   Login
  40 Posts | 8 Stories | 149 Comments | 165 Trackbacks

News



Article Categories

Archives

Post Categories

Image Galleries

Programming

Sunday, October 04, 2009 #

Did you ever wonder why your system at random times hangs? Sometimes it comes back after a few seconds (could simply be paging) but at least once a day I wish I would be able to know why the system is responding so slowly. Before going into kernel land I must confess that I have never written a device driver so my knowledge to kernel mode debugging is quite limited but on the other hand if you did not do this either you will have a much easier time to follow me.

Some hangs seem to be Heisenbugs which disappear when you start looking at them. I have found when I let Process Explorer running on my machine it seems to resolve some issue by its pure presence. It could also be that some malware and Trojan software does not even install when Sysinternals tools are running.

Did you know that you can watch with Process Explorer the call stack of all applications in your system? Simply right click on a process and select Properties and select the Threads tab where you can view the stack for each thread with full function names.

ProcessExplorerStacks

image

 

Wrong Symbols

If on your machine the function names do not appear or they are of the form xxxx.dll +0xdddd where dddd is a rather big number (see below mvfs51.sys where we do not have symbols)  then you are missing the symbols. First of all you need to download Windbg. Why Windbg? To resolve the symbol names you need a good version of dbghelp.dll which is part of Windbg. Most SysInternals Tools have the possibility to configure symbols (Options – Configure Symbols …) and the path to the Windbg version of dbghelp.dll. To make it easier to copy and paste here is the one and only

Symbol Server Path: SRV*C:\Windows\Symbols*http://msdl.microsoft.com/download/symbols

The first part is the cache directory to which the symbols will be stored for later retrieval. Armed with this knowledge you should be able to find out the root cause of quite a lot of hangs by simply examining the call stacks.

Kernel Debugging?

On Windows XP you get the full stack including the kernel by simply looking at the process call stack in Process Explorer. With Windows Vista and above you need to run Process Explorer with elevated privileges (File – Run As Administrator in Process Explorer) to get also the kernel stack. The call stack above is typical for user mode only. If your thread stack contains stacks of the form

ntkrnlpa.exe!KiFastCallEntry+0x12a

then you are seeing the full stack including the kernel. When you find in your stack of interest .sys files you just have found a device driver. That is actually very useful to find out why something locks up.

Managed applications will find the native stack view less useful since dbghelp.dll is not able to show the managed call stack. There was one version of Windbg which is able view the mixed mode stack with the kv command in Windbg. But it was withdrawn from MS a few days later. I tried it of course with Process Explorer but it retrieves the call stack in a different way so I was not able see mixed mode stack there (yet).

The deeper reason why this feature has been hold back (at least what this is what the rumors say) has to do with legal reasons. The debugger team used deep CLR know how to walk the managed stack. Because other units within MS are not allowed to use internals of other products they would have to make them public. I am not interested in how illegal this might be but the MS lawyers are very well paid and should be able to sort this out. Seamless call stack tracking with Process Explorer and related tools would be one of my number one feature requests.

A surprisingly simple way to resolve hangs is to check in Google the name of the device drivers in your hang call stack and check for updated device drivers. In my department for example I did see quite a lot of hangs with the following call stack:

 

 

ntkrnlpa.exe!KiSwapContext+0x2f
ntkrnlpa.exe!KiSwapThread+0x8a
ntkrnlpa.exe!KeWaitForSingleObject+0x1c2
TmXPFlt.sys+0xc90d  // Trend Micro Virus Scanner
TmXPFlt.sys+0x306e
ntkrnlpa.exe!ObpCaptureObjectCreateInformation+0x19c
ntkrnlpa.exe!IopfCallDriver+0x31
ntkrnlpa.exe!IopParseDevice+0xa12
ntkrnlpa.exe!ObpLookupObjectName+0x53c
ntkrnlpa.exe!ObOpenObjectByName+0xea
ntkrnlpa.exe!IopCreateFile+0x407
ntkrnlpa.exe!IoCreateFile+0x8e
ntkrnlpa.exe!NtOpenFile+0x27
ntkrnlpa.exe!KiFastCallEntry+0xfc
ntkrnlpa.exe!ZwOpenFile+0x11

mvfs51.sys+0x2df70  // Rational ClearCase Source Control Driver –> Google mvfs51.sys
mvfs51.sys+0x2f850

TmXPFlt.sys+0x1039 // Trend Micro Virus Scanner –> Google TmXPFLT.sys
mvfs51.sys+0x309e6
mvfs51.sys+0x12197


ntdll.dll!NtQueryAttributesFile+0xc
kernel32.dll!GetFileAttributesW+0x79  // User mode call to get the file attributes
csproj.dll!LUtilFileExists+0xe
csproj.dll!CVsProjHostProcInstance::PrepareHostProcExecutable+0x39
csproj.dll!CVsProjHostProcInstance::StartHostingProcessHelper+0xa0
csproj.dll!CVsProjHostProcInstance::StartHostingProcess+0x72

What can we learn from this one? The faulting process was Visual Studio which was about to start its hosting process. It first checks if the executable exists by reading the file attributes from the executable. Since the file is located on a drive with source control system the mvfs51.sys driver from ClearCase does some work. Then the Trend Micro virus scanner hooks in and causes other ClearCase driver calls which go back into the kernel and end up in the virus scanner again which seem to cause the deadlock. In the end the virus scanner did win the hook fight and locked up the process from which it will never recover.

Now you have got a hanging process that cannot be killed by any means. If you try to kill it you will end up with a process with one thread left that is still stuck in the device driver call. If you ever encounter an unkillable process which is still alive after you try to terminate them via the task manager  it is most likely stuck in a device driver call.

High CPU Spikes / Hanging Process

Ok that was the easy part. Now we are getting nearer to Windbg. If you have an application which behaves in strange ways (e.g. has high CPU spikes at some times) I have another SysInternals gem: ProcDump can take memory snapshots of an arbitrary application. It is especially useful if you want to know in which state an application was when it was hung or did eat up all CPU time.

ProcDump v1.1 - Writes process dump files
Copyright (C) 2009 Mark Russinovich
Sysinternals - www.sysinternals.com

Monitors a process and writes a dump file when the process exceeds the
specified CPU usage.

usage: procdump [-64] [-c CPU usage [-u] [-s seconds] [-n exceeds]] [-h] [-e] [-ma] [-r] [-o] [[<process name or PID> [dump file]] | [-x <image file> <dump file> [arguments]]


   -c      CPU threshold at which to create a dump of the process.
   -e      Write a dump when the process encounters an unhandled exception.

   -h      Write dump if process has a hung window.
….

Example: Write up to 3 dumps of a process named 'consume' when it exceeds
         20% CPU usage for three seconds to the directory
         c:\dump\consume with the name consume.dmp:
            C:\>procdump -c 20 -n 3 -o consume c:\dump\consume
Example: Write a dump for a process named 'hang.exe' when one of it's
         windows is unresponsive for more than 5 seconds:
            C:\>prodcump -h hang.exe hungwindow.dmp

The generated .dmp files can be analyzed with Windbg quite easily if you have matching symbols. This is pure user mode debugging but it is easier to start first in user mode and dig only deeper if one needs to.

 

Kernel Debugging / Hanging System

 

When your system has frozen you can not start any new processes so starting a debugger is of little use. Luckily there is a nice trick to force the generation of a kernel dump by pressing a magic key combination: Right Ctrl + Scroll Lock + Scroll Lock will generate a nice looking real blue screen. See instructions below how to enable it. Technically speaking it is a user initiated kernel dump. Please read the phrase again to notice that only the RIGHT Ctrl key in combination with double pressing the Scroll Lock will do the trick.

Before you can generate the blue screen (= kernel dump) you need to set the kernel dump mode to Complete Memory Dump. You can find this menu if you press the Windows Key + Pause and then look in the Advanced System Settings – Advanced – Startup and Recovery

image

 

 

To enable the magic key combination you need to edit some registry settings which are explained deeper on MSDN and a much more elaborate page dedicated to dump file generation and common pitfalls on Windows Server 2008 (especially on computers with much installed memory).

PS/2 Keyboard

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\i8042prt\Parameters

DWORD CrashOnCtrlScroll 1

USB Keyboard

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\kbdhid\Parameters

DWORD CrashOnCtrlScroll 1

 

Ok now we can successfully generate a memory dump of the kernel and examine it. It is actually quite simple to pinpoint common problems like crashing/hanging drivers with a few commands without the need to understand fully how the kernel works. After the reboot you can open the generated dump file (normally located at C:\Windows\Memory.dmp) with Windbg. Then you need to setup the symbol path (see wrong symbols at the beginning of the article) and now you can execute the !analyze –v command to find out the root cause why the blue screen did occur.

kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************
MANUALLY_INITIATED_CRASH (e2)
The user manually initiated this crash dump.
Arguments:
Arg1: 00000000
Arg2: 00000000
Arg3: 00000000
Arg4: 00000000
Debugging Details:
------------------
BUGCHECK_STR:  MANUALLY_INITIATED_CRASH
DEFAULT_BUCKET_ID:  DRIVER_FAULT
PROCESS_NAME:  Idle
LAST_CONTROL_TRANSFER:  from f754e7fa to 804f8925
STACK_TEXT:  
80548d38 f754e7fa 000000e2 00000000 00000000 nt!KeBugCheckEx+0x1b
80548d54 f754e032 00c0f0d8 0190e0c6 00000000 i8042prt!I8xProcessCrashDump+0x237
80548d9c 8054071d 85904b20 85c0f020 00010009 i8042prt!I8042KeyboardInterruptService+0x21c
80548d9c f758dc46 85904b20 85c0f020 00010009 nt!KiInterruptDispatch+0x3d
80548e50 80540cc0 00000000 0000000e 00000000 processr!AcpiC1Idle+0x12
80548e54 00000000 0000000e 00000000 00000000 nt!KiIdleLoop+0x10

 

In our case the keyboard driver did crash. A closer look reveals that the crash provoked by the user. Lets put this dump aside and have at first a look at a “real” blue screen. A blue screen is actually a well defined exit point which can be triggered by drivers intentionally when it is no longer safe to continue. The function is KeBugCheck which causes the blue screen and dump generation when configured. This function can only be called by kernel drivers. No you can´t blue screen Windows from a user mode application. I have not tried to send to Windows the magic Right Ctrl + Scroll Lock + Scroll Lock combination from a user mode application but I do not think that this will work since the keyboard driver won´t get these events.

Lets have a look at a real crash caused by a driver on a 64 bit machine and analyze it.

7: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

Unknown bugcheck code (0)
Unknown bugcheck description
Arguments:
Arg1: 0000000000000000
Arg2: 0000000000000000
Arg3: 0000000000000000
Arg4: 0000000000000000

Debugging Details:
------------------

PROCESS_NAME:  xxxxx

FAULTING_IP:
nt!KeBugCheck+0
fffff800`02261620 4883ec28        sub     rsp,28h

EXCEPTION_RECORD:  ffffffffffffffff -- (.exr 0xffffffffffffffff)
ExceptionAddress: fffff80002261620 (nt!KeBugCheck)
   ExceptionCode: 80000003 (Break instruction exception)
  ExceptionFlags: 00000001
NumberParameters: 0

ERROR_CODE: (NTSTATUS) 0x80000003 - {EXCEPTION}  Breakpoint  A breakpoint has been reached.

EXCEPTION_CODE: (HRESULT) 0x80000003 (2147483651) - One or more arguments are invalid

DEFAULT_BUCKET_ID:  VISTA_DRIVER_FAULT

BUGCHECK_STR:  0x0

CURRENT_IRQL:  0

MANAGED_STACK: !dumpstack -EE
OS Thread Id: 0x0 (7)
Child-SP         RetAddr          Call Site

LAST_CONTROL_TRANSFER:  from fffffa6009a2536b to fffff80002261620

STACK_TEXT: 
fffffa60`0b5e0c68 fffffa60`09a2536b : fffffa60`09a59824 00000000`00000008 00000000`00000001 fffff800`022a27d0 : nt!KeBugCheck
fffffa60`0b5e0c70 fffffa60`09a28da2 : fffffa80`231ffaf0 fffff880`10a9f634 fffffa60`00000007 fffffa80`21ad5b40 : mvfs60x64+0x1936b
fffffa60`0b5e0dc0 fffffa60`09a2e5af : fffffa60`09a65ba8 fffff880`00000017 fffffa60`0b5e0eec fffffa80`4ab8733a : mvfs60x64+0x1cda2
fffffa60`0b5e0e70 fffffa60`09a3edc0 : fffffa80`0f1fb260 fffffa80`2038ef40 fffffa80`00000017 fffffa60`09a296e0 : mvfs60x64+0x225af
fffffa60`0b5e0f40 fffffa60`09a42e34 : fffffa80`00000001 fffffa80`0f1fb260 fffffa80`2038ef40 fffffa60`09a55fc4 : mvfs60x64+0x32dc0
fffffa60`0b5e1000 fffffa60`09a48ba0 : fffffa80`215ff7d0 fffffa80`21ad5b40 fffffa60`0b5e10d0 00000000`5346564d : mvfs60x64+0x36e34
fffffa60`0b5e1090 fffffa60`09a4b52a : fffffa80`215ff7d0 fffffa80`2280cad0 fffffa60`0b5e12e0 fffffa60`0b5e14a0 : mvfs60x64+0x3cba0
fffffa60`0b5e1150 fffffa60`09a4e890 : fffffa80`215ff7d0 fffffa80`2280cad0 fffffa60`0b5e12e0 fffffa60`0b5e14a0 : mvfs60x64+0x3f52a
fffffa60`0b5e11a0 fffffa60`09a2feb3 : fffffa80`215ff7d0 fffffa80`2280cad0 fffffa60`0b5e12e0 fffffa60`0b5e14a0 : mvfs60x64+0x42890
fffffa60`0b5e12a0 fffffa60`09a4cc00 : 00000000`00000000 fffffa60`0b5e14a0 fffffa60`0b5e1400 fffffa60`0b5e13e8 : mvfs60x64+0x23eb3
fffffa60`0b5e13a0 fffffa60`09a4ee4f : fffffa80`1367e7c0 fffffa80`24d75710 fffffa80`24d75710 fffff800`024e58f4 : mvfs60x64+0x40c00
fffffa60`0b5e1550 fffffa60`09a25fc0 : fffffa80`1367e7c0 fffffa80`24d75710 fffffa80`24d759d8 fffffa80`13683010 : mvfs60x64+0x42e4f
fffffa60`0b5e1590 fffffa60`00c08e17 : fffffa80`1367e7c0 fffffa80`24d75710 fffffa80`24d75a20 fffffa80`13895af0 : mvfs60x64+0x19fc0

fffffa60`0b5e15e0 fffffa60`00c2526c : fffffa80`13895af0 fffffa80`13683010 fffffa80`24d75700 fffffa60`0b5e16a0 : fltmgr!FltpLegacyProcessingAfterPreCallbacksCompleted+0x227
fffffa60`0b5e1650 fffff800`024e81f3 : 00000000`00000005 fffffa80`1209f010 00000000`00000040 00000000`00000000 : fltmgr!FltpCreate+0x25d
fffffa60`0b5e1700 fffff800`024e1ec9 : fffffa80`1367e7c0 00000000`00000000 fffffa80`2529a010 00000000`00000001 : nt!IopParseDevice+0x5e3
fffffa60`0b5e18a0 fffff800`024e5db4 : 00000000`00000000 fffffa80`252f7701 fffffa80`00000040 00000000`00000000 : nt!ObpLookupObjectName+0x5eb
fffffa60`0b5e19b0 fffff800`024f2360 : 00000000`80100080 00000000`0047ac08 fffffa60`09a42f01 00000000`00000000 : nt!ObOpenObjectByName+0x2f4
fffffa60`0b5e1a80 fffff800`024f2e98 : 00000000`0047ab98 00000000`80100080 fffffa80`00000000 00000000`0047abb8 : nt!IopCreateFile+0x290
fffffa60`0b5e1b20 fffff800`022610f3 : 00000000`000001e4 00000000`00000000 00000000`00000000 00000000`00000000 : nt!NtCreateFile+0x78
fffffa60`0b5e1bb0 00000000`77515fca : 00000000`773ccb6c 00000000`02818b90 00000000`00000002 00000000`009e0000 : nt!KiSystemServiceCopyEnd+0x13
00000000`0047ab28 00000000`773ccb6c : 00000000`02818b90 00000000`00000002 00000000`009e0000 00000000`00000002 : ntdll!ZwCreateFile+0xa
00000000`0047ab30 00000005`16f47c4e : 00000000`02818b90 00000000`80000000 00000000`00000005 00000005`16f46f93 : KERNEL32!CreateFileW+0x26c
00000000`0047ac80 00000005`16f49f76 : 00000000`02819a80 00000000`00000000 00000000`00000000 00000000`0047ada0 : diasymreader!IStreamCRTFile::Create+0xba
00000000`0047acf0 00000005`16f4a10c : 00000000`00000000 00000000`02819a80 00000000`00000000 00000005`16f8158b : diasymreader!MSF_HB::internalOpen+0x36
00000000`0047ad30 00000005`16f36782 : 00000000`02818b00 00000000`0047be00 00000000`00000000 00000000`00000400 : diasymreader!MSF::Open+0x5c
00000000`0047ad70 00000005`16f36ef0 : 00000000`02818b90 00000005`16f033e0 00000000`00000000 00000000`00000ef0 : diasymreader!PDB1::OpenEx2W+0xda
00000000`0047adf0 00000005`16f3753b : 00000000`0000001f 00000000`00000000 00000000`02818b90 00000000`0000000c : diasymreader!PDB1::OpenValidate4+0x7c
00000000`0047ae90 00000005`16f4b43f : 00000000`02818db0 00000000`00000ee4 00000000`02817cd0 00000000`00000ee4 : diasymreader!PDB::OpenValidate4+0x47
00000000`0047aef0 00000005`16f4c2db : 00000000`02818b90 00000000`0281854c 00000000`02817c00 00000000`02818bac : diasymreader!LOCATOR::FOpenValidate4+0x73
00000000`0047af50 00000005`16f4c62e : 00000000`0047be00 00000000`000002fb 00000000`02817cd0 00000000`00000094 : diasymreader!LOCATOR::FLocatePdbPathHelper+0x133
00000000`0047afb0 00000005`16f4c961 : 00000000`0047be00 00000000`00000003 00000000`0047be00 00000000`028139cc : diasymreader!LOCATOR::FLocatePdbPath+0x11a
00000000`0047ba40 00000005`16f37ae3 : 00000000`0047cc40 00000005`16f80e44 00000000`0047cc40 00000005`16f80e44 : diasymreader!LOCATOR::FLocatePdb+0x1b5
00000000`0047bde0 00000005`16f3417a : 00000000`028139a0 00000000`009e92d0 00000000`0047cc40 00000000`00000000 : diasymreader!PDBCommon::OpenValidate5+0x9f
00000000`0047cbb0 00000005`16f5fef0 : 00000000`028139a0 00000000`03084b14 00000000`03084b14 00000000`00000000 : diasymreader!PDB::OpenValidate5+0x36
00000000`0047cc00 00000005`16f2b747 : 00000000`00000000 00000000`00000000 00000000`02808010 00000000`009e92d0 : diasymreader!CDiaDataSource::loadDataForExe+0x90
00000000`0047cd30 00000005`16f1ce3b : 00000000`00000000 00000000`11401f6e 00000000`03074348 00000000`028139a0 : diasymreader!CDiaWrapper::Create+0xbb
00000000`0047cd90 00000005`16f1a704 : 00000000`00000079 00000000`0000004a 00000000`00800000 00000000`106ee840 : diasymreader!SymReader::Initialize+0xa7
00000000`0047cdf0 00000000`106adbfc : 00000000`00000014 00000000`0049c0e0 00000000`00000000 00000000`00000000 : diasymreader!SymBinder::GetReaderForFile+0x170
00000000`0047ce70 00000000`00000014 : 00000000`0049c0e0 00000000`00000000 00000000`00000000 00000000`036c1fc0 : CPI!CPI_GetCallbacks+0x1a52c
00000000`0047ce78 00000000`0049c0e0 : 00000000`00000000 00000000`00000000 00000000`036c1fc0 00000000`00000800 : 0x14
00000000`0047ce80 00000000`00000000 : 00000000`00000000 00000000`036c1fc0 00000000`00000800 00000000`03074348 : 0x49c0e0

STACK_COMMAND:  kb

FOLLOWUP_IP:
mvfs60x64+1936b
fffffa60`09a2536b cc              int     3

SYMBOL_STACK_INDEX:  1

SYMBOL_NAME:  mvfs60x64+1936b

FOLLOWUP_NAME:  MachineOwner

MODULE_NAME: mvfs60x64

IMAGE_NAME:  mvfs60x64.sys

DEBUG_FLR_IMAGE_TIMESTAMP:  4a37fa38

FAILURE_BUCKET_ID:  X64_0x0_mvfs60x64+1936b

BUCKET_ID:  X64_0x0_mvfs60x64+1936b

Followup: MachineOwner
---------

From this real live crash we get the faulting 64 bit ClearCase driver on a silver tablet. We have the call stack, managed call stack if available, failing module name and much more presented with one command. If you want to find out why your machine blue screens from time to time the information presented here is sufficient to find the buggy driver and either uninstall the damn thing (lets hope it was not important anyway) or look at the driver vendors homepage to get an updated version. If you really care you can file a bug report and send them your dump to analyze further.

You can convert a full kernel dump into a mini dump by using the .dump outputFileName.dmp command inside Windbg.

 

Now lets see how we can debug a real hang scenario. And I won´t get sidetracked by other interesting details.

When you get a user initiated memory dump you need to find out which processes were running and examine the call stack of the interesting ones. The Windbg command !process 0 0 will give you a complete list of all processes running.

7: kd>!process 0 0

PROCESS fffffa801347a9c0
    SessionId: 1  Cid: 3ac8    Peb: 7fffffde000  ParentCid: 03b4
    DirBase: 19486c000  ObjectTable: fffff8800b78b430  HandleCount: 158.
    Image: mobsync.exe

You can select a specific process by giving the process handle to the process command. This will give you a wealth of information about its current state and all call stacks inside it. That should help to find out where the system was hanging.

7: kd> !process fffffa801347a9c0
PROCESS fffffa801347a9c0
    SessionId: 1  Cid: 3ac8    Peb: 7fffffde000  ParentCid: 03b4
    DirBase: 19486c000  ObjectTable: fffff8800b78b430  HandleCount: 158.
    Image: mobsync.exe
    VadRoot fffffa800f0b1650 Vads 80 Clone 0 Private 1129. Modified 2. Locked 0.
    DeviceMap fffff8800cb88ca0
    Token                             fffff88014d16060
    ElapsedTime                       00:00:57.613
    UserTime                          00:00:00.000
    KernelTime                        00:00:00.000
    QuotaPoolUsage[PagedPool]         155336
    QuotaPoolUsage[NonPagedPool]      7584
    Working Set Sizes (now,min,max)  (2435, 50, 345) (9740KB, 200KB, 1380KB)
    PeakWorkingSetSize                2449
    VirtualSize                       79 Mb
    PeakVirtualSize                   80 Mb
    PageFaultCount                    2545
    MemoryPriority                    BACKGROUND
    BasePriority                      8
    CommitCharge                      1343

        THREAD fffffa8016695bb0  Cid 3ac8.3c04  Teb: 000007fffffdc000 Win32Thread: fffff900c2c08450 WAIT: (WrUserRequest) UserMode Non-Alertable
            fffffa801253e4b0  SynchronizationEvent
        Not impersonating
        DeviceMap                 fffff8800cb88ca0
        Owning Process            fffffa801347a9c0       Image:         mobsync.exe
        Attached Process          N/A            Image:         N/A
        Wait Start TickCount      3233090        Ticks: 3690 (0:00:00:57.564)
        Context Switch Count      81                 LargeStack
        UserTime                  00:00:00.000
        KernelTime                00:00:00.015
        Win32 Start Address 0x00000000ff685d38
        Stack Init fffffa601008bdb0 Current fffffa601008b720
        Base fffffa601008c000 Limit fffffa6010083000 Call 0
        Priority 10 BasePriority 8 PriorityDecrement 0 IoPriority 2 PagePriority 5
        Child-SP          RetAddr           Call Site
        fffffa60`1008b760 fffff800`0226728a nt!KiSwapContext+0x7f
        fffffa60`1008b8a0 fffff800`0226868a nt!KiSwapThread+0x2fa
        fffffa60`1008b910 fffff960`001bb817 nt!KeWaitForSingleObject+0x2da
        fffffa60`1008b9a0 fffff960`001bb8ae win32k!xxxRealSleepThread+0x25f
        fffffa60`1008ba40 fffff960`001bb1fa win32k!xxxSleepThread+0x56
        fffffa60`1008ba70 fffff960`001bb4a9 win32k!xxxRealInternalGetMessage+0x72e
        fffffa60`1008bb50 fffff960`001bca15 win32k!xxxInternalGetMessage+0x35
        fffffa60`1008bb90 fffff800`022610f3 win32k!NtUserGetMessage+0x79
        fffffa60`1008bc20 00000000`772ed09a nt!KiSystemServiceCopyEnd+0x13 (TrapFrame @ fffffa60`1008bc20)
        00000000`0029f7a8 00000000`00000000 USER32!ZwUserGetMessage+0xa

A hang can be caused by a shared lock where different processes try to acquire it. This common deadlock scenario can be check with the !locks command to examine locks to which more than one process wants to get access:

7: kd> !locks
**** DUMP OF ALL RESOURCE OBJECTS ****
KD: Scanning for held locks..

Resource @ 0xfffffa800f952218    Exclusively owned
     Threads: fffffa800e585040-01<*>
KD: Scanning for held locks...

Another very useful command is !stacks to show from all processes the last stack frame where the are standing:

7: kd> !stacks


    [fffffa8012d8b7d0 csrss.exe]
2e0.0003e0  fffffa801307b7a0 ffce9c7b Blocked    cdd!PresentWorkerThread+0x479
2e0.0003ec  fffffa80130903d0 ffce9ccd Blocked    nt!AlpcpReceiveMessagePort+0x287
2e0.0003fc  fffffa8012ce8ad0 ffce9c8c Blocked    nt!AlpcpReceiveMessagePort+0x287
2e0.00046c  fffffa80131619c0 ffce9c8c Blocked    nt!AlpcpReceiveMessagePort+0x287
2e0.000f78  fffffa800f052060 ffce9c8c Blocked    nt!AlpcpReceiveMessagePort+0x287
2e0.00133c  fffffa800f1d92e0 ffce9c8c Blocked    nt!AlpcpReceiveMessagePort+0x287
2e0.0011a0  fffffa800f1d6700 ffce9c8c Blocked    nt!AlpcpReceiveMessagePort+0x287
2e0.00101c  fffffa800f141800 ffce9c7c Blocked    nt!AlpcpReceiveMessagePort+0x287
2e0.000dbc  fffffa800f1ffbb0 ffce9c8e Blocked    nt!AlpcpReceiveMessagePort+0x287
2e0.00038c  fffffa801fd14060 ffce9c7c Blocked    nt!AlpcpReceiveMessagePort+0x287
2e0.001408  fffffa80207b0360 ffce9ccd Blocked    nt!AlpcpReceiveMessagePort+0x287
2e0.00140c  fffffa8012efa060 ffce9c8c Blocked    nt!AlpcpReceiveMessagePort+0x287
2e0.001678  fffffa800f1017c0 ffce9c60 Blocked    win32k!xxxRealSleepThread+0x25f

A more thorough list has been created by Dmitry Vostokov at his famous Crash Dump Analysis web site which gives a good overview about the most used Windbg commands. To dig deeper you will need to buy the Windows Internals book by Mark Russinovich and understand how the Windows kernel and drivers do work and visit the NT Debugging blog where Microsoft escalation engineers show some advanced kernel debugging techniques.

If you have read until here you (should) have lost fear of the dreaded blue screen. Its not the end but the beginning of an interesting debugging session. It is a pity that so few people are able to analyze kernel dumps even at the most basic level. In many cases it is possible to find out which device driver is the guilty one. You then have the option to remove the faulting driver entirely or try to get an updated one. At least you know who is to blame and most of the time it is not the OS.

One last note: If you transfer the dump to another machine you not only need the dump file but also the exact same executable binaries on the analyzing machine to load the correct pdbs. You need to set them up under File – Image File Path in Windbg to successfully analyze dumps.


Monday, October 05, 2009 #

Just a quick note to myself to remember the registry key

HKLM\Software\Policies\Microsoft\Windows NT\CurrentVersion\Winlogon

        DWORD SyncForegroundPolicy=0x1

Setting this key will ensure that Windows does not try to log on until the network subsystem has been fully initialized. Domain users with a roaming profile revert back to a local profile because the log on is simply too fast. When no user is sitting in front of the PC to enter his user name and password Windows does not always have enough time to initialize the networking subsystem. During logon this can cause failure to map network drives or to load the roaming profile. The latter one can screw your roaming profile if you e.g. log the same user on different PCs on. When one user does a log off it will store the profile back in the domain potentially screwing up the newer settings which just recently have been configured for this user.


Monday, September 14, 2009 #

If you have ever wondered why your .net application has a slow startup performance you normally start watching with Process Explorer at your processes. But this will not always give you the full picture since it can loose some csc.exe calls because if you have a fast machine it will not get them. A more reliable tool is Procmon where you can look for process starts:

Procmon_SlowApplication

Where do these compiler invocations come from? It turns out that XmlSerializer is to blame which generates a C# file in your TEMP folder, compiles it with the C# compiler csc.exe, loads this assembly into the application and removes all traces from your disc after it has done its work. This operation can cost you 0.3 - 2s of your application startup time so it is most of the time worth investigating. The assembly is needed to generate really fast de/serialization code for the types it must de/serialize. There are ways around this by switching to another serializer like DataContractSerializer from .NET Framework 3.0 or to use sgen to generate this seralization assembly only once and for all so it can be loaded later without the code generation overhead. DataContracts do not suffer from the assembly generation overhead because it does not generate any code at all at the cost of a little slower de/serialization.

Sometimes it can be quite hard to find out why some applications cause csc.exe calls. The first thing to use is the fusion log viewer (fuslogvw.exe) which is part of the .NET Framework SDK and is in your path if you start a Visual Studio 20xx command prompt.

image

Here we see assembly load failures for assemblies which names end with .XmlSerializers which is a good way to identify how many compiler calls where made and from which assemblies the de/serialized types originate from. Most of the time this information is enough but there are times when you want to identify the exact types that caused these calls. E.g. you do not want to generate serialization code for all public types of a rather huge assembly which would bloat the code size of the generated serialization assembly or you wan to embed the serialization code into your own assembly.

Luckily XMLSerializer has some debugging capabilities which can be turned on in your application App.config or machine wide if you really want to be sure that you do not miss anything.

The machine config file is located in your windows directory at

%WINDIR%\Microsoft.NET\Framework\v2.0.50727\CONFIG\machine.config

Here you only need to add

<system.diagnostics>

<switches>

  <add name="XmlSerialization.Compilation" value="true"/>

</switches>

</system.diagnostics>

This will prevent XmlSerializer to clean the %temp% directory up so you have a chance to look at the generated code. Now go to your %TEMP% directory and search for *.cs files.

The the command

findstr /C:"t == typeof" *.cs

will give you the requested types

kzbrxewx.0.cs:                if (t == typeof(global::SlowApplication.Program)) {

in a very easy way. If you are afraid of an additional assembly to load you can also generate with sgen only for the needed type a code file which you can add to your project to stay lightweight. You only need to decorate your serialized type with

[XmlSerializerAssembly("YouMainModule, Version=..., Culture=...,, PublicKeyToken=....")]

That allows your code to stay slim at the expense that you need to regenerate this thing from time to time to stay in sync with the implementation of your serialized types.

 

If you ever wondered if it is possible to use generics within an XMLSerializer serialized class: No you can´t because the XML array type does not know from which collection class it originated from.  If you try to create your custom collection like this

public class FilterItemList : List<FilterItem>

        {

        }


it will not work. But there is a way around it. You only need to name your collection class of the form ArrayOf<ClassName> : List<ClassName> does work.

public class ArrayOfFilterItem : List<FilterItem>

      {

      }

Now we are back in the game and can use fields, properties with our generic type ArrayOfFilterItem  and XmlSerializer will not complain.


Sunday, June 21, 2009 #

The Enterprise Library team is trying to improve the performance of the Logging Application Block (LAB) even further. Since version one we did come a long way. It is interesting that even the guidance projects of Microsoft do have change their mind. Lets recap how the LAB has evolved since it initial release in January 2005.

Released Version Logging Application Block (LAB) Characteristics ca. Logs/s
January 2005    v1.0 Log File is opened/closed after every write. Performance was awful but accepted. 400 Logs/s
January 2006  v2.0 .NET 2.0 Support, Easier configuration, Log File is flushed after every write. Much better perf. 3000 Logs/s
April 2007 3.0 New Application Blocks but no significant changes in the LAB. 3000 Logs/s
May 2008  4.0 Log files which are rolled over are finally supported out of the box. The TextFormatter was reworked to make it over 13 times faster as I suggested. 10 000 Logs/s

 

Since customers are still complaining about the performance of the Logging Application Block a performance feature was added to the product backlog:

...

LAB02: Async logging (text formatting done asynchronously) to cut down on load on primary thread (M)

...

Looks ok. We have many CPUs by now why not put it on another thread? As Grigory has pointed out there are several problems with that approach. If you switch to another thread you have to fetch the call stack of the calling thread first before doing the thread transition. But since the Stacktrace generation is a fairly expensive operation there is little if nothing at all gained to do it on an another thread. Thread affinity strikes back and destroys your nice lazy init design pattern to initialize only the properties of a log messages which are really configured to be formatted by the message template. So what is the problem here? First I think it is the customers expecting too much from a general purpose application block. The title of the post says it all: Logging is not Tracing!

Logging

When you design a big application you need to have good and flexible error reporting perhaps across machines to collect log data in a centralized way. That is a perfect use case for the Logging Application Block where you configure some remote trace listener and send the log data to a central log server which stores its log messages in a database, log file or whatever. If you use out of process communication you are limited by the network performance already which is in the best case several thousand logs/s.

 

Tracing

Besides Error Reporting you also need to trace your program flow to find out where the performance bottlenecks are and even more important when an error occurs you have a chance to find out how you did get there. In an ideal world every function would have some tracing enabled with the function duration, passed parameters and how far you did get in your function. This is what performance profilers actually do by using the Profiling API of .NET. If you have ever used such a tool you will notice that the performance of you application will be degraded by a factor 5-100. Doing automatic instrumentation is therefore not an option and you still need to write some code or use an IL weaver to add some tracing code to your app.

I think there has never been made a real distinction between logging and tracing.

  • Logging is error reporting.
  • Tracing is following your program flow and data.

Logging is always enabled. Tracing on the other hand is normally disabled since it would generate huge amounts of data in very short time. By differentiating these two use cases you can optimize each of them in a different manner. Logging needs to be flexible. Tracing has three main goals: Correctness, performance and performance. This can heavily influence your API design and other design choices to support each scenario as well as possible.

Conclusions

And now back to the original question: Why do customers still complain about the performance of the Logging Application Block (LAB)? I am very certain that these customers use the LAB for tracing and complain about its bad performance. When you try to log enter/leave of some very frequently called functions (millions calls/s) you end up with huge log files and a crawling application. What do you do if your application is slow? Use a profiler and check where most of the time is spent and find that most of the time is spent in the LAB string formatting and disk IO functions. There is little you can do about the disk IO so you leave it and try to optimize the LAB further. That will only push the limit a little and be all in vain until the next customer complains that if he Logs enter/leave of his highly optimized GetHashCode() function his application becomes very slow. How do we get out of this? There is no easy way out of this. It could help to make such performance boundaries more explicit like the WCF team has done it with DataContracts. With explicit I mean one API for logging another one for Tracing. That could help to minimize the pain of users with wrong expectations of the performance of the LAB.

That said I see still quite some potential for the LAB to support tracing better than today:

  • Create some TraceFormatter with some quite fixed message format which includes time, pid, tid, category and message for example. No Call Stack since it is too expensive.
  • In the formatters cache the the last LogEntry object and the formatted message. If you trace to two different destinations with the same formatter one could reuse the already formatted message if the object reference is the same as last time.
  • Format the time in an efficient way like it was shown here. A factor 15 compared to DateTime.ToString should help to make TraceFormatter a really fast one.
  • Do not use multithreading. It will make it slower, harder to understand and less reliable. I expect to have 100% reliable logging and tracing. Always. The following simple code should work with no exceptions:

void Main(string [] args)

{

      Log.Info("Hello world"); // could be lost if done asynchronously

}

If you introduce a time latency between the log call and its processing you can loose the most important log message just before your application has terminated.

Perhaps we will see some nice Tracing Application Block in the future if enough customers complain. Just tell Grigory to consider this as well for future version of the Enterprise Library.


Wednesday, May 27, 2009 #

Managed Performance counters are tricky (or broken, it depends how you look at them) to read when you have more than one process with the same name running managed code. Each performance counter gets as instance name a unique identifier

  1. ManagedApp
  2. ManagedApp#1
  3. ManagedApp#2
  4. ...

If you want to know for a specific process identified by its process id thing become tricky. There is a counter in the .NET Memory category called Process ID which enables us to find out the correct counter instance name without guessing.

To find the correct instance name here is a little helper class which does it in a semi performant way:

 

using System;

using System.Diagnostics;

using System.IO;

using System.Threading;

using System.Globalization;

using System.Runtime.Remoting.Messaging;

 

namespace PerformanceCounterRead

{

    public class PerfCounterReader : IDisposable

    {

        PerformanceCounter myMemoryCounter;

        const string CategoryNetClrMemory = ".NET CLR Memory";

        const string ProcessId = "Process ID";

        const int ProcessesToTry = 40;

 

        public PerfCounterReader(int processId) : this(Process.GetProcessById(processId))

        {

        }

 

        string GetInstanceNameForProcess(int instanceCount, Process p)

        {

            string instanceName = Path.GetFileNameWithoutExtension(p.MainModule.FileName);

 

            if (instanceCount > 0) // Append instance counter

            {

                instanceName += "#" + instanceCount.ToString();

            }

 

            // Reader .NET CLR Memory Process ID for the given instance to check if

            // it does match our target process

            using (PerformanceCounter counter = new PerformanceCounter(CategoryNetClrMemory, ProcessId,

                   instanceName, true))

            {

 

                long id = 0;

 

                try

                {

                    while (true)

                    {

                        var sample = counter.NextSample();

                        id = sample.RawValue;

 

                        // for some reason it takes quite a while until the counter is

                        // updated with the correct data

                        if (id > 0)

                            break;

 

                        Thread.Sleep(15);

                    }

                }

                catch (InvalidOperationException)

                {

                    // swallow exceptions from non existing instances we tried to read

                }

 

                return (id == p.Id) ? instanceName : null;

            }

 

        }

 

        string GetManagedPerformanceCounterInstanceName(Process p)

        {

            Func<int, Process, string> PidReader = GetInstanceNameForProcess;

            string instanceName = null;

            AutoResetEvent ev = new AutoResetEvent(false);

 

            for (int i = 0; i < ProcessesToTry; i++)

            {

                int tmp = i;

                // Since reading the performance counter for every process is

                // very slow we try to speed up our search by reading up to ProcessesToTry

                // in parallel

                PidReader.BeginInvoke(tmp, p, (IAsyncResult res) =>

                    {

                        if (instanceName == null)

                        {

                           string correctInstanceName = PidReader.EndInvoke(res);

 

                           if (correctInstanceName != null)

                            {

                                instanceName = correctInstanceName;

                                ev.Set();

                            }

                        }

 

                    }, null);

            }

 

 

            // wait until we got the correct instance name or give up

            if (!ev.WaitOne(20 * 1000))

            {

                throw new InvalidOperationException("Could not get managed performance counter instance name for process " + p.Id);

            }

 

            return instanceName;

        }

 

        public PerfCounterReader(Process p)

        {

            string processInstanceName = GetManagedPerformanceCounterInstanceName(p);

            myMemoryCounter = new PerformanceCounter(CategoryNetClrMemory, "# Bytes in all Heaps", processInstanceName);

        }

 

        public long BytesInAllHeaps

        {

            get

            {

                return myMemoryCounter.NextSample().RawValue;

            }

        }

 

        #region IDisposable Members

 

        public void Dispose()

        {

            myMemoryCounter.Dispose();

        }

 

        #endregion

    }

}

To use this class you can give it your current process to check how exact the counter behaves:

 

            var p = Process.GetCurrentProcess();

            using(PerfCounterReader reader = new PerfCounterReader(p))

            {

                while (true)

                {

                    Console.WriteLine("Managed Heap Memory[{0}]: {1:N0} {2:N0}", p.Id, reader.BytesInAllHeaps, GC.GetTotalMemory(false));

                    memory.Add(new List<byte>(10000 * 1000));

                    Thread.Sleep(1000);

                }

            }

 

This will produce output similar to this:

Managed Heap Memory[1616]:     868.844   1.129.604
Managed Heap Memory[1616]:     868.844  11.202.852
Managed Heap Memory[1616]:  10.840.884  20.376.384
Managed Heap Memory[1616]:  20.912.640  30.376.376
Managed Heap Memory[1616]:  20.912.640  40.449.624
Managed Heap Memory[1616]:  20.912.640  50.522.872
Managed Heap Memory[1616]:  20.912.640  60.596.120
Managed Heap Memory[1616]:  20.912.640  70.669.368
Managed Heap Memory[1616]:  20.912.640  80.734.424
Managed Heap Memory[1616]:  20.912.640  90.807.672
Managed Heap Memory[1616]:  20.912.640 100.880.920
Managed Heap Memory[1616]: 100.841.252 110.376.732
Managed Heap Memory[1616]: 100.841.252 120.449.980

What is interesting that the GC.GetTotalMemory function gives much more precise results than the performance counter. It seems that the performance counter is updated only once every 5-10 seconds which is quite slow but better than nothing. The .NET Memory Performance Counters are updated after every GC.Collect. If you want to track during unit tests your resource consumption in a timely manner you will need to add quite big sleeps or trigger a GC in the remote process to get decent reliable numbers.

As a rule of the thumb I can only emphasize measure and check your numbers for errors. Coming from nuclear physics I was educated to question the numbers and check for consistency. This art seems to have gotten lost in our fast paced IT industry where the display (excel sheet with fancy macros) seems to be more important than what you actually did measure. If these numbers do help you to track and steer resource consumption, performance, ... then you have produced real business value. Once you have got reliable measurements you can reason about the numbers what they can tell you. With an increasing amount of work you can

  1. Measure something wrong
  2. Measure something right
  3. Measure the relevant things right
  4. Measure the relevant things right and take further actions to improve your software.

If you are stuck in 1-3 then you have gained nothing for your current project because the knowledge gained from your measurements does not flow back into your software.

Measuring for example the available physical memory before and after a test will show you that you have "lost" or "gained" 100-300 MB of memory. But what does it tell you about the resource consumption of your tests? Not much since the OS does manage your physical memory of all processes. Even if you have a big memory leak it does not necessarily show up a lost physical memory since the OS is quite good at paging unused memory out into the pagefile. The machine wide memory consumption is easy to measure but of little value (2). More about the Zen of measuring performance/consumption right is the topic of a future post.


Thursday, May 14, 2009 #

Uhh What? DevPath is an environment variable that allows you specify global directories which are searched just like GAC. If you ever had the urge to load dlls from your application from subdirectories you need a probing element in your app.config which allows exactly that.

The only problem with that is that you cannot escape from your application root directory. When you try to load something from ..\Centralbin it is ignored. In that cases you need to use the GAC if you like it or not. Since DevPath was broken for some time with .NET 2.0 I thought it was no longer supported. But thanks to John Robbins article "PDB Files: What every Developer must know." I did learn a different story. That makes it possibly now to use some Microsoft tools in a standard fashion. The Xml Serialization assembly generator Sgen for example can create the serialization assembly only if all public serializable types do not have dependencies to assemblies in other directories. This is a major PITA since fresh compiled assemblies are located in other directories than the rest (except it you have set Copy To Local to true in Visual Studio but that is a bad idea either).

But now we can alter the sgen.exe.config and add one line

<?xml version ="1.0"?>
<configuration>
    <runtime>       
        <generatePublisherEvidence enabled="false"/>
        <developmentMode developerInstallation="true"/>
    </runtime>
</configuration>

And now behold lets call sgen.exe

C:\Source>sgen.exe

System.Threading.SynchronizationLockException: Object synchronization method was called from an unsynchronized block of code.
at System.Resources.ResourceManager.TryLookingForSatellite(CultureInfo lookForCulture)
at System.Resources.ResourceManager.InternalGetResourceSet(CultureInfo culture, Boolean createIfNotExists, Boolean tryParents)
at System.Resources.ResourceManager.GetString(String name, CultureInfo culture)
at System.Environment.ResourceHelper.GetResourceStringCode(Object userDataIn)
at System.Environment.GetResourceFromDefault(String key)
at System.Environment.GetResourceString(String key)
at System.IO.Path.CheckInvalidPathChars(String path)
at System.IO.Path.NormalizePathFast(String path, Boolean fullCheck)
at System.IO.Path.NormalizePath(String path, Boolean fullCheck)
at System.IO.Path.GetFullPathInternal(String path)
at System.AppDomainSetup.set_DeveloperPath(String value)
at System.AppDomain.SetupFusionStore(AppDomainSetup info)
at System.AppDomain.SetupDomain(Boolean allowRedirects, String path, String configFile)

Ups. .NET 3.5 SP1 did not do the trick? Some investigation shows that the .NET Framework is still subject to bad error handling. If DEVPATH is empty or DEVPATH contains invalid path characters such as > < | or " then it will try to report the issue so far so good. But it seems that Microsoft seems to be a lover of fast in process tests where each methods works perfectly but the whole thing blows apart when used in a true product scenario. This is not the first time that did happen with DEVPATH but I thought that since the release of .NET 2.0 in 2005 these things would have been fixed and some regression tests had been added. Apparently I was wrong.

In my specific case I did try set devpath="c:\Source\EntLib3Src\App Blocks\bin" which did fail because of the parenthesis. Once I removed the "invalid" characters all did work out fine.

During my investigation with Reflector I stumbled upon another undocumented environment variable RELPATH which does set the private probing path.

info.PrivateBinPath = Environment.nativeGetEnvironmentVariable(AppDomainSetup.PrivateBinPathEnvironmentVariable);

When I set it to e.g. subDir then I do no longer need to set the private probing path in my App.config. Nice that could come in handy in some scenarios.


Tuesday, December 16, 2008 #

I do performance measurements quite regularly which involves to call a piece of code several times to measure how long it did take. I am sure nearly everybody has done this already. But as a physicist I know that (nearly) every measurement has fundamental problems which never go away. Key to a successful measurement is that you exactly know what you are measuring and not what you think you measure. The easiest way to get out of this dilemma without too much knowledge is to simply ignore the fact that you don´t know enough and restrict yourself to pure differential measurements. With differential I do mean that you measure it once, change the code in a way that makes it better and measure it again. When you measure different times then you can assume that your code change was the cause for the different timing. That does work to some extent quite well but you should always be vigilant when the results change dramatically from one build to the next. There is no error I did not make with performance measurements so I think it is safe to give away my top 101:

  1. DateTime.Now has a resolution of ~16ms. If you measure anything faster you must use Stopwatch which has a resolution of about 100ns.
  2. With Windows Vista you need to set your Power options to maximum performance to prevent the OS from change the CPU clock frequency at random times.
  3. Warm and cold start times are way different. Usually a factor of 2-6 is quite normal. With cold startup I mean that you have a fresh booted system which did have never run your test case. This cold startup time does mainly contribute to disk IO which you can watch with XPerf very nicely.
  4. Be sure that you measure the right thing. I cannot emphasize this enough. Did you really measure the right thing?
  5. Know your influencers. You need to get a good understanding how much e.g. number of iterations, input data size, concurrency, other applications can affect the outcome. I had more than once the case that I did execute a test in the evening and then next morning. But the results did differ by a factor two for no apparent reason.
  6. Debug and Release builds still differ in the managed world. Although the difference is much smaller than it was with C++.
  7. First time initialization effects should be measured separately. When you initialize a class and call some method 10 million times it makes not much sense to mix the ctor call time with pure function calling time.
  8. Do measure long enough to get reliable results. The mean value is a powerful ally to flatten sporadic effects.
  9. If you get strange disk IO reports check if a virus scanner does interfere.
  10. Other processes might be running as well. These might influence your test if they consume significant resources.
  11. Stupid but happens: Turn off tracing before you measure.

You still want to make some quick throw away measurements? You have been warned. It is a tricky thing. To get at least the timing calculation right I did create some extension methods I want to share there. With these extensions you can create an Action delegate and call the new Profile method on it to execute it n-times. To give you full control over the string formatting you can supply a format string which expands {runs}, {time} and {frequency} in a human readable way for you.

            Action func = () =>

            {

                using (File.OpenRead(@"C:\config.sys"))

                {

                }

            };

 

            func.Profile(1000, "Did open the file {runs} times in {time}s. Can do {frequency} File.Open/s");

That little snippet produces on the console

Did open the file 1 000 times in 0.06s. Can do 16 667 File.Open/s

Since we know that the first call to File.Open will involve actual disk accesses which are much slower we want to measure the first call independently to see the "cold" startup performance as well:

 

func.ProfileWithFirst(1000, "First File Open did take {time}s", "Did open the file {runs} times in {time}s. Can do {frequency} File.Open/s");

That will give us

First File Open did take 0.00s
Did open the file 999 times in 0.05s. Can do 20 388 File.Open/s

There are also times when you want to control the number of iterations by yourself. Nothing easier than that. Simply change the delegate type from Action to Func<int> and you are ready to go.

 

            Func<int> myFunc = () =>

                {

                    const int Runs = 1000;

                    for (int i = 0; i < Runs; i++)

                    {

                        using (File.OpenRead(@"C:\config.sys")) ;

                        using (File.OpenRead(@"C:\autoexec.bat")) ;

                    }

 

                    return Runs * 2;

                };

 

            myFunc.Profile("Did open {runs} files in {time}s");

That was almost too easy. But what if I do not like the formatting? Well it turns out you can customize it still. The format string placeholders {0} {1} and {2} are the number of runs, elapsed time in seconds  as float and call frequency (runs/s)  as float. They can be used as usual to customize the output format to your specific needs.

myFunc.Profile("Did open {runs} files in {1:F5}s");

Did open 2 000 files in 0,10000s

The actual class is written in C# 3.0 with lambda expressions. The usage of this rather low level primitives gives you a rather surprising amount of flexibility and composability to build higher level functions.

Here is the source code for the Action and Func<int> delegate extension methods:

 

using System;

using System.Globalization;

using System.Diagnostics;

 

namespace PerformanceTester

{

    /// <summary>

    /// Helper class to print out performance related data like number of runs, elapsed time and frequency

    /// </summary>

    public static class Extension

    {

        static NumberFormatInfo myNumberFormat;

 

        static NumberFormatInfo NumberFormat

        {

            get

            {

                if (myNumberFormat == null)

                {

                    var local = new CultureInfo("en-us", false).NumberFormat;

                    local.NumberGroupSeparator = " "; // set space as thousand separator

                    myNumberFormat = local; // make a thread safe assignment with a fully initialized variable

                }

                return myNumberFormat;

            }

        }

 

        /// <summary>

        /// Execute the given function and print the elapsed time to the console.

        /// </summary>

        /// <param name="func">Function that returns the number of iterations.</param>

        /// <param name="format">Format string which can contain {runs} or {0},{time} or {1} and {frequency} or {2}.</param>

        public static void Profile(this Func<int> func, string format)

        {

 

            Stopwatch watch = Stopwatch.StartNew();

            int runs = func();  // Execute function and get number of iterations back

            watch.Stop();

 

            string replacedFormat = format.Replace("{runs}", "{3}")

                                          .Replace("{time}", "{4}")

                                          .Replace("{frequency}", "{5}");

 

            // get elapsed time back

            float sec = watch.ElapsedMilliseconds / 1000.0f;

            float frequency = runs / sec; // calculate frequency of the operation in question

 

            try

            {

                Console.WriteLine(replacedFormat,

                                    runs,  // {0} is the number of runs

                                    sec,   // {1} is the elapsed time as float

                                    frequency, // {2} is the call frequency as float

                                    runs.ToString("N0", NumberFormat),  // Expanded token {runs} is formatted with thousand separators

                                    sec.ToString("F2", NumberFormat),   // expanded token {time} is formatted as float in seconds with two digits precision

                                    frequency.ToString("N0", NumberFormat)); // expanded token {frequency} is formatted as float with thousands separators

            }

            catch (FormatException ex)

            {

                throw new FormatException(

                    String.Format("The input string format string did contain not an expected token like "+

                                  "{{runs}}/{{0}}, {{time}}/{{1}} or {{frequency}}/{{2}} or the format string " +

                                  "itself was invalid: \"{0}\"", format), ex);

            }

        }

 

        /// <summary>

        /// Execute the given function n-times and print the timing values (number of runs, elapsed time, call frequency)

        /// to the console window.

        /// </summary>

        /// <param name="func">Function to call in a for loop.</param>

        /// <param name="runs">Number of iterations.</param>

        /// <param name="format">Format string which can contain {runs} or {0},{time} or {1} and {frequency} or {2}.</param>

        public static void Profile(this Action func, int runs, string format)

        {

            Func<int> f = () =>

            {

                for (int i = 0; i < runs; i++)

                {

                    func();

                }

                return runs;

            };

            f.Profile(format);

        }

 

        /// <summary>

        /// Call a function in a for loop n-times. The first function call will be measured independently to measure

        /// first call effects.

        /// </summary>

        /// <param name="func">Function to call in a loop.</param>

        /// <param name="runs">Number of iterations.</param>

        /// <param name="formatFirst">Format string for first function call performance.</param>

        /// <param name="formatOther">Format string for subsequent function call performance.</param>

        /// <remarks>

        /// The format string can contain {runs} or {0},{time} or {1} and {frequency} or {2}.

        /// </remarks>

        public static void ProfileWithFirst(this Action func, int runs, string formatFirst, string formatOther)

        {

            func.Profile(1, formatFirst);

            func.Profile(runs - 1, formatOther);

        }

    }

 

}


Friday, December 12, 2008 #

How can you use the word efficient memory usage and mention in the same headline .NET? We all know that C++ is much more efficient with regards to memory consumption. Yes I agree that if you really love your memory you should think twice if .NET is the right choice for you. There have been reasons why  Windows Vista has not a single managed executable executed while starting up. Ok the Event Viewer is managed which explains why it is starting so slow. First of all you need to know what things cost. The following table shows you how much memory is allocated for some common object types:

 

Type Size in Bytes (32-bit)
new object() 12
new string('\0'); 20
new DummyStruct(); 4

What perhaps is surprising that each managed class object  consumes at least 12 bytes of memory. If you want to allocate a huge number of objects you are perhaps better off with a struct value type.

The program used to get the numbers was this one:

using System;

using System.Collections.Generic;

using System.Linq;

using System.Text;

using System.Threading;

using System.Runtime.InteropServices;

 

namespace MemoryAllocation

{

 

    [StructLayout(LayoutKind.Sequential,Pack=1)]

    struct DummyStruct

    {

        public int a;

        public int b;

    }

 

    class Program

    {

        static List<T> Allocate<T>(int allocations, Func<T> allocator)

        {

            List<T> memory = new List<T>();

            for (int i = 0; i < allocations; i++)

            {

                memory.Add( allocator() ); // call function that creates a new object

            }

 

            return memory;

        }

 

        static char [] empty = new char[] { '\0' };  // input for string ctor to create an empty string

 

        static void Main(string[] args)

        {

            var before = GC.GetTotalMemory(true);

 

            const int Allocations = 1000 * 1000;

 

            // allocate memory and do not release it

            var mem = Allocate(Allocations,

                //() => new object()

                //() => new DummyStruct()

                () => new string(empty)

                );

 

            GC.Collect();

            GC.Collect();

            var after = GC.GetTotalMemory(true);

 

            GC.KeepAlive(mem);

 

            // get memory allocated by one object excluding the 4 bytes which are used for the object

            // reference in the array

            Console.WriteLine("One object consumes about {0} bytes", (after - before) / Allocations);

        }

    }

}

One thing to note is that you need to subtract from the output 4 bytes for reference types (object and string) because they are stored in an array and we do not want to count the array reference as memory consumption also. As I said in my previous post "Where Did My Memory Go" every small (< 85000 bytes) managed object allocation will eat up your physical memory because the garbage collector will traverse the managed heaps from time to time to remove dead objects and to compact the heaps. That has the effect that your objects although you will never use them will stay always hot in the memory which prevents them to go into the page file. You can of course force the OS to swap all your memory out to the page file by calling SetProcessWorkingSetSize(GetCurrentProcess(),-1,-1) but that has severe performance drawbacks when you access the swapped out memory which causes hard page faults. Windows Forms application actually do this to save memory. That is the reason why the Working Set drop to some MB when you minimize a managed application. The golden rule is to use efficient data structure to consume as little memory as possible. If you want to optimize your memory consumption you need a managed memory profiler. The ones I found most useful are

 

.NET Memory Profiler from SCITech

  • It is cheap 179€/license (taken at 13.12.2008)
  • The fastest profiler I have used so far.
  • Full 64-bit support.
  • It does support allocation stacks (which functions did lead to the object allocation).
  • View content of allocated objects.
  • It can take Snapshots of your process and compare it to another snapshot. This way you can find memory leaks quite easily.
  • Object tagging to find out which objects are new since the last snapshot.
  • Nice and fast filtering capabilities.

Of course there are also some gotchas

  • The extended profiling mode is not as stable as I would have wished. It works for most applications but can crash on bigger applications.
  • The object list is not very easy to navigate to find your biggest memory consumers.

 

The .NET Memory Profiler is easy to use and definitely worth its money.

 

YourKit Profiler for .NET

  • It can be enabled during application runtime which is a quite unique feature I have found nowhere else.
  • It is both a performance and memory profiler.
  • Full 64-bit support.
  • Class List view is easy to navigate.
  • Class Tree view is very cool to find out more about which objects contain all the others on a namespace level.
  • Memory Analysis can find duplicate strings and other waste memory anti patterns which can otherwise only be found by watching each string ...

That sounds very impressive and it is. But it has also some quite severe limitations

  • It has no allocation stack support. That makes it very hard to find out who has allocated your object.
  • Static class members cannot be tracked down to the class that holds the reference to them. They show up as object roots with no connection to anywhere.
  • Stack local instances are flagged but if you e.g. allocate 200 MB inside a function you will not be able to navigate to the class that allocated the object.
  • The fast profiling mode is much less stable and crashes quite often
  • Opening a snapshot is very slow.
  • The performance profiler (sampling, tracing) does not show me the bottlenecks where thread sleeps are involved in the way I would expect them. That can lead to the wrong direction. The new Ants Profiler 4.0  or Intels VTune Performance profiler are much better suited for that job.
  • Not so cheap 389 €/license (taken at 13.12.2008).

YourKit is the leading Java profiler company which also has a .NET profiler in their portfolio. No profiler is perfect and in fact both complement each other and I would recommend to use both (SciTech and YourKit)  to get the best possible overview how your memory in your application is distributed. All profilers can be downloaded from the software vendors for free with a 14-day trial license to try them out. I recommend to do so to find out which profiler suits your needs best. There are some other profilers also out there like AQTime and the Ants Memory Profiler. AQTime seems to be able to profile .NET and C++ for performance and memory which makes it very interesting. But so far I found not enough time check because it is not easy to use. The current Ants Memory Profiler is not usable and I cannot recommend it at all. But they have a very good performance profiler which is really worth its money. The only thing I really do not understand about the Ants profilers is that it is NOT possible to launch an application with command line arguments from a batch file. That is ok for GUI applications but if your application under test spawns child processes you need to be able to call the profiler from the command line.

Equipped with a profiler we can chase our memory now.

I recommend to look at first for

  • Duplicate strings - The objects that allocated them will most likely have several instances around. Consider to make them static to save memory.
  • *Cache* in the type name. It is surprising how much memory with caches is lost. Bigger applications seem to have their own cache in each architectural layer which should be questioned. If the cache itself is ok look how many instances of your cache exist. A sane rule is that a cache should exist only once. If you find more than one cache instance it is very likely that the cached data may be the same but it is not shared.
  • XmlDocuments have their own XML DOM tree representation which consumes quite a lot of memory (x3-x5 times more than the plain xml file). The profilers have a hard time to resolve from an XmlNode to the actual object that holds a reference to them so it can be a bit tricky to find out.
  • XmlReaderSettings are a fine source of memory leaks if you store them as member inside your class. When you choose to validate your XML document you will attach the reader settings class with the validation error callbacks to the just read XmlDocument. In effect you reference from the XmlReaderSettings instance your XML DOM tree even when you do not need it ever again!
  • Huge number of objects of the same type and check who has allocated them.

 

When you have found an inefficiency it is time to fix it. Then you need to measure memory consumption again. Here comes the hard part. It is very difficult to check if a memory optimization did actually save memory. A simple look at Working Set, Private Bytes, GC Heap does not work since the GC heaps are allocated in chunks (16 MB if I remember correctly). These numbers tell you only the peak memory consumption during startup of your application. But since then half of your heaps might be empty and you can get the impression that nothing has changed after your patch. The easiest way to check if an optimization did actually work is to look at the GC.GetTotalMemory value from time to time or you can use the memory profilers overview pages which are also helpful.


Monday, December 08, 2008 #

I have installed Windows Vista Ultimate since some month now and I still wonder what system service I have missed to get a well behaved system. To troubleshoot heavy disk access I use Process Monitor from SysInternals. It is an invaluable tool to find out who did access what on the system. It can monitor every module load, process start, thread creation, registry or disk access and network activity you can imagine. If you are developing unmanaged code you can even view the call stacks who did access this file or registry key not only from which application but it captures also (when the symbols are loaded) the full call stack. To make the call stack work you need to download the symbols which is quite easy and I recommend to do it. Below is a typical screen shot from Process Monitor when the system is doing seemingly nothing:

As you can see there is still quite a lot activity going on behind the scenes. Process Monitor makes the Windows black box a white box and you can look directly at the system while it is working. If you ask with what black art this information is obtained: Process Monitor installs a filter driver (at run time only) which allows it to redirect nearly all IO related activities to the filter driver. To check if the driver is runnung you can use the fltmc command:

c:\>fltmc.exe

Filtername                     Number of Instances    Height    Frame
-----------------------------  --------------------    ----    -----
PROCMON11                           0       385001         0
luafv                                   1         135000          0
FileInfo                                8        45000            0

If you wan to find out who did who did open the files simply go to Tools - Stack Trace

and you will see  even the call stack who did try to access a specific file or registry key. I did try this with a small C++ application to get a full call stack but the latest version (2.02) seems have problems to pick up the correct events. I did not see my file IO at all (Windows Vista 32-bit). Ok that means that I can finish my post. But wait. My initial goal was to tell you a little known secret: XPerf is your secret weapon when you want to see what the Kernel really does.

Some introduction links are here:

Starting with Windows 2000 the ETW (Event Tracing for Windows) was introduced but it was a little known feature which did not get much attention. The main reason is that it was quite complicated to implement and use. Xperf has been used within MS since years but it was so useful they have published it. The tool can be used to capture traces on Server 2003 and XP systems but you must analyze them on a Vista system to decode them.

Its main uses are (where I have used it so far):

  • Find out the difference why the first start of your application takes so much longer than the second start
  • Who is accessing the file x zillion times?
  • Are my disk access patterns inefficient?

To get started simply install the tool and start a trace session with some default event providers (xperf -providers lists all of them).

  • xperf -on base

Now you can execute your use case, start your application, ... Then you can stop the trace session and save the traces to disk

  • xperf -d c:\savedTraces.etl

To view the saved traces simply type

  • xperf c:\savedTraces.etl or double click on the file in explorer if you have a file association with .etl files.

You will see some output like this:

As you can see there is CPU usage, Disk IO, Disk IO per process, .... Today application was scraping over my hard disk again and I never figured out what it was. Process Monitor shows you very much details but sometimes the amount of data can be overwhelming and I was not able to spot the issue. But here we see a nice 100% spike in the Disk Utilization by Process. Lets select the region of interest and click on Detail Graph (this is a context menu):

This graph shows you the time on the horizontal axis and the logical cluster number of you hard disk on the vertical axis. To make it even more interesting it shows you only true disk accesses. If all data is in the file system cache this graph will contain only a few dots which tells you that the hard disk had nothing to do with your slow system. From this graph you can conclude

  • Big vertical changes means long disk seeks thus slow performance. An optimal disk access pattern (linear read) would be a straight line.
  • Big horizontal gaps means that the disk or the application was busy. You just did wait for something to complete (random access) quite long.

Additionally the Summary table shows you also a wealth of information

In my case I did find that FireFox has its own database (urlclassifer3.sqlite) which does update its index from time to time. The database itself contains a list of URLs which are considered harmful. Firefox will warn you when you try to navigate to potential fraud urls. You can disable this thing under

Options - Security and uncheck

Tell me if the site I´m visiting is a suspected attack site.

Tell me if the site I´m visiting is a suspected forgery.

Shut down Firefox and delete the database. Voila no more hard disk scraping in the background and one reason less to blame Windows Vista. Finally this case can be closed. With this tool I managed also to find out what my little cpptest.exe application was doing:

 

When the _NT_SYMBOL_PATH is  set to c:\windows\symbols;SRV*C:\Windows\Symbols*http://msdl.microsoft.com/download/symbols and you add the cpptest pdb directory you will get this output. From this stack trace we can conclude that cpptest.exe does call from its main method some long lasting functions such as WriteFile which are responsible for the disk IO we where seeing in the System process. One important thing about file IO with Windows Vista is that the OS seems to be veeeery lazy with file writes. If I write e.g. a 100 MB file it is has 0 bytes until I close the file! To force the Vista to write the contents of a file to disk you need to call FlushFileBuffers or use unbuffered IO.


Sunday, November 30, 2008 #

When you program in a high level language like .NET where the Garbage collector takes care of your memory you do not have to think about memory as often as it is the case in C++. Memory leaks tend to show up much more often in C++ and other non garbage collected languages because nobody is cleaning after you. Garbage collection is a good thing but somehow your application consumes much more memory than you thought it should. What should you do now? First of all you need to understand how your memory is organized by Windows itself. Mark Russinovich has a very eloquent article about that. If you do not understand what the whole article is all about I give you a quick start.

  1. Download Process Explorer from TechNet.

Now you see some output like this.

 

What do the numbers of Working Set, Working Set Private, Private Bytes, ... mean to a "normal" programmer? I do allocate memory, use it and let the garbage collector free it when I no longer use it. The usage of memory is easy but Windows does a lot behind the scenes to make it work in an efficient way. The most crucial part is that Windows does is to share memory between processes when it is read only. Your code and read only data in a dll is a good example of a read only data structure that can be used in many processes. If you use the same dll in more than one process it will be shared between all processes that use it.

Shared Memory

If you look at the columns in the screen shot of Process Explorer you will notice that different counters for Working Set and Private Bytes have been selected. The deeper reason for this is that these numbers are incredibly useful to tune your application. To get more data you can select the Properties (right click on a process) of a process.

 

  •  Working set is the actually used physical memory which cannot be more than the amount of your RAM chips installed on your computer.
  • Private Bytes is the memory that cannot be shared between processes.
  • Working Set Private tells you how much private bytes attribute to your working set (allocated physical memory) .

When you add the total working set of all processes you can get a much bigger number than the installed memory on your machine. The reason behind this is that much of your process data (e.g. code) can be shared. You can calculate your working set out of

Working Set = Working Set Private + Working Set Shareable

If you want to create a well behaved .NET application you would aim for a low working set and low private bytes. Private bytes are for example all your allocated objects which live on the CLR heap either in Generation 0,1,2 or the Large Object Heap. More about that comes later.

Code Sharing - NGen And Precompiled Assemblies

In the .NET environment things are complicated a bit by the JIT compiler which does compile your IL code into each process separately. To achieve full code sharing in .NET processes you need to precompile your assembly with the NGEN tool to enable cross process code sharing. If you look with process explorer at your loaded dlls (press Ctrl-D in Process Explorer) you will find that all .NET assemblies from Microsoft are precompiled to minimize the memory footprint if more than one .NET process (which is very likely) is runnig. To validate that you are you using the precompiled images look into the Fusion Log (Fuslogvw and check the Native Image checkbox). An even easier way is to look at the path of you loaded dlls. If it does contain C:\Windows\NativeImages_v2.xxxx then you have loaded the precompiled assembly successfully. If not your NGen image did not match the loaded assembly and must be updated or you are using multiple AppDomains. In that case you need to decorate your Main method with LoaderOptimization.MultiDomain value to tell the JIT compiler to share the code between AppDomains.

        [LoaderOptimization(LoaderOptimization.MultiDomain)]

         static void Main(string[] args)

 

Code size in enterprise applications can easily reach several hundred MB which would become a major headache if no code sharing between processes is possible.

 

Data Sharing

Another way to share data between processes are Memory Mapped Files which will be supported by the .NET Framework 4.0 without any PInvokes finally.

 

Page File Allocated Memory

An even trickier thing is to allocate memory in the page file directly by calling VirtualAllocEx. Since the page file is shared between all processes it is not really possible to attribute this allocation to a specific process (yet). This is the reason why Page File backed memory does not show up as private byte memory at all although your application might consume GBs of it.

 

Working Set and Allocation Size

There is a very direct relation between Working Set size and allocation size in .NET applications. Try to run the following code snippet

    class Program

    {

        static void Main(string[] args)

        {

            List<byte[]> memory = new List<byte[]>();

            const int Factor = 85; // Allocate 85000 bytes with each loop run

            while (true)

            {

                var bytes = new byte[Factor * 1000];

                memory.Add(bytes); // prevent the GC from reclaiming the memory

                Thread.Sleep(Factor / 2);  // throttle the  allocation to make it visible

                Console.WriteLine("Next run");

            }

        }

    }

That code snippet will allocate memory in blocks each the size of 85000 bytes and sleep a little to watch the memory allocations more easily. If you wonder why on Earth I did use 85000 bytes as block size: That is the size when the .NET Framework (2.0, 3.0, 3.5, 3.5 SP1) will allocate your object on the Large Object Heap. All objects on this heap are never moved by the Garbage Collector. You can observe this directly when you watch the Working Set Size. It remains constant while you allocate hundreds of MBs of private bytes memory! Windows allocates the memory and finds that since you did not touch the memory it can be moved to the page file where your application will happily allocate more and more page file but not physical memory until you reach the 2 GB limit for a 32 bit process or the page file becomes full.

The effect changes drastically when you change the Factor from 85 to 84. This will change the allocation size below the threshold and you will allocate the memory on the normal CLR heap. That heaps are compacted from time to time by the GC which means that the GC will force Windows to move our memory from the page file into the physical memory. Although our application does not access the allocated bytes the GC will which binds our memory allocation directly to physical memory!

 

When you allocate memory in smaller chunks than 85000 bytes it will be allocated in your physical RAM due to the GCs nature to traverse the whole heap from time to time.

 

That is important since it severely limits our ability to run an application with many small objects on machines with not so much RAM. It is therefor vitally important for all .NET developers to track their memory consumption and have a sharp eye on many small (<85000 bytes) object allocations which directly add to the process working set. How and which memory profilers can be used to track typical .NET applications is a topic for a future post.


Tuesday, May 20, 2008 #

In case you have not noticed the Patterns & Practices Team has released the greatest and latest version of the Enterprise Library. More info's can be found at Grigory Melniks blog.

What’s New in v4.0?

This release of Enterprise Library includes the following:

– Integration with the Unity Application Block

– Windows Management Instrumentation (WMI) 2.0 support and improved instrumentation

– Performance improvements (particularly, in the Logging Application Block)

– Pluggable Cache Managers

– Visual Studio 2008 support

– Bug fixes

Unity looks cool but I was not able to see any noticeable startup performance gain. The usage of dynamic code generation (Build Plans) to invoke ctors and properties seems not to pay off when we create from each type only one or two instances.  WMI looks nice but you need this stuff only if your architecture does not have central configuration mechanism. Alternatively you could use a custom configuration source which gets its configuration from somewhere else (WCF?) quite easily. The Visual Studio 2008 integration is (of course) seamless as expected. One really cool thing is buried in the performance improvements section where the Logging Application Block has got some attention. As noted in my previous article Really Fast Formatting With Enterprise Library there are massive performance gains possible by doing some intelligent preprocessing of the TextFormatter templates and getting into a more functional style of programming. The formatting performance of the new Enterprise Library is >10 times better. It seems that these improvements were not on the original road map (the March CTP did not contain any improvements in this area) but my proposal was taken and implemented.

I did try to extend the improved TextFormatter but did give up soon because this portion of the Enterprise Library plays a little closed source here and does not let me override the formatting of any existing token. Fortunately a little copy & paste did solve this issue quite nicely (not the best pattern of code reuse but it works). By applying the new formatting method via delegates I was able to get performance gains over a factor 2 in logging throughput which helps scalability a lot. Come and get it and see it for yourself.


Wednesday, April 09, 2008 #

The Enterprise Library 4 CTP and the feature list can be found at the Codeplex Home page.

General Features of Entlib 4 (Excerpt from Release Notes)

· Enterprise Library 4.0 takes advantage of the improved features in the .NET WMI 2.0 API to provide update capability for configuration exposed through the Manageable Configuration Source. Objects defined in the Enterprise Library configuration, such as databases, trace listeners, and cache backing stores, are exposed through WMI as a set of classes that you can query and update. Providing that you have the relevant Windows permission to update WMI values, any changes you make to the properties of these classes are reflected back through the Manageable Configuration Source and update the configuration information exposed to your application. Changes do not affect the original configuration file contents and may be overridden by Group Policy.

· In version 4.0, Enterprise Library has the Allow Partially-Trusted Caller attribute (APTCA) on all assemblies. This means that you can call the methods of Enterprise Library and the application blocks from an application running in a partial trust environment. You can do this with the signed assemblies provided with Enterprise Library. There is no longer any requirement, as there was in version 3.x, to recompile the source code and the source code for Object Builder then either use the unsigned binaries or strong-name them yourself.

That is good news for all medium trust ASP.NET web developers and Active Directory administrators that want to have an easy way to configure the application blocks in a central way.

New Logging Application Features (Excerpt from Release Notes)

· The application block now allows you to specify a value for the Filter property of each of the Trace Listeners. This property applies a filter that selects the level of message that it will detect. The valid values are All, Off, Critical, Error, Warning, Information, Verbose, and Activity Tracing. The setting effectively means "the specified level and everything more important." For example, the Warning setting will detect warnings, errors, and critical events. The default is All, which means that the behavior of the listener is the same as in previous versions of Enterprise Library if a value is not specified in the configuration.

· The application block now allows you to specify that the RollingFlatFileTraceListener will start a new file at midnight every day. To use this behavior, set the value of the RollInterval property of the RollingFlatFileTraceListener to Midnight.

· The application block contains performance improvements that include the following:

◦ Delayed gathering of context information until actually requested (except in distributed scenarios where context information is gathered even if not used).

◦ Automatic flushing is now configurable. However, the default is to flush always, which means that the behavior of the listener is the same as in previous versions of Enterprise Library if a value is not specified in the configuration. To disable automatic flushing, set the AutoFlush property of the Log Source to True. It is then your responsibility to ensure that all entries are flushed to the target, especially if an exception or failure occurs in the application. Otherwise, you will lose any cached logging information not yet written to the target.

· Log file names now support the use of environment variables. You can include environment variables such as %WINDIR%, %TEMP%, and %USERPROFILE% in the Filename property of the Flat File Trace Listener, Rolling Flat File Trace Listener, and XML Trace Listener.

The application block implements several new performance counters that you can use to monitor performance and operations. The new counters are Total Logging Events Raised, Total Trace Listener Entries Written, and Total Trace Operations Started.

First Impressions

The startup performance improvements with Unity (which is already released) are not done yet but I hope that this will help to make the usage of the application blocks nearly invisible. The new dependency injection mechanism which uses reflection like ObjectBuilder (slow) but uses dynamically generated IL code to call the actual constructors/getter/setter methods. That increases the object creation performance dramatically. I hope that this will make a noticeable difference in a real world application.

After doing a quick diff I found some improvements in the Logging Application Block. They do mainly concern the configuration via WMI and group policies with ADM templates. That enables to query and deploy logging configuration via an Active Directory without changing a physical file on the target machines. That is a nice feature but I have to yet to see if that does work in all scenarios. Low privileged users do not always have full WMI access which could hinder the usage of the new feature in low trust environments.

Writing to a log file seems to be very easy but it did take over three years (Jan. 2005 was the first release) to support environment variables in file names. The delivered code quality was always very good but why on earth did it take so long? Working agile with a very strict timeline does enable the P&P team to deliver top features on time but low priority features such as this one have been dropped. That is no problem since the community did deliver very quickly the missing features such as the Rolling File Trace Listener from Erwyn van der Meer which is in my opinion still the one of the cleanest implementations of a Rolling File Trace Listener. What I really miss is a backflow of great community extensions to the Enterprise Library main code base.

A really nifty feature of the new Rolling File Trace Listener is the roll interval Midnight to enable to start a new log file at every day. Although one should not expect a roll over exactly at midnight since the roll over is only performed when the next log call arrives after the time span has elapsed.

 

Logging Application Block Performance

The creation of LogEntry objects has become dramatically faster since the current values are only read when needed. That is good but not perfect (yet). The main observations I did make

 

  • LogEntry: Process Id, Process Name and AppDomain Name will never change during the lifetime of an application. They could be read only once and copied from static variables to the actual instance variables on demand.
  • LogEntry: The native and managed thread id is fetched on demand which could be too late if the output destination does asynchronous operations (queuing?).
  • Formatters: There is no progress in the text formatter performance. That is surprising to say the least since delayed LogEntry variable expansion does not help much if you use a TextFormatter with the default template to format your LogEntry. I whish so much I could check out the sources for TextFormatter ....

 

The Good/The Whishes/The Summary for Entlib 4

Good:

  • Many new features (central configuration, unity integration, ...)
  • Performance Improvements

Wishes

  • High quality community extensions should be included in the main code base.
  • Do NOT Authenticode sign the strong named binaries again. I really hate the 15s startup delays of non Internet connected machines! Yes I know with .NET 3.5 there is a way to turn Authenticode verification off in the App.config file but .NET 2.0 users will suffer from this.
  • Do include the pdb's for Release and Debug for all shipped binaries (ObjectBuilder too).

The next release of the Enterprise Library does deliver cool features such as Unity and support for group policy based configuration. Finally the Patterns & Practices team has found the time to fix some minor quirks which have been annoying but not mission critical. Since the new release seems to be API compatible with the previous versions you should definitely consider to upgrade as soon as possible. Have a look at the blog of Grigori Melnik for latest announcements about the upcoming release of Entlib 4.


Tuesday, April 08, 2008 #

Error handling is a difficult beast. And there is always one more way to do it. Thottam Sriram did write some nice examples what main cases you usually have to deal with. The most ugly case is the one where you cannot handle an exception but you need to do different cleanup logic in a finally block.

 

               public void Func()

        {

            bool cleanup = true;

 

            try

            {

                OtherFunc();

                cleanup = false;

            }

            catch (Exception)

            {

                // perform Cleanup on error

            }

            finally

            {

                if (cleanup)

                {

                    // perform cleanup on error

                }

            }

        }

 

The necessity to keep a state variable for a corner case does complicate the code quite a lot. You can get around that by knowing that .NET exceptions just wrap SEH (Structured Exception Handling) of Windows. The CLR does make extensive usage of SEH which you can read in the wonderful article from Chris Brumme. To sum it up in a few words if an exception is thrown an Exception Pointers structure is allocated. The actual exception processing is done two in two steps.

  1. Ask all exception handlers what about the current error can be done. If one says it can fix it the exception processing is stopped. No exception will ever happen.
  2. If no handler could fix it in the first pass they are actually invoked one by one and the stack unwinding does start.

C++ has special keywords for SEH named __try/__except/__leave which cannot be mixed with the regular try/finally C++ keywords within one function because only one exception handling chain within a function does make sense. Now you could ask how that complicated stuff can make by error handling easier? Well since we know that during an exception unwind scenario for the executing thread an Exception Pointers structure must exist we can check for its existence and we can distinguish the non exception case from the other. The BCL gives us direct access to this precious information via Marshal.GetExceptionPointers. Armed with that knowledge we can create a simple helper class that returns true if we are in an exception unwind case.

 

    public static class ExceptionHelper

    {

        /// <summary>

        /// Check if we are in a exception unwind scenario or not.

        /// </summary>

        public static bool InException

        {

            get

            {   // Errata: The red marked code seems to be necessary. Since unit tests with .NET 2.0

                // have shown that only checking for the Exception Pointers structure does not always work.

                return Marshal.GetExceptionPointers() == IntPtr.Zero &&

                       Marshal.GetExceptionCode() == 0 ? false : true;

            }

        }

    }

 

The code from above becomes now

 

        public void ImprovedFunc()

        {

            try

            {

                OtherFunc();

            }

            finally

            {

                if (ExceptionHelper.InException)

                {

                    // perform cleanup on error

                }

                else

                {

                    // Do normal cleanup in non exception case

                }

            }

        }

 

We did get rid of the catch clause and the state variable. That little trick can make your life easier in some situations. You can call the check property not only in a finally clause but also from a helper function because the Exception Pointers structure is thread local and visible as long exception processing is going on.

 

Remarks

 

At CLR IL (Intermediate Language) level there is the fault keyword which is similar. It acts like a finally block that is only executed when an exception has occurred. Besides the fact that C# does not support it is not really something that I would want. When an error happens I must execute nearly the same cleanup actions as in the non exceptional case. For that reason code sharing within one code block is essential to prevent code duplication both handlers. If you try to be smart and use the goto keyword (yes I won't dare to use it if it would have worked) to jump from a from a catch handler into the finally handler to skip a portion of the cleanup code you will get a compile error.

Since we have got our hands on the Exception Pointers structure we should be able to find there some reference to the managed exception and perhaps even return it. I tried that and failed. By looking at the Rotor source code I could not find evidence that this structure does contain any hints to the managed exception object. It looks like that the managed Thread implementation has it stored somewhere but since CLR objects are a mixture of IL code and C++ classes it is not feasible to use a simple fixed offset and cast it to the managed exception object.

If somebody has more info about that I am always interested to learn more ...


Monday, January 14, 2008 #

One of the action points for Enterprise Library 4 is to improve the performance of the TextFormatter. I like performance problems and found some interesting results I would like to share here. TextFormatter does take a template where the placeholders are expanded for every log message. The template does look like this

Timestamp: {timestamp}
Message: {message}
Category: {category}
Priority: {priority}
EventId: {eventid}
Severity: {severity}
Title:{title}
Machine: {machine}
Application Domain: {appDomain}
Process Id: {processId}
Process Name: {processName}
Win32 Thread Id: {win32ThreadId}
Thread Name: {threadName}
Extended Properties: {dictionary({key} - {value}
)}

TextFormatter does use StringBuilder to create from the template a mutable string and does then call StringBuilder.Replace for each token it could contain. It does basically a String.IndexOf to find if something has to be replaced and then allocates a new string which contains enough space for the token to be inserted. This process is repeated until all gaps are filled. This results in searches for tokens which are not present in the current template and some memory allocation/copy overhead because the string to replace will never have the same size.

But the most expensive things are the TokenFunctions which do expand e.g. the Extended Properties collection at the end of the example template. A quick look into the source shows where the problem is:

public virtual void Format(StringBuilder messageBuilder, LogEntry log)

    {

       ...

            string messageString = messageBuilder.ToString();

       ...

 

The performance optimization to use StringBuilder in the Formatter has cornered the TokenFunction formatter because it cannot to a simple replace of a fixed string and needs to parse the template. But StringBuilder gives you no access to the not yet constructed string and you need to use therefore this solution.

Armed with this knowledge I came up with an alternate approach to format a text template which does not suffer from the deficiencies of StringBuilder.Replace. The trick is to simply use StringBuilder.Append. To be able to do this I split the template apart in its constant (marked in orange) parts and use for each dynamic content a retriever function. The tuple (string, callback) is put into an object (MicroFormatter) from which we can create a list that allows us to build up the expanded template step by step.  Registering template fill callbacks does eliminate all string searches and (nearly) all buffer reallocations. The cool thing is that this is also possible for the TokenFunctions which can now directly format the parsed template.

public class MicroFormatter

    {

        /// <summary>

        /// Constant prefix which is part of the message template

        /// </summary>

        public string Prefix { get; set; }

 

        /// <summary>

        /// Gets the dynamic content of the LogEntry

        /// </summary>

        public Func<LogEntry, string> Retriever { get; set; }

 

        /// <summary>

        /// TokenFunction formatter wrapper function which does need a string template as parameter which is then formatted.

        /// The formatted string is then returned

        /// </summary>

        public Func<LogEntry, string, string> TokenFuncRetriever { get; set; }

 

        /// <summary>

        /// Template which is expanded by the token e.g. "dictionary({key} - {value}" for the DictionaryFormatter

        /// </summary>

        public string TokenFuncTemplate { get; set; }

 

        public MicroFormatter()

        {

        }

 

        public MicroFormatter(string prefix, string tokenFuncTemplate, Func<LogEntry, string, string> tokenFuncRetriver)

        {

            Prefix = prefix;

            TokenFuncTemplate = tokenFuncTemplate;

            TokenFuncRetriever = tokenFuncRetriver;

        }

 

        public MicroFormatter(string prefix, Func<LogEntry, string> retriever)

        {

            Prefix = prefix;

            Retriever = retriever;

        }

 

        /// <summary>

        /// Appends to a StringBuilder instance the static and dynamic contents of the parsed template

        /// </summary>

 

           public void Format(LogEntry entry, StringBuilder template)

 

        {

            if (Prefix != null) // Append constant template part

            {

                template.Append(Prefix);

            }

 

            if (Retriever != null) // Add dynamic content of LogEntry

            {

                template.Append(Retriever(entry));

            }

            else if (TokenFuncRetriever != null) // Or add dynamic content of TokenFunction

            {

                template.Append(TokenFuncRetriever(entry, TokenFuncTemplate));

            }

        }

    }

 

This allows us to generate a easy Format Method:

    // Combined list of microformatters which allow an easy StringBuilder.Append instead of

    // a much slower StringBuilder.Replace which does first search (many times pointless) and

    // then replaces the string which needs to reallocate and copy the buffer

    List<MicroFormatter> myFormatters = new List<MicroFormatter>();

 

    public string Format(LogEntry entry)

    {

        // allocate string builder can be dynamically sized

        StringBuilder sb = new StringBuilder(500);

 

        // Call formatters in sequence they were entered to build the spitted

        // template piece by piece together again.

        foreach (MicroFormatter fmt in myFormatters)

        {

           fmt.Format(entry, sb);

        }

 

        return sb.ToString();

    }

 

The red lines are the important changes which give us much better performance. This simple solution does improve the formatting speed over 14 times! That should be enough for everyone ;-) Another costly part is the creation of the LogEntry itself which does call a private method in its default ctor (CollectIntrinsicProperties) which can cost quite some time. It would be nice if the P&P team would make this function virtual so one has a chance to optimize things a bit. This would be beneficial if somebody does want to filter LogEntry derived objects which can be filtered out without needing any of the other properties. That would allow to defer the costly data collection when it is really ensured that the entry will be logged.

 

Performance

All Performance Tests were performed with an Intel Core Duo 6600 @ 2.40 GHz with 3 GB RAM, EntLib 3.1, VS2008 RTM and the message template which is shown at the start of the article.

 

Test Case (executed 100K times) Execution Frequency in 1/s Speed Gain (x factor)
Create LogEntry    22 000 -
Format LogEntry TextFormatter     5 300 1
Format LogEntry SmartFormatter   75 000 14

 

More Code

The complete code of my SmartFormatter is shown below:

namespace Logging

{

    /// <summary>

    /// Really fast formatter with equivalent features as TextFormatter of the Enterprise Library

    /// </summary>

    public class SmartFormatter : ILogFormatter

    {

        protected const char TokenStartChar = '{';

 

        protected const string timeStampToken = "{timestamp}";

        protected const string messageToken = "{message}";

        protected const string categoryToken = "{category}";

        protected const string priorityToken = "{priority}";

        protected const string eventIdToken = "{eventid}";

        protected const string severityToken = "{severity}";

        protected const string titleToken = "{title}";

        protected const string errorMessagesToke = "{errorMessages}";

 

        protected const string machineToken = "{machine}";

        protected const string appDomainNameToken = "{appDomain}";

        protected const string processIdToken = "{processId}";

        protected const string processNameToken = "{processName}";

        protected const string threadNameToken = "{threadName}";

        protected const string win32ThreadIdToken = "{win32ThreadId}";

        protected const string activityidToken = "{activity}";

 

        protected const string NewLineToken = "{newline}";

        protected const string TabToken = "{tab}";

 

        protected const string DictionaryStartToken = "{dictionary(";

        protected const string KeyValueStartToken = "{keyvalue(";

        protected const string TimeStampStartToken = "{timestamp(";

        protected const string ReflectedPropertyStartToken = "{property(";

        protected const string EndToken = ")}";

 

        /// <summary>

        /// Key is the Token template e.g. "{timestamp}" and value is a callback function that retrieves the requested string from

        /// a LogEntry object.

        /// </summary>

        protected internal Dictionary<string, Func<LogEntry, string>> myTokenToFunc = new Dictionary<string, Func<LogEntry, string>>();

 

        /// <summary>

        /// Key is the Token Function start e.g. "{dictionary(" and value is a callback function that formats from a

        /// configured template the contents of a LogEntry object.

        /// </summary>

        protected internal Dictionary<string, Func<LogEntry, string, string>> myTokenFormatterToFunc = new Dictionary<string, Func<LogEntry, string, string>>();

 

        // Combined list of MicroFormatters which allow an easy StringBuilder.Append instead of

        // a much slower StringBuilder.Replace which does first search (many times pointless) and

        // then replaces the string which needs to reallocate and copy the buffer

        List<MicroFormatter> myFormatters = new List<MicroFormatter>();

 

        /// <summary>

        /// For brevity we use the C# 3.0 lambda syntax

        /// </summary>

        protected internal virtual void AddWellknownTokens()

        {

            myTokenToFunc.Add(timeStampToken, (LogEntry ent) => { return ent.TimeStampString; });

            myTokenToFunc.Add(messageToken, (LogEntry ent) => { return ent.Message; });

            myTokenToFunc.Add(categoryToken, FormatCategoriesCollection);

            myTokenToFunc.Add(priorityToken, (LogEntry ent) => { return ent.Priority.ToString(); });

            myTokenToFunc.Add(eventIdToken, (LogEntry ent) => { return ent.EventId.ToString(); });

            myTokenToFunc.Add(severityToken, (LogEntry ent) => { return ent.Severity.ToString(); });

            myTokenToFunc.Add(titleToken, (LogEntry ent) => { return ent.Title; });

            myTokenToFunc.Add(errorMessagesToke, (LogEntry ent) => { return ent.ErrorMessages; });

            myTokenToFunc.Add(machineToken, (LogEntry ent) => { return ent.MachineName; });

            myTokenToFunc.Add(appDomainNameToken, (LogEntry ent) => { return ent.AppDomainName; });

            myTokenToFunc.Add(processIdToken, (LogEntry ent) => { return ent.ProcessId; });

            myTokenToFunc.Add(processNameToken, (LogEntry ent) => { return ent.ProcessName; });

            myTokenToFunc.Add(threadNameToken, (LogEntry ent) => { return ent.ManagedThreadName; });

            myTokenToFunc.Add(win32ThreadIdToken, (LogEntry ent) => { return ent.Win32ThreadId; });

            myTokenToFunc.Add(activityidToken, (LogEntry ent) => { return ent.ActivityId.ToString(); });

            myTokenToFunc.Add(NewLineToken, (LogEntry ent) => { return Environment.NewLine; });

            myTokenToFunc.Add(TabToken, (LogEntry ent) => {return "\t"; });

 

            // register token functions

            myTokenFormatterToFunc.Add(DictionaryStartToken, (LogEntry ent, string template) =>

            {

                DictionaryToken token = new DictionaryToken();

                return token.FormatToken(template, ent);

            });

 

            myTokenFormatterToFunc.Add(KeyValueStartToken, (LogEntry ent, string template) =>

                {

                    KeyValueToken token = new KeyValueToken();

                    return token.FormatToken(template, ent);

                });

 

            myTokenFormatterToFunc.Add(TimeStampStartToken, (LogEntry ent, string template) =>

                {

                    TimeStampToken token = new TimeStampToken();

                    return token.FormatToken(template, ent);

                });

 

            myTokenFormatterToFunc.Add(ReflectedPropertyStartToken, (LogEntry ent, string template) =>

                {

                    ReflectedPropertyToken token = new ReflectedPropertyToken();

                    return token.FormatToken(template, ent);

                });

        }

 

        public SmartFormatter()

        {}

 

        public SmartFormatter(string template)

        {

            AddWellknownTokens();

            BuildFormatterList(template);

        }

 

        public string Format(LogEntry entry)

        {

            // allocate string builder can be dynamically sized

            StringBuilder sb = new StringBuilder(500);

 

            // Call formatters in sequence they were entered to build the splitted

            // template piece by piece together again.

            foreach (MicroFormatter fmt in myFormatters)

            {

                fmt.Format(entry, sb);

            }

 

            return sb.ToString();

        }

 

        /// <summary>

        /// Returns the template in between the parentheses for a token function.

        /// Expecting tokens in this format: {keyvalue(myKey1)}.

        /// </summary>

        /// <param name="startPos">Start index to search for the next token function.</param>

        /// <param name="message">Message template containing tokens.</param>

        /// <returns>Inner template of the function.</returns>

        protected virtual string GetInnerTemplate(string tokenFuncStart, int startPos, string message)

        {

            int tokenStartPos = message.IndexOf(tokenFuncStart, startPos) + tokenFuncStart.Length;

            int endPos = message.IndexOf(EndToken, tokenStartPos);

            return message.Substring(tokenStartPos, endPos - tokenStartPos);

        }

 

        /// <summary>

        /// Parse the template and create the list of microformatters. This way we get an ordered list

        /// of things we really have to do instead of trying to blindly format tokens in a template which

        /// may not be present

        /// </summary>

        /// <param name="template"></param>

        internal virtual protected void BuildFormatterList(string template)

        {

            MicroFormatter fmt = new MicroFormatter();

            List<char> curNonTokenString = new List<char>();

            bool isTokenChar = false;

 

            // Go through template and build list of Microformatters

            for(int i = 0; i<template.Length; i++)

            {

                isTokenChar = false;

                if (template[i] == TokenStartChar)

                {

                    // Transform all template variables

                    foreach (string token in myTokenToFunc.Keys)

                    {

                        if (template.IndexOf(token, i) == i)

                        {

                            i += token.Length-1;

                            MicroFormatter formatter = new MicroFormatter();

                            formatter.Prefix = new string(curNonTokenString.ToArray());

                            // Get for this token the callback function which retrieves

                            // from a LogEntry object the resulting dynamic string

                            formatter.Retriever = myTokenToFunc[token];

                            myFormatters.Add(formatter);

                            isTokenChar = true;

                            curNonTokenString.Clear();

                            break;

                        }

                    }

 

                    // Transform all Token functions in the message template

                    foreach (string tokenStart in myTokenFormatterToFunc.Keys)

                    {

                        if (template.IndexOf(tokenStart, i) == i)

                        {

                            string tokenTemplate = GetInnerTemplate(tokenStart,i, template);

                            i += tokenStart.Length + tokenTemplate.Length + EndToken.Length - 1;

 

                            // Create formatter for this token function and store the token function

                            // template inside the formatter

                            MicroFormatter formatter = new MicroFormatter(new string(curNonTokenString.ToArray()),

                                                                          tokenTemplate,

                                                                          myTokenFormatterToFunc[tokenStart]);

                            myFormatters.Add(formatter);

                            isTokenChar = true;

                            curNonTokenString.Clear();

                            break;

                        }

                    }

                }

 

                // if the current template character does not belong to a token variable or function

                // store it later as constant template part in the next MicroFormatter instance

                if (!isTokenChar)

                {

                    curNonTokenString.Add(template[i]);

                }

            }

 

            // the rest of the template is constant

            if (curNonTokenString.Count != 0)

            {

                myFormatters.Add( new MicroFormatter(new string(curNonTokenString.ToArray()), null) );

            }

        }

 

        public static string FormatCategoriesCollection(LogEntry ent)

        {

            StringBuilder categoriesListBuilder = new StringBuilder();

            int i = 0;

            foreach (String category in ent.Categories)

            {

                categoriesListBuilder.Append(category);

                if (++i < ent.Categories.Count)

                {

                    categoriesListBuilder.Append(", ");

                }

            }

            return categoriesListBuilder.ToString();

        }

   }

}


Tuesday, November 20, 2007 #

This is my first test post with Windows Live Writer 2008. I hope it will not mess up when I paste source code ....

But back to business. VS2008 formerly known as Orcas is out! It is just great and has many cool features inside it.

You can download various editions of Visual Studio 2008 at the following locations.

One of the best features are Language Integrated Queries (LINQ). The first question that does arise is if LINQ to Objects is able to replace common for loops in terms of performance in every case. To check it out I did write a very simple test where we filter for specific items in an array and return the filtered list.

List<string> Test = new List<string> { "This", "is", "a", "list", "that", "will", "be", "filtered", "with",
"a", "for", "loop", "and", "LINQ"};

 

List<string> FilterFor(string substr)
{
     List<string> ret = new List<string>();

     foreach (string str in Test)
     {
         if (str.Contains(substr))
             ret.Add(str);
     }

     return ret;
}

 

List<string> FilterForLinq(string substr)
{
     var ret = from str in Test where str.Contains(substr) select str;
     return ret.ToList<string>();
}

 

On my Intel Core Duo 6600 @ 2.40 GHz with 3 GB RAM I get

  First Call in ms Tight Loop in Calls/ms
FilterFor 1 584
FilterForLinq 2 454
FilterForLinq with IEnumerable<string> as return value 1 17543

 

 

The values were measured with the .NET Framework 3.5 RTM in release configuration. It seems that for simple filters we see a small (20%) speed loss if we use LINQ in very tight loops. But when I used a profiler (ANTS 3.0) I did get no speed difference. Since the profiler uses callbacks to get the exact timing we loose data locality and CPU cache benefits which slows the program nearly 100 times down such small effects are not visible anymore.

When we do only create the query but do not execute it we are over 30 times faster than the traditional array based approach. Lazy evaluation is the key to defer the actual work if we do not need all values or do not use them always. Huge speed gains are possible by doing only as much work as need but no more. I am looking forward to PLINQ which could take advantage of my two CPUs.

If you want to get started with more complex queries I recommend the 101 LINQ Samples site to learn what you can do with such queries or MSDN.