Posts
72
Comments
234
Trackbacks
162
June 2010 Entries
Application Verifier – First Steps

Microsoft publishes way too many tools to check all of them out. One pretty cool debugging tool is Application Verifier which you can download here. The official docs mention that it is only meant for unmanaged code. But it does work in mixed mode applications as well so you can try to hunt down your managed heap corruption, double closed handles, … with this tool.

When you start it the first time you select the executable you want to check and then enable from the right menu the checks you want to enable:

image

For most checks you need to start (or attach) the application under a debugger. When a fishy condition is found by Application Verifier it will either cause a debugger break or an exception. Then you can find out with the !avrf (ups did I mention that there is only one true debugger: Windbg out there) debugger extension command what check was triggered.

The only thing that bugged me for quite some time that I could enable checks but when a check did assert I could not use the debugger command !avrf. The documentation (Help – Help) does mention problems very briefly without being helpful:

If symbols for ntdll.dll and verifier.dll are missing, the !avrf extension will generate an error message.

When I tried to use it

windbg FastRegex.exe

I was greeted with the message

0:000> !avrf
Verifier package version >= 3.00
*** ERROR: Module load completed but symbols could not be loaded for C:\Windows\SYSTEM32\vfbasics.dll
*************************************************************************
***                                                                   ***
***                                                                   ***
***    Your debugger is not using the correct symbols                 ***
***                                                                   ***
***    In order for this command to work properly, your symbol path   ***
***    must point to .pdb files that have full type information.      ***
***                                                                   ***
***    Certain .pdb files (such as the public OS symbols) do not      ***
***    contain the required information.  Contact the group that      ***
***    provided you with these symbols if you need this command to    ***
***    work.                                                          ***
***                                                                   ***
***    Type referenced: vfbasics!_AVRF_EXCEPTION_LOG_ENTRY                ***
***                                                                   ***
*************************************************************************
No type information found for `_AVRF_EXCEPTION_LOG_ENTRY'.
Please fix the symbols for `vfbasics.dll'.

How nice. Ok no problem I thought I only have to check my symbol path (correct) and reload the pdbs with .reload –f for all loaded modules and try again. But the result was that the message did not go away although I did have the pdbs loaded. What the heck was going on? Google did lead no results (yes I am not working for MS so I can still use Google ;-)). But luckily I know people who work for MS and know what was wrong.

The solution was to prepend the symbol path with C:\Windows\system32 and then the rest. Why? Application Verifier does install private symbols into the system32 directory. When I did reload the pdbs for the missing symbols due to my malformed symbol path I did get from the MS symbol server the stripped pdbs with only public symbols which are useless for the !avrf extension since it does use internal structures.

I could not believe it but this was the issue. It would be nice to update the docs to mention this pitfall. On my machine with AppVerfier 4.1.1078 I do get the following pdbs installed into the %WINDIR%\system32 directory.

08.02.2010  07:32         3.296.256 appverif.pdb
08.02.2010  07:32         2.419.712 vfbasics.pdb
08.02.2010  07:32         5.172.224 vfcompat.pdb
08.02.2010  07:32         4.344.832 vfLuaPriv.pdb
08.02.2010  07:32           240.640 vfntlmless.pdb
08.02.2010  07:32         3.927.040 vfprint.pdb
08.02.2010  07:32         4.697.088 vfprintpthelper.pdb
08.02.2010  07:32         1.690.624 vrfcore.pdb

After that I could use Application Verifier as the docs state.

Posted On Friday, June 25, 2010 8:52 AM | Feedback (1)
Slow – Faster - Really Fast

Innocent looking things can make a huge difference. One of the dark arts a good programmer should master are regular expressions. You can parse complex data with them very quickly but the syntax of Regular Expressions looks like black magic. At MSDN there are a lot of well documented examples online which is useful when you have forgotten how this quantifier or that grouping expression does work. There is also some advice on performance which can be very helpful in specific occasions. When it comes to optimizing regular expressions there is very little material online. One of the better posts deals with Java Regular Expressions which can help to tune .NET Regular expressions as well.

The vivid readers do already know what the topic for today will be: How can I optimize Regular Expressions?

Suppose you have a log file and you want to parse log messages out of it. You create a regular expression read the file in parse it and display the parsed messages. But if the file size exceed a few MB the viewer takes quite some time to load the messages. You investigate and check how fast the file reading is. At my machine I can read a 20 MB file in 0.2s so reading was obviously not the problem. But the parsing did take over 16s! My first thought was to remove the slow regular expressions and rewrite the parser. But after reading some of the material from the above links I thought that my regular expression could cause an awful amount of backtracking. The log file looks like this:

Message: This is a test message
Id: Id of Message
Severity: Warning
Process Name: C:\Test.exe
Process Id: 4093
Properties: xxxxx

And the Regular expression used to parse the messages

 

        static Regex slow = new Regex("Message: (?<message>.*)\r\n" +
                               "Id: (?<id>.*)\r\n" +
                               "Severity: (?<severity>.*)\r\n" +
                               "Process Name: (?<processname>.*)\r\n" +
                               "Process Id: (?<pid>.*)\r\n" +
                               "Properties: (?<props>.*)\r\n", RegexOptions.Singleline | RegexOptions.Compiled);

The well known trick to compile the regular expression did not help much. But the expression did work fine and I could extract the parts of the message with some code like this:

            string pMsg;
            string id;
            string severity;
            string processName;
            string processId;
            string properties;
 
            Match m = rex.Match(msg);
            if( m.Success)
            {
                    pMsg = m.Groups["message"].Value;
                    id = m.Groups["id"].Value;
                    severity = m.Groups["severity"].Value;
                    processName = m.Groups["processname"].Value;
                    processId = m.Groups["pid"].Value;
                    properties = m.Groups["props"].Value;
 
            }

To try out new regular expressions there is an indispensable tool called Regulator where you can paste regular expressions and test strings into it and it will display the matches directly. Another nice feature is that it can translate any cryptic regular expression into plain English. After optimizing the regular expression I got these numbers:

Did parse 20000 messages in 0,59s 33.898 messages/s
Did parse 20000 messages in 0,09s 235.294 messages/s
Fast version is 7 times faster

If you expect that the regular expression does look dramatically different I must disappoint you. It is nearly the same. But as I said the beginning its the innocent looking things that can make a huge difference.

        static Regex fast = new Regex("Message: (?<message>[^\r\n]*)\r\n" +
                               "Id: (?<id>[^\r\n]*)\r\n" +
                               "Severity: (?<severity>[^\r\n]*)\r\n" +
                               "Process Name: (?<processname>[^\r\n]*)\r\n" +
                               "Process Id: (?<pid>[^\r\n]*)\r\n" +
                               "Properties: (?<props>[^\r\n]*)\r\n", RegexOptions.Singleline | RegexOptions.Compiled);

Did you spot the difference? I did only replace the (?<name>.*) with (?<name>[^\r\n]*) to get a 7 times speedup. Not bad. You might ask why this is the case? The deeper reason is that the .* operator is greedy and does try to match as much as possible. If you would apply the regular expression to a file with many messages it would match only one giant message. To do this it has to scan the entire string while trying to match the other .* expressions as well. By adding [^\r\n]* instead of .* I did limit the expression to one line only. In the worst case scenario the the expression will match at most one line which is exactly what I wanted.

After parsing the messages we can create a message object and stuff the parsed message parts into a object. I did check with my trusty YourKit memory profiler the memory consumption after reading a 20 MB log file and was surprised that I did consume over 50 MB of memory! How can this be? The reason is simple: The log file ASCII but .NET and Windows use UTF-16 as storage mechanism for strings. That does mean that every ASCII character which does take 1 byte on the disk will take 2 bytes in memory! That is bad but we can improve it quite a lot by playing dirty tricks. Why else would you read this stuff? It is obvious that if we look at the data we process very much redundant information is kept in memory. Especially log messages will contain many identical fields like process name, process id, host name, …

This is the first time I can use the well known but never used String.Intern method with a real reason to do so:

            string myHostName;
            public string HostName
            {
                get { return myHostName; }
                set
                {
                    if (value != null)
                    {
                        // Save space by using the space for identical strings only once
                        myHostName = String.Intern(value);
                    }
                    else
                    {
                        myHostName = null;
                    }
                }
            }

 

When we fill our message objects we can play smart and intern the fields which tend to hold a lot of identical data. This little trick did shave off about 10 MB of additional memory. Not bad. But there is more. You remember that .NET strings are stored in the wasteful UTF-16 Encoding. But other encodings do exist. Why not use the UTF-8 encoding which uses only half of the memory?

            byte [] myCompleteMessage;
            public string CompleteMesage
            {
                get { return Encoding.UTF8.GetString(myCompleteMessage); }
                set
                {
                    if (value != null)
                        myCompleteMessage = Encoding.UTF8.GetBytes(value);
                    else
                        myCompleteMessage = null;
                }
            }

This makes sense if the field is not accessed very often. I did keep the complete unparsed message around for a special display mode. By using this trick I could squeeze out another 20 MB. After applying these optimization the application did consume after parsing a 20 MB log file a little less than 20 MB additional CLR memory. That is much better and it did cost me only changing a few property getter and setters. Making the parsing multi threaded is left as an exercise for the reader. Besides that I am running out of good ideas how to make it much faster.

Let me ask you dear reader what was your best performance optimization you are still proud of?

Posted On Tuesday, June 22, 2010 10:24 AM | Feedback (2)
Slow ListView? – Virtual ListView!

Did you ever ask yourself during coding how far you can get with simple constructs and when it is time to use the more advanced but also more complex constructs? When you take the next step it can happen that you discover that the more “complex” solution is even more simple! I had such an aha moment with the Windows Forms ListView. I had one in detailed mode with several thousand rows which did show up very slowly because of the huge number of ListViewItem objects inside it.

image

Nothing fancy but for 20K rows it did take about 5s until something was displayed. The code to display items in list view is straightforward:

        class AccountInfo

    {
        public string AccountType
        {
            get;
            set;
        }
 
        public int AccountBalance
        {
            get;
            set;
        }
 
        public int BuyRating
        {
            get;
            set;
        }
    }

    public partial class cAccounting : Form
    {
        List<AccountInfo> myInfos = new List<AccountInfo>();
        public cAccounting()
        {
            InitializeComponent();
            for (int i = 0; i < 20000; i++)
            {
                myInfos.Add(new AccountInfo
                {
                    AccountType = i % 3 == 0 ? "Good" : "Bad",
                    AccountBalance = i,
                    BuyRating = i*2
                });
            }
 
            ShowItems(myInfos);
        }
 
        void ShowItems(List<AccountInfo> infos)
        {
            cList.SuspendLayout();
            foreach (var acc in infos)
            {
                cList.Items.Add( new ListViewItem(new string [] { acc.AccountType, acc.AccountBalance.ToString(), acc.BuyRating.ToString() }));
            }
            cList.ResumeLayout();
        }

So far so good. Now I wanted to make it faster. Let´s see how things do change when we make the ListView virtual. This little change does cause the list view to fetch only the visible items. Only visible ListViewItems are fetched on demand. If you have 10 million rows but display only 100 of them then only 100 ListViewItem objects will be used.

image

We no longer have the Items collection which would throw exceptions when we would try to use it. The SelectedItems collection is also gone. Instead the SelectedIndices is your friend. Well nearly I found that my items were deselected without beeing notified by the ItemSelectionChanged event so I stored the last selected item index by myself when the SelectedIndicies collection was not empty.

But back to our virtual list view: We only need to register to the RetrieveVirtualItem event and set the VirtualListSize property to the number of items our ListView will contain. That´s all.

image

The contract is that the event tells you for which ItemIndex in the list view wants to draw a ListViewItem and you set the Item property correctly.

 
        private void ShowItemsVirtual(List<AccountInfo> infos)
        {            
            cList.VirtualListSize = infos.Count; // Set number of items in list view
        }
 
        private void cList_RetrieveVirtualItem(object sender, RetrieveVirtualItemEventArgs e)
        {
            var acc = myInfos[e.ItemIndex];
            // hand out new item
            e.Item = new ListViewItem(
                new string [] 
                { acc.AccountType, acc.AccountBalance.ToString(), acc.BuyRating.ToString() })
                { Tag = acc }; // Set Tag object property to our actual AccountInfo object
        }

 

That was not hard. And we get a display speed increase of over a factor 5 with little effort. The startup time has dropped from 5s to below 1s. But the best thing is that we no longer have set the state of our ListViewItems when we want to e.g. color them. Previously we needed to traverse the whole Items collection and set the e.g. BackColor property for the required ListViewItems by hand. Now we can store the state in one variable and every time the list view is redrawn we hand out ListViewItem objects with the updated state. Bye bye list traversing and setting properties.

To update the color of every second row we need to insert only one additonal line in our RetrieveVirtualItem callback:

 

image

 

        private void cList_RetrieveVirtualItem(object sender, RetrieveVirtualItemEventArgs e)
        {
            var acc = myInfos[e.ItemIndex];
            e.Item = new ListViewItem(
                new string [] 
                { acc.AccountType, acc.AccountBalance.ToString(), acc.BuyRating.ToString() })
                { Tag = acc,
                  BackColor = e.ItemIndex % 2 == 0 ? Color.Aqua : Color.White
                }; 
        }

To display the colored items in the current view we need to call Refresh to force the list view to fetch our new ListViewItems. We have full control which item is on which position rendered. Now it is time for sorting by different columns. In virtual mode the usual ListViewItem sorter is not supported so we have to roll our own. I find the code below actually easier to write because I do know that the data is stored in List<AccountInfo> myInfos and we only need to sort the list by the current sort column ascending or descending order.

       // hold state for each colum which sort order should be used
        Dictionary<int,SortOrder> mySortOrderMap = new Dictionary<int,SortOrder>
        {
            {0, SortOrder.None },
            {1, SortOrder.None },
            {2, SortOrder.None }
        };
 
        // define comparer for each column
        Dictionary<int, Comparison<AccountInfo>> myComparers = new Dictionary<int, Comparison<AccountInfo>>
        {
            {0, (a,b) => a.AccountType.CompareTo(b.AccountType)},
            {1, (a,b) => a.AccountBalance.CompareTo(b.AccountBalance)},
            {2, (a,b) => a.BuyRating.CompareTo(b.BuyRating)}
        };
 
        // State transitions from one sort order to another
        Dictionary<SortOrder, SortOrder> myToggle = new Dictionary<SortOrder, SortOrder>
        {
            {SortOrder.None, SortOrder.Ascending },
            {SortOrder.Ascending, SortOrder.Descending},
            {SortOrder.Descending, SortOrder.Ascending}
        };
 
        private void cList_ColumnClick(object sender, ColumnClickEventArgs e)
        {
            var newSortOrder = myToggle[mySortOrderMap[e.Column]];
            mySortOrderMap[e.Column] = newSortOrder;     // Store sort order for current column
            SortBy(newSortOrder, myComparers[e.Column]); // Do the actual sorting
        }
 
        void SortBy(SortOrder order, Comparison<AccountInfo> comparer)
        {
            myInfos.Sort((a, b) =>
                {
                    int lret = comparer(a, b); // Do the actual comparison
                    if (order == SortOrder.Descending) // reverse when necessary
                    {
                        lret *= -1;
                    }
                    return lret;
                });
 
            cList.Refresh();
        }

This was much easier than I thought and what I especially like about virtual list views is that I do not have to worry about the state of our rendered ListViewItems. Instead the “factory” event will provide items with the newest state directly to the view once we invalidated the list view by calling Refresh.

Posted On Monday, June 21, 2010 9:56 AM | Feedback (4)
ApiChange Corporate Edition

In my inital announcement I could only cover a small subset what ApiChange can do for you. Lets look at how ApiChange can help you to fix bugs due to wrong usage of an Api within a fraction of time than it would take normally. It happens that software is tested and some bugs show up. One bug could be …. : We get way too man log messages during our test run.

Now you have the task to find the most frequent messages and eliminate the Log calls from the source code. But what about the myriads other log calls? How can we check that the distribution of log calls is nearly equal across all developers? And if not how can we contact the developer to check his code?

ApiChange can help you too connect these loose ends. It combines several information silos into one cohesive view. The picture below shows how it is able to fill the gaps.

image

The public version does currently “only” parse the binaries and pdbs to give you for a –whousesmethod query the following colums:

image

If it happens that you have Rational ClearCase (a source control system) in your development shop and an Active Directory in place then ApiChange will try to determine from the source file which was determined from the pdb the last check in user which should be present in your Active Directory. From there it is only a small hop to an LDAP query to your AD domain or the GC (Global Catalog) to get from the user name his

  • Full name
  • Email
  • Phone number
  • Department
  • ….

ApiChange will append this additional data all of your query results which contain source files if you add the –fileinfo option. As I said this is currently not enabled by default since the AD domain needs to be configured which are currently only some hard coded values in the SiteConstants.cs source file of ApiChange.Api.dll.

Once you got this data you can generate metrics based on source file, developer, assembly, … and add additional data by drag and drop directly into the pivot tables inside Excel. This allows you to e.g. to generate a report which lists the source files with most log calls in descending order along with the developer name and email in the pivot table. Armed with this knowledge you can take meaningful measures e.g. to ask the developer if the huge number of log calls in this source file can be optimized.

I am aware that this is a very specific scenario but it is a huge time saver when you are able to fill the missing gaps of information. ApiChange does this in an extensible way.

namespace ApiChange.ExternalData
{
    public interface IFileInformationProvider
    {
        UserInfo GetInformationFromFile(string fileName);
    }
}

It defines an interface where you can implement your custom information provider to close the gap between source control system and the real person I have to send an email to ask if his code needs a closer inspection.

Posted On Wednesday, June 09, 2010 10:23 AM | Feedback (1)
ApiChange Is Released!

I have been working on little tool to simplify my life and perhaps yours as developer as well. It is basically a command line tool that allows you to execute queries on your compiled .NET code base. The main purpose is to find out how big the impact of an api change would be if you changed this or that.  Now you can do high level operations like

  • Diff public types for breaking changes.
  • Who uses a method?
  • Who uses a type?
  • Who uses implements an interface?
  • Who references me?
  • What format has the binary  (32/64, Managed C++, Pure IL, Unmanaged)?
  • Search for all event subscribers and unsubscribers.
    • A unique feature is to check for event subscription imbalances. Forgotten event subscriptions are the 90% cause of managed memory leaks. It is done at a per class level. If one class does subscribe to one event more often than it does unsubscribe it is treated as possible event subscription imbalance.
  • Another unique ability is to search for users of string literals which allows you to track users of a string constant which is not possible otherwise.
  • For incremental builds the ShowRebuildTargets command can be used to identify the dependant targets that need a rebuild after you did compile one assembly. It has some heuristics in place to determine the impact of breaking changes and finds out which targets need to be recompiled as well.
  • It has a ton of other features and a an API to access these things programmatically so you can build upon these simple queries create even better tools. Perhaps we get a Visual Studio plug in?

You can download it from CodePlex here. It works via XCopy deployment. Simply let it run and check the command line help out.

The best feature in my opinion is that the output of nearly all commands can be piped to Excel for further analysis. Since it does read also the pdbs it can show you the source file name and line number as well for all matches.

The following picture shows the output of a –WhousesType query. The following command checks where type from BaseLibraryV1.dll are used inside DependantLibV1.dll. All matches are printed out with the reason and matching item along with file and line number. There is even a hyper link to the match which will open Visual Studio.

ApiChange -whousestype "*" BaseLibraryV1.dll -in DependantLibV1.dll –excel

The "*” is the actual query which means all types. The syntax is the same like in C# just that placeholders are allowed ;-). More info's can be found at the Codeplex Documentation.

 

image

 

The tool was developed in a TDD style manner which means that it is heavily tested and already used by a quite large user base inside the company I do work for. Luckily for you I got the permission to make it public so you take advantage of it.

It is fully instrumented with tracing. If you find bugs simply add the –trace command line switch to find out what is failing and send me the output.

How is it done?

Your first guess might be that it uses reflection. Wrong. It is based on Mono Cecil a free IL parser with a fantastic API to access all internals of a managed assembly. The speed is awesome and to make it even faster I did make the tool heavily multi threaded. The query above did execute in 1.8s with the Excel output. On a rather slow machine I can analyze over 1500 assemblies in less than 40s with a very low memory consumption. The true power of Mono Cecil is that I can load an assembly like any other data file. I have no problems unloading a file but if I would have used reflection I would need to unload a whole AppDomain just to get rid of one assembly in my memory.

Just to give you a glimpse how ApiChange.Api.dll can be used I show you one of the unit tests:

 

        public void Can_Find_GenericMethodInvocations_With_Type_Parameters()
        {
            // 1. Create an aggregator to collect our matches
            UsageQueryAggregator agg = new UsageQueryAggregator();
 
            // 2. This is the type we want to search for. Load it via the type query
            var decimalType = TypeQuery.GetTypeByName(TestConstants.MscorlibAssembly, "System.Decimal");
 
            // 3. register the type query which searches for uses of the Decimal type
            new WhoUsesType(agg, decimalType);
 
            // 4. Search for all users of the Decimal type in the DependandLibV1Assembly
            agg.Analyze(TestConstants.DependandLibV1Assembly);
 
            // Extract matches and assert
            Assert.AreEqual(2, agg.MethodMatches.Count, "Method match count");
            Assert.AreEqual("UseGenericMethod", agg.MethodMatches[0].Match.Name);
            Assert.AreEqual("UseGenericMethod", agg.MethodMatches[1].Match.Name);
        }

Many thanks go from here to Jb Evian for the creation of Mono.Cecil. Without this fantastic piece of code it would have been much much harder. There are other options around like the Common Compiler Infrastructure  Metadata Api which should do the same thing but it was not a real option since the Microsoft reader did fail on even simple assemblies (at least in September 2009 this was the case). Besides this I found the CCI Apis much harder to use. The only real competitor was Reflector which does support many things but does not let me access his cool high level analyze commands. So I decided to dig into the IL specs and as a result you can query your compiled binaries from the command line or programmatically.

The best thing is you try it out for yourself and give me some feedback what you miss. If you want to contribute or have a cool idea what should be added drop me a mail at A Kraus1@___No Spam____@gmx.de. There is much more inside the tool I did not talk about it (yet).

Posted On Thursday, June 03, 2010 1:17 AM | Feedback (1)