George Mamaladze

.NET C# tips, tricks, tweaks. Effective use of data structures and algorithms. Clean code.
posts - 11 , comments - 18 , trackbacks - 0

Thursday, May 3, 2012

Tiny UML Sequence Diagram Markup Language for Agile Developer Documentation

I hate overdocumented software components and frameworks. There are top 3 reasons for that:

  1. Only the code tells the truth(programmer.97things.oreilly.com/wiki/index.php/Only_the_Code_Tells_the_Truth) The good component expresses it’s intended way of usage through good structure, expressive naming. Ideally it is designed the way which does not allow wrong usage.
  2. Any documentation tends to became outdated. Reading such a documentation can be misleading and frustrating.
  3. And finally, to be honest, who reads the documentation first? No one does. Usually we come back to documentation after we get stuck. While reading it we understand that all the facts that are described we’v already discovered our self, but the point we are stumbled over is not covered at all.

What is a GOOD documentation?
That’s why it’s important having a good documentation. A good documentation must:

  1. … cover only aspects which can not be easily derived from code, interface or intellisense. Duplication is waste.
  2. … be easy to maintain. Wiki like public knowledge repositories or Q&A platforms like Stackoverflow are good examples. They are easy to edit – if you see and outdated or wrong post, go ahead and fix it!
  3. … be about problem solving, not restating the facts.

As an example there are over 20 Stream classes in .NET with at least 3 levels of inheritance depth. Let’s say a component I’m going to use requires a Stream as input. What I need is a kind of guidebook to help me finding the right one based on context of my application. All I want to know is that somewhere there is a class named MemoryMappedViewStream derived from UnmanagedMemoryStream (which is derived from Stream) that meets my requirements. But what I get as a first hit in MSDN search is a detailed method descrition of string. Basically the same thing I see when I press dot key in Visual Studio and intellisense box comes up.

Diagrams – pro and contra

+++++++++++++++++++++++++++++++
To get to the point – the central message of this post is that one diagram tells you more than 100 auto generated documentation pages.

For instance a class diagram showing inheritance hierarchy and relationships between classes could be a great help to get an overview about available functionality and make right choice. (See an example with Stream above). To win the same information from code you will probably need hours of jumping and navigating.

A sequence diagram is another brilliant sample. Due to many indirections, wrapping, tiny classes and methods (which are very good!) it is often not too easy to read a call stack. Sometimes you are looking at a huge call stack trying to understand what’s really going on. What you wish is a high level view on major players in some particular scenario and the way they interact, talk to each other.  Having a sequence in such a moment is a great help.

Diagrams are also documentation. Well we’v discovered that they are necessary, because they cover aspects which can not be really easy read out of code. So point 1 from our 3 is satisfied. But what about maintainability?

- – - – - – - – - – - – - – - – - – - – - – - – 
To create a diagram we usually use some modeling tool. I used Enterprise Architect for a long time but now I am a fan of Visual Studio modeling capabilities (unfortunatelly available only with Ultimate edition).  So:

  • You need a modeling tool to create a diagram.
  • If a community should maintain the documentation (the Wiki approach), ALL of contributors must have the SAME modeling tool.
  • If a contributor wants to update or adjust the diagram slightly he needs a SOURCE modeling project for that.
  • You have an export / import publishing overhead with diagrams. You need to export a diagram you designed to some supported format and include it in documentation, typically by uploading an image. Well … and you have to do it every time you make an adjustment.

Maintainability overhead is probably the reason why open source projects having excellent wiki-s are using diagrams so rarely.

The Solution

The solution is as simple as elegant. It comes initially from authors of http://www.websequencediagrams.com/. The main idea is to generate the diagram from a lightweight markup textual diagram description language. As an example following code

title Hello 
participant Alice
Alice->Bob : Hi!
activate Bob
Bob-->Alice : Hey!
deactivate Bob

would generate this diagram

We are gone even one step further. Our open source project KangaModeling at kangamodeling.codeplex.com provides a web server application which generates diagrams on the fly. So you can include sequence diagrams in your Wiki articles, blog posts or web sites with just several lines of markup.

.. you can include sequence diagrams in your Wiki articles, blog posts or web sites with just several lines of markup.

Placing the code

Place your markup code on the page and surround it with <pre> tag. Set class attribute to kanga.sd.

<pre class="kanga.sd">
    title Hello World
    Alice->Bob: Hi!
    activate Bob
    Bob-->Alice: Hey!
    deactivate Bob
</pre>
   

Making it work

Finally, to get the whole thing to render properly on the page, you have to add JavaScript to the page.

 <script type="text/javascript" src="http://kangamodeling.org/js/kangasd.js"></script>

For optimal result, place this code at the very end of your page. A sequence deiagram will appair in place of your markup code.

For internet posts you can use our publicly available web service at kangamodeling.org.
For intranet solutions you are welcome do download ASP.NET web service code from kangamodeling.codeplex.com. Installing it on any IIS will prevent sensibly information to leave your intranet.

We are going to keep on improving and providing additional features. As a next big step we plan to provide you with language and service to generate also class diagrams.

Go on and try it out. Join us if you like KangaModeling. Help us by contributing or reporting issues and wishes.

Posted On Thursday, May 3, 2012 6:34 PM | Comments (0) | Filed Under [ C# Language .NET Framework UML Sequence Diagram Developer Documentation Wiki ]

Thursday, September 22, 2011

Just launched another open source project QrCode.Net at qrcodenet.codeplex.com

 Download binaries and application – 32K
 Source code – 99K

 

Background

Recently I was looking for a .NET implementation of QR code generator. Most of components either use online services to generate and/or recognize QR code or the implementation was not “good enough” for my purposes. The most popular and very powerful Java implementation comes from google’s open source project code.google.com/p/zxing called ZXing (Zebra Crossing => Z=Zebra + X=Cross + ing).

There I found a one-to-one c# port of earlier version. The project is focusing on additional features and further development in Java so it seems that no one is taking care of c# branch.

Thus I have decided to set-up this project QRCode.Net  qrcodenet.codeplex.com.

As a start point I took a straight forward c# port of google’s QR code implementation from ZXing project.
I wrote a wrapper around and a demo application which is able to generate QR code from text as you type and save it to file.

In addition it contains a very naive and simple implementation of Artistic QR code generation. At the highest error correction level it is possible to create artistic QR codes that still scan correctly, but contain intentional errors to make them more readable or attractive to the human eye, as well as to incorporate colors, logos and other features into the QR code block.

Project TODOs and Plans

I think this project would be a good playground for:

  1. Understanding QR code form inside
  2. Designing and implementing efficient and elegant data structures and algorithms. I have looked at the ZXing codebase, especially encoding. The code is good, but there are still many things ti improve. Furthermore the automated Java to c# converter produced a lot of mess.
  3. Refactoring and rewriting code in a TDD manner – existing implementation can be used as an excellent refactoring initial point. In some cases I think I would prefer to rewrite and use old classes a reference implementations and run my test again both the old and the new code.

Posted On Thursday, September 22, 2011 6:13 AM | Comments (2) | Filed Under [ C# Language ]

Tuesday, August 9, 2011

Parallel Programing, PLINQ and Globalization

I’m going to start with a simple code snippet which sorts an array of strings using LINQ.

 

1 IEnumerable<string> line = new[] {"Z","A","Ä"};
2 var result = line.OrderBy(letter => letter);
3 Console.WriteLine("{0}", string.Join(" ", result));

 

The result might look like this:

 

A Ä Z

 

… or not. It depends on the thread culture the sorting is running in. The string order is culture aware (unlike char order which is culture invariant), so if we switch for instance on one of the Norwegian cultures by adding this line Thread.CurrentThread.CurrentCulture = CultureInfo.GetCultureInfo("nn-NO"); before calling sort, we will get following output instead:

 

A Z Ä

 

As next I extended my code snippet to create 4 arrays and sort each of them parallely.

 

01 Thread.CurrentThread.CurrentCulture = CultureInfo.GetCultureInfo("nn-NO");
02 Console.WriteLine("Main thread-{0} \t Culture-'{1}'", Thread.CurrentThread.ManagedThreadId, Thread.CurrentThread.CurrentCulture);
03 Console.WriteLine(new string('-', 80));
04  
05 List<string[]> list = new List<string[]>();
06 for (int i = 0; i < 3; i++)
07 {
08     list.Add(new[] { "Ä", "A", "Z" });
09 }
10  
11 var result =
12     list
13         .Select(
14             line => line
15                 .OrderBy(letter => letter));
16  
17 Parallel.ForEach(result,
18     line =>
19         Console.WriteLine(
20             "Thread-{0} \t Culture-'{1}' \t {2}",
21             Thread.CurrentThread.ManagedThreadId,
22             Thread.CurrentThread.CurrentCulture,
23             string.Join(" ", line)));
24  
25 Console.WriteLine();
26 Console.WriteLine("Press any key to quit");
27 Console.ReadKey();

 

The result looks like this:

 

Main thread-1    Culture-'nn-NO'
------------------------------------------------
 
Thread-1         Culture-'nn-NO'         A Z Ä
Thread-5         Culture-'de-DE'         A Ä Z
Thread-3         Culture-'de-DE'         A Ä Z
Thread-4         Culture-'de-DE'         A Ä Z
 
Press any key to quit

 

Line 4 sorting order differs from line 5. The sorting was splited up into 4 threads one main and 3 new threads.

All three newly created threads got the default culture of my system – not the culture of the main thread which was set manually.

The culture is a property of the executing thread. When a thread is started, its culture is initially determined by using GetUserDefaultLCID from the Windows API. There is no way that I know manipulate this.

The same result if you use PLINQ syntax:

 

01 list
02     .AsParallel()
03     .Select(
04         line => line
05             .OrderBy(letter => letter))
06     .ForAll(
07         line =>
08             Console.WriteLine(
09                 "Thread-{0} \t Culture-'{1}' \t {2}",
10                 Thread.CurrentThread.ManagedThreadId,
11                 Thread.CurrentThread.CurrentCulture,
12                 string.Join(" ", line)));

 

The same query without parallel execution delivers consistent output, all four sequences are sorted in the same order.

The solution is to pass a specific culture aware comparer across into the OrderBy method.

 

01 var norvegianIgnoreCaseComparer = StringComparer.Create(CultureInfo.GetCultureInfo("nn-NO"),false);
02 list
03     .AsParallel()
04     .Select(
05         line => line
06             .OrderBy(letter => letter, norvegianIgnoreCaseComparer))
07     .ForAll(
08         line =>
09             Console.WriteLine(
10                 "Thread-{0} \t Culture-'{1}' \t {2}",
11                 Thread.CurrentThread.ManagedThreadId,
12                 Thread.CurrentThread.CurrentCulture,
13                 string.Join(" ", line)));

 

Well, but what about foreach and LINQ legacy code which can be paralelized with simple replacement of a single line by Parallel.ForEach() or adding AsParallel(). The result might be unpredictable and difficult to figure out. So if I would be the author of .NET or PLINQ I would take over the culture of the main thread into the child threads, thus the data come from the main thread, the split-up takes place implicitly and in most cases results are merged back into the main thread back and used there.

Similar issues might occur in queries using any of culture aware calculations, for instance DateTime formatting and parsing.

Posted On Tuesday, August 9, 2011 6:15 AM | Comments (0) | Filed Under [ C# Language ]

Monday, August 8, 2011

Edulinq an excellent LINQ+TDD guide

Recently I stumbled across a question on stackowerflow.com about Edulinq. Its a series of blog posts turned at the end into a free eBook. Downloaded to my eBook reader and had a lot of fun in subway.

There where not to much to discover inside of LINQ for me, but it could help a novice to understand the LINQ “magic” and use it efficiently.

Another very important point for me was the test driven approach. You could take this book as a guide for test driven design and development.

Reading this book chapter after chapter you will see how TDD helps you to come to a best API decision when working on a library. It demonstrates that on sometimes you should compromise functionality or performance to increase usability and maintain necessary level of protection.

Every chapter follows the same structure contains a lot of test and productive code, builds on top of the previous and makes a basement for the next one – remembers me the double linked list.

A must to read book for TDD + LINQ fans.

 

Posted On Monday, August 8, 2011 10:03 PM | Comments (0) | Filed Under [ C# Language ]

Tuesday, August 2, 2011

"Cancel me if you can" or PLINQ cancelability & responsiveness in WinForms

Download Source Code

Challenge

Just had the first chance to apply PLINQ (Parallel LINQ) to the real task. When I am playing around with some new technology everything works fine, the real understanding comes along with challenges in a real use case. Currently I am working on application which processes large amount of text data gathering statistics on word occurrences (see: Source Code Word Cloud). Here what the simplified core of my code is doing.

  • Enumerate through all files with *.txt extension.
  • Enumerate through words in each text files.
  • Group by word and count occurrences.
  • Sort by occurrences.
  • Output top 20.

Everything worked fine with LINQ. I wrote a prototype console application with a single LINQ query. ... and ta-daa! Just adding one line turned it into PLINQ.

public static void Main(string[] args) {
    var result = Directory
        .EnumerateFiles(@"c:\temp", "*.txt", SearchOption.AllDirectories)
        .AsParallel()
        //.WithDegreeOfParallelism(1..63)
        .SelectMany(File.ReadLines)
        .SelectMany(ReadWords)
        .GroupBy(word =>; word, (word, words) =>; new Tuple(words.Count(), word))
        .OrderByDescending(occurrencesWordPair =>; occurrencesWordPair.Item1)
        .Take(20);

    foreach (Tuple tuple in result) {
        Console.WriteLine(tuple);
    }

    Console.Write("Press any key to quit."); Console.ReadKey();
}

private static IEnumerable ReadWords(string line) {
    StringBuilder word = new StringBuilder();
    foreach (char ch in line) {
        if (char.IsLetter(ch)) { word.Append(ch); }
        else {
            if (word.Length == 0) continue;
            yield return word.ToString();
            word.Clear();
        }
    }
}


Finally all four cores of my PC and the fan where busy processing dozens of megabytes of free eBooks in my temp folder I have prepared for testing. Moving to PLINQ brought me significant performance boost. But adopting this one line into the WinForms application was a different challenge.

Frustration

... cancelability and responsiveness during long running queries were lost. It seems that the OrderBy query was synchronizing data back into main thread and windows messages are not processed. In the examle below I am demonstrating my implementation of cancelation according to MSDN How to: Cancel a PLINQ Query which does not worked :(

public Form1()
{
    InitializeComponent();
    m_CancellationTokenSource = new CancellationTokenSource();
}

private readonly CancellationTokenSource m_CancellationTokenSource;

private void buttonStart_Click(object sender, EventArgs e)
{
    var result = Directory
        .EnumerateFiles(@"c:\temp", "*.txt", SearchOption.AllDirectories)
        .AsParallel()
        .WithCancellation(m_CancellationTokenSource.Token)
        .SelectMany(File.ReadLines)
        .SelectMany(ReadWords)
        .GroupBy(word => word, (word, words) => new Tuple(words.Count(), word))
        .OrderByDescending(occurrencesWordPair => occurrencesWordPair.Item1)
        .Take(20);

    try {
       foreach (Tuple tuple in result) {
            Console.WriteLine(tuple);
        }
    }
    catch (OperationCanceledException ex) {
        Console.WriteLine(ex.Message);
    }
}

private void buttonCancel_Click(object sender, EventArgs e)
{
   m_CancellationTokenSource.Cancel();
}

private static IEnumerable ReadWords(string line) { ... }

I was not able to click Cancel button, move the form around, do anything at all until the whole query where finished. That's was not the perfect cancelation behavior - "Yes you can cancel it but you can not click button which does that". I have asked the question colleagues and posted at stackowerflow. Everyone gave me the same answer - call the query in another thread. That was not an option form me, if I wanted to clutter my code with threading staff I would do it without PLINQ. PLINQ allows you to keep your logic free from parallelism / synchronization overhead keeping your code readable and closer to business not to techniques. (see proposed solutions at stackowerflow)

The Ahaa Moment

While waiting for THE expert's opinion from somewhere in cyberspace I came up with my own which I like pretty much: I wrote two extension methods:

public static class CallbackExtensions
{
    public static ParallelQuery ProcessWindowsMessages(this ParallelQuery source)
    {
        return source.Select(
            item =>
                {
                    Application.DoEvents();
                    Thread.Yield();
                    return item;
                });
    }

    public static ParallelQuery WithCallback(this ParallelQuery source, ISynchronizeInvoke target, Action callback)
    {
        return source.Select(
            item =>
                {
                    Application.DoEvents();
                    Thread.Yield();
                    target.Invoke(callback, new object[] {item});
                    return item;
                });
    }
}

The first one just passes control to the main theread allowing standard COM and SendMessage pumping. The second one allows to invoke any action into the main therad with item argument which is currently processed. You can put them anywhere in your query depending how responsive your query should be. The example below passes control back to main after processing each file.

var result = Directory
    .EnumerateFiles(@"c:\temp", "*.txt", SearchOption.AllDirectories)
    .AsParallel()
    .ProcessWindowsMessages()
    .WithCancellation(m_CancellationTokenSource.Token)
    .SelectMany(File.ReadLines)
    .SelectMany(ReadWords)
    .GroupBy(word => word, (word, words) => new Tuple(words.Count(), word))
    .OrderByDescending(occurrencesWordPair => occurrencesWordPair.Item1)
    .Take(20);

You can even use the second extension method to perform progress indication and staff like that. In my example I am showing every word processed in caption. Yes I know - this is highly inefficient but for demo purposes is ok. You can download my solution and play with position of the line. For instance putting it beside file query will show files in caption and will have moderate responsiveness.

var result = Directory
    .EnumerateFiles(@"c:\temp", "*.txt", SearchOption.AllDirectories)
    .AsParallel()
    .WithCancellation(m_CancellationTokenSource.Token)
    .SelectMany(File.ReadLines)
    .SelectMany(ReadWords)
    .WithCallback(this, word => { this.Text = word; })
    .GroupBy(word => word, (word, words) => new Tuple(words.Count(), word))
    .OrderByDescending(occurrencesWordPair => occurrencesWordPair.Item1)
    .Take(20);

Negative effects

Using one of my extension methods you are creating a bottleneck bringing all of your parallel treads back into main just to paint one word into form caption. So the best solution would be to slowdown callbacks as needed. Let's say you need to update a progress bar. It would be enough to do it 10 times in a second. All other calls will be trivially return. Or you may decide to pass every 1000-Th word to display on the screen.

Posted On Tuesday, August 2, 2011 10:37 AM | Comments (0) | Filed Under [ C# Language ]

Wednesday, July 20, 2011

The story of QuadTree perforamnce optimization

Download QuadTreeOptimization.zip c# project

Background

Today I'd like to share with you interesting experiences I made last several evenings when working on one of my open source hobby projects - Source Code Cloud Generator

http://sourcecodecloud.codeplex.com/

One of the tasks was arranging of nonintersecting rectangles on a 2D surface in certain pattern. For quick 2D collision detection (rectangle intersection) I decided to use QuadTree data structure [see: http://en.wikipedia.org/wiki/Quadtree]. I found one good (clean) open source c# implementation from Michael Coyle [many thanks!] at http://www.codeproject.com/KB/recipes/QuadTree.aspx. I started using it without major modifications.

After the first version of my app was done, I have started optimizing different aspects. One of performance highlights was drawing speed. About 50% of the overall drawing & layouting time was consumed by QuadTree. So I decided to OPTIMIZE this particular component.

Optimization

Optimization is very similar to refactoring - you are modifying code to make it better (in this case faster) preserving exactly the same behaviour. Very good differentiation between refactoring, bug-fix and optimization, was made by Robert C Martin in his book "Working Effectively with Legacy Code". All these three hold your functionality invariant - each of them trying to improve different aspects of code. In this case I wanted to improve the speed of the algorithm.

Approach

To be precise my goal was to optimize the QuadTree data structure in my particular use case - not making it efficient in general for all sorts of use cases. The overall scenario was quite long running and very UI and file system dependant. My application was gathering data by processing large amount of files and drawing visualization on screen. The QuadTree was somewhere in the middle of this chain.

The challenge was to isolate the QuadTree component while still working with REAL DATA in order to make optimization progress visible and measurable.

Step 1

So I built in simple logging functionality into two main methods of my component. It was recording all method calls with parameters and return values in common semicolon delimited text file.

 
HasContent;766,7109;469;312,5781;53,99999;False
Insert;766,7109;469;312,5781;53,99999;True
HasContent;798,7172;474,5293;248,5657;42,94143;True
HasContent;796,9323;474,7823;248,5657;42,94143;True
...

Then I have generated a set of log files which where result of several different runs of the real application. Logs very in size from 4 to 16MB and contained many thousand records.

Step 2

I wrote a set of regression tests, conservation tests, preserving basic behaviour. Like inserting a rectangle covering whole surface, inserting rectangle of zero size, etc.

Step 3

I have created a unit test which was fetching its test data from test case generator. The test case generator was accessing CSV file line by line generating test records out of it. Thus I was able to playback my log files inside NUnit.

This is the test itself:

        [Test]
        [TestCaseSource(typeof(TestDataReader), "TestCases")]
        public bool Test(string methodName, RectangleF rectangle)
        {
            switch (methodName)
            {
                case "Insert":
                    m_InsertStopwatch.Start();
                    QuadTree.Insert(new LayoutItem(rectangle));
                    m_InsertStopwatch.Stop();
                    return true;

                case "HasContent":
                    m_QueryStopwatch.Start();
                    var result = QuadTree.HasContent(rectangle);
                    m_QueryStopwatch.Stop();
                    return result;
            }

            Assert.Fail("Unknown method {0}", methodName);
            return false;
        }

and appropriate test case generator:

    public static class TestDataReader
    {
        public static IEnumerable TestCases
        {
            get
            {
                using (var logFile = File.OpenText("TestData.log"))
                {
                    while (!logFile.EndOfStream)
                    {
                        string line = logFile.ReadLine();
                        var parseResult = ParseTestCase(line);
                        yield return parseResult;
                    }
                }
            }
        }

        private static TestCaseData ParseTestCase(string line)
        {
         ...
         }
    }

Step 4

I have created two copies of QuadTree - the original one and one to be optimized. Also I made two test fixtures for each of them deriving from the base test fixture. Only difference was that they where working with instances of two different QuadTree implementations using the same interface.

    [TestFixture]
    public class ExhaustiveTestOriginal : ExhaustiveTestBase
    {
       
        [TestFixtureSetUp]
        public override void SetUp()
        {
            base.SetUp();
            this.QuadTree = new QuadTree<LayoutItem>(new RectangleF(new PointF(0, 0), new SizeF(2000, 2000)));
        }

        [TestFixtureTearDown]
        public override void TearDown()
        {
            base.TearDown();
        }
    }

Step 5

I was making optimizations I supposed to improve performance one by one, only one at a time and was able to verify the impact immediately. After several experiments I ended up with a solution which was almost two times faster then the original one.

Console output:

C:\>nunit-console.exe QuadTreeOptimization.dll /run=QuadTreeOptimization.Optimized /nodots
NUnit-Console version 2.6.0.11089

Selected test(s): QuadTreeOptimization.Optimized

Insert: 00:00:00.0049159
Query: 00:00:00.7418825
Total: 00:00:
00.7467984

Tests run: 123022, Errors: 0, Failures: 0, Inconclusive: 0, Time: 50,5165949 seconds
  Not run: 0, Invalid: 0, Ignored: 0, Skipped: 0


C:\>nunit-console.exe QuadTreeOptimization.dll /run=QuadTreeOptimization.Original /nodots
NUnit-Console version 2.6.0.11089

Selected test(s): QuadTreeOptimization.Original

Insert: 00:00:00.0087474
Query: 00:00:01.2876519
Total: 00:00:
01.2963993

Tests run: 123022, Errors: 0, Failures: 0, Inconclusive: 0, Time: 50,1728383 seconds
  Not run: 0, Invalid: 0, Ignored: 0, Skipped: 0

Here my optimization "changelog":
* Eliminating almost duplicate code.
* Replacing result accumulation in a List<> by yield return iterator.
* Replacing a List<> representing the node list with fixed length of 4 with an array.
* Replacing a List<> representing the node list with unlimited growing length with Stack<> - which is basically singly linked list.

It was fun having immediate verification after each refinement attempt. Several times it helped me to figure out very quickly that I was barking up the wrong tree.

Examples:
My first idea was to improve performance by balancing the tree. I started reading about sophisticated balancing strategies, until I found out that my QuadTree filled with REAL data is nearly perfectly balanced.
Then I used LinkedLists<> instead of lists everywhere which seemed to be intuitively more suitable. And again MEASURE-MODIFY-MEASURE-ROLLBACK. 


Interesting Findings

At the end I figured out that using of [TestCaseSource(typeof(TestDataReader), "TestCases")] attribute and organizing my series of assertions like each one as an independent test case was not the optimal solution.
1. Calling test method goes over reflection causing incredible performance penalty in overal execution time.
2. Test cases can only be executed all together, thus it's a recording and every next assertion assumes certain state left by all previous stimulus.
3. NUnit GUI was not able to handle more than several thousand test cases, because it was trying to generate the tree containing list of all test cases ending up with out of memory exception. That's why I primarily used NUnit console.
4. Time measured by NUnit is the overall time and not the "clean" test time. It is the sum of time needed for test case construction (in my case reading from file) + time needed for reflection invoke + test time itself. I was interested on performance measurement of method calls on my class, so I had to implement my own StopWatch around them.

At the end I put back optimized implementation in my original project and it worked perfectly at the first try.
The whole solution became if fact a throw away code and data. It was made to accomplish one very specific optimization mission - like Russian Soyuz spacecraft expendable launch system - less elegant, but "low risk of mission failure, a short time to launch and a relatively low cost." (Wikipeia about expendable launch systems.)

Posted On Wednesday, July 20, 2011 10:05 AM | Comments (1) | Filed Under [ C# Language ]

Monday, July 18, 2011

Why .NET LinkedList does not support Concat and Split operations?

Why .NET LinkedList does not support Concat and Split operations?

 

Download SimpleLinkedList source code

Concat O(1) or O(n) ?

The .NET LinkedList is a circular doubly linked list, where each node holds a reference to its previous and next nodes. The last node’s next is the head node and the head node’s previous is the last one.

 

Doubly linked list

 

Linked lists are attractive because of O(1) insertion and removal operations. Instead of shifting elements in array you just chain nodes in appropriate order and that’s it.

 

Following this logic concatenation of two lists should be similar O(1) operation. You just bind the end of the first list with the head of the second and let the both ends of the resulting list point to each other.

 

Nevertheless if you look at .NET LinkedList implementation you will find out that the only way to join two linked lists is to add elements of the second list one by one to the first one. This is O(n) operation.

 

Implementing Concat

So if you look at the implementation you will find out that the LinkedListNode has one more field except prev and next. This is a reference to the owner list. So with this implementation even if you implement concat in O(1) as described above, you will still need at least reparent all nodes which is again O(n) operation.

 

Can we avoid referencing list in every node? Yes we can. We can rewrite Previous and Next properties like this.

 

Before:

        public LinkedListNode<T> Next

        {

            get

            {

                if ((this.next != null) && (this.next != this.list.head))

                {

                    return this.next;

                }

                return null;

            }

        }

 

After:

        public SimpleLinkedListNode<T> GetNext(SimpleLinkedList<T> list)

        {

            if ((this.Next != null) && (this.Next != list.First))

            {

                return this.Next;

            }

            return null;

        }

 

After removing parent checking exception handling code from LinkeList implementation you can implement contamination like this:

 

        public void Concat(SimpleLinkedList<T> secondList)

        {

            if (secondList.m_Head == null)

            {

                return;

            }

 

            if (this.m_Head == null)

            {

                this.m_Head = secondList.m_Head;

            }

            else

            {

                var seamLeft = this.Last;

                var seamRight = secondList.m_Head;

                var newEnd = secondList.Last;

 

                seamLeft.Next = seamRight;

                seamRight.Prev = seamLeft;

                newEnd.Next = this.m_Head;

                this.m_Head.Prev = newEnd;

            }

        }

 

You are even able to maintain Count property correctly by adding counts of both lists.
 

 

this.m_Count += secondList.m_Count;
 

Drawbacks

The major drawback of this solution is that the second list is will be left in an inconsistent state after this operation. Its last.prev does not point to its head anymore. Another point is that any active operation on one of the lists will modify both, thus they have shared nodes. These side effects violate fundamental principles of OO design. So I can imagine that Microsoft won’t compromise at that point and decided not to implement concatenation.

 

What about Split?

Implementing Split is very similar, with the difference that you are not able to maintain Count property in O(1) time. You are splitting at certain node without knowing its index inside the list. So If you want to have a Split operation you should compromise Count property.

 

Concat extension method

There is a Concat extension method on IEnumarable<T> interface. You might think it’s as stupid as Count extension method on IEnumarable<T> and spends O(n) on bringing all together.

 

In fact it executes in O(1) returning a new enumerator which enumerates first list and jumps to the second when the first list reaches its end.

This might help if you are not going to continue working with concatenation result like with a LinkedList. Another point is that several subsequent concatenation operations cause nested enumerators. So you get a tree of enumerators over enumerators etc.

 

My implementation

The actual goal of my implementation of double linked list supporting concat and split was to try out and demonstrate test driven refactoring approach.

 

The first implementation was just derived from common .NET LinkedList<T>.

 

    public class SimpleLinkedList<T> : LinkedList<T>

    {

 

        public SimpleLinkedList()

        {

 

        }

 

        public SimpleLinkedList(IEnumerable<T> enumerable)

            : base(enumerable)

        {

 

        }

 

        internal SimpleLinkedList<T> Split(LinkedListNode<string> splitNode)

        {

            throw new NotImplementedException();

        }

 

        public LinkedListNode<T> Find(T search, IEqualityComparer<T> comparer)

        {

            return Find(search);

        }

 

        public void FindLast(T search, IEqualityComparer<T> comparer)

        {

            return Find(search);

        }

    }

 

The second step was creating appropriate test set ensuring common LinkedList behaviour.

Example:

        [TestCase(new string[] { }, "A", new[] { "A" })]

        [TestCase(new string[] { }, null, new[] { (string)null })]

        [TestCase(new[] { "A" }, "B", new[] { "A", "B" })]

        [TestCase(new[] { "A", "B", "C" }, "D", new[] { "A", "B", "C", "D" })]

        public void AddLast_one_element_occururs_at_the_end_of_the_list(string[] original, string element, string[] expected )

        {

            var target = new SimpleLinkedList<string>(original);

            target.AddLast(element);

            CollectionAssert.AreEquivalent(expected, target);

        }

 

After I had a reasonable set I started implementing functionality successively.

Here is the result - SimpleLinkedList

Advantages:

·          Less memory consumption, thus every node SimpleLinkedListNode<T> has three pointers (prev, next, value) instead of four (prev, next, list, value) unlike original .NET implementation.

·          Supports Concat and Split operations in O(1)

·          Supports IEnumarable<T> Reverse() enumerator in O(1) – by the way I do not see any reason why it’s not provided natively on the .NET LinkedList. Appropriate extension method requires O(n).

Disadvantages:

·          Does not support Count.

·          Concat operation leaves second list in an inconsistent state.

·          Split operation leaves original list in an inconsistent state.

·          You are able to share nodes between lists.

Other:

·          I have chosen an alternative strategy for implementing enumeration and find operations, rather than more verbose and purely readable original implementation. I hope the negative performance impact remains insignificant.

 

So be careful using this list and use it only if you are aware of consequences.

Posted On Monday, July 18, 2011 12:55 AM | Comments (2) | Filed Under [ C# Language ]

Saturday, February 5, 2011

yield – Just yet another sexy c# keyword?

 

 

yield (see NSDN c# reference) operator came I guess with .NET 2.0 and I my feeling is that it’s not as wide used as it could (or should) be.

 

I am not going to talk here about necessarity and advantages of using iterator pattern when accessing custom sequences (just google it).

 

Let’s look at it from the clean code point of view. Let's see if it really helps us to keep our code understandable, reusable and testable.

 

Let’s say we want to iterate a tree and do something with it’s nodes, for instance calculate a sum of their values. So the most elegant way would be to build a recursive method performing a classic depth traversal returning the sum.

 

        private int CalculateTreeSum(Node top)

        {

            int sumOfChildNodes = 0;

            foreach (Node childNode in top.ChildNodes)

            {

                sumOfChildNodes += CalculateTreeSum(childNode);

            }

            return top.Value + sumOfChildNodes;

        }

 

 

“Do One Thing”

Nevertheless it violates one of the most important rules “Do One Thing”. Our  method CalculateTreeSum does two things at the same time. It travels inside the tree and performs some computation – in this case calculates sum. Doing two things in one method is definitely a bad thing because of several reasons:

·          Understandability: Readability / refactoring

·          Reuseability: when overriding - no chance to override computation without copying iteration code and vice versa.

·          Testability: you are not able to test computation without constructing the tree and you are not able to test correctness of tree iteration.

 

I want to spend some more words on this last issue. How do you test the method CalculateTreeSum when it contains two in one: computation & iteration? The only chance is to construct a test tree and assert the result of the method call, in our case the sum against our expectation. And if the test fails you do not know wether was the computation algorithm wrong or was that the iteration? At the end to top it all off I tell you: according to Murphy’s Law the iteration will have a bug as well as the calculation. Both bugs in a combination will cause the sum to be accidentally exactly the same you expect and the test will PASS. J

 

Ok let’s use yield!

That’s why it is generally a very good idea not to mix but isolate “things”. Ok let’s use yield!

 

        private int CalculateTreeSumClean(Node top)

        {

            IEnumerable<Node> treeNodes = GetTreeNodes(top);

            return CalculateSum(treeNodes);

        }

 

 

        private int CalculateSum(IEnumerable<Node> nodes)

        {

            int sumOfNodes = 0;

            foreach (Node node in nodes)

            {

                sumOfNodes += node.Value;

            }

            return sumOfNodes;

        }

 

        private IEnumerable<Node> GetTreeNodes(Node top)

        {

            yield return top;

            foreach (Node childNode in top.ChildNodes)

            {

                foreach (Node currentNode in GetTreeNodes(childNode))

                {

                    yield return currentNode;

                }

            }

        }

 

Two methods does not know anything about each other. One contains calculation logic another jut the iteration logic. You can relpace the tree iteration algorithm from depth traversal to breath trevaersal or use stack or visitor pattern instead of recursion. This will not influence your calculation logic. And vice versa you can relace the sum with product or do whatever you want with node values, the calculateion algorithm is not aware of beeng working on some tree or graph. 

How about not using yield?

Now let’s ask the question – what if we do not have yield operator?

The brief look at the generated code gives us an answer. The compiler generates a 150 lines long class to implement the iteration logic.  

    [CompilerGenerated]

    private sealed class <GetTreeNodes>d__0 : IEnumerable<Node>, IEnumerable, IEnumerator<Node>, IEnumerator, IDisposable

    {

        ...

       150 Lines of generated code
       ...

    }

 

Often we compromise code readability, cleanness, testability, etc. – to reduce number of classes, code lines, keystrokes and mouse clicks. This is the human nature - we are lazy. Knowing and using such a sexy construct like yield, allows us to be lazy, write very few lines of code and at the same time stay clean and do one thing in a method. That's why I generally welcome using staff like that.

 

Note: The above used recursive depth traversal algorithm is possibly the compact one but not the best one from the performance and memory utilization point of view. It was taken to emphasize on other primary aspects of this post.

Posted On Saturday, February 5, 2011 3:04 AM | Comments (4) | Filed Under [ C# Language ]

Saturday, November 27, 2010

Implementing the Equals Method in .NET and … Linear Algebra

As an informatics student you learn hundreds of mathematical definitions and theorems and all of them seem to be useless programming. Here the evidence from the “real programmer life” to demonstrate you that’s not true.

 

I was asked to analyze some very strange behaving class. The Contains(object obj) method was not working correctly on after populating the list with instances of this class. The problem was a mess implementation of bool Equals(Object obj) overloaded methods. There was no logical or some kind of conditional mistake in code. So what was wrong?

 

There is an article on MSDN named Guidelines for Implementing Equals Method describing the way you should implement Equals method when overriding it. Sounds like a set of non mandatory recommendations. Let’s look at them in deep. Most of them are almost intuitive, but these three are essential ones:

 

Follow the contract defined on the Object.Equals Method as follows:

1. x.Equals(x) returns true.

2. x.Equals(y) returns the same value as y.Equals(x).

3. (x.Equals(y) && y.Equals(z)) returns true if and only if x.Equals(z) returns true.

 

And now some algebra; the definition of equivalence relation says (see: Wikipedia – Equivalence relation):

 

A given binary relation ~ on a set A is said to be an equivalence relation if and only if it is reflexive, symmetric and transitive.

Equivalently, for all a, b and c in A:

1. a ~ a. (Reflexivity)

2. if a ~ b then b ~ a. (Symmetry)

3. if a ~ b and b ~ c then a ~ c. (Transitivity)

 

So the Equals method must be an equivalence relation, thus only equivalence relation partitions a set into several disjoint subsets, called equivalence classes. All the elements in a given equivalence class are equivalent among themselves, and no element is equivalent with any element from a different class.

 

This theorem states that if we define a function which satisfies this tree simple rules, the result we get will correspond our “common sense” and what we intuitively understand under classifying objects and (IMPRTANT!) if one of the rules is violated the result will definitively lead to “inconsistent” behavior.

 

Examples

Let’s try to check following sample relation against this tree rules and see weather they can be used for equality or not.

    class Foo

    {

        public string Text { get; set; }

 

        public override bool Equals(object obj)

        {

            // Check for null values and compare run-time types.

            if (obj == null || GetType() != obj.GetType()) { return false; }

            this.Equals((Foo)obj);

        }

 

        private bool Equals(Foo obj)

        {

            return this.Text.Contains(obj.Text);

        }

    }

Here we try to use String.Contains(string substring) as equality relation.

1.       String contains called on itself returns always true.

3.       If string “ABC” contains “BC” and “BC” contains “C” than “ABC” contains “C” is also truth.

However the second rule will not pass. For instance “ABC” contains “BC” but “BC” does not contain “ABC”.

 

Another interesting sample would be do let two strings be equal if they are reverse to each other. In this case the rules 2. and 3. are satisfied but the rule 1. is not.

 

Testing

Testing several combinations is generally not enough to prove that a specific implementation of equality really satisfies these 3 requirements. You need either check all combinations or mathematically prove every of this three facts. So you still need quite a bit of mathematics.

 

Soon I am going to write about how implementations of Equals() and GetGashCode() are related to each other, what is the mathematical background of this relation and how many truth is in guidelines and recommendations on this topic.

Posted On Saturday, November 27, 2010 11:21 PM | Comments (0) | Filed Under [ C# Language ]

Tuesday, February 2, 2010

Back to the roots: .NET binary search and the meaning of the negative number of the Array.BinarySearch() return value

Back to the roots: .NET binary search and the meaning of the negative number of the Array.BinarySearch() return value

 

Recently I gave a group of developers a task witch can be simplified to following simple problem: you have a sorted array of elements; find the index of a given element in this array. They came up with following solution:

//given array

int[] sortedArray = new[] { 1, 5, 8, 12, 18, 20 };

//Create a list from the array

List<int> list = new List<int>(sortedArray);

//use IndexOf method to find the element

int result = list.IndexOf(8);

Of course it works find but didn’t we do too much work? Can we probably use the fact that the given array is sorted? Do we need an additional data structure List?

If we read the Performance Considerations section at MSDN class documentation we see that to perform IndexOf() operation the list first sorts the content and then performs the binary search on it.

For more information see: http://msdn.microsoft.com/en-us/library/6sh2ey19.aspx

 

So we can simply use the BinarySearch() method of the Array static class.

//perform binary search

int result = Array.BinarySearch(sortedArray, 8);

So the BinarySearch() does basically the same as the IndexOf() does, but quicker, sometimes it’s better to remember basic algorithms instead of using complex data structures to accomplish the very narrow task efficiently.

 

For more information on binary search algorithm see: http://en.wikipedia.org/wiki/Binary_search_algorithm

 

Why does the BinarySeacrh() do more then IndexOf()

Let’s search a non existing element using both methods and compare results.

//given array

int[] sortedArray = new[] { 1, 5, 8, 12, 18, 20 };

//Create a list from the array

List<int> list = new List<int>(sortedArray);

//use IndexOf to find element

int result1 = list.IndexOf(17);

//Use Array.BinarySearch

int result2 = Array.BinarySearch(sortedArray, 17);

The result looks strange:

 

IndexOf(17)=-1 BinarySearch(17)=-5

 

Both of them deliver a negative value which should be interpreted as element not found. However in opposite to List.IndexOf() which always delivers -1 the Array.BinarySearch() returns some “magic” negative value. Following explanation can be found in class documentation:

Return Value

The index of the specified value in the specified array, if value is found. If value is not found and value is less than one or more elements in array, a negative number which is the bitwise complement of the index of the first element that is larger than value. If value is not found and value is greater than any of the elements in array, a negative number which is the bitwise complement of (the index of the last element plus 1).

For more information see: http://msdn.microsoft.com/en-us/library/y15ef976.aspx

Sounds complicated but means following: the binary search travels the binary tree, at some point it realizes that element can not be found and stops on a certain node. If you negate the result of the binary search and subtract 1 you will get the index of this element.

 

 

Index

0

1

2

3

4

5

Element

1

5

8

12

18

20

 

So if you search for 7 the result will be the negated index of element 8 minus 1, equals 2.

So if you search for 17 the result will be the negated index of element 18 minus 1, equals 4.

The formula is:

- Array.BinarySearch(sortedArray, 22) - 1

Important: if you are searching for a value which is greater then the last element this formula gives you array length, which is a non existing index. So you should not use the result of this calculation to access an element of the array without checking if it’s less then length.

So if you search for 22 the result will be -7. -(-7) -1 = 6 = array length.

 

The return value of BinarySearch() can be used in wide range of algorithms. For instance when solving the problem of vertex collision detection. It might be useful also when working with text segments.

Posted On Tuesday, February 2, 2010 12:57 AM | Comments (0) |

Powered by: