Alois Kraus

blog

  Home  |   Contact  |   Syndication    |   Login
  133 Posts | 8 Stories | 368 Comments | 162 Trackbacks

News



Archives

Post Categories

Programming

At my day job I do look at a lot of code written by other people. Most of the code is quite good and some is even a masterpiece. And there is also code which makes you think WTF… oh it was written by me. Hm not so bad after all. There are many excuses reasons for bad code. Most often it is time pressure followed by not enough ambition (who cares) or insufficient training. Normally I do care about code quality quite a lot which makes me a (perceived) slow worker who does write many tests and refines the code quite a lot because of the design deficiencies. Most of the deficiencies I do find by putting my design under stress while checking for invariants. It does also help a lot to step into the code with a debugger (sometimes also Windbg). I do this much more often when my tests are red. That way I do get a much better understanding what my code really does and not what I think it should be doing. This time I do want to show you how code can evolve over the years with different .NET Framework versions.

Once there was  time where .NET 1.1 was new and many C++ programmers did switch over to get rid of not initialized pointers and memory leaks. There were also nice new data structures available such as the Hashtable which is fast lookup table with O(1) time complexity. All was good and much code was written since then. At 2005 a new version of the .NET Framework did arrive which did bring many new things like generics and new data structures. The “old” fashioned way of Hashtable were coming to an end and everyone used the new Dictionary<xx,xx> type instead which was type safe and faster because the object to type conversion (aka boxing) was no longer necessary. I think 95% of all Hashtables and dictionaries use string as key. Often it is convenient to ignore casing to make it easy to look up values which the user did enter. An often followed route is to convert the string to upper case before putting it into the Hashtable.

        Hashtable Table = new Hashtable();

        void Add(string key, string value)
        {
            Table.Add(key.ToUpper(), value);
        }

This is valid and working code but it has problems. First we can pass to the Hashtable a custom IEqualityComparer to do the string matching case insensitive. Second we can switch over to the now also old Dictionary type to become a little faster and we can keep the the original keys (not upper cased) in the dictionary.

        Dictionary<string, string> DictTable = new Dictionary<string, string>(StringComparer.OrdinalIgnoreCase);

        void AddDict(string key, string value)
        {
            DictTable.Add(key, value);
        }

Many people do not user the other ctors of Dictionary because they do shy away from the overhead of writing their own comparer. They do not know that .NET has for strings already predefined comparers at hand which you can directly use.

Today in the many core area we do use threads all over the place. Sometimes things break in subtle ways but most of the time it is sufficient to place a lock around the offender. Threading has become so mainstream that it may sound weird that in the year 2000 some guy got a huge incentive for the idea to reduce the time to process calibration data from 12 hours to 6 hours by using two threads on a dual core machine. Threading does make it easy to become faster at the expense of correctness. Correct and scalable multithreading can be arbitrarily hard to achieve depending on the problem you are trying to solve.

Lets suppose we want to process millions of items with two threads and count the processed items processed by all threads. A typical beginners code might look like this:

        
        int Counter;
        void IJustLearnedToUseThreads()
        {
            var t1 = new Thread(ThreadWorkMethod);
            t1.Start();
            var t2 = new Thread(ThreadWorkMethod);
            t2.Start();

            t1.Join();
            t2.Join();

            if (Counter != 2 * Increments)
                throw new Exception("Hmm " + Counter + " != " + 2 * Increments);

        }
        const int Increments = 10 * 1000 * 1000;

        void ThreadWorkMethod()
        {
            for (int i = 0; i < Increments; i++)
            {
                Counter++;
            }
        }

It does throw an exception with the message e.g. “Hmm 10.222.287 != 20.000.000” and does never finish. The code does fail because the assumption that Counter++ is an atomic operation is wrong. The ++ operator is just a shortcut for

Counter = Counter + 1

This does involve reading the counter from a memory location into the CPU, incrementing value on the CPU and writing the new value back to the memory location. When we do look at the generated assembly code we will see only

inc         dword ptr [ecx+10h] 

which is only one instruction. Yes it is one instruction but it is not atomic. All modern CPUs have several layers of caches (L1,L2,L3) which try to hide the fact how slow actual main memory accesses are. Since cache is just another word for redundant copy it can happen that one CPU does read a value from main memory into the cache, modifies it and write it back to the main memory. The problem is that at least the L1 cache is not shared between CPUs so it can happen that one CPU does make changes to values which did change in meantime in the main memory. From the exception you can see we did increment the value 20 million times but half of the changes were lost because we did overwrite the already changed value from the other thread.

This is a very common case and people do learn to protect their  data with proper locking.

 

        void Intermediate()
        {
            var time = Stopwatch.StartNew();
            Action acc = ThreadWorkMethod_Intermediate;
            var ar1 = acc.BeginInvoke(null, null);
            var ar2 = acc.BeginInvoke(null, null);
            ar1.AsyncWaitHandle.WaitOne();
            ar2.AsyncWaitHandle.WaitOne();

            if (Counter != 2 * Increments)
                throw new Exception(String.Format("Hmm {0:N0} != {1:N0}", Counter, 2 * Increments));

            Console.WriteLine("Intermediate did take: {0:F1}s", time.Elapsed.TotalSeconds);
        }
        void ThreadWorkMethod_Intermediate()
        {
            for (int i = 0; i < Increments; i++)
            {
                lock (this)
                {
                    Counter++;
                }
            }
        }

This is better and does use the .NET Threadpool to get rid of manual thread management. It does give the expected result but it can result in deadlocks because you do lock on this. This is in general a bad idea since it can lead to deadlocks when other threads use your class instance as lock object. It is therefore recommended to create a private object as lock object to ensure that nobody else can lock your lock object.

When you read more about threading you will read about lock free algorithms. They are nice and can improve performance quite a lot but you need to pay close attention to the CLR memory model. It does make quite weak guarantees in general but it can still work because your CPU architecture does give you more invariants than the CLR memory model. For a simple counter there is an easy lock free alternative present with the Interlocked class in .NET. As a general rule you should not try to write lock free algos since most likely you will fail to get it right on all CPU architectures.

        void Experienced()
        {
            var time = Stopwatch.StartNew();
            Task t1 = Task.Factory.StartNew(ThreadWorkMethod_Experienced);
            Task t2 = Task.Factory.StartNew(ThreadWorkMethod_Experienced);
            t1.Wait();
            t2.Wait();

            if (Counter != 2 * Increments)
                throw new Exception(String.Format("Hmm {0:N0} != {1:N0}", Counter, 2 * Increments));

            Console.WriteLine("Experienced did take: {0:F1}s", time.Elapsed.TotalSeconds);
        }
        void ThreadWorkMethod_Experienced()
        {
            for (int i = 0; i < Increments; i++)
            {
                Interlocked.Increment(ref Counter);
            }
        }

Since time does move forward we do not use threads explicitly anymore but the much nicer Task abstraction which was introduced with .NET 4 at 2010. It is educational to look at the generated assembly code. The Interlocked.Increment method must be called which does wondrous things right? Lets see:

lock inc    dword ptr [eax] 

The first thing to note that there is no method call at all. Why? Because the JIT compiler does know very well about CPU intrinsic functions. Atomic operations which do lock the memory bus to prevent other processors to read stale values are such things. Second: This is the same increment call prefixed with a lock instruction. The only reason for the existence of the Interlocked class is that the JIT compiler can compile it to the matching CPU intrinsic functions which can not only increment by one but can also do an add, exchange and a combined compare and exchange operation. But be warned that the correct usage of its methods can be tricky.

If you try to be clever and look a the generated IL code and try to reason about its efficiency you will fail. Only the generated machine code counts.

Is this the best code we can write? Perhaps. It is nice and clean. But can we make it any faster? Lets see how good we are doing currently.

Level Time in s
IJustLearnedToUseThreads Flawed Code

Intermediate

1,5 (lock)

Experienced

0,3 (Interlocked.Increment)
Master 0,1 (1,0 for int[2])

That lock free thing is really a nice thing. But if you read more about CPU cache, cache coherency, false sharing you can do even better.

        int[] Counters = new int[12]; // Cache line size is 64 bytes on my machine with an 8 way associative cache try for yourself e.g. 64 on more modern CPUs


        void Master()
        {
            var time = Stopwatch.StartNew();
            Task t1 = Task.Factory.StartNew(ThreadWorkMethod_Master, 0);
            Task t2 = Task.Factory.StartNew(ThreadWorkMethod_Master, Counters.Length - 1);
            t1.Wait();
            t2.Wait();

            Counter = Counters[0] + Counters[Counters.Length - 1];

            if (Counter != 2 * Increments)
                throw new Exception(String.Format("Hmm {0:N0} != {1:N0}", Counter, 2 * Increments));

            Console.WriteLine("Master did take: {0:F1}s", time.Elapsed.TotalSeconds);
        }
        void ThreadWorkMethod_Master(object number)
        {
            int index = (int) number;
            for (int i = 0; i < Increments; i++)
            {
                Counters[index]++;
            }
        }

The key insight here is to use for each core its own value. But if you simply use simply an integer array of two items, one for each core and add the items at the end you will be much slower than the lock free version (factor 3). Each CPU core has its own cache line size which is something in the range of 16-256 bytes. When you do access a value from one location the CPU does not only fetch one value from main memory but a complete cache line (e.g. 16 bytes). This means that you do not pay for the next 15 bytes when you access them. This can lead to dramatic performance improvements and non obvious code which is faster although it does have many more memory reads than another algorithm.

So what have we done here? We have started with correct code but it was lacking knowledge how to use the .NET Base Class Libraries optimally. Then we did try to get fancy and used threads for the first time and failed. Our next try was better but it still had non obvious issues (lock object exposed to the outside). Knowledge has increased further and we have found a lock free version of our counter which is a nice and clean way which is a perfectly valid solution. The last example is only here to show you how you can get most out of threading by paying close attention to your used data structures and CPU cache coherency.

Although we are working in a virtual execution environment in a high level language with automatic memory management it does pay off to know the details down to the assembly level. Only if you continue to learn and to dig deeper you can come up with solutions no one else was even considering. I have studied particle physics which does help at the digging deeper part. Have you ever tried to solve Quantum Chromodynamics equations? Compared to that the rest must be easy ;-). Although I am no longer working in the Science field I take pride in discovering non obvious things. This can be a very hard to find bug or a new way to restructure data to make something 10 times faster. Now I need to get some sleep ….

posted on Wednesday, July 24, 2013 10:35 AM