C#/.NET Little Wonders: TPL Wonders – The Parallel.For() Method

Once again, in this series of posts I look at the parts of the .NET Framework that may seem trivial, but can help improve your code by making it easier to write and maintain. The index of all my past little wonders posts can be found here.

Previously, I had created some posts about the Task Parallel Library’s Concurrent Collections – which are very efficient collections designed for concurrent use with minimal contention – but in this next series of Little Wonders posts, I’d like to explore the parts of the TPL that can simplify and improve the way you perform parallel programming.

For this week, we are going to start looking at the Parallel static class’s, For() static method, which is a quick and easy way to perform parallel processing over a range of number values.

The Task Parallel Library – a brief summary before we begin…

Starting with .NET 4.0, the Task Parallel Library is considered by Microsoft to be the preferred way to write parallel code. This library makes it much simpler to create work, to be performed in parallel, over previous threading models. This makes it much easier to take advantage of the multi-core machines that are becoming so common place in the computing environments in which we work and live.

The TPL comes with several items to make parallelization easier:

Tasks – These classes represent “work” that can be done in parallel, and that need to be scheduled and coordinated. Any two tasks may end up using different threads in a pool, or the same thread depending on scheduling. Tasks can also be designated “long running” in which case they typically get their own dedicated thread and do not consume a thread from the pool.
Task Schedulers – These classes represent scheduling algorithms that take tasks and assign them to a pool of worker threads for execution. The default scheduler chooses the number of threads based on the architecture (number of cores) on your machine to attempt to take advantage of as much parallelism as possible without overloading the processors with task switching.
TaskFactory – This static helper class makes it easy to construct and start a Task with a multitude of options.
Concurrent Collections – These data structures are collection classes optimized for parallel access. They were written in such a way so as to reduce contention as much as possible. They should be your first choice for data that needs to be accessed by multiple threads/tasks (some of my past Little Wonders posts explore these in more detail).
Parallel static class – This is a static helper class that makes it easy to perform parallel loops and method invocations without having to create tasks or threads directly.

Some people think of tasks as just a way to schedule units of work on a thread pool, and thus they may think that they are a poor substitute for long-running threads. This is really an oversimplification, even though tasks do get much of their power and flexibility from handling units of work scheduled in a thread pool, they can also be used to simplify the programming of long running threads. So don’t let the name fool you, nearly anything you can do with threads, you can do with tasks – and the programming and management of them tends to be much simpler.

Note: Not all work benefits from parallelism. If each work item to be performed takes a relatively insignificant amount of time to execute, parallelism can actually degrade performance. When in doubt, stay serial and only parallelize when a need is identified. That said, if you do find a need to parallelize work, the TPL is a great set of tools for the job and VS2010 has a great concurrency profiler to help identify hot spots.

Parallel.For() – A very simple example…

So, on to the meat of the post! Today we will look at the Parallel.For() method family. This method family contains a set of methods that performs like a traditional, sequential for loop – except that the body of the “loop” can perform in parallel.

You can think of this (very roughly) as if you had a for loop where every pass creates and schedules a parallel task, and then waits for all the tasks to complete before moving on. As with explicit tasks, there is no guarantee that every task will have its own thread. The scheduler will determine how many of them to perform in parallel at a time. In truth, this is really a simplification of what’s going on, but it’s the essential idea.

So let’s look at a very simple example of Parallel.For() that iterates over int values (long values are also an option):

1 :  // stopwatch for timing
     2 : var timer = Stopwatch.StartNew();
3 : 4 :  // this is a parallel loop over range 0..9
         5 : Parallel.For(0, 10, i => Console.WriteLine(i));
6 : 7 :  // print how long it took...
         8 : timer.Stop();
9 : Console.WriteLine("Parallel execution took: {0} ms",
                      timer.ElapsedMilliseconds);

Ignore the Stopwatch for the moment (a handy class and one of the first Little Wonders I posted on) since it’s just there to time the body of the code. Notice the Parallel.For() call, which was given three arguments: the starting value (inclusive), the ending value (exclusive), and the Action<> to perform on each value. In this case, we defined tasks on the range of values from 0 up to but not including 10, and for each of those values we called Console.WriteLine() to display the value.

Now, notice that the starting value is inclusive, and the ending is exclusive. that means if you choose a starting value of 0 and an ending value of 100, you will actually be processing the values from 0..99. This is very similar to what we’d get with a traditional, sequential for loop.

So let’s look at the output as run on my quad-core machine:

1 : 0 2 : 8 3 : 6 4 : 2 5 : 4 6 : 1 7 : 9 8 : 5 9 : 3 10 : 7 11
    : Parallel execution took : 3 ms

Looking at this output, a few things stick out. First, note that the values are not processed in order. The order these items will be processed is somewhat indeterminate. I say somewhat because the order we see will depend partially on how the Parallel.For() partitions the range and assigns them to tasks, and partially on how long each task takes to execute before moving on to the next task.

However, that said, note that each value did get processed. Also notice that the time for executing this Parallel.For() was 3 ms. That feels a bit high for a simple for loop, so let’s look at what an “equivalent”, traditional for loop would do:

1 :  // stopwatch for timing
     2 : var timer = Stopwatch.StartNew();
3 : 4 :  // this is a sequential loop over range 0..9
         5 : for (int i = 0; i < 10; i++) 6 : {
  7 : Console.WriteLine(i);
  8:
}
9 : 10 :  // print how long it took...
          11 : timer.Stop();
12 : Console.WriteLine("Sequential execution took: {0} ms",
                       timer.ElapsedMilliseconds);

Pretty standard, right? Notice that like our friend the Parallel.For(), the way we typically write a for loop is from starting value up to but not including ending value as well. This is probably why they made the design decision to have the final value be exclusive, in keeping with the usage habits of the traditional for loop. So let’s run this one:

1 : 0 2 : 1 3 : 2 4 : 3 5 : 4 6 : 5 7 : 6 8 : 7 9 : 8 10 : 9 11
    : Sequential execution took : 0 ms

Obviously, the output is sequential as we’d expect, but also notice that the sequential for loop took 0 ms (more precisely, somewhere less than 1 ms). If you thought parallelism is always better, this may stun you quite a bit. However you must keep in mind everything the Parallel.For() does under the covers.

Instead of looping through 10 items and printing them, in the Parallel.For() we’ve – behind the scenes – created 10 iterations of work (tasks) and scheduled them. This is a lot less trivial work than putting data in the Console.Out buffer. In addition, since the Console.Out buffer is synchronized, we didn’t buy much of anything by making this parallel, as the synchronization would prevent those actions from performing in parallel anyway.

This illustrates the main point of parallel computing: just because you can doesn’t mean you should. Parallelizing trivial operations is actually less efficient due to task creation and scheduling. You will really see the payoff where each unit of work is larger.

Parallel.For() – Okay, now for a better example…

So what kind of stuff is better for parallelism? Well, typically (relatively) significant work items that can be done mostly in parallel. Like perhaps calling a series of distinct web methods and getting their results, or perhaps handling several files in parallel, or breaking a calculation into several independent chunks. The main thing is to try to make the tasks as independent as possible to avoid contention on synchronization constructs, which limit parallelism.

Now, obviously, there are several ways to perform parallelism, so which ways are the Parallel.For() best suited for? Typically, I find the Parallel.For() is best for where you want to perform a series of tasks utilizing a range of values. This sounds trivial, right? Why would we want to parallelize over a range of values? The key is to think not of these values as just values, but indexes into arrays, lists, or more complex constructs.

For example, what if we have a List<string> of file names that we want to search for a word, and then store in a int[] the number of times each of those terms appeared in each file. This makes great use of the Parallel.For() method since the files are independent of each other, and both the file names and the resulting number of times are uniquely identified by the value in the range specified in the Parallel.For() (as indexes into the list and array respectively).

So let’s look at how we’d do this. First, we need a method to take in a file name and search the file:

1 : public static int FindTermInFile(string fileName, string term) 2 : {
  3 : int foundCount = 0;
  4 : 5 :  // read the text of the file into one long string
           6 : string text = File.ReadAllText(fileName);
  7 : 8 :  // find first occurence of the term
           9 : int pos = text.IndexOf(term);
  10 : 11 :  // while we found another one
             12 : while (pos >= 0) 13 : {
    14 :  // increment and find next one until not found (-1)
          15 : ++foundCount;
    16 : 17 : pos = text.IndexOf(term, pos + 1);
    18:
  }
  19 : 20 : return foundCount;
  21:
}

Then, we need our program to process all these files using this method. Let’s start first with handling all files sequentially:

1 :  // create a list of files from some classic book texts
     2 : var files = new List<string> 3 : {
       4 : "wealth.txt",    // Adam Smith's Wealth of Nations
       5 : "war.txt",       // Tolstoy's War and Peace
       6 : "origins.txt",   // Darwin's Origin of the Species
       7 : "lewis.txt",     // Thomas Jefferson's Report on Lewis and Clark
                            // Expedition
       8 : "reason.txt",    // Thomas Paine's Age of Reason
       9 : "sherlock.txt",  // Sir Arthur Conan Doyle's Sherlock Holmes
       10 : "cities.txt"    // Dicken's Tale of Two Cities
       11 :
     };
12 : 13 :  // make an array of results
           14 : var results = new int[files.Count];
15 : 16 : var timer = Stopwatch.StartNew();
17 : 18 :  // find all occurences of the string "the" in each book.
           19 : for (int i = 0; i < files.Count; i++) 20 : {
  21 : results[i] = FindTermInFile(files[i], "the");
  22 : 23 : Console.WriteLine("Found \"the\" {0} times in book {1}", results[i],
                              files[i]);
  24:
}
25 : 26 : timer.Stop();
27 : 28
    : Console.WriteLine(
          "The string \"the\" was found {0} times in {1} books, operation took {2} ms",
          29: results.Sum(), files.Count, timer.ElapsedMilliseconds);

Seems pretty straightforward, right? We’re not doing anything too fancy here, there’s probably other things we could do to handle this better, but just as a simplistic example it will do fine. So let’s run this on my quad-core box and see the results:

 1: Found "the" 41348 times in book wealth.txt
   2: Found "the" 21278 times in book war.txt
   3: Found "the" 19455 times in book origins.txt
   4: Found "the" 20271 times in book lewis.txt
   5: Found "the" 7791 times in book reason.txt
   6: Found "the" 7218 times in book sherlock.txt
   7: Found "the" 10694 times in book cities.txt
   8:  
   9: The string "the" was found 128055 times in 7 books, operation took 12734 ms

So, it searched 7 books in 12.7 seconds, not too shabby for a quad-core. But really, the number of cores on my machine is irrelevant in this case because this is all sequential code, so it’s really just utilizing one of my processors.

So now let’s look at what happens if we parallelize this with our new friend, the Parallel.For(). We’ll leave the search method the same, but instead use the Parallel.For() loop for handling each file and storing the result in each array element (since the array elements are unique, we don’t need to synchronize here since no two actions should be hitting the same list/array elements).

The body of our program will remain largely the same, except we will replace the sequential for loop:

1: // find all occurences of the string "the" in each book.
   2: for (int i=0; i<files.Count; i++)
   3: {
   4:     results[i] = FindTermInFile(files[i], "the");
   5:  
   6:     Console.WriteLine("Found \"the\" {0} times in book {1}", results[i], files[i]);
   7: }

With this new Parallel.For():

  1: // find all occurences of the string "the" in each book.
   2: Parallel.For(0, files.Count, i =>
   3:     {
   4:         results[i] = FindTermInFile(files[i], "the");
   5:  
   6:         Console.WriteLine("Found \"the\" {0} times in book {1}", results[i], files[i]);
   7:     });

Notice the structure of the “loop” remains largely the same – it still has the same start value of 0, and the same end value (exclusively) of files.Count. The main difference is instead of specifying a body of the loop, we specify a delegate (using a lambda in our case, but you could use a method as well) that takes the value to process (i in our case) and then what to do with the value (process the file at position i in the list and store the result at position i in the array.

So how does the parallel form work out? Let’s see:

  1: Found "the" 20271 times in book lewis.txt
   2: Found "the" 7791 times in book reason.txt
   3: Found "the" 7218 times in book sherlock.txt
   4: Found "the" 41348 times in book wealth.txt
   5: Found "the" 10694 times in book cities.txt
   6: Found "the" 19455 times in book origins.txt
   7: Found "the" 21278 times in book war.txt
   8:  
   9: The string "the" was found 128055 times in 7 books, operation took 6838 ms

We’ve cut the time in about half, and are making better use of the multiple cores on my machine.

Note: You may be wondering why didn’t it reduce it by a factor of 4 since I have a quad core? Well, just because you have four times the processors doesn’t mean you’ll always get four times the throughput. This is due to the fact that the tasks have to be created and scheduled – which is overhead – and that even those these tasks look independent, they still share a common bottleneck: file IO. But, for now, it’s a great example of how a for loop can be parallelized to get some performance gains.

Sidebar: Parallel.For() – What happens if exceptions are thrown?

So what happens if an exception is thrown in an iteration of a Parallel.For()?

Well, if this were a traditional thread, the thread would terminate with the exception and the program would run on none-the-wiser. The TPL handles this quite a bit better, and thus any Task (and the Parallel.For(), since it uses tasks) that throws an exception will hold onto it until the task is finalized either through a Wait() or other finalizing method. Once this happens it will re-throw it’s exception as an AggregateException which may include one or more exceptions from any tasks that faulted.

Note that for the Parallel.For() this happens as soon as it schedules all the iterations. That is, when the Parallel.For() partitions the values for the iterations and assigns them to a series of tasks, it performs a WaitAll() to wait for all the tasks to complete. At that point if any task(s) threw an exception before or during the WaitAll(), it will abort the remaining tasks not yet running and throw the AggregateException.

This is why you have to be very careful with the Action<> throwing an exception in the Parallel.For() loop. Thus, if you want to ensure every value gets processed, your Action<> should catch and handle it’s own exceptions. If you do want an unhandled exception to halt the other non-running iterations, then you would just catch the AggregateException and “handle” each exception to mark it as “observed”:

  1: try
   2: {
   3:     Parallel.For(0, 44, i =>
   4:         {
   5:             if ((i % 10) == 0)
   6:             {
   7:                 throw new ApplicationException("Could not process item: " + i);
   8:             }
   9:             Console.WriteLine(i);
  10:         });
  11: }
  12:  
  13: catch (AggregateException agEx)
  14: {
  15:     Console.Out.Flush();
  16:     agEx.Handle(ex =>
  17:         {
  18:             // do whatever we want to "handle" the exception
  19:             Console.WriteLine(ex);
  20:  
  21:             // return true if handled, otherwise it will raise if we dispose or wait again
  22:             return true;
  23:         });
  24: }

Notice that iterations that are a multiple of ten will throw an exception if they get ran. Thus we have surrounded our Parallel.For() in a try/catch block, and are attempting to catch an AggregateException. If we do catch one, we can call the Handle() method to process the exception and mark it as having been handled (by returning true in our delegate).

I’ll discuss more on Handle() and the other methods of AggregateException in a future Little Wonders post.

Sidebar: Parallel.For() with optional Action<int, ParallelLoopState>

One of the options when using Parallel.For() is to replace your Action<int> with an Action<int, ParallelLoopState>.

This gives each iteration control of other iterations that have not yet been executed (or who are watching for changes using the properties below):

Break()
- No items higher than the current iteration value will be run (any previously ran or running are not affected). All items lower than this value will continue to be scheduled and run.
Stop()
- No further items will be run, any items previously run or currently running are unaffected.

Or, each iteration can also query the current loop state to see if they should abort before they are finished:

IsExceptional
- Gets whether any iteration of the loop threw an unhandled exception.
IsStopped
- Gets whether any iteration has called Stop().
LowestBreakIteration
- Gets the lowest iteration value that called Break().
ShouldExitCurrentIteration
- Gets whether the current iteration should exit based on requests made by this or other iterations.

The methods let you alter the behavior of other iterations, and the properties let you determine if you should stop your work early based on other iteration requests. This is typically handy if the work in each iteration lasts pretty long and has points you could check along the way to see if it should abort early.

For example, we could rework our word finder to take the ParallelLoopState so that it passes the loop state into our work method.

1: // find all occurences of the string "the" in each book.
   2: Parallel.For(0, files.Count, (i, state) =>
   3:     {
   4:         results[i] = FindTermInFile(files[i], "the", state);
   5:  
   6:         Console.WriteLine("Found \"the\" {0} times in book {1}", results[i], files[i]);
   7:     });

And pass it to the FindTermInFile() method to tell us whether to exit after reading the file or between each find.

  1: public static int FindTermInFile(string fileName, string term, ParallelLoopState state)
   2: {
   3:     int foundCount = 0;
   4:  
   5:     // read the text of the file into one long string
   6:     string text = File.ReadAllText(fileName);
   7:  
   8:     if (!state.ShouldExitCurrentIteration)
   9:     {
  10:         // find first occurence of the term
  11:         int pos = text.IndexOf(term);
  12:  
  13:         // while we found another one
  14:         while (pos >= 0 && !state.ShouldExitCurrentIteration)
  15:         {
  16:             // increment and find next one until not found (-1)
  17:             ++foundCount;
  18:  
  19:             pos = text.IndexOf(term, pos + 1);
  20:         }
  21:     }
  22:  
  23:     return foundCount;
  24: }

Now, it any other iteration breaks or stops, any other iterations that are currently running and should halt will stop at the next opportunity instead of running to completion.

SideBar: Parallel.For() with optional ParallelOptions parameter

Finally, one other option you have when creating a Parallel.For() is to specify a ParallelOptions argument which contains options to tweak how the parallelism is performed:

CancellationToken
- Allows you to specify a token to cancel execution of the Parallel.For() loop. If triggered, this will cancel all non-started tasks. In addition, the Action<> delegate may examine this token to determine if it should halt its work early.
MaxDegreeOfParallelism
- This allows you to override what .NET has chosen to be the maximum degree of parallelism it will use in processing your request. In general, you should let .NET do it’s thing and not override this.
TaskScheduler
- Lets you specify an alternate task scheduler.

Typically, we don’t mess too much with the TaskScheduler or MaxDegreeOfParallelism and instead realy on .NET to perform the appropriate behaviors. That said, you can specify alternate schedulers or a different degree of parallelism if you have reason to believe this would improve the performance of your parallel application. But once again, let me stress: before you alter the .NET defaults for these, measure and profile and make sure there’s a need.

Summary

The Parallel.For() loop is an easy and powerful way to parallelize a sequential for loop where each of the iterations can be processed independently of each other. This construct uses TPL tasks to execute each iteration’s work, and has many options for controlling how the parallelism is performed.

Like any parallel processing construct, they are best used in situations that can truly benefit from parallel execution. When in doubt, keep it sequential until you identify a need.

This article is part of the GWB Archives. Original Author: James Michael Hare

Replatforming Guide: Pros, Cons, and Impact

Deciding to replatform is no small feat; it’s like setting sails for new horizons with your digital presence. Weighing the

Cypress vs Selenium: Why Cypress is Better!

Navigating the competitive landscape of web testing tools, Cypress emerges as a noteworthy contender, outshining Selenium with its cutting-edge advantages.