I apologize for the code not being indented , I am using word 2007 and no matter what I do it does not seem to want to indent properly (I have tried copy/pasting to notepad, you name it), if you know how to get this working please contact me. See picture of word version http://geekswithblogs.net/images/geekswithblogs_net/gyoung/4418/r_word.JPG
update I manually indented all of it but I am still wanting to know how to avoid this
Everyone seems to enjoy performance posts so …. I saw this question in the advanced .net group and found it fairly interesting.
I have a comma separated list of integer values what's the quickest way to turn this into a integer array.
Being the geek that I am I set off immediately to find out just how quick I could make an algorithm fly.
That said let’s start out with the most obvious answer, we will split the string and call int.Parse() on the string elements. We can use this code to benchmark our other results against.
|
static int[] Split(string _Numbers) { string[] pieces = _Numbers.Split(','); int[] ret = new int[pieces.Length]; for (int i = 0; i < pieces.Length; i++) { ret[i] = int.Parse(pieces[i]); } return ret; } |
Next we will need some data to feed into the routine. I chose "1,2,3,4,5,6,7,8,9,10,121,1000,10000" as my testing data. Running through this data 1,000,000 times takes a total of 4.39 seconds on my system (in release mode). I think that we can do better than this!!
Using an old trick I figured that I would just make the string a char [] and iterate through it taking my current number and subtracting its char code from ‘0’ (it just so happens that ‘9’ – ‘0’ = 9 how convenient! Applying this methodology leaves us with the following code.
|
static int[] Iterate(string _Numbers, ref int _Count) { int[] buffer = new int[4]; char[] chars = _Numbers.ToCharArray(); _Count = 0; int holder = 0; for (int i = 0; i < chars.Length; i++) { if (chars[i] == ',') { buffer[_Count] = holder; holder = 0; _Count++; if (_Count == buffer.Length) { int[] tmp = buffer; buffer = new int[tmp.Length * 2]; Buffer.BlockCopy(tmp, 0, buffer, 0, tmp.Length * 4); } } else { holder = holder * 10 + chars[i] - '0'; } } return buffer; } |
This code also has some oddities involved with it since it does not know initially the size of the int [] that it needs to pass back. In order to support this, it grows it’s int [] as it needs to (by doubling). This can be an expensive operation so avoiding it is best. Also since it is doubling its array, it has a new parameter Count which it uses to return the total number of elements in the array it returns (it may return a 32 element array that only uses 18 elements).
As for performance, the exact code above will handle the same data as our first test in <> .5 seconds on my machine with a buffer size of 4. Not bad 10% of our first try! To show the importance of the buffer though, if we make the initial size 16 the code finishes in .4 seconds!
There are still some areas we can optimize though. Keep in mind that this code is creating 1,000,000 char []. This is a pretty expensive operation, by using unsafe code we can avoid doing this. Here is the code
|
static unsafe int[] Unsafe1(string _Numbers, ref int _Count) { int[] buffer = new int[64]; _Count = 0; int total = _Numbers.Length; int holder = 0; fixed (char* a = _Numbers) { char* c = a; while (total > 0) { if (*c == ',') { buffer[_Count] = holder; holder = 0; _Count++; if (_Count == buffer.Length) { int[] tmp = buffer; buffer = new int[tmp.Length * 2]; Buffer.BlockCopy(tmp, 0, buffer, 0, tmp.Length * 4); } } else { holder = holder * 10 + *c - '0'; } c++; total--; } } return buffer; } |
Again we use the same buffering mechanism that we used on the last entry. The main difference here is that instead of creating a char [] we use unsafe code to iterate through the string. With a buffer size of 16, this code runs through the 1,000,000 iterations in .37.
In running these, many of you may notice that for small bits of data that the iterate function may actually run faster than the unsafe code. This can be the case but as you add more data, the gap will grow larger in the favor of the unsafe code (it also uses less memory as it does not duplicate the memory from the original string).
Use the framework Luke!
We have had one problem thus far with our code, its ugly. Having to get back an array then use a separate counter from the length of the array to loop is .. well ugly. As an example consider the following.
|
int foo = 0; int [] values = Iterate(numbers, ref foo);
for (int i = 0; i < foo; i++) {
}
//CANT DO
for (int i = 0; i < values.Length; i++) {
} |
Personally I prefer the second way of doing this was it makes more sense to programmers. Luckily generics in the framework can do just this for us. By changing our return type from an int [] to a generic list we can keep most of the speed and offer a better interface.
|
static unsafe List<Int32> UseGeneric(string _Numbers) { List<int> ret = new List<int>(64); int total = _Numbers.Length; int tmp = 0;
fixed (char* a = _Numbers) { char* c = a; while (total > 0) { if (*c == ',') { ret.Add(tmp); tmp = 0; } else { tmp = tmp * 10 + *c - '0'; } total--; c++; } } return ret; } |
This is a much nicer interface than what we had before and is still very performant. 1,000,000 iterations with this takes approximately the same time as out unsafe example! The reason this works about the same as our first example, is because it does pretty much the same thing internally with it’s buffer as what we were implementing on our own (of course, it does a much nicer job encapsulating it than we were so this is probably the best overall solution). Basically as you add items, it also doubles it's internal array size then copies over the old data in order to allow you to continue adding data.
Another solution which was offered up in the discussion by Ernst Kuschke is syntactical sugar (although not very performant) so I figured it was definitely worth including simply for its elegance (and it should probably be a good general method for doing this. Sugar like this is always more maintainable, converting this to support doubles or another type would be a lot easier than our other examples.
|
public static int[] doTheThing(string commadelimitedInts) { return Array.ConvertAll<string, int>(commadelimitedInts.Split(new string[] { "," }, StringSplitOptions.RemoveEmptyEntries), new Converter<string, int>(intToString)); }
public static int intToString(string strInt) { int i; int.TryParse(strInt, out i); return i; } |
Ernst takes advantage of the ConvertAll method to produce a very short bit of code that does the task. It has performance characteristics similar to our baseline as well (while being significantly shorter). If performance is not you main goal (it is quite rare that we are doing millions of theses transforms in an application), this is definitely the way to go as it is quite maintainable.
Update: Ernst's friend Piers came up with an even more elegant method of calling this
|
public static int[] doTheThing(string commadelimitedInts) { return Array.ConvertAll<string, int>(commadelimitedInts.Split(new string[] { "," }, StringSplitOptions.RemoveEmptyEntries), new Converter<string, int>(System.Convert.ToInt32)); } |
The elegance of this solution is definitely apparent!
Now to put our various solutions to the real test I have made a big string [1..1000] that we will run through 1,000,000 times. All algorithms that use array growing will be seeded with an initial buffer size of 64 to make things fair between them while also forcing their weakness of having to grow the buffer a few times.
|
Algorithm |
Execution Time |
|
Split (Base Line) |
00:05:42:703 |
|
Syntactic Sugar with Delegate |
00:06:36:703 |
|
Iteration |
00:00:22.312 |
|
Unsafe |
00:00:16.90625 |
|
Unsafe with generic return |
00:00:18.937 |
Analysis
As you can see performance for our base line (simple split) and the delegate method completely degrade under larger values as one would expect. As I mentioned earlier, the unsafe methods squeaked ahead of the iteration due to not having to create the char array. Of course, the real winner here is the generic return coming in second who makes our code readable, and extremely fast.
Can you come up with an algorithm for this problem? Post it here!