Geeks With Blogs
Greg Young Greg.ToString()

I apologize for the code not being indented , I am using word 2007 and no matter what I do it does not seem to want to indent properly (I have tried copy/pasting to notepad, you name it), if you know how to get this working please contact me. See picture of word version http://gwb.blob.core.windows.net/gyoung/4418/r_word.JPG

 

update I manually indented all of it but I am still wanting to know how to avoid this

 

Everyone seems to enjoy performance posts so …. I saw this question in the advanced .net group and found it fairly interesting.

I have a comma separated list of integer values what's the quickest way to turn this into a integer array.

 

Being the geek that I am I set off immediately to find out just how quick I could make an algorithm fly.

That said let’s start out with the most obvious answer, we will split the string and call int.Parse() on the string elements. We can use this code to benchmark our other results against.

 

static int[] Split(string _Numbers) {
  string[] pieces = _Numbers.Split(',');
  int[] ret = new int[pieces.Length];
  for (int i = 0; i < pieces.Length; i++) {
    ret[i] = int.Parse(pieces[i]); 
  } 
  return ret;
}

 

Next we will need some data to feed into the routine. I chose "1,2,3,4,5,6,7,8,9,10,121,1000,10000" as my testing data. Running through this data 1,000,000 times takes a total of 4.39 seconds on my system (in release mode). I think that we can do better than this!!

Using an old trick I figured that I would just make the string a char [] and iterate through it taking my current number and subtracting its char code from ‘0’ (it just so happens that ‘9’ – ‘0’ = 9 how convenient! Applying this methodology leaves us with the following code.

static int[] Iterate(string _Numbers, ref int _Count) {
  int[] buffer = new int[4];
  char[] chars = _Numbers.ToCharArray();
  _Count = 0; 
  int holder = 0; 
  for (int i = 0; i < chars.Length; i++) {
    if (chars[i] == ',') {
      buffer[_Count] = holder;
      holder = 0; 
      _Count++;
      if (_Count == buffer.Length) {
        int[] tmp = buffer; 
        buffer = new int[tmp.Length * 2]; 
        Buffer.BlockCopy(tmp, 0, buffer, 0, tmp.Length * 4);
      }
    } else {
      holder = holder * 10 + chars[i] - '0'
    }
  } 
  return buffer;
}

 

This code also has some oddities involved with it since it does not know initially the size of the int [] that it needs to pass back. In order to support this, it grows it’s int [] as it needs to (by doubling). This can be an expensive operation so avoiding it is best. Also since it is doubling its array, it has a new parameter Count which it uses to return the total number of elements in the array it returns (it may return a 32 element array that only uses 18 elements).

As for performance, the exact code above will handle the same data as our first test in <> .5 seconds on my machine with a buffer size of 4. Not bad 10% of our first try! To show the importance of the buffer though, if we make the initial size 16 the code finishes in .4 seconds!

There are still some areas we can optimize though. Keep in mind that this code is creating 1,000,000 char []. This is a pretty expensive operation, by using unsafe code we can avoid doing this. Here is the code

static unsafe int[] Unsafe1(string _Numbers, ref int _Count) {
  int[] buffer = new int[64];
  _Count = 0;
  int total = _Numbers.Length;
  int holder = 0;
  fixed (char* a = _Numbers) {
    char* c = a;
    while (total > 0) { 
      if (*c == ',') {
        buffer[_Count] = holder;
        holder = 0;
        _Count++;
        if (_Count == buffer.Length) {
          int[] tmp = buffer;
          buffer = new int[tmp.Length * 2];
          Buffer.BlockCopy(tmp, 0, buffer, 0, tmp.Length * 4);
        }
      } else
        holder = holder * 10 + *c - '0';
      } 
      c++;
      total--;
    }
  }
  return buffer;
}

 

Again we use the same buffering mechanism that we used on the last entry. The main difference here is that instead of creating a char [] we use unsafe code to iterate through the string. With a buffer size of 16, this code runs through the 1,000,000 iterations in .37.

In running these, many of you may notice that for small bits of data that the iterate function may actually run faster than the unsafe code. This can be the case but as you add more data, the gap will grow larger in the favor of the unsafe code (it also uses less memory as it does not duplicate the memory from the original string).

Use the framework Luke!

We have had one problem thus far with our code, its ugly. Having to get back an array then use a separate counter from the length of the array to loop is .. well ugly. As an example consider the following.

int foo = 0;
int [] values = Iterate(numbers, ref foo);

for (int i = 0; i < foo; i++) {

}

//CANT DO

for (int i = 0; i < values.Length; i++) {

}

 

Personally I prefer the second way of doing this was it makes more sense to programmers. Luckily generics in the framework can do just this for us. By changing our return type from an int [] to a generic list we can keep most of the speed and offer a better interface.

static unsafe List<Int32> UseGeneric(string _Numbers) {
  List<int> ret = new List<int>(64);
  int total = _Numbers.Length;
  int tmp = 0;

  fixed (char* a = _Numbers) {
    char* c = a;
    while (total > 0) {
      if (*c == ',') {
        ret.Add(tmp);
        tmp = 0; 
      } else {
        tmp = tmp * 10 + *c - '0';
      }
      total--;
      c++; 
    } 
  }
  return ret;
}

 

This is a much nicer interface than what we had before and is still very performant. 1,000,000 iterations with this takes approximately the same time as out unsafe example! The reason this works about the same as our first example, is because it does pretty much the same thing internally with it’s buffer as what we were implementing on our own (of course, it does a much nicer job encapsulating it than we were so this is probably the best overall solution). Basically as you add items, it also doubles it's internal array size then copies over the old data in order to allow you to continue adding data.

 

Another solution which was offered up in the discussion by Ernst Kuschke is syntactical sugar (although not very performant) so I figured it was definitely worth including simply for its elegance (and it should probably be a good general method for doing this. Sugar like this is always more maintainable, converting this to support doubles or another type would be a lot easier than our other examples.

public static int[] doTheThing(string commadelimitedInts) {
  return Array.ConvertAll<string, int>(commadelimitedInts.Split(new string[] { "," }, StringSplitOptions.RemoveEmptyEntries), new Converter<string, int>(intToString));
}

public static int intToString(string strInt) {
  int i;
  int.TryParse(strInt, out i);
  return i;
}

Ernst takes advantage of the ConvertAll method to produce a very short bit of code that does the task. It has performance characteristics similar to our baseline as well (while being significantly shorter). If performance is not you main goal (it is quite rare that we are doing millions of theses transforms in an application), this is definitely the way to go as it is quite maintainable.

Update: Ernst's friend Piers came up with an even more elegant method of calling this

public static int[] doTheThing(string commadelimitedInts) {
  return Array.ConvertAll<string, int>(commadelimitedInts.Split(new string[] { "," }, StringSplitOptions.RemoveEmptyEntries), new Converter<string, int>(System.Convert.ToInt32));
}

The elegance of this solution is definitely apparent!

Now to put our various solutions to the real test I have made a big string [1..1000] that we will run through 1,000,000 times. All algorithms that use array growing will be seeded with an initial buffer size of 64 to make things fair between them while also forcing their weakness of having to grow the buffer a few times.

Algorithm

Execution Time

Split (Base Line)

00:05:42:703

Syntactic Sugar with Delegate

00:06:36:703

Iteration

00:00:22.312

Unsafe

00:00:16.90625

Unsafe with generic return

00:00:18.937

 

Analysis

As you can see performance for our base line (simple split) and the delegate method completely degrade under larger values as one would expect. As I mentioned earlier, the unsafe methods squeaked ahead of the iteration due to not having to create the char array. Of course, the real winner here is the generic return coming in second who makes our code readable, and extremely fast.

Can you come up with an algorithm for this problem? Post it here!

 

 

Posted on Monday, May 29, 2006 9:56 PM Under the covers | Back to top


Comments on this post: String to Int []

# re: String to Int []
Requesting Gravatar...
My parser does only need about 0,11s to iterate in my debugger over your input string 1 million times. The magic word here is: Caching ;-)

class Parser
{
Dictionary<int, List<int>> myParsedList = new Dictionary<int, List<int>>();
char[] Seps = { ',' };

public List<int> Parse(string line)
{
List<int> ret;
myParsedList.TryGetValue(line.GetHashCode(),out ret);

if( ret == null )
{
ret = new List<int>();
foreach (string number in line.Split(Seps, StringSplitOptions.RemoveEmptyEntries))
{
ret.Add(Convert.ToInt32(number));
}
myParsedList[line.GetHashCode()] = ret;
}

return ret;
}
}
Left by Alois Kraus on May 30, 2006 2:26 AM

# re: String to Int []
Requesting Gravatar...
heh you're right I should habe made it random strings ;)
Left by Greg Young on May 30, 2006 7:19 AM

# String to int[]
Requesting Gravatar...
There's been a cool (I found it interesting) discussion on converting a comma-delimited string of int...
Left by Ernst Kuschke on May 31, 2006 11:20 AM

# re: String to Int []
Requesting Gravatar...
This function is about 35% faster than your fastest function without caching. By the way your implementation does miss the last number of the array. You should fix that. I guess then I am 40 % faster ;-))))

Yours,

Alois Kraus

public unsafe int [] Parse(string line)
{
int numCount = 1; // the last number does also deservere some memory
fixed (char* p = line)
{
// calculate buffer size by counting all , in array
for (int i = line.Length-1; i >= 0; i--)
{
if (*(p + i) == ',')
numCount++;
}

int [] ret = new int[numCount]; // allocate return array

char *pCurChar = p + line.Length - 1;
fixed(int *Numbers = &ret[0]) // access the return array directly
{
int tmp=0;
int digit=1;
for (int i = line.Length-1; i >= 0; i--)
{
if (*pCurChar == ',') // copy number to return array
{
*(Numbers+numCount-1) = tmp;
numCount--;
tmp = 0;
digit = 1;
}
else // calculate number a bit further
{
tmp = tmp + (*pCurChar - '0') * digit;
digit = digit * 10;
}

pCurChar--; // get next character
}

*Numbers = tmp; // first digit is still missing
}

return ret;

}
Left by Alois Kraus on Jun 02, 2006 5:35 PM

# re: String to Int []
Requesting Gravatar...
Nice catch on the missing 2 lines of code. Nice improvments too, I had considerred doing this but figured the initial iteration would be too expensive.

I am not getting the same results on you for speed though .. did you do something like set the initial buffer size really small (like 1)? I am getting 16ish seconds for the unsafe above and 14.3ish for yours for the test I mentioned above.

Cheers,

Greg
Left by Greg Young on Jun 03, 2006 12:19 AM

# re: String to Int []
Requesting Gravatar...
Although now that I think about it we had some differences in speed on the last as well .. perhaps differring JIT optimizations?
Left by Greg Young on Jun 03, 2006 12:21 AM

# re: String to Int []
Requesting Gravatar...
I did use for my test your input string "1,2,3,4,5,6,7,8,9,10,121,1000,10000" and iterated over it 1 million times. I am not doing any buffer reallocations since I calculate the correct buffer size in adavance by counting the "," chars + 1 which I then alloacte at once. No reallocations necessary. Does your test input string consist of only one number? And yes if we both are very close then we can this JITer issue again where it depends on your PC.

Yours,
Alois Kraus
Left by Alois Kraus on Jun 03, 2006 1:13 AM

# re: String to Int []
Requesting Gravatar...
The really shorted solution to do it is the following one:

public static int [] ConvertToIntList(string line)
{
return Array.ConvertAll<string, int>(line.Split(",".ToCharArray()), int.Parse);
}

Yours,
Alois Kraus
Left by Alois Kraus on Jun 08, 2006 7:30 AM

# Indenting code
Requesting Gravatar...
Regarding your code not indenting, it is probably because it is using <tab> characters instead of <space> characters.

In order to rectify it, you could do a search/replace in either word or notepad, change <tab> to any number of spaces (you've used 2).

To get the <tab> character into the search/replace box, you will have to copy and paste it in.

If you are using spaces and it just isn't working then I'm as puzzled as you.
Left by Vincent on Dec 05, 2007 1:49 AM

# re: String to Int []
Requesting Gravatar...
You should always explicitly specify the radix to be safe:

var yourInt;

yourInt = parseInt(str,10);

In this case you won’t have problems with strings that can be interpreted as octal numbers.

From the javascript specification:

If the radix parameter is omitted, JavaScript assumes the following:

If the string begins with “0x”, the radix is 16 (hexadecimal)

If the string begins with “0″, the radix is 8 (octal). This feature is deprecated

If the string begins with any other value, the radix is 10 (decimal)

Not all browsers implement the specification in the same way, so if you want your code to work consistently everywhere you shouldn’t omit the radix.

Regards,
Mcx Gold Tips
Left by Gold Tips on Jan 19, 2011 3:11 AM

Your comment:
 (will show your gravatar)


Copyright © Greg Young | Powered by: GeeksWithBlogs.net