Paulo Reichert

  Home  |   Contact  |   Syndication    |   Login
  5 Posts | 0 Stories | 0 Comments | 0 Trackbacks

News

Archives

Post Categories

Thursday, December 17, 2009 #

This blog is supposed to be technical only, but as a life-long dream has come true and the festival of the century has been announced I couldn’t ignore.

Sonisphere is great. I’ve been there this year and loved it. Metallica played a kick-ass concert. Next year, however, they’ve decided to raise the bar and got probably the most monstrous  line-up I have seen in a single concert together.

This is what they’ve got confirmed so far for the UK:

On top of that they are giving free tickets to the opening concert in Warsaw, Poland, on the 26/06. Check that out:

That’s definitely going to be the highlight of my events calendar for the next year for sure.

Rock On!!! :)


Thursday, December 10, 2009 #

If you have read the code I have posted previously in Covariance Matrix Calculation in .NET, the method to transpose a matrix MxN was quite simple. It basically created a new double[,] to represent a NxM matrix and then copied the data from the first to the second matrix in the corresponding positions.

That method is straightforward to implement and is necessary when you’re storing a matrix as a double[,] rather than a double[] (single dimension).

If you are, however, transposing a square matrix, then you can store the matrix as a one-dimensional array and transpose it in place by moving the values in the array around. This improves memory locality and for square matrices can speed up the transposing quite considerably.

Recap – Not in place

I have slightly changed the code that transposes the matrix in a new copy to use Ints. It looks like the following:

   1: public static int[,] Transpose(int[,] matrix)
   2: {
   3:     var x = matrix.GetLength(0);
   4:     var y = matrix.GetLength(1);
   5:     var result = new int[y, x];
   6:  
   7:     for (int i = 0; i < x; i++)
   8:         for (int j = 0; j < y; j++)
   9:             result[j, i] = matrix[i, j];
  10:  
  11:     return result;
  12: }

In-Place Square Matrix

For square matrices you can use the following method:

   1: public static void InSituSquare(int[] matrix, int m, int n)
   2: {
   3:     for (var i = 0; i < n - 1; i++)
   4:         for (var j = i + 1; j < n; j++)
   5:         {
   6:             int temp = matrix[i * n + j];
   7:             matrix[i * n + j] = matrix[i + n * j];
   8:             matrix[i + n * j] = temp;
   9:         }
  10: }

 

Conclusion

Using my not very scientific methods to test performance, for matrices of 200x200 the in-situ method for square matrices is about 85% faster than the one copying the array. By 800x800 matrices it is about 100% faster and starts to go over 100% faster as we increase the matrix sizes. Also the memory consumption of the application is a lot smaller.

Some of the performance difference may be due to indexing performance in C#. I haven’t tested using pointers for the two above methods to skip the C# indexer code. That said, even in languages like C++, depending on the hardware in use the in-situ transpose can prove quite a bit faster.

For non-square matrices the situation is quite different. I have seen a number of different implementations and most of them were slower than just copying the data into a second array due to the complexity of the operation. The number of value swaps tend to be far bigger than when using a second array. I have found some implementations that are plain ridiculous in the number of cycles and swaps they need to accomplish the work, making them fairly pointless unless you really need to save memory.

I’m still researching some cache-optimized non-square in place matrix transpose and will post more about it at a later date.

PS: Note code for transposing not in place will work with both square and non-square matrices, but the second one will work only with square ones.


Monday, December 07, 2009 #

I have been doing some research recently in estimation methods for time series and related data and have come across the K – nearest neighbours method that uses the distance between the variable we want to estimate and the other variables available and works out the K closest ones. Then it uses a weighted average of the values for the K nearest variables to infer the value of the variable we are missing data.

This method has been used in medical research and has proven to have a good performance so I wanted to implement it in a sample application to test with some other data samples and evaluate how efficient it is.

Here I outline how to implement it in C# and provide a sample application that you can use to test it.

Theory

The theory behind KNN is that if you get the closest neighbours to a variable X and use their values, weighted by their distance to X, you can estimate a reasonable value for X.

The first thing we need to do is to decide on a distance measure. We can use several options, but I decided to use the Euclidean distance that is defined as follows:

eucl

As the distance is dependant on the intervals between the values of the variables so it’s recommendable that they are scaled to reduce the bias on value spikes. You can use the log(x) for that, where x is the unscaled value.

Now that we have the distance, we can rank them and pick the K closest variables. We can then use their values weighted by their distance to estimate the value missing using the method below:

knn

Where Wi is obtained by:

WiDef

Delta is:

DeltaDef  

And Psii is the Euclidian distance.

Implementation

First we define a class to represent the data, just to make some of the data manipulation easier:

public class Study
{
    public int SpeciesNumber { get; set; }
    public float PetalWidth { get; set; }
    public float PetalLength { get; set; }
    public float SepalWidth { get; set; }
    public float SepalLength { get; set; }
    public string SpeciesName { get; set; }
    public float[] GetValues()
    {
        return new[] { PetalWidth, PetalLength, SepalWidth, SepalLength };
    }
    public override string ToString()
    {
        return
            string.Format(
                "Study:\n\tSpecies Number: {0}\n\tPetal Width: {1}\n\tPetal Length: {2}\n\tSepal Width: {3}\n\tSepal Length: {4}\n\tSpecies Name: {5}\n",
                SpeciesNumber, PetalWidth, PetalLength, SepalWidth, SepalLength, SpeciesName);
    }
}

Note that I have implemented a utility method in that class to get all the values as an array of floats and overridden ToString() to get the string representation of the entity.

The following method calculates the Euclidean distance between the various variables and returns the K nearest:

private static List<KeyValuePair<float[], double>> GetNearestStudies(float[] y, int k)
{
    var dists = new Dictionary<float[], double>();
    var studies = GetStudies();
    for (int i = 0; i < studies.Count; i++)
    {
        var x = studies[i].GetValues();
        var temp = 0d;
        for (int j = 0; j < 4; j++)
        {
            if (float.IsNaN(x[j]) || float.IsNaN(y[j]))
                continue;
            temp += Math.Pow(Math.Log(y[j]) - Math.Log(x[j]), 2);
        }
        dists.Add(studies[i].GetValues(), Math.Sqrt(temp));
    }
    var sorted = dists.OrderBy(kp => kp.Value);
    return sorted.Take(k).ToList();
}

The following method estimates the value of a variable based on the samples and distances specified:

public static float EstimateValue(float[] samples, double[] distances)
{
    var delta = distances.Sum();
    var y = 0d;
    for (int i = 0; i < samples.Length; i++)
    {
        var wi = distances[i] * delta;
        y += wi * samples[i];
    }
    return (float)y;
}

The following method puts it all together to create a new study and fill any gaps:

public static Study CreateStudy(float petalWidth, float petalLength, float sepalWidth, float sepalLength)
{
    var values = new[] { petalWidth, petalLength, sepalWidth, sepalLength };
    var nearest = GetNearestStudies(values, 3);
    for (int i = 0; i < values.Length; i++)
        if (float.IsNaN(values[i]))
        {
            if (nearest.Select(kp => kp.Value).Contains(0))
            {
                values[i] = nearest.Where(kp => kp.Value == 0).Select(kp => kp.Key[i]).FirstOrDefault();
                continue;
            }
            values[i] = EstimateValue(nearest.Select(kp => kp.Key[i]).ToArray(),
                                      nearest.Select(kp => kp.Value).ToArray());
        }
    var newStudy = new Study
    {
        SpeciesNumber = 1,
        PetalWidth = values[0],
        PetalLength = values[1],
        SepalWidth = values[2],
        SepalLength = values[3],
        SpeciesName = "Sestosa"
    };
    return newStudy;
}

Disclaimer

Please note that the code above does not represent production quality code nor best practices. I would not recommend you to use the code above in a production system as is, but you should use it as a reference to implement a more robust and scalable version.

Conclusion

During my research in estimation methods I came across a few and they all suffer from certain deficiencies. In this method’s case what I don’t like is the fact that it does not account for direction of correlation, therefore variables that change inversely to the variable being completed will still play a positive role in the estimation.

This is very useful though on estimating values close to other inputs. An example of that is similar searches where by using the euclidean distance one can work out what words would be close enough to the user input and therefore increase the accuracy of the results. It’s also widely used in several statistical studies.

There is another method based on covariance of the variables that accounts for the direction of the relationship, but the method itself seems to be a bit rough and I still haven’t managed to validate its efficiency. When I get something conclusive for that I will post it here.

Other versions

I have also implemented this code in Matlab. If anyone wants to see that please drop a comment and I will post it in here as well.

Acknowledgements

I have re-used concepts and ideas from various sources in this articles and I would like to acknowledge them here:


Saturday, December 05, 2009 #

Well... Hello.

People that know me know that back in 2005-2006 when I used to work for a consultancy company I used to have a technical blog and it was useful not only to be in touch with other developers but also to post code that was useful to me even over and over. You can still find that old blog in http://consultingblogs.emc.com/pauloreichert/.

After I left that company I spent some crazy 18 months developing one of the coolest systems I have ever worked on for a big American futures broker and there I used WCF, WF, WPF and Linq to their limits and then had a hiatus in a managerial position in another smaller financial company, but now I'm back in the technical arena doing some cool quantitative finance work in .NET and have lots of things I think I could contribute to the community.

That's why I started this blog here. No place better than GeeksWithBlogs.net to host my new blog, considering that many blogs that I read on a regular basis are in this community.

As part of my work I'm doing lots of maths with .NET, C++ and Matlab so I will post about all three languages.

Well, thanks for reading and hopefully I will see your comments on some of my posts.

- Paulo


UPDATE: Please note that this has got nothing to do with type covariance/contravariance new in .NET 4.0.

This is one of the basics but a quick search on the web for code to do this didn't produce any results. The only thing I found was a paid library for matrix operations that although being very good, was a bit too expensive if all I needed was a covariance calculation. I decided then to write my own algorithm for that and publish it here so that I can have it available next time I need it and anybody else wanting something similar can get this and customise to their needs.

From Wikipedia:
"In statistics and probability theory, the covariance matrix or dispersion matrix is a matrix of covariances between elements of a random vector. It is the natural generalization to higher dimensions of the concept of the variance of a scalar-valued random variable."

A covariance matrix is calculated for a matrix NxP, where P is the number of variables and N is the number of observations for each variable. The definition of covariance is given by the following equation:

 cov

The best technique I found to calculate the covariance matrix for a matrix A is:

  1. Subtract the means for each variable in A
  2. Multiply: AT*A
  3. Normalize A/N-1

Breaking down the code in the steps outlined above, we get first the subtraction, which is quite simple:

private unsafe static double[,] SubtractMeans(double[,] matrix) 
{ 
  var x = matrix.GetLength(0); 
  var y = matrix.GetLength(1); 
  var result = new double[x, y]; 
  fixed (double* mp = matrix) 
  fixed (double* rp = result) 
  { 
    for (int i = 0; i < y; i++) 
    { 
      var tmp = 0d; 
      for (int j = 0; j < x; j++) 
        tmp += mp[j * y + i]; 
      var mean = tmp / x; 
      for (int j = 0; j < x; j++) 
        rp[j * y + i] = mp[j * y + i] - mean; 
    } 
  } 
  
  return result; 
}

Then we transpose the matrix. A(T)(i,j) == A(j,i) so we can just create a new matrix of PxN and swap the indexers around:

public static double[,] Transpose(double[,] matrix) 
{ 
  var x = matrix.GetLength(0); 
  var y = matrix.GetLength(1); 
  var result = new double[y, x]; 
  for (int i = 0; i < x; i++) 
    for (int j = 0; j < y; j++) 
      result[j, i] = matrix[i, j]; 
  return result; 
} 

Now the multiplication:

public unsafe static double[,] Multiply(double[,] matrix1, double[,] matrix2)
{
    var x = matrix1.GetLength(0);
    var y = matrix2.GetLength(1);
    var z = matrix1.GetLength(1);
    if (matrix1.GetLength(1) != matrix2.GetLength(0))
        throw new InvalidOperationException("Can't multiply");
    var result = new double[x, y];
    fixed (double* p1 = matrix1)
    fixed (double* p2 = matrix2)
    fixed (double* p3 = result)
    {
        for (int i = 0; i < x; i++)
            for (int j = 0; j < y; j++)
            {
                double tmp = p3[i * y + j];
                for (int k = 0; k < x; k++)
                    tmp += p1[i * z + k] * p2[k * y + j];
                p3[i * y + j] = tmp;
            }
    }
    return result;
} 

And finally we put it all together with a Covariance method that call the previous 3 methods and normalizes the resulting matrix.

public unsafe static double[,] Covariance(double[,] matrix)
{
    var n = SubtractMeans(matrix);
    var t = Transpose(n);
    var m = Multiply(t, n);
    var x = m.GetLength(0);
    var y = m.GetLength(1);
    fixed (double* mp = m)
    {
        for (int i = 0; i < x; i++)
            for (int j = 0; j < y; j++)
            {
                var tmp = mp[i * x + j];
                tmp /= y - 1;
                mp[i * x + j] = tmp;
            }
    }
    return m;
} 

Presto.

 

Please note that the code presented in this post uses unsafe pointers to iterate through the matrices because it gives better performance compared to using fully managed code.