I have been doing some research recently in estimation methods for time series and related data and have come across the K – nearest neighbours method that uses the distance between the variable we want to estimate and the other variables available and works out the K closest ones. Then it uses a weighted average of the values for the K nearest variables to infer the value of the variable we are missing data.
This method has been used in medical research and has proven to have a good performance so I wanted to implement it in a sample application to test with some other data samples and evaluate how efficient it is.
Here I outline how to implement it in C# and provide a sample application that you can use to test it.
Theory
The theory behind KNN is that if you get the closest neighbours to a variable X and use their values, weighted by their distance to X, you can estimate a reasonable value for X.
The first thing we need to do is to decide on a distance measure. We can use several options, but I decided to use the Euclidean distance that is defined as follows:
As the distance is dependant on the intervals between the values of the variables so it’s recommendable that they are scaled to reduce the bias on value spikes. You can use the log(x) for that, where x is the unscaled value.
Now that we have the distance, we can rank them and pick the K closest variables. We can then use their values weighted by their distance to estimate the value missing using the method below:
Where
is obtained by:

is:
And
is the Euclidian distance.
Implementation
First we define a class to represent the data, just to make some of the data manipulation easier:
public class Study
{
public int SpeciesNumber { get; set; }
public float PetalWidth { get; set; }
public float PetalLength { get; set; }
public float SepalWidth { get; set; }
public float SepalLength { get; set; }
public string SpeciesName { get; set; }
public float[] GetValues()
{
return new[] { PetalWidth, PetalLength, SepalWidth, SepalLength };
}
public override string ToString()
{
return
string.Format(
"Study:\n\tSpecies Number: {0}\n\tPetal Width: {1}\n\tPetal Length: {2}\n\tSepal Width: {3}\n\tSepal Length: {4}\n\tSpecies Name: {5}\n",
SpeciesNumber, PetalWidth, PetalLength, SepalWidth, SepalLength, SpeciesName);
}
}
Note that I have implemented a utility method in that class to get all the values as an array of floats and overridden ToString() to get the string representation of the entity.
The following method calculates the Euclidean distance between the various variables and returns the K nearest:
private static List<KeyValuePair<float[], double>> GetNearestStudies(float[] y, int k)
{
var dists = new Dictionary<float[], double>();
var studies = GetStudies();
for (int i = 0; i < studies.Count; i++)
{
var x = studies[i].GetValues();
var temp = 0d;
for (int j = 0; j < 4; j++)
{
if (float.IsNaN(x[j]) || float.IsNaN(y[j]))
continue;
temp += Math.Pow(Math.Log(y[j]) - Math.Log(x[j]), 2);
}
dists.Add(studies[i].GetValues(), Math.Sqrt(temp));
}
var sorted = dists.OrderBy(kp => kp.Value);
return sorted.Take(k).ToList();
}
The following method estimates the value of a variable based on the samples and distances specified:
public static float EstimateValue(float[] samples, double[] distances)
{
var delta = distances.Sum();
var y = 0d;
for (int i = 0; i < samples.Length; i++)
{
var wi = distances[i] * delta;
y += wi * samples[i];
}
return (float)y;
}
The following method puts it all together to create a new study and fill any gaps:
public static Study CreateStudy(float petalWidth, float petalLength, float sepalWidth, float sepalLength)
{
var values = new[] { petalWidth, petalLength, sepalWidth, sepalLength };
var nearest = GetNearestStudies(values, 3);
for (int i = 0; i < values.Length; i++)
if (float.IsNaN(values[i]))
{
if (nearest.Select(kp => kp.Value).Contains(0))
{
values[i] = nearest.Where(kp => kp.Value == 0).Select(kp => kp.Key[i]).FirstOrDefault();
continue;
}
values[i] = EstimateValue(nearest.Select(kp => kp.Key[i]).ToArray(),
nearest.Select(kp => kp.Value).ToArray());
}
var newStudy = new Study
{
SpeciesNumber = 1,
PetalWidth = values[0],
PetalLength = values[1],
SepalWidth = values[2],
SepalLength = values[3],
SpeciesName = "Sestosa"
};
return newStudy;
}
Disclaimer
Please note that the code above does not represent production quality code nor best practices. I would not recommend you to use the code above in a production system as is, but you should use it as a reference to implement a more robust and scalable version.
Conclusion
During my research in estimation methods I came across a few and they all suffer from certain deficiencies. In this method’s case what I don’t like is the fact that it does not account for direction of correlation, therefore variables that change inversely to the variable being completed will still play a positive role in the estimation.
This is very useful though on estimating values close to other inputs. An example of that is similar searches where by using the euclidean distance one can work out what words would be close enough to the user input and therefore increase the accuracy of the results. It’s also widely used in several statistical studies.
There is another method based on covariance of the variables that accounts for the direction of the relationship, but the method itself seems to be a bit rough and I still haven’t managed to validate its efficiency. When I get something conclusive for that I will post it here.
Other versions
I have also implemented this code in Matlab. If anyone wants to see that please drop a comment and I will post it in here as well.
Acknowledgements
I have re-used concepts and ideas from various sources in this articles and I would like to acknowledge them here: