Come on that is easy:
- bool Equals(object other) compares all member variables of another object instance to the current instance.
- bool Equals(T other) does the same thing but expects an object of the same type than our self. This method is defined by the IEquatable<T> interface.
- int GetHashCode() calculates of the internal state of the current object a mostly unique identifier. Mostly because if you have more than 2^32 object states some values must collide.
Most of the time you do not need to bother about these things except if you want to use your object as key in a hash table or if you want to use e. g. the List<T>.Contains(T value) method and you want to check for more than reference equality.
Although it looks like a very easy task to implement object equality and hash methods there are some pitfalls. The most important rule for both methods is that they most not throw any exceptions. Otherwise you make your type unusable in object collections (e.g. Hashtable, ArrayList, List<object>, Dictionary<object, …>, …) where objects of different types are compared against each other.
Nice rule lets violate it:
public override bool Equals(object obj)
{
SimpleType other = (SimpleType)obj;
....
Ups this will cause an InvalidCastException if we ever try to search in an List<object> with SimpleTypes for e.g. some string. Ok many people can live with this limitation but it is so easy to make it work with the as operator:
class SimpleType : IEquatable<SimpleType>
{
public override bool Equals(object obj)
{
return Equals(obj as SimpleType);
}
public bool Equals(SimpleType other)
{
// Reference Equality
if (object.ReferenceEquals(this, other))
return true;
// this cannot be null if other is null we must return false
if (object.ReferenceEquals(other,null))
return false;
return myb == other.myb && mya == other.mya;
}
The comment “// this cannot be null if other is null we must return false” is not really true. If you look at the string.Equals method
public bool Equals(string value)
{
if ((value == null) && (this != null))
…
you find a null check for the this pointer. The reason is that you can call on any class member methods without getting a NullReferenceException until you try to dereference the this pointer the first time. A C# specific thing is that member methods are not called with the call IL instruction but the callvirt instruction which will throw a NullReferenceException when you try to call a method on a null reference. That is interesting for languages like F# which prefers direct calls whenever possible. If speed is a concern it costs you only one instruction
cmp dword ptr [ecx],ecx
This little guy is responsible for you nice NullReferenceException which does nothing else than to compare the this pointer (= address) of your object which is always in the ecx register (at least for .NET) against the memory location it is supposed to point to. For a null pointer we end up trying to read from memory location 0 which is certainly not a valid object address. The memory page at address 0 is not readable which will cause a Windows SEHException (0xC0005 try with Windbg) which is translated by the CLR into our well known NullReferencException.
I had to tell you this because I tend to forget these little details so I can later look it up here. But back to our main topic. Common sources of errors in Equals and GetHashcode. I found a very interesting pattern to implement GetHashCode.
public override int GetHashCode()
{
StringBuilder sb = new StringBuilder();
sb.Append(mya);
sb.Append(myb);
return sb.ToString().GetHashCode();
Ok this does work and it produces a meaningful hash value for your object. It has only the little problem that this method does allocate strings like crazy which will be garbage collected very soon. If you have a Memory Profiler attached to your application and you see % Time in GC nearly 100% then it could be that somebody tried to use our SimpleType within a Dictionary<SimpleType,xxx> which can cause billions of GetHashCode calls within seconds. Your application will then be a nice unit test how fast the .NET garbage collector can be but there will be nearly no CPU cycles left for your application logic.
Another pitfall is to mix up operators for custom types. The == operator is not automatically routed to your instance once you override your Equals method. Instead the default behavior for reference equality does kick in.
class OtherType
{ }
class SimpleType : IEquatable<SimpleType>
{
int mya;
string myb;
OtherType myOther;
public bool Equals(SimpleType other)
{
// Reference Equality
if (object.ReferenceEquals(this, other))
return true;
// this cannot be null if other is null we must return false
if (object.ReferenceEquals(other,null))
return false;
return myb == other.myb &&
mya == other.mya &&
myOther == other.myOther; // Wrong this tests for reference equality only!!
}
The code looks like it does what you meant but it does not. Before using an operator you need always to check if there are any custom operators defined. For strings for example you can compare two null references without getting any exception. But things are different when you try to call the member method Equals. It is also easy to add recursion to your operators by e.g. using the == operator for the null check. If your == operator calls Equals again you need an infinite stack. But these things are not subtle you will find them with unit testing.
So how does working sample look like? Below you find the SimpleType reference implementation which shows how you can add a meaningful hash function for more than one member (Hash codes of other types are combined by the xor operaton ^ which is ok for most cases but there are other approaches possible which make it possible to get a different hash code when different fields are null.
class SimpleType : IEquatable<SimpleType>
{
int mya;
string myb;
public bool Equals(SimpleType other)
{
// Reference Equality
if (object.ReferenceEquals(this, other))
return true;
// this cannot be null if other is null we must return false
if (object.ReferenceEquals(other,null))
return false;
return myb == other.myb &&
mya == other.mya;
}
public override bool Equals(object obj)
{
return Equals(obj as SimpleType);
}
public override int GetHashCode()
{
int ret = mya;
if (myb != null)
ret ^= myb.GetHashCode();
return ret;
}
public static bool operator ==(SimpleType a, SimpleType b)
{
// Enable a == b for null references to return the right value
if (Object.ReferenceEquals(a, b))
return true;
// If one is null and the other not. Remember a==null will lead to Stackoverflow!
if (Object.ReferenceEquals(a,null))
return false;
return a.Equals((object) b);
}
public static bool operator !=(SimpleType a, SimpleType b)
{
return !(a==b);
}
public SimpleType(int a, string b)
{
mya = a;
myb = b;
}
}
Most programmers understand that GetHashCode and hash tables are somehow related. But only few do really understand how they achieve the ultra fast lookup times of O(1). That means that I can check for the existence of a key in constant time regardless how many objects I have stored in my Hashtable. The precondition for this magic is that GetHashCode is fast and produces a uniform distribution of hash values across the full value range of 0 – 2^32. A typical hash table implementation uses an array to to store its objects. When an object needs to be inserted it calculates the hash code and transforms this value to an index to the array. A widely used formula is
idx = hash % length
Where idx is the array index where our object is stored, hash is the hash value of the object to be stored and length is the length of the internal storage array of our hash table. The modulo operation will magically give you an array index which is between 0 and length-1. Then you can store the object at the given index. The picture below illustrates this.

Now it is easy to answer the question why the size of the array of a hash table should be a prime number. Lets suppose it is not a prime number e.g. 16 then the remainder of the modulo operation would return 0 for all hash values which are divisible by 16 hence causing collisions which would force the hash table to switch on the collision resolution logic which is more complex and involves calling Equals to all previously stored objects with the same hash value to find out if the same (=Equals) object was already stored.
The lookup of a hash table takes basically 5 steps:
- Calculate hash of input key e.g. hash “cd” = 5
- Calculate from hash the corresponding array index 5 % 10 = 5
- Calculate the hash value from object at array Index 5: hash “cd” = 5
- If the value does match call Equals to check if it really the same object
- If not we have a hash collision. That can mean we check all further objects in the array with the same hash value until we find a match or stop when we have found an object with a different hash value.
Step 1-3 are the application of our magic formula to calculate the storage index inside our array. The other steps are there to confirm that we did not get only an object with a matching hash code (which you cannot trust to be unique) but we need to compare the key values one by one which have the same hash code until we found it.
There is still much more to say about hash tables such as custom IEqualityComparers to get more speed or a different behavior of your dictionary. It is e.g. quite easy to create a file name dictionary which will throw when you try to add the same file more than once to it. This results in a somewhat imperformant hashing function but for small numbers it can be acceptable.
With a good hashing function you can create many interesting data structures. With .NET 3.5 we did get e.g. the HashSet<T> class which enables a fast check if something was already added to the set or not. It is possible to squeeze more memory out by using a probabilistic data structure called Bloom Filter which has the unique property of constant check time regardless how many items are stored inside it. That coolness comes at the cost that false positives are possible (the bloom filter tells you it has the value stored but it did not) which can be configured during creation of the filter. An implementation in C# can be found at CodePlex.
Bloom Filters are useful for cache managers which need a first fast check if the element is in the cache. If yes we can lookup the value in the cache if no it goes directly to disk. When a false positive is reported we look in the cache in vain but that does not matter since we still can get our data from disk. Googles Big Table for example uses bloom filters for this very purpose very successfully.