Once again, in this series of posts I look at the parts of the .NET Framework that may seem trivial, but can help improve your code by making it easier to write and maintain.
Today we are going to examine the LINQ set operations that are part of the IEnumerable<T> extension methods. Now, most of the time when people think of set operations they think of the math or logic classes they are usually taught in, but really these LINQ methods have a much larger appeal and applicability than just math exercises!
Set Operations
When we think of set operations in the realm of set logic, there are three primary ones that come to mind:
- Intersection:
- The intersection of two sets is a set that contains all the common elements of both sets.
- { A, B, C } ∩ { B, C, D } = { B, C }
- Union:
- The union of two sets is a set that contains all the unique elements of both sets.
- { A, B, C } ∪ { B, C, D } = { A, B, C, D }
- Difference:
- The difference of two sets is the first set minus all common elements with the second set.
- { A, B, C } – { B, C, D } = { A }
Notice that in set theory, sets are collections of unique elements with no duplicates. Thus the set union, as we saw, removes all duplicates between the two sets. This becomes important as we discuss how these set operations apply to collections in .NET.
I’ve blogged before on the often overlooked HashSet and SortedSet classes (here) in .NET, which are collections that implement sets. Though these are useful, sometimes you want to be able to apply these useful operations to other types of collections as well. This is where the LINQ set operation extension methods come in handy, they are implementations of these set operations that can be applied to any sequence that implements IEnumerable<T> including favorites like arrays, List<T>, iterators, etc.
A quick preface on equality
For the set operations to properly work on your sequences of a type, the type must have a valid notion of equality (both Equals() and GetHashCode() must be meaningful). Now, for nearly all the primitive types (int, double, char, etc) the string class, structs and some BCL reference types, equality is well defined and implemented and they will work fine as is.
However, custom classes you write can present a problem because the default implementation of equality most likely will not meet our needs. Just keep that in mind for now and we’ll come back to that in the end with examples of how this can bite you and how to mitigate it…
Intersect() – gathering the common elements
The Intersect() method gets the common elements from two different enumerable sequences, just like the set logic’s intersection operation dictates. Intersections are very useful in determining where two sets overlap (that is, what elements two sets have in common).
The two forms of Intersect(), assuming extension method syntax where the first argument is the target, are:
- Intersect(IEnumerable<TSource> second)
- Returns a sequence of elements that are common between the first sequence and the second sequence using the default equality comparer for TSource.
- Intersect(IEnumerable<TSource> second, IEqualityComparer<TSource> comparer)
- Returns a sequence of elements that are common between the first sequence and the second sequence using the definition of equality specified by the passed-in equality comparer.
Let’s play with string since it already has a good default equality comparer. So say we have two enumerable sequences of string:
1: // lets say we have a list of healthy stuff to consume
2: var healthyStuff = new List<string> { "fruits", "vegetables", "proteins", "simple carbs", "fiber" };
3:
4: // and we have a list of what i consume
5: var myStuff = new List<string> { "soda", "chips", "proteins", "fat", "sugar" };
So now that we have these two lists: one of healthy things we can consume, and one that is things i typically consume. We can perform an intersection on them and see what things I consume that are healthy by saying:
1: var results = myStuff.Intersect(healthyStuff);
2:
3: foreach (var item in results)
4: {
5: Console.WriteLine(item);
6: }
This will output the item “proteins” which is the only item common between what I eat and healthy stuff (I eat better than that, really, but just for illustrative purposes).
It should be noted that the Intersect of two sets follows the commutative property. This means that A ∩ B = B ∩ A. So in our example that would mean that the following two statements are logically identical:
1: // these two are identical
2: results = myStuff.Intersect(healthyStuff);
3: results = healthyStuff.Intersect(myStuff);
This makes sense because asking what healthy foods exist in the list that I eat is the same as asking what foods do I eat that exist in the healthy foods list.
Union() – combining the unique elements
The Union() method combines the unique elements from two different enumerable sequences, just like the set union operation dictates. Unions are very useful for combining two sets without duplicates. Thus if in our example we wanted to get a list of all the healthy foods and all foods I eat, we could union the two sets.
The two forms of Union(), assuming extension method syntax where the first argument is the target, are:
- Union(IEnumerable<TSource> second)
- Returns a sequence of unique elements from the first and second sequence combined using the default equality comparer for TSource.
- Union(IEnumerable<TSource> second, IEqualityComparer<TSource> comparer)
- Returns a sequence of unique elements from the first and second sequence combined using the definition of equality specified by the passed-in equality comparer.
By unique elements, I don’t mean to imply that only items with no duplicates are in the resulting set, but that the resulting set eliminates any duplicates. So that if you had { 1, 1, 3, 5 } ∪ { 1, 3, 7 } the result would be { 1, 3, 5, 7 }.
For example:
1: var results = myStuff.Union(healthyStuff);
2:
3: // this will output soda, chips, proteins, fat, sugar,
4: // fruits, vegetables, simple carbs, and fiber
5: foreach (var item in results)
6: {
7: Console.WriteLine(item);
8: }
Notice that the duplicate of protein is gone. Union() is also commutative, so A ∪ B = B ∪ A. However that said the ordering will be different since the elements from the first sequence appear first in the resulting sequence, followed by any elements in the result that came from the second sequence.
The nice thing about Union() is it gives you a nice and easy way to join together two sequences and eliminate duplicates. Note that this is very different from the Concat() extension method in LINQ that just concatenates one sequence to the end of the other, but this makes no attempt to remove duplicates.
1: // these are logically identical
2: results = myStuff.Union(healthyStuff);
3: results = myStuff.Concat(healthyStuff).Distinct();
Except() – subtracting the common elements
The Except() method performs a set difference between two sets. That is, A – B yields the items in A minus any items in B that happen to be in A. Any items that were unique to B alone are ignored. Thus, if we wanted to get a list of the food I eat that is NOT healthy food, I could do the set difference between what I eat and the healthy things to eat.
The two forms of Except(), assuming extension method syntax where the first argument is the target, are:
- Except(IEnumerable<TSource> second)
- Returns the unique items in the first set, minus any common items from the second sequence using the default equality comparer for TSource.
- Except(IEnumerable<TSource> second, IEqualityComparer<TSource> comparer)
- Returns the unique items in the first set, minus any common items from the second sequence using the definition of equality specified by the passed-in equality comparer.
Once again this is a simple set difference operation. { 1, 3, 5, 7 } – { 1, 5, 8 } = { 3, 7 }. The 1 and 5 are removed since they were in both the first and second set, and the 8 is removed since it didn’t exist in the first set. So the resulting sequence are only the unique items from the first set that are not also in the second set.
As you can probably tell, difference is not commutative because if you reverse the order of the sets in the difference you get two different things:
- { 1, 3, 5, 7 } – { 1, 5, 8 } = { 3, 7 }
- { 1, 5, 8 } – { 1, 3, 5, 7 } = { 8 }
This means that you have to be careful with Except() that you are subtracting the sets correctly. Once again if we look at the food example:
1: // this is a list of the things I eat that are not healthy
2: // soda, chips, fat, sugar
3: var results = myStuff.Except(healthyStuff);
4:
5: // this is a list of healthy things that I do not eat
6: // fruits, vegetables, simple carbs, fiber
7: results = healthyStuff.Except(myStuff);
So as you can see, Except() is a handy way to get a list of elements in a sequence that do not match the items from a second sequence.
A quick note on deferred execution…
Please note, that like many of the System.Linq extension methods, Except(), Union(), and Intersect() are deferred execution which means they will not be invoked until an enumerator is requested from them. Thus if you did something like this:
1: results = healthyStuff.Except(myStuff);
2:
3: // because results is an iterator (deferred execution) this clears
4: // myStuff before it is actually used, which alters our results
5: myStuff.Clear();
6:
7: foreach (var item in results)
8: {
9: Console.WriteLine(item);
10: }
The resulting set will be all of healthyStuff (instead of the difference) since the results variable holds an iterator to the resulting sequence, but that sequence will not be calculated until we begin enumerating through the results. Thus by calling Clear() on one of the sets before we actually use the results, we’ve altered the operands before the operator is actually applied. In this example, that means we try to subtract an empty set myStuff from the list of healthyStuff, which of course results in the full list.
To avoid this either iterate over the results immediately, or use extension methods like ToArray() or ToList() to immediately process the query and put the results in a collection.
A quick note on covariance in IEnumerable<T>
In IEnumerable<T> the type T is covariant, this means you can use the set operations to manipulate two sets of different types related through inheritance. For example, if you had class Employee and class SalariedEmployee which inherits from Employee, then you can perform set operations between the two sets and the resulting set type is the wider of the two types (that is, the higher up the inheritance chain – Employee in this case).
A quick note on mixed containers
Also notice that the only thing required for these set operations in System.Linq to work is that both sequences must implement IEnumerable<T>, this means they can be an array, a List<T>, a HashSet<T>, or an iterator from another query of type T (and so on). Essentially, this is just to say that you can intersect a HashSet<string> with a List<string> and so on, the only thing that is important is that their element types are the same (or covariant as stated above).
A final note on equality in complex classes
I hinted before that these operations will work exactly as you expect for primitives, strings, structs, and any reference types that correctly implement the concept of equality. And I hinted that custom classes you write may be in danger of not working. But why?
Well, you may think that the first problem is that with class the default concept of Equals() is a reference comparison. While this is true, it is only half the issue. Let’s say we define an Employee class and override Equals() on it:
1: public class Employee
2: {
3: public int Id { get; set; }
4: public string Name { get; set; }
5: public double Salary { get; set; }
6:
7: public override bool Equals(object obj)
8: {
9: var other = obj as Employee;
10:
11: return other == null
12: ? false
13: : Equals(Id, other.Id) &&
14: Equals(Name, other.Name) &&
15: Equals(Salary, other.Salary);
16: }
17: }
So now if we attempt to perform set operations on these two lists, what do we get?
1: var listOne = new []
2: {
3: new Employee { Id = 1, Name = "John Smith", Salary = 12342 },
4: new Employee { Id = 12, Name = "Lucy Doe", Salary = 99243 }
5: };
6:
7: var listTwo = new []
8: {
9: new Employee { Id = 2, Name = "Jane Doe", Salary = 3241 },
10: new Employee { Id = 1, Name = "John Smith", Salary = 12342 }
11: };
Now, if we try to do an intersection, we’d expect to see John Smith, right?
1: var results = listOne.Intersect(listTwo);
But we don’t, we get an empty list! Why? We can get a hint in that the second forms of Union(), Intersect(), and Except() that take an IEqualityComparer<TSource>. Why IEqualityComparer, why not just IComparer?
The answer is that IEqualityComparer requires both an Equals() and a GetHashCode() method to be defined. So this should be a good hint to us that we need to provide not only a meaningful Equals() overload but a GetHashCode() overload in our custom classes (or provide a separate custom IEqualityComparer of course). Remember that two items that are equal should always return the same hash code, but the opposite is not necessarily true. It’s okay for two non-equal items to have the same hash code, but it’s never okay for two equals items to not have equal hash codes. Typically this means that the GetHashCode() method should be based on the same fields used in the Equals() check (or a subset).
So, assuming we add an override for GetHashCode():
1: public class Employee
2: {
3: // ... all the other stuff from before
4:
5: public override int GetHashCode()
6: {
7: unchecked
8: {
9: // using the pretty standard method of primes and combinging field hash codes
10: int hash = 11;
11:
12: hash = hash * 31 + Id.GetHashCode();
13: hash = hash * 31 + Name != null ? Name.GetHashCode() : 0;
14: hash = hash * 31 + Salary.GetHashCode();
15:
16: return hash;
17: }
18: }
19: }
Now we get the expected result of John Smith! Many of the LINQ extension methods use the hash codes of the items in the sequences to quickly and efficiently work their way through the lists. We don’t have this issue with primitives and classes such as string which already override Equals() and GetHashCode() appropriately, and struct doesn’t have this issue because struct by default already does a member-wise Equals() and GetHashCode() construction.
Thus, another way we could have corrected this would be to make Employee a struct, though this has larger ramifications to consider and shouldn’t be done lightly (for more info on class vs struct and all the differences see here). So the general recommendation is to either provide a meaningful Equals() and GetHashCode(), or create a separate IEqualityComparer that will define these for your custom class.
Summary
System.Linq includes some great set operation extension methods that can be used to operate on two sequences as if they were sets. These can come in handy when combining sequences with no duplicates (Union()), seeing if two sequences have elements in common (Intersect()), or seeing what elements in a sequence are not part of another sequence (Except()).
While set operations are typically thought of as math operations, these can be applied to many computer science problems and should be considered when needing to check membership between two sequences of items.
Technorati Tags: C#, .NET, Little Wonders, LINQ, Set, Except, Union, Intersect