2007.05.11 11:56 AM

UniqueList<T> for .NET 2.0

In a fine example of missing the forest for the trees, I was so focused on capturing unique strings in my earlier UniqueStringList post that it never dawned on me how simple it would be to write the same thing as a generic collection and capture unique sets of any value or reference type. Luckily, a sharp-eyed reader named Simon Hewitt (more on him later) reported a small bug ("buglette") in my original code (see the comments there for details), giving me an excuse to look at it again, and while doing so I realized that my first pass was needlessly narrow. Thus was born UniqueList<T>. UniqueList<T> retains the core hash table logic and basic occurrence counting behavior of UniqueStringList, but does so for any reference or value type, not just strings.

While investigating IList<T>, which UniqueList<T> implements, as opposed to UniqueStringList's use of IList, I noticed that the generic .NET collection interfaces distribute their members a little differently than the non-generic collection interfaces. The division of collection-specific functionality into dictionaries and lists continues, as does the stacking of interface members from IEnumerable through ICollection, IDictionary, and IList. However, the generic ICollection<T> interface now takes over the IsReadOnly property, along with the Add, Clear, Contains, and Remove methods, whereas in the non-generic ICollection interface these are defined by the IDictionary and IList interfaces. This difference didn't require any changes on my end (I still feel IList<T> is the proper interface for this class), but I did find it interesting. Here's a class diagram that illustrates the difference (click to enlarge):

The earlier UniqueStringList post includes a rough diagram of the .NET collection classes derived from the non-generic interfaces and an explanation for my use of IList.

I took a clue from the new generic List<T> class, which has replaced the venerable ArrayList for many of us, and added some additional helper methods to UniqueList<T>, including AddRange and ToArray, and a couple of additional CopyTo overloads. The addition of AddRange made it a no-brainer to add a new constructor overload that takes an IEnumerable<T> with which to populate the new UniqueList<T> instance (List<T> provides a similar constructor). For consistency, I also added CopyTo- and ToArray-like methods for retrieving the unique item counts (see CopyItemCountsTo and ItemCountsToArray), although I can't imagine these ever being useful.

A more useful change I think was the addition of a method that returns an IEnumerable<UniqueList<T>.ListItem> with which to enumerate the list's items and counts in one step, similar to the KeyValuePair<TKey, TValue> enumerator returned by Dictionary<TKey, TValue>. So, in addition to doing this:

UniqueList<string> strings = new UniqueList<string>(new string[] { "a", "b", "c", "a" });
      
foreach (string s in strings) {
  Console.WriteLine("{0}: {1}", s, strings.GetItemCount(s));
}

You can now also do this:

UniqueList<string> strings = new UniqueList<string>(new string[] { "a", "b", "c", "a" });

foreach (UniqueList<string>.ListItem sc in strings.GetItemsAndCounts()) {
  Console.WriteLine("{0}: {1}", sc.Value, sc.Count);
}

The new inner ListItem struct that this relies on exposes a Value property of type T, as defined by the outer class, and an int Count property. By returning an IEnumerable<T> and using the new iterator features of C# 2.0, GetItemsAndCounts is a fairly fast operation. I considered adding another method to return an array of all the items and counts combined as ListItems, but it would require an O(n) operation just to assemble the array, so I didn't bother. I also considered making the counts first class properties of the list's items by storing them as ListItems internally. However, as I described in the earlier UniqueStringList post, this would make retrieval of just the list's items via CopyTo and ToArray a slower O(n) operation than .NET's List<T>.CopyTo and would have added overhead to everything and (perhaps) made serialization a little trickier. So, for now, I continue to treat the occurrence-counting aspects of UniqueList<T> as secondary features, even though this means the count retrieval methods must be separate and, honestly, kind of ugly.

Speaking of serialization, it's still supported. However, UniqueList<T> doesn't check that its generic type parameter and the IEqualityComparer<T> optionally passed to it during construction are both serializable. If they aren't and you try to serialize the UniqueList<T> instance the runtime will raise an error. If you'd like to implement your own serializable constraints (note that it can't be done with a simple generic type parameter constraint, because the serializable nature of a type is established with an attribute), check out the techniques described in Juval Lowy's MSDN article from October 2004 Format Your Way to Success with the .NET Framework Versions 1.1 and 2.0.

Here's another interesting bit of generics behavior I discovered while writing UniqueList<T>. In the code you'll see that there's a GetItemCount method that takes a parameter of the generic type (T) to return an occurrence count for a given list item, and a separate GetItemCountByIndex method that takes an int to return an occurrence count for a list item based on its position in the list. I had originally implemented these as two GetItemCount methods having these same (overloaded) parameters, but this proved problematic when UniqueList<T> was initialized with a generic type parameter of int. The problem was that UniqueList<int> essentially had two methods with the same signature, and as far as I could tell there was no way to distinguish between them. In my testing, the non-generic method always got called. So, to avoid this, I renamed the index-based method. I'm a little surprised that the compiler doesn't at least raise a warning when it sees that more than one method differ only by the type of one parameter, one of which relies on the generic type parameter. Maybe this behavior is defined somewhere, but I can't find it.

UniqueList<string> behaves slightly different than UniqueStringList when faced with a null value. In UniqueStringList a null is treated like an empty string (""). I could have replicated this in UniqueList<T> by checking whether the generic type parameter was a string, but I figured I would avoid the overhead and just treat all nulls the same, which is to say they're ignored.

Implementing this null-ignoring logic meant checking the value parameters of many methods for nulls, but I was worried this would be problematic when the generic type parameter used to initialize UniqueList<T> was a value type. Turns out this isn't an issue. As Wintellect co-founder Jeffrey Richter explains in his CLR via C#, Second Edition, comparing a generic type variable to null using the == or != operators is legal regardless of whether the generic type is constrained. He explains that when the JIT compiler sees an if statement comparing a value typed variable with null that "...the JIT compiler sees that the if statement can never be true, and the JIT compiler will not emit the native code for the if test or the code in the braces." That's good news, but based on my testing I think it's only half right. I think what actually happens is that the JIT compiler simply replaces the comparison itself with false. Then, as is normal for the JIT compiler, if this constant false prevents the execution of some branch of code, it simply leaves the code out. I say this because the value type-to-null comparison can be used in ways other than in if statements (or even in odd ways in if statements) that don't result in the elimination of code. Here are some examples that work just fine:

public void test(T item) {
  bool b = (item == null);
  Console.WriteLine(b);  // prints "false" if T is value type
  test2(item == null);
  if ((item == null) == false) {
    Console.WriteLine(false);  // prints "false" if T is value type
  }
}

private void test2(bool test) {
  Console.WriteLine(test);  // prints "false" if T is value type
}

Speaking of Wintellect, I discovered after writing UniqueList<T> that they host a community project devoted to developing a Power Collections library of generic container classes. Wish I'd seen this before, as there's a lot of good stuff there. However, after looking over all of their collections, I didn't find one that offers all of the advantages of UniqueList<T>. Their Set<ElementType> comes close, in that it provides unique key storage without a separate value type, but it doesn't preserve the order of addition, nor does it provide occurrence counts. What I really like about their collections, though, are the set operation methods: Difference, Intersection, IsDisjointFrom, IsProperSubsetOf, IsSubsetOf, SymmetricDifference, and Union. I'll have to look into adding similar methods to UniqueList<T>.

Another thing I recently discovered is that there is already a .NET-based UniqueStringList on Code Project from November 2002. I found it while using Google to find my own post one day. I haven't looked at the code yet, but the list of features described in the article are very similar to those for my UniqueStringList, including optional preservation of addition order (via the Insert method), support for serialization, and the ignoring of nulls and duplicates (no raised errors). It includes some features not found in my implementation, such as special empty string handling and synchronization, and doesn't include others, like occurrence counting. There's a good chance that if I'd seen this before starting my own UniqueStringList implementation last year, we wouldn't be here today. The really interesting thing, though, is that the article and code were written by Simon Hewitt, the same considerate reader who noted the "buglette" in my UniqueStringList implementation. So, to Mr. Hewitt: Thanks again for pointing out my bug, and I'm terribly sorry I inadvertently hijacked your project's name. I promise to do a search next time before conjuring up a name for my own project.

Finally, a quick word regarding performance. I didn't do nearly as much performance testing of UniqueList<T> as I did for UniqueStringList, and I didn't capture any of the results for graphing this time, but I did do enough to see that it performs as well as UniqueStringList. The slight difference was just barely measurable.

Unlike with UniqueStringList, I'm not going to reproduce the code for UniqueList<T> in this post. If you'd like a copy, just grab it here. Note that I added a disclaimer and MIT-style attribution license to the code this time. This is mostly for liability reasons, because as Mr. Hewitt made clear with UniqueStringList, my code isn't perfect. Use it any way you want, but know that it hasn't been used in a production environment and hasn't been covered completely with tests. It might lose an item, miss a count, burn your coffee, or run away with your girlfriend. Like the sidebar says: Proceed at your own risk.


Comments

Performance could be better. Dictionary is faster.

Nobody | 2007.08.23 10:35 AM

Thanks, Nobody. Your insight is breathtaking, your analysis unimpeachable.

ewbi.develops | 2007.08.23 10:41 AM

I like this implementation.
I stumbled on it via Google, when I searched for a StringDictionary or generic dictionary which does not automatically convert the key to lower case like the .NET dictionary does.
Do you know of any such implementation?

regards, Robert

bobfox | 2008.04.10 02:58 PM

Hi bobfox,

No time to look up all the details right now, but I don't believe the generic Dictionary converts keys to lowercase. The key comparison logic is driven by the IEqualityComparer object, an instance of which you can supply in the constructor. If your keys are strings, then you'll likely provide an IEqualityComparer derived from StringComparer, which are easily accessed using StringComparer's static (shared) properties: CurrentCulture, CurrentCultureIgnoreCase, Ordinal, OrdinalIgnoreCase, etc. These return singleton instances of specific StringComparer-derived objects like OrdinalComparer, which when constructed with a True for ignoreCase, relies on special case-based string comparisons in System.String's EqualsHelper. The only base collection I can think of that does a conversion of keys to lowercase is StringDictionary, which is just a badly wrapped HashTable.

Or maybe I've misunderstood your question? Let me know.

ewbi.develops | 2008.04.10 03:26 PM

Pls, could you provide the code for the methods Remove and RemoveAt, I want to recommend your class in my blog, but it's important to have these methods implemented.

Miguel | 2008.06.06 12:29 AM

Sorry for my insistence, do you have the code for Remove and RemoveAt? I wanted to recommend the class but it's a big inconvenience not to have methods to delete.

Miguel | 2008.06.23 03:09 AM

Hi Miguel,

So sorry to make you wait for a reply. For some reason I didn't get notified by TypePad after your first comment.

I'm afraid I don't have Remove/RemoveAt implementations for this, as it was never intended to be a general-purpose in-and-out collection. Instead, it was meant to be a container for quickly collecting and enumerating unique objects. The container is very lightweight and grows efficiently now, but these characteristics would change with the addition of any reasonable deletion and contraction support. I suppose one might add an inefficient brute-force deletion/compactor, or maybe just a full reindexer (skipping the deleted object), but I'm afraid it can't be me, as I don't have any need for it and I'm too busy now to give it any attention anyway.

If someone needs a more general purpose hashed collection, including deletion support, my recommendation would be the generic Dictionary. Looking back at my original UniqueStringList post you'll find some performance graphs that show it performing almost as quickly as UniqueStringList (and, therefore, UniqueList).

Good luck!

ewbi.develops | 2008.06.23 09:08 AM



Post a Comment

 
  (optional)
  (no html)
 
   


TrackBack

TrackBack URL:  http://www.typepad.com/services/trackback/6a00d8341c7bd453ef00d83548e04069e2

Listed below are links to weblogs that reference UniqueList<T> for .NET 2.0: