2006.10.14 03:24 AM

UniqueStringList for .NET 2.0

Update 5/11/2007

Fixed an off-by-one bug in IndexOf, as kindly reported by Simon Hewitt in the comments. Also added a null to empty string check and pulled the unnecessary check for an indexed bucket value of 0 from the same routine. More importantly, though, I've posted a new generic version of this logic named UniqueList<T>. I recommend checking it out.


Among the many collection types of the .NET 2.0 Framework there is no type that satisfies the following set of requirements:

  • Stores only unique strings.
  • Preserves the order of string additions.
  • Provides ordinal and culture-sensitive string comparisons.
  • Provides case-sensitive and case-insensitive string comparisons.
  • Allows pre-sizing of the string collection for efficiency.
  • Provides indexed lookups.
  • Provides enumeration of strings in the order of their addition.
  • Provides quick retrieval of the string collection as an array of strings.
  • Performs at near O(1) time for additions and lookups.
  • Supports serialization.

Some come close, but ultimately they all come up a little short.

Here's a simple inheritance diagram showing the interfaces and relationships of the various contenders.

StringCollection
If names counted for something, this type would fit the bill. Unfortunately, names don't count. Internally, StringCollection is nothing but a thinly wrapped ArrayList, the wrapper's sole purpose being to enforce a String-based interface over the Object-based ArrayList. Being essentially an ArrayList, it does succeed in preserving the order of additions, and so provides indexed lookups and ordered enumerations. It also supports serialization and, thanks to the wrapper, is restricted to strings. However, being an ArrayList, it also means that ensuring the uniqueness of added strings, which must be done with a check of Contains() before each addition, will likely result in as many linear traversals of the entire collection as there are items in the collection with every item addition. In other words, each addition takes O(n) time - ouch. And if that's not reason enough to move on, StringCollection also doesn't allow pre-sizing and it only provides case-sensitive ordinal comparisons via the default String.Equals() method. Next.

Hashtable
Unlike StringCollection, this venerable dictionary type easily satisfies the O(1) performance requirement. After all, it's a hash table. It also supports various culture-aware and case-in/sensitive string comparisons (much improved in .NET 2.0), and even supports serialization. Unfortunately, because it doesn't retain knowledge of the order of item additions, it can't provide indexed lookups or addition-ordered enumerations - a significant requirement. Less significantly, there's also the issue of it being Object-based, instead of String-based, for both its keys and values. And, as far as that goes, its need for both keys and values (the hallmark of all dictionary types) makes it somewhat awkward to use and a little wasteful of space when, as in this case, there are only keys, no values.

Dictionary<String, whatever>
This new generic dictionary type behaves essentially like Hashtable, though it does perform a little better due to improved collision resolution through chaining. It solves Hashtable's less significant issue, which is that it can be made to restrict keys to strings. However, as with Hashtable, its storage of values, in addition to keys, remains unnecessary overhead in this case. And it suffers from the same fatal shortcoming as Hashtable, which is that it retains no knowledge of the order of additions.

NameValueCollection
This type probably comes closest to satisfying all of the requirements. Internally, it combines an ArrayList with a Hashtable to provide the best of both worlds: ordered, indexed key storage and near O(1) addition and lookup times (for some reason the documentation incorrectly describes the Item indexer property as exhibiting O(n) time). It's restricted to strings, supports pre-sizing, culture-aware and case-in/sensitive string comparisons, quick key retrieval, and serialization. And its support for adding multiple values to each key means it isn't necessary to explicitly check whether a key already exists before adding it (in fact, you can't, as NameValueCollection doesn't implement IList or IDictionary, only ICollection, and so doesn't expose a Contains() method). Unfortunately, this type still comes up short.

First, as indicated by its name, and like Hashtable and Dictionary<String, whatever>, NameValueCollection stores both names (keys) and values, the latter serving no purpose in this case except to waste space and to make string additions more awkward than necessary. In fact, owing to its support for storing multiple values per key, NameValueCollection actually wastes more space than either Hashtable or Dictionary<String, whatever>. Secondly, NameValueCollection isn't a great performer. While the documentation claims it provides near O(1) addition times, its internal resizing operations are pitifully slow. Even when amortized over lots of additions (as is standard when measuring the performance of hash tables, save linear hash), the slow resizes measurably reduce the speed of additions as the number of items grows into the thousands. Finally, the NameValueCollection's CopyTo() method doesn't return an array of keys, only values. The collection's keys are addressed via its base type's Keys property, which returns a NameObjectCollectionBase.KeysCollection object, which unfortunately hides the implementation of CopyTo() from its ICollection interface. However, the keys can be enumerated to fill an array, though in O(n) time. Alternatively, one might consider using the Add() method to write strings as both keys and values, making the NameValueCollection's CopyTo() useful again. Unfortunately, this would require enumeration and examination of every key's values during addition to avoid duplicates and additional wasted space.

ListDictionary
This singly linked list type provides painfully slow O(n) addition and CopyTo() times, and so is intended for use with only very few items. On the up-side, it does preserve the order of addition.

HybridDictionary
This odd type uses both ListDictionary and Hashtable, the former for small sets and the latter for larger sets, in an effort to improve performance. Unfortunately, the order of additions is lost when the switch is made to Hashtable.

StringDictionary
This type is to Hashtable what StringCollection is to ArrayList, basically a thin String-based wrapper. Besides failing to maintain the order of additions, StringDictionary also prevents use of Hashtable's culture-aware and case-in/sensitive comparison options, choosing instead to force all keys to lowercase using the InvariantCulture.

Microsoft.VisualBasic.Collection
If you don't mind adding a reference to the Microsoft.VisualBasic assembly, then the Collection type, like NameValueCollection, comes close to meeting the requirements. Internally, it combines a custom linked list with a strongly typed Dictionary<String, Collection.Node> to offer ordered, indexed key storage and near O(1) addition and lookup times. Unfortunately, it still falls short of the mark.

The first issue is with Collection's key comparisons: they are always case-insensitive and culture-aware (using CurrentCulture). This creates an extreme performance disadvantage, particularly in those cases where a case-sensitive ordinal comparison would suffice. Also, its use of a reference type for storage of nodes (Collection.Node), combined with the repetition of each node in its Dictionary<String, Collection.Node> (Dictionary.Entry), and the lack of direct access to its keys, thus forcing strings to be stored as both keys and values, result in a lot of wasted space and unnecessary object instances. This negatively impacts performance and memory utilization. Finally, Collection provides no CopyTo() support, hiding the implementation from its ICollection interface, making the copying of strings to an array an O(n) operation.


The UniqueStringList class below overcomes these issues and satisfies all of the requirements.

using System;
using System.Collections;
using System.Collections.Generic;
using System.Runtime.Serialization;
using System.Text;

namespace Ewbi {

  /*
    UniqueStringList
    
    Stores unique strings in one Add step.  Uses an open addressing hash table with prime modulo 
    bucket determination and linear probing for collision resolution.  Uses any IEqualityComparer<string> 
    based type for hash codes and equality checks (case-sensitive System.Ordinal by default).  Strings 
    kept in an array to preserve order for enumeration and indexed lookups.  Sensitive to clustering, 
    needs good hash distribution and load factor management.  Supports serialization.  Not thread safe.
  */

  [Serializable]
  public sealed class UniqueStringList : IList<string> {
  
    private const double loadFactor = .58;

    private IEqualityComparer<string> comparer;
    private List<string> items;
    private List<int> counts;
    private int capacity;
    
    [NonSerialized()]
    private int[] buckets;
    [NonSerialized()]
    private int maxLoad;

    [OnDeserialized()]
    private void OnDeserialized(StreamingContext context) {
      buckets = new int[capacity];
      ReindexItems();
    }

    public UniqueStringList() : this(0) {}
    
    public UniqueStringList(int capacity) : this(capacity, null) {}

    public UniqueStringList(IEqualityComparer<string> comparer) : this(0, comparer) { }

    public UniqueStringList(int capacity, IEqualityComparer<string> comparer) {
      if (capacity < 0) throw new ArgumentOutOfRangeException("capacity");
      this.comparer = comparer;
      if (this.comparer == null) this.comparer = (IEqualityComparer<string>) StringComparer.Ordinal;
      Initialize(capacity);
    }
    
    public int IndexOf(string item) {
      if (item == null) item = "";
      int index = GetIndex(item);
      return buckets[index] - 1;
    }

    public void Insert(int index, string item) {
      throw new NotSupportedException();
    }

    public void RemoveAt(int index) {
      throw new NotSupportedException();
    }

    public bool Remove(string item) {
      throw new NotSupportedException();
    }

    public string this[int index] {
      get {
        return items[index];
      }
      set {
        throw new NotSupportedException();
      }
    }
    
    public void Add(string item) {
      if (item == null) item = "";
      int index = GetIndex(item);
      if (buckets[index] == 0) {
        items.Add(item);
        counts.Add(1);
        buckets[index] = items.Count;
        if (items.Count > maxLoad) Expand();
      } else {
        counts[buckets[index] - 1]++;
      }
    }

    public void Clear() {
      Initialize(0);
    }

    public bool Contains(string item) {
      return (IndexOf(item) != -1);
    }

    public void CopyTo(string[] array, int arrayIndex) {
      items.CopyTo(array, arrayIndex);
    }
    
    public void CopyCountsTo(int[] array, int arrayIndex) {
      counts.CopyTo(array, arrayIndex);
    }

    public int Count {
      get { return items.Count; }
    }

    public bool IsReadOnly {
      get { return false; }
    }

    IEnumerator IEnumerable.GetEnumerator() {
      return GetEnumerator();
    }

    public IEnumerator<string> GetEnumerator() {
      return items.GetEnumerator();
    }
    
    public int GetItemCount(string item) {
      if (item == null) item = "";
      int index = GetIndex(item);
      if (buckets[index] == 0) return 0;
      return GetItemCount(buckets[index] - 1);
    }
    
    public int GetItemCount(int index) {
      return counts[index];
    }

    private static int GetPrime(int min) {
      for (int i = min | 1; i < 0x7fffffff; i += 2) {
        if (IsPrime(i)) return i;
      }
      return min;
    }

    private static bool IsPrime(int candidate) {
      if ((candidate & 1) == 0) return (candidate == 2);
      int halfway = (int) Math.Sqrt((double) candidate);
      for (int i = 3; i <= halfway; i += 2) {
        if ((candidate % i) == 0) return false;
      }
      return true;
    }
    
    private void Initialize(int size) {
      capacity = GetPrime(size);
      items = new List<string>(capacity);
      counts = new List<int>(capacity);
      buckets = new int[capacity];
      maxLoad = (int) (capacity * loadFactor);
    }

    private void Expand() {    
      capacity = GetPrime(capacity * 2);
      buckets = new int[capacity];
      ReindexItems();
    }
    
    private void ReindexItems() {
      maxLoad = (int) (capacity * loadFactor);
      for (int itemIndex = 0; itemIndex < items.Count; itemIndex++) {
        int index = GetIndex(items[itemIndex]);
        buckets[index] = itemIndex + 1;
      }      
    }
    
    private int GetIndex(string item) {
      int hashCode = comparer.GetHashCode(item) & 0x7fffffff;
      int index = hashCode % capacity;
      int increment = (index > 1) ? index : 1;
      int i = capacity;
      while (0 < i--) {
        int itemIndex = buckets[index];
        if (itemIndex == 0) return index;  // Empty bucket.
        if (comparer.Equals(items[itemIndex - 1], item)) return index;  // Bucket item match.
        index = (index + increment) % capacity;  // Probe.
      }
      throw new InvalidOperationException("Failed to locate a bucket for item.");
    }

  }
  
}

There were some decisions made while writing this class that are worth pointing out.

  • As its name implies, UniqueStringList is a list collection, not a dictionary (there are no key-value pairs). This meant IList<String> was the logical interface to implement. However, some of its index-based members, such as Insert() and Remove(), were not appropriate here, and so were not implemented.
  • I choose simple probing for collision detection because it was easier to implement and performed as well as chaining, as long as the IEqualityComparer<String> hash function used provides reasonably good distribution.
  • Likewise, the 58% load factor was chosen after extensive testing to balance memory waste with maximum efficiency. The tests involved strings ranging from 4 to 40 characters long. Depending on your string lengths and hash function, a different load factor might be warranted.
  • Though it's not listed as a requirement, occurrence counts are tracked during string item additions. Instead of combining the string items and counts in a simple value type in one array, the counts were kept in a separate array. This was done to preserve the O(1) time of the CopyTo(), as the counts are only a secondary feature. I agree that the GetItemCount() and CopyCountsTo() methods are ugly, but oh well.
  • It isn't necessary to check whether a string item exists via Contains() before trying to add it. Just add them. If they already exist, their occurrence counts will be incremented. Be aware, though, that only the first instance of any added string will be retained. So, for instance, when using a case-insensitive IEqualityComparer<String>, such as StringComparer.OrdinalIgnoreCase, adding "A" and then "a" will result in only "A" being retained (and, thus, only "A" will be returned via indexed lookups and during enumerations).

Some quick performance tests against some of the framework's collection types show UniqueStringList performing pretty well.

Here's the legend used by each of following graphs.


Total Add() Time: StringComparer.Ordinal
This is a measure of the time (in milliseconds) required to add the specified number of string items using the case-sensitive StringComparer.Ordinal IEqualityComparer<String> object. Note how StringCollection's O(n) time quickly drifts into space. Likewise, NameValueCollection's costly resizes send it soaring at 5,000 items.

Total Add() Time: StringComparer.Ordinal, Pre-sized
This is a measure of the time (in milliseconds) required to add the specified number of string items to collections that are pre-sized to the number of items being added, eliminating the cost of resizes, using the case-sensitive StringComparer.Ordinal IEqualityComparer<String> object. StringCollection isn't included because it doesn't support pre-sizing. Note the improvement in NameValueCollection's performance, though it remains the least efficient. Also, Dictionary<String, whatever> catches UniqueStringList when pre-sized.

Total Add() Time: StringComparer.OrdinalIgnoreCase
This is a measure of the time (in milliseconds) required to add the specified number of string items using the case-insensitive StringComparer.OrdinalIgnoreCase IEqualityComparer<String> object. StringCollection isn't included because it only supports case-sensitive ordinal comparisions. The times correspond well with the first StringComparer.Ordinal test. Basically, the insensitivity of the string comparisons makes everything slower. Culture-aware comparisons take even longer. Note that when ignoring case, the 50,000 item string set actually produced only 43,285 unique strings.

Average Add() Time: StringComparer.Ordinal
This is a measure of the average time (in milliseconds) required to add each string item using the case-sensitive StringComparer.Ordinal IEqualityComparer<String> object. The times include an amortization of the collection resizes. Note that it doesn't take long for StringCollection to disappear.

Average Add() Time: StringComparer.Ordinal, Pre-sized
This is a measure of the average time (in milliseconds) required to add each string item to collections that are pre-sized to the number of items added, using the case-sensitive StringComparer.Ordinal IEqualityComparer<String> object. This eliminates the resizes, isolating the actual cost of adding each item. StringCollection isn't included because it doesn't support pre-sizing. Note how little time the resizes, which are nasty O(n) operations, actually impact the average times.

Allocated Bytes
This is a measure of the total bytes consumed (in millions of bytes) by each collection following the addition of the specified number of string items. In its native Unicode form, the 50,000 item string set totaled 937,606 bytes. As expected, StringCollection is the most efficient consumer of memory, being just an ArrayList. The waste in NameValueCollection's storage is immense, requiring nearly seven times as much space as its content. By comparison, UniqueStringList performs pretty well.


Feel free to use this anyway you want. Or not. As with everything else posted here, there are no warranties implied and no liability assumed. If you use this code you assume all the risk. That being said, I apologize for any bugs. If you find any, I'd appreciate you letting me know.


Comments

Some musings by Wesner Moise back in February 2004 on weaknesses in the framework's collection classes (old but still a good read):
http://wesnerm.blogs.com/net_undocumented/2004/02/whats_missing_i.html

Justin Rogers also from February 2004 talking about StringCollection vs. just using ArrayList:
http://weblogs.asp.net/justin_rogers/archive/2004/02/29/81680.aspx

Google's SparseHash project page:
http://goog-sparsehash.sourceforge.net/

Wikipedia article discussing open addressing versus chaining in hash tables:
http://en.wikipedia.org/wiki/Hash_table#Open_addressing_versus_chaining

In case you're interested, here's some info using the Windows APIs QueryPerformanceCounter and QueryPerformanceFrequency to time managed code:
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnpag/html/scalenethowto09.asp

ewbi.develops | 2006.10.14 04:29 AM

I believe the code segment above is actually
(C) ******* M. *****, Ltd.?

Keith | 2006.10.21 03:37 PM

You wish, you numpty. :)

Btw, I never tie posts back to my clients by name - too embarrassing (sometimes for me, sometimes for them), so I censored your comment.

The project that actually led to me writing this code was for another client (unfortunately, I didn't get around to writing this class until I'd already made due with something else in their project - never enough time when on the clock). Anyhow, I just got started with you guys again last week after a multi-month hiatus. With luck, though, I'll get a chance to use this in your stuff real soon now.

Thanks again for the dinner and drinks!

ewbi.develops | 2006.10.21 06:22 PM

Hi

Love this code!

Is there an off-by-one buglette in IndexOf?

Cheers
Simon

Simon Hewitt | 2006.11.27 11:54 PM

Simon,

Sorry for the slow reply, it's been a crazy December. You are correct, there is an off-by-one issue in IndexOf. And, having fixed that, there was also no reason to first check whether the buckets[index] value is 0. Thanks so much for catching and reporting this! Btw, I'm about to post a new generic version of the same code good for use with any type, not just strings. I'll update this code when I get the new post up. Thanks again!

ewbi.develops | 2007.01.02 05:19 PM

Oh you are so right about the not retaining index information on the .NET collections...!

Owen | 2007.03.03 01:37 PM


TrackBack

TrackBack URL:  http://www.typepad.com/services/trackback/6a00d8341c7bd453ef00d8342c5df453ef

Listed below are links to weblogs that reference UniqueStringList for .NET 2.0:

» UniqueStringList for .NET 2.0 from Developing For .NET
Ive been a fan of custom Enums since I started coding for .Net: like most software packages, we have tons of selection lists. Many of these are table driven, code-based lists like so: Code : Description 10 : Asparagus 12 : Carrot 73 : Tomato et... [Read More]

Tracked on Jan 5, 2007 1:56:05 PM