2006.01.23 01:41 AM

Hashing in .NET, Part I

While implementing a persistent linear hash table in .NET, based on the MaVerick open source multivalue database from a few years ago, I got pretty severely side-tracked investigating various string hashing algorithms and exploring .NET's take on hashing. Now, in an effort to feel better about all the time I wasted, I'm going to dump some of what I learned here.

Just to be clear, I'm not really qualified to evaluate the merits of different hashing algorithms, at least not from some rigorous theoretical, mathematical, or statistical perspective. I'm just fascinated by them, particularly as they relate to storing things in and later retrieving them from hash tables.

Dirichlet's Box Principle (aka the pigeonhole principle) is an endless source of distraction for me:

Given n boxes and m>n objects, at least one box must contain more than one object.

What I really enjoy is watching others, clearly smarter than me, figure out how to spread this inevitable multiple-occupancy condition out across as many of the available boxes as efficiently as possible.

The box principle affects hash tables in a couple of ways. First there's the conversion of a nearly infinite number of thing-identifying keys (m) into a finite number of well distributed values (n) using nothing more than a deterministic algorithm and the keys themselves. Then there's the subsequent storage of those values (m) and their associated things into buckets (n), which necessarily number fewer than the finite (but still generally large) number of all possible values, in a way that allows them to be accessed in something approaching O(1) time.

Historically, at least in my experience, these two tasks have been the combined responsibility of the hash table itself. To add something to a hash table, you usually provided two things: 1) the thing to store, and 2) the thing's unique key, generally in the form of a string (if the key was a number or date or whatever, you would first convert it to a suitable string). The hash table then did two things: 1) hash the key using some algorithm to generate a well distributed numeric value (a hash code) within some range, and 2) use this value and some modular arithmetic to establish (and, if necessary, create) the bucket in which to store (or from which to retrieve) the thing and its key. Owing to the box principle, consideration had to be given to the possibility of a collision, where different keys result in the same hash code (and therefore the same bucket), so storage and retrieval always included a comparison of the requested key with the key found in any matching hash code-based buckets to be sure the correct object was located. If the keys didn't match, the hash table would provide some sort of collision resolution (for instance linear or quadratic probing, or rehashing, also referred to as double hashing) in order to establish additional buckets for the colliding keys.

Like Java before it, .NET adopted a somewhat different approach. Specifically, it took the responsibility for turning keys into hash codes away from the hash table and gave it to, well, everything else. It did this by endowing every .NET type with a System.Int32 hash code accessed via the GetHashCode method, which is inherited by every type from the public virtual implementation in System.Object. Then, having guaranteed the availability of a hash code for every object, and similarly having guaranteed the availability of an Equals method for comparing the equality of every object, it implemented its System.Collections.Hashtable in a way that allows any type of object to be used as a key for objects added to the table. After all, at least in principle, if every object has a hash code and can be compared to every other object of the same type, then any object can be a hash table key.

In practice, though, what this means is that when adding objects to a Hashtable, you aren't just providing unique keys for your objects. You are also providing objects that provide hash codes and key object equality comparisons for your objects. And this leaves the .NET developer facing a number of potential issues.

The purpose of a hash table, as opposed to say an array, is to provide both sparse storage and fast key-indexed retrieval of things. In order to do this, a hash table requires that each thing be identified by a key that is unique relative to the set of all things that might be stored in the hash table. The relationship between a thing and its key is established when the thing is added to the hash table. That relationship is expressed only indirectly: the key maps to a bucket (eventually), and the bucket holds the thing and key (the latter in order to resolve collisions). Due to this indirection, it is generally possible to change the thing in a bucket mapped to by a particular key (in the same way that you might replace the contents of a particular array element). But, and this is the important part, the same key should always map to the same bucket and it should never be possible for two or more keys to map to the same (eventual) bucket (in the same way that you would always expect a particular number to point to the same element of an array, and likewise would never expect two numbers to point to the same element of an array).

By allowing any type of object to serve as a Hashtable key, .NET makes it our responsibility to ensure the latter is done correctly.

The objects we supply as keys to Hashtable are used in two ways:

  • their hash codes are used to establish buckets, and
  • their relative equality is used to ascertain whether key objects found in those buckets match.

Correctly supporting these Hashtable activities means that our key objects should provide:

  • well distributed, constant, equivalence-consistent hash codes, either from GetHashCode methods in the key objects themselves (whether overridden or from System.Object) or via custom IHashCodeProvider objects, and
  • type-safe, reflexive, symmetric, transitive, consistent, non-null, equivalence-based object comparisons, either from Equals methods in the key objects themselves (overridden or from System.Object) or via custom IComparer objects.

For anything but the simplest key object types (think strings), it's very easy to get this wrong. In part II, I'll try to illustrate some of the ways this can happen.


Comments



Post a Comment

 
  (optional)
  (no html)
 
   


TrackBack

TrackBack URL:  https://www.typepad.com/services/trackback/6a00d8341c7bd453ef00d8355ae0a669e2

Listed below are links to weblogs that reference Hashing in .NET, Part I: