Evaluation of Hash Functions

In [uzgalis91] hash functions were evaluated using the method described by McKenzie, but a much stronger key test set was created for testing the functions. Unordered sets which contain the consecutive binary integers are the most closely packed possible keys. A compact set of this form contain all the possible 1-, 2-, 3-, ..., n-bit differences between keys. Consecutive integer keys distribute evenly using the modulo function; however it is not obvious how to transform them into a uniform random distribution. Because of the closeness of keys (they can be no closer without being the same) an unordered consecutive binary integer set provides a difficult key set for hash functions. A clean hash function must introduce enough noise to distribute consecutive binary integer hashes in such a way they are indistinguishable from a uniform random distribution.

This consecutive binary key set testing procedure was reapplied to the same hash functions that McKinzie, et al, evaluated in 1989. This experiment showed that all previously tested hash functions did not perform well, including cbuhash, McKinzie's choice for a good hash function. However both pkphash and buzhash performed well. The buzhash function had the advantage of being developed with the aid of the test.

The binary key test provides an efficient measure of how well a hash function performs. This test is statistical. But, it provides a reasonably accurate empirical test for clean hashing, and it gives a measure of the success comparable to those used for pseudo-random number generators.

Two criteria weed out most candidates for a satisfactory Java language library hash function. In a language library function nothing can be known about the key set. Assume that the library function must at least hash 2**20 (about a million) compact binary keys, as a mark of minimum reasonable coverage for the hash function. As requirements the following criteria can be justified by arguments of reasonableness and coverage necessary to be useful. They are certainly not the only requirements that could be imposed, but these two criteria are sufficient too weed out all but a few hash functions, others fall far short of sufficient mixing or coverage to be useful.

Criteria for a Library Hash Function

The first criterion allows one to take some of the magic out of hash function creation. The second one covers the engineering aspects of most programs so that most programmers do not need to worry about the hash function. These criteria allow a library hash function to be used with reasonable confidence that it will in fact work acceptably.

The current situation is that there are only a few known functions that run fast enough and approximate a library hash function well. Among them are: pkphash and buzhash. Perhaps less desirable on performance grounds are rndhash and the cryptographic hash functions.

All the hash functions expressed earlier in SAL work on keys that are ASCII strings. However most of these string parameters are really an alias for bytes of data, independent of type. Thus having a single algorithm for computing a library hash function is efficient because one only needs one hash function for all types... one that ignores the type of the argument. In strongly typed languages, like Java, this can be done only at the lowest language level.

To ignore type works for hash functions because they expand and scramble the bits in the key and then reduce the resulting large noise to a smaller noise, usually by a modulo reduction. The actions of the hash function don't depend on the meaning of the bits in the key. They can operate in general because it is reasonably easy to expand and scramble bits independent of any particular type.

It is not easy to see how to collapse the large noise created by long keys. The most obvious way to collapse this information is to perform a modulus operation. This will work sometimes for relatively short segments of noise, but it has a few inherent problems. First, multi-precision arithmetic is slow. Second, taking the modulus for a power of two never produces the desired result because it just strips off the upper bits. Both pkphash and buzhash create will create many bits of noise for each 8-bits in the key and then fold it or build 64-bits. In doing so they stay consistent with the first criteria for a library hash function by conceptually different but efficient methods.