In [uzgalis91] hash functions were evaluated using the method described by McKenzie, but a much stronger key test set was created for testing the functions. Unordered sets which contain the consecutive binary integers are the most closely packed possible keys. A compact set of this form contain all the possible 1-, 2-, 3-, ..., n-bit differences between keys. Consecutive integer keys distribute evenly using the modulo function; however it is not obvious how to transform them into a uniform random distribution. Because of the closeness of keys (they can be no closer without being the same) an unordered consecutive binary integer set provides a difficult key set for hash functions. A clean hash function must introduce enough noise to distribute consecutive binary integer hashes in such a way they are indistinguishable from a uniform random distribution.

This consecutive binary key set testing procedure was
reapplied to the same hash functions that McKinzie, *et al*, evaluated in 1989. This experiment showed
that all previously tested hash functions did not perform well,
including *cbuhash*, McKinzie's choice for a good hash
function. However both *pkphash* and *buzhash* performed well. The *buzhash* function had the advantage of being
developed with the aid of the test.

The binary key test provides an efficient measure of how well a hash function performs. This test is statistical. But, it provides a reasonably accurate empirical test for clean hashing, and it gives a measure of the success comparable to those used for pseudo-random number generators.

Two criteria weed out most candidates for a satisfactory Java
language library hash function. In a language library function
nothing can be known about the key set. Assume that the library
function must at least hash 2**20 (about a million) compact
binary keys, as a mark of minimum reasonable coverage for the
hash function. As requirements the following criteria can be
justified by arguments of reasonableness and coverage necessary
to be useful. They are certainly not the only requirements that
could be imposed, but these two criteria are sufficient too weed
out all but a few hash functions, others fall far short of
sufficient mixing or coverage to be useful.

Criteria for a Library Hash Function |

The first criterion allows one to take some of the magic out of hash function creation. The second one covers the engineering aspects of most programs so that most programmers do not need to worry about the hash function. These criteria allow a library hash function to be used with reasonable confidence that it will in fact work acceptably.

The current situation is that there are only a few known
functions that run fast enough and approximate a library hash
function well. Among them are: *pkphash* and *buzhash*. Perhaps less desirable on performance
grounds are *rndhash* and the cryptographic hash
functions.

All the hash functions expressed earlier in SAL work on keys that are ASCII strings. However most of these string parameters are really an alias for bytes of data, independent of type. Thus having a single algorithm for computing a library hash function is efficient because one only needs one hash function for all types... one that ignores the type of the argument. In strongly typed languages, like Java, this can be done only at the lowest language level.

To ignore type works for hash functions because they expand and scramble the bits in the key and then reduce the resulting large noise to a smaller noise, usually by a modulo reduction. The actions of the hash function don't depend on the meaning of the bits in the key. They can operate in general because it is reasonably easy to expand and scramble bits independent of any particular type.

It is not easy to see how to collapse the large noise created
by long keys. The most obvious way to collapse this information
is to perform a modulus operation. This will work sometimes for
relatively short segments of noise, but it has a few inherent
problems. First, multi-precision arithmetic is slow. Second,
taking the modulus for a power of two never produces the desired
result because it just strips off the upper bits. Both *pkphash* and *buzhash* create will create many bits of noise for
each 8-bits in the key and then fold it or build 64-bits. In
doing so they stay consistent with the first criteria for a
library hash function by conceptually different but efficient
methods.

Copyright © 1996 Robert Uzgalis,
All Rights Reserved.

Contact: Robert Uzgalis <buz@zis.com>

home

This paper was automatically translated from troff into html by
t2h.