1. Introduction

This paper evaluates the hashing services provided by Java and offers practical suggestions for improving them. It begins with a quick summary of the use of hashing and the background of recent hashing research. Then it introduces Java and its implementation and concentrates on practical improvements that could be made to the way it provides hashing.

Hashing is a refinement of the simple idea that searching can be made more efficient if one can quickly narrow one's search by locating a subset of items to search. Each item is characterized by a key, a part of the item which uniquely identifies it; such as an employee-number. Hashing creates a subset of items to search by computing from the key a characteristic value, called the hash, and then groups items with the same hash; the group is called a bin.

There are several methods for grouping items; the most common makes a linked list of the items with the same hash. To retrieve an item the program computes the hash from the key that you want to search for, and it then searches for the item in the bin associated with the hash.

If item keys are the names of people and one uses the first character of the last name as the hash, then a hash search will work well for last names starting with X, because not many last names start with X. But the search works poorly for finding Mr. John Smith, because there are many names that begin with S.

A clean hash function, when given a series of keys, gives back a reasonable approximation of a uniform random distribution of hashes. This means that all bins will probably be about the same size, so that, on the average, searching will take about the same amount of time to find a record in any bin.

Hash searching is the most powerful method of retrieving data known. It works well even with rather poor hash functions, that is those that do not distribute the input cleanly. It is so powerful that most automated retrieval mechanisms today are based on hashing.

History of Hashing

The idea of hashing started in the earliest days of computing. The first true electronic computers began to run in 1949 and 1950. A proposal for hash search was described by Hans Peter Luhn in an IBM technical memorandum in 1953. What he wanted was a function that would deliberately abuse keys producing practically the equivalent of the mathematical concept of uniformly distributed random variables.[bashe86]

Luhn's goal for producing uniform randoms is one approach, but often in computer science the goal of getting a completely even distribution has been substituted for it. The change in goal is significant and leads to two completely different lines of research both of which are commonly called `hashing'. Creating even distributions can only be done by considering the structure of the keys, so the method can never be `general'. However creating random uniform distributions can be done without respect to the structure of the key, and so it can be provided as a standard part of a language library. The word `hashing', as used in this paper, refers only to the goal of producing a uniform random distribution of a key set.

In some sense it is commonly believed that it is impossible to create a general hash function that would work for any set of keys. This is because all hash function have a worst case, in which the hash function fails to hash at all, and all records fall into a single bin. But it is silly to define a general hash function in a way that makes it not a hash function, which is what this argument effectively does.

It is better to define a general hash function as one which distributes hash values from any independent domain in a way the approximates a random uniform variable. Then it is only the probability of a bad hash that comes into question. Gonnet studied this and came to the conclusion that the number of items in the biggest bin grew as inverse factorial of the number of items hashed[gonnet8x]. This growth rate of the worst case of the average table is so slow above thirty items that essentially it never gets any worse. So a general hash function which approximates a random uniform variable for all inputs will have a very low probability of producing a bad search table.

It is also impossible to create a general hash function that works better that uniform random for any arbitrary set of keys. However if one knows the keys in advance one can tailor hash functions to perform better than a random uniform table. For a language library function which, in general, can't know keys in advance, a random uniform hash is the best that can be done.

Much work has been done on specialized hash functions trying to minimize the distribution for give key sets or for keys derived from a given probability distribution. As a consequence little serious work has been done on general hash functions themselves. Recently, however, there have been a few papers which provide notable exceptions to the general trend [mckenzie90, pearson90, uzgalis91]. From these three papers have come a growing recognition of principles which govern the evaluation of and requirements for general hash functions.