Hashing

Steven J. Zeil

Last modified: Oct 26, 2023
Contents:

Hashing is an important approach to set/map construction.

We’ve seen sets and maps with $O(N)$ and $O(\log N)$ search and insert operations.

Hash tables trade off space for speed, sometimes achieving an average case of $O(1)$ search and insert times.

1 Hashing 101: the Fundamentals

 

Hash tables use a hashing function to compute an element’s position within the array that holds the table.

If we had a really good hashing function, we could implement set insertion this way:

template <class T>
class set {
  ⋮
private:
    const unsigned hSize = ...;
  T table[hSize];
};

template <class T>
void set<T>::insert (const T& key)
{
  unsigned h = hash(key);
  table[h] = key;
}

and searching through the table would not be much harder:

template <class T>
size_type set<T>::count (const T& key) const
{
  int h = hash(key);
  if (table[h] == key)
    return 1;
  else
    return 0;
}

1.1 The Ideal: Perfect Hash Functions

 

For this overly-simple form of hashing to work, the hash function must

A function that satisfies these requirements is called a perfect hash function.


Suppose, for example, that we were writing an application to work with calendar dates and wanted to quickly be able to translate the names of days of the work week (excluding the weekend) into numbers indicating how far into the week the day is:

Key Value
Monday 1
Tuesday 2
Wednesday 3
Thursday 4
Friday 5

If we are willing to use a table with a little bit (or a lot) of extra space, we could use a function

unsigned hash(const std::string& dayName)
{
    return (unsigned)dayName[1] - 'a';
}

because each of those seven strings has a distict second character.

So we can set up the table:

std::array<string, 5> days = {"Monday", "Tuesday",
    "Wednesday", "Thursday", "Friday"};
int table[96];
for (int i = 0; i < 5; ++i)
    table[hash(days[i])] = i+1;

and then afterwards, we can look up those day names in $O(1)$ time:

 int dayOfWeek (const string& dayName)
 {
     return table[hash(dayName)];
 }

Perfect hash functions are usually only possible if we know all the keys in advance. That rules out their use in most practical circumstances.

There are some applications where perfect hash functions are possible. For example, most programming languages have a large number of reserved words such as “if” or “while”, but for any given language the set of reserved words is fixed. Programmers who are writing a compiler for that language may use a perfect hash function over the language’s keywords to quickly recognize when a word read from the source code file is really a reserved word.

1.2 The Reality: Collisions

For the most part, though, we can’t really expect to have perfect hash functions. This means that some keys will hash to the same table location.

Two keys collide if they have the same hash function value.

For example, if we were to expand our days of the week code to include the weekend, then Sunday and Tuesday would collide under our chosen hash function because both have the same second letter. We could compensate with a more complicated hash function, perhaps one involving a pair of letters, but this could also increase the number of unused/wasted slots in the table.

Collisions are, in most cases, unavoidable, simply because we do not know, in advance, what all of the keys will be.

Consequently, we say that a good hash function will

  1. return values in the range 0hSize-1,
  2. be fast and easy to compute, and
  3. minimize the number of collisions.

2 Hash Functions

A good hash function will

  1. return values in the range 0hSize-1,
  2. be fast and easy to compute, and
  3. minimize the number of collisions.

Actually, the first of these three requirements is usually enforced inside the hash table code by the simple technique of taking hash() modulo hSize:

template <class T>
void set<T>::insert (const T& key)
{
  unsigned h = hash(key) % hSize;
  table[h] = key;
}

template <class T>
size_type set<T>::count (const T& key) const
{
  int h = hash(key) % hSize;
  if (table[h] == key)
    return 1;
  else
    return 0;
}

And, unless we have special knowledge about the keys, the best we can say about “minimizing the number of collisions” is that we hope that our hashing function will distribute the keys uniformly, i.e., if I am drawing keys at random, the probability of the next key’s going into any particular position in the hash table should be the same as for any other position.


So the characteristics that we’ll look for in a good hash function

The possibility of collisions also forces us to revise those simple table retrieval algorithms to include collision handling, which we will discuss a little later..

3 Hash Functions: Examples

The proper choice of hash functions depends upon the structure and distribution of the keys.

Don’t get hung up on trying to find hash functions that “mean something”. Most hash functions don’t compute anything useful or “natural”. They are simply functions chosen to satisfy our requirements that they be fast and distribute the keys uniformly over the range 0hSize-1.

3.1 Hashing Integers

This is the easiest possible case.

If we have a set of integer keys that are already in the range 0hSize-1, we don’t need to do anything:

int hash(int i) {return i;}

If the keys are in a wider range, we could employ the modulus trick:

int hash(int i) {return i % hSize;}

but usually we don’t bother because, as discussed earlier, this modulus transformation is usually done inside the hash table code anyway.

Later, when we look at variable hashing and at the std:: containers that use hashing, we won’t even know the value of hSize, so we will have to trust that the table takes care of that modulus internally.


So integer keys are easy, but you can’t always take them for granted. I once worked for a company that assigned an integer ID number to every employee. When the computerized system for the company payroll was first instituted, the way the IDs were assigned when like this:

This left “gaps” in the ID number sequence that could be used in subsequent years for new employees.

When a new person was hired, someone would compare the new person’s name to the alphabetical list of employee names and would assign the new person a number lying somewhere in the gap between the people whose names came just before and after the new person’s.

Because of this scheme, more than 3/4 of the ID numbers in the company were evenly divisible by 5.

Now, suppose we took those numbers and hashed them into a table of size hSize==100.

int hash(int i) {return i % 100;}

There are 20 numbers divisible by 5 in the range from 0 to 99. So 3/4 of the ID numbers would hash into only 1/5 of the table positions. These numbers are not being distributed uniformly.

There is, as it happens, a very easy fix for this: add one more element to the table. The same set of IDs will do very well with an hSize of 101:

int hash(int i) {return i % 101;}

Check it out:

keys hash to
00005, 00010, … 00100 5, 10, … , 100
00105, 00110, … 00200 4, 9, … , 99
00205, 00210, … 00300 3, 8, … , 98
00305, 00310, … 00400 2, 7, … , 97

The lesson here: the distribution of the original key values is important.

3.1.1 The Curious Power of the Prime Modulus

The fact that 101 worked so well for hSize is no accident. The trick of taking the integer key modulo hSize

Consequently, it’s a standard part of hashing “lore” to try and use prime (or nearly prime) numbers for the hash table size. Shortly, we’ll see that, for some collision handling schemes, the use of prime table sizes is particularly important.

3.2 Hashing Character Strings

Hash functions for strings generally work by adding up some expression applied to each character in the string (remember that a char is just another integer type in C++).

We need to be a little bit careful to get an appropriate distribution. For one thing, although a char could be any of 255 different values, most strings actually contain only the 96 “printable” characters starting at 32 (blank).

In addition, we often want to make sure that similar strings, likely to occur together, don’t hash to the same location. For example, many words differ from one another only in having two adjacent characters transposed (and, if we were programming a spelling checker, you might want to consider that character transposition is a very common spelling error). So a simple hash function like this:

unsigned hash (const string& s)
{
   unsigned h = 0;
   for (int i = 0; i < s.length(); i++)
      h += s[i];
   return h;
}

doesn’t work very well. Words that differ only by transposition of characters would have the same hash value.

A better approach is to use multipliers to make every character position “count” differently in the final sum.

int hash (const string& s)
{
   int h = 0;
   for (int i = 0; i < s.length(); i++)
      h = (C*h + s[i]) % M;
   return h;
}

where C is an integer multiplier and M is a modulus, used to keep the whole sum from overflowing. These are usually chosen as prime numbers. C can be a small prime, but M needs to be large since this function will return hash values in the range 0M-1, which has to be at least as large as the hash table size.


C++ now provides a standard hashing function for std::string, called std::hash. However, std::hash is actually a template for a functor class, so we typically use it like this:

std::hash<string> hash_s;
std::size_t hashValue = hash_s("abcdef");

3.3 Hashing Compound Structures

When you need a hash function for a more elaborate data type, you generally try to

class Book {
public:
   Book (const Author& author,
         const string& title,
         const string& isbn,
         const Publisher& publisher,
         int editionNumber,
         int yearPublished);
     ⋮
   int hash() const;
private:
  Author theAuthor;
  string theTitle;
  string theIsbn;
  Publisher thePublisher;
  int theEdition;
  int theYear;
};

For example, if we wanted a hash table of books, we would probably take advantage of the fact that each book has a unique ISBN number:

int Book::hash() const
{
  std::hash<string> hash_s;
  return hash_s(theIsbn);
}

so that hashing a book turns into a simple problem of hashing a single string.

But what if we didn’t have that nice convenient ISBN field? Then we would need to use a combination of the other fields that, combined, would uniquely identify the book:

int Book::hash() const
{
  std::hash<string> hash_s;
  return theAuthor.hash() + 73*hash_s(theTitle)
    + 557*thePublisher.hash() + 677*theEdition;
}

(Why not theYear? Because once we have determined the publisher and the edition, the year is already fixed, so adding that in as well won’t help distinguish one book from another.) Notice how this hash function breaks down into a series of hashes on other data types, including strings and integers.

3.4 Hashing and Equality

A key requirement if we are going to use hashing is that comparing hash codes is treated as a way to see if two values are likely to be be equal to one another.

For a good hash function,

  • If x == y, then hash(x) == hash(y).

  • If hash(x) == hash(y), then there is a good chance that x == y.

  • If hash(x) != hash(y), then x != y.


As we have noted some time ago when discussing relational operators, we often have choices as to what we want operator== to mean for a newly created ADT.

If we have a Book ADT that implements its operator== by comparing ISBNs, then we whould hash on the ISBN. But if that Book ADT tests for equality by comparing titles and author names, the we should hash books on their titles and author names.