Hash-Based Sets and Maps

Steven J. Zeil

Last modified: Oct 26, 2023

Contents:

1 Examples: The Unordered Set

1.1 An Unordered Set of Strings

2 Supplying a Hash Function

2.1 std::hash

2.2 Programmer-defined types

2.3 Passing A Hash Function to an Unordered Container

The original associative containers (sets and maps) in the C++ std library are based, as we have seen, on balanced binary trees. There are times when even the $O(\log \mbox{size()})$ performance of these containers is considered too slow.

The C++11 standard added hashing-based versions of these containers to serve in such circumstances.

The hash-based versions of set and map (and of their “multi-” cousins) will offer an average of nearly $O(1)$ time for insertion and searching. As always, with hashing, we pay for this increase in speed with an increase in memory required. To guarantee the $O(1)$ time, these classes will use rehashing when the tables get full enough to degrade the performance.

The tree-based set and map containers have the property that they keep their keys in order. When we use iterators to look at the contents of a std::set, for example, we get the data in ascending order.

Hash tables, on the other hand, by their very nature try to distribute their keys as randomly as possible. So one of the things that we give up when using hash-based storage is that ordering. We can still use iterators to get at all the keys, but there’s no telling what in order we will see those data values appear. Because of this, the new hash-based containers have been dubbed unordered associative containers.

1 Examples: The Unordered Set

When we use ordered sets, based on binary search trees, we are required to supply an ordering relational function such as a less-than operator. With hash-based containers, we have to supply instead an equality comparison function and a hashing function. The first will usually default to operator==. The hashing function has default values for strings and other “built-in” types, but we will usually have to supply our own when building unordered sets or maps on our own ADTs.

1.1 An Unordered Set of Strings

Here is an example of the use of an unordered set to remove all duplicate strings in a vector.

#include <unordered_set>
#include <string>
#include <vector>

void discardDuplicates (vector<string>& v)
{
  unordered_set<string> unique;
  for (int i = 0; i < v.size(); ++i)
  {
    unique.insert (v[i]);
  }
  v.clear();
  for (unordered_set<string>::iterator p = unique.begin();
       p != unique.end(); ++p)
    v.push_back(*p);

}

The algorithm takes advantage of the fact that sets ignore attempts to add duplicate elements.

So the first loop simply adds every string in the vector into a set, which has the side effect of only keeping a single copy of any duplicated values. Then the second loop copies the resulting collection of unique strings from the set back into the vector.

There’s not a whole lot special about the unordered set we are using. In fact, this code would work, though possibly a bit more slowly, with an ordinary tree-based std::set. That’s not an accident. The unordered associative containers have an interface essentially identical to that of their tree-based ordered counterparts. The only differences that you will find are some new operations to force rehashing, and some changes to the constructors to allow explicit passing of hash functions and equality tests. (We did not need to do so for this example, because the default values for strings would work perfectly well for us.)

This code, by the way, could easily be simplified, but I used a slightly wordy approach so that you could see how the insertion operation looks just like its ordered counterpart. Here’s another possibility:

#include <unordered_set>
#include <string>
#include <vector>
#include <algorithm>

void discardDuplicates (vector<string>& v)
{
  unordered_set<string> unique (v.begin(), v.end());
  v.assign(unique.begin(), unique.end());
}

What’s the complexity of this? In the worst case, there are no actual duplicates, and the first line adds v.size() strings at a worst-case cost of O(unique.size()) for each addition:

void discardDuplicates (vector<string>& v)
{
  unordered_set<string> unique (v.begin(), v.end());     // O(v.size()^2)
  v.clear();                                             // O(1)
  v.reserve (unique.size());                             // O(1)
  copy (unique.begin(), unique.end(), back_inserter(v)); // O(v.size())
}

for an overall $O(\mbox{v.size()}^2)$.

But, on average, those insertions into unique will be $O(1)$, so we get an average complexity:

void discardDuplicates (vector<string>& v)
{
  unordered_set<string> unique (v.begin(), v.end());     // O(v.size())
  v.clear();                                             // O(1)
  v.reserve (unique.size());                             // O(1)
  copy (unique.begin(), unique.end(), back_inserter(v)); // O(v.size())
}

for an overall $O(\mbox{v.size()})$.

It’s worth noting that the algorithm would work with the tree based ordered set as well, but then each insertion is $O(\log \mbox{unique.size()})$ on average, which leads to

void discardDuplicates (vector<string>& v)
{
  set<string> unique (v.begin(), v.end());        // O(v.size() log(v.size()))
  v.clear();                                             // O(1)
  v.reserve (unique.size());                             // O(1)
  copy (unique.begin(), unique.end(), back_inserter(v)); // O(v.size())
}

for an overall average $O(\mbox{v.size()} \log(\mbox{v.size()}))$.

Neither the set and unordered_set versions of this algorithm will preserve the original ordering of the strings in v. However, the set version will result in the output v containing this strings in sorted order, while the unordered_set version leaves them in an arbitrary, largely unpredictable order.

2 Supplying a Hash Function

2.1 std::hash

The C++ std library provides hash functions for most of the types you would expect. Most of these hash functions are in the <functional> header. Others are in the same headers where you find the types in the first place. For example, the string hash function is provided in <string>.

The hash functions are provided by the template std::hash, but std::hash is not a function - it’s a template class that can be used to “generate” hash functions and save them in variables.

#include <functional>
#include <string>
   ⋮
std::hash<int> int_hash; // int_hash(...) is a hash function for integers
std::hash<double> dbl_hash; // dbl_hash(...) is a hash function for doubles
std::hash<string> str_hash; //strl_hash(...) is a hash function for strings

2.2 Programmer-defined types

For more complex, programmer-defined types, we build our own hash functions by combining hash values for the type’s data members.

For example, given

class Book {
public:
   Book (const Author& author,
         const string& title,
         const Publisher& publisher,
         int editionNumber,
         int yearPublished);
     ⋮
   std::size_t hash() const;
private:
  Author theAuthor;
  string theTitle;
  Publisher thePublisher;
  int theEdition;
  int theYear;
};

bool operator== (const Book& left, const Book& right);

We might use this as a possible hash function:


std::hash<string> str_hash;

std::size_t Book::hash() const
{
  return theAuthor.hash() + 31*str_hash(theTitle)
    + 57*thePublisher.hash() + 701*theEdition + 131*theYear;
}

This assumes that our Author and Publisher classes provide their own hash() functions. So we combine those values with hashes for integers and the title string. (Note again the pervasive use of prime numbers in hash-related calculations.)

For consistency, we will assume that the equality operator works on the same fields (it’s crucial that any two objects that are “equal” must have the same hash value):

bool operator== (const Book& left, const Book& right);
{
  return left.theAuthor == right.theAuthor  &&
    left.theTitle == right.theTitle &&
    left.thePublisher == right.thePublisher &&
    left.theEdition == right.theEdition &&
}   left.theYear == right.theYear;
}

2.3 Passing A Hash Function to an Unordered Container

In the previous section, we saw how to design a hash function to a class, such as our Book::hash member function. That still leaves the challenge of how to pass that hash function to an unordered set or map.

We’re going to enhance our Book example by adding a second hash function and a matching euqality test:

class Book {
public:
   Book (const Author& author,
         const string& title,
         const Publisher& publisher,
         int editionNumber,
         int yearPublished,
         const string& isbn);
     ⋮
   std::size_t hash() const;
   std::size_t hashByISBN() const {return std::hash<string>()(theISBN);}
private:
  Author theAuthor;
  string theTitle;
  Publisher thePublisher;
  int theEdition;
  int theYear;
  string theISBN;
};

bool operator== (const Book& left, const Book& right);
bool compareByISBN (const Book& left, const Book& right) {return left.getISBN() == right.getISBN();}

The hash and operator== functions are implemented, as in the prior section, to work with the author, title, publisher and edition.
The hashByISBN and compareByISBN functions, by contrast, provide an alternate way to find books by searching for the ISBN instead.

There are three ways to tell an unordered container what hash function you want to use.

2.3.1 As a template parameter

You can make the hash function a part of the set/map data type when you instantiate the template to declare the data type. This is useful if you want all containers of this type to use the same hash function. It’s possible to have many different types of unordered containers on the same key data type, each using a different hash function and equality operator. For example, we might have one unordered_set<Book> that compares books by title, author, publisher, and edition, as we did in the last section, and another

You do this by declaring a functor class to provide the hash function, and pass that as a template parameter:

```c++
class BookHash {
public:

  std::size_t operator() (const Book& b) const
  {
    return (std::size_t)b.hash();
  }
};
   ⋮
  typedef unordered_map<Book,Price,BookHash > HashTable;

  HashTable recommendedPrices;
   ⋮
void purchase(Book b, Price p)
{
  if (p < recommendedPrices[b])
    cout << "You got a bargain!" << end;
}
```

It’s important to note here that BookHash is a data type, not a function. The highlighted template parameter in

typedef unordered_map<Book,Price,BookHash > HashTable;

must be a data type. That’s why we had to declare the BookHash functor class rather than passing the Book::hash function directly.

If we want to implement a different table to search by ISBN, we supply a different hash functor and a different, matching, comparison operator:

```c++
class ISBNHash {
public:

  std::size_t operator() (const Book& b) const
  {
    return (std::size_t)b.hashByISBN();
  }
};
   ⋮
  typedef unordered_map<Book,Price,ISBNHash, compareByISBN > HashTable2;

  HashTable2 recommendedPricesByISBN;
   ⋮
```

2.3.2 Specialize std::hash

Alternatively, sometimes you would like to specify a “default” hash function for a type such a Book so that anyone can create unordered sets or maps without creating their own hash functions.

This gets a little bit messier. By default, an unordered container will use the functor std::hash<Key> as the (type of the) hash function of type Key. In <functional>, std::hash provides hash functions for the C++ primitive types. Other headers, such as <string> extend that same template to provide hash functions on non-primitive types.

```c++
std::hash<string> hash_s;
std::hash<double> hash_d;

string s = "abc";
double d = 3.14;
cout "s hashes to " << hash_s(s) << " and d hashes to " << hash_d(d) << endl;
```

The problem is, hash<Key> does not exist for general types. It is already defined for the C++ basic types such as float and double and for selected standard classes such as string, But if you want use it with your own types, you have to specialize the template std::hash for your type:

namespace std {

  template <> 
  struct hash<Book> // denotes a specialization of hash<...>
	{
	   std::size_t operator() (const Book& book) const
	   {
	    return (std::size_t)book.hash();
	   }
	};
}

unordered_map<Book,Price> recommendedPrices; // uses hash<Book>
   ⋮
void purchase(Book b, Price p)
{
  if (p < recommendedPrices[b])
    cout << "You got a bargain!" << end;
}

Specialization is one of those powerful but ugly features of C++ that most programmers rarely dabble in. It’s a way of giving an existing template a specific instantiation for selected data types.

Note, however, that specialization of std::hash is a one-function-only technique. We cannot specialize std::hash to use Book::hash some of the times and BookHashByISBN at other times.

2.3.3 As a parameter to the constructor call

If you have already supplied a hash function for the type of the unordered container using either of the two previous techniques, you can override that hash function and supply a different hash function when constructing specific variables of that type. You do this by passing the hash function as a parameter to the constructor.

This only works if you have already used one of the other techniques to create a default constructor either for the container or for the key type.

Suppose, for example, that we have already used one of the two prior techniques to declare

typedef unordered_map<Book,Price, $\ldots$ > HashTable;
HashTable byAuthorTitlePublEdn;

We can still create specific HashTable variables that use our “by ISBN” functions by supplying those in the constructor:

```c++
class ISBNHash {
public:

  std::size_t operator() (const Book& b) const
  {
    return (std::size_t)b.hashByISBN();
  }
};
   ⋮

const std::size_t initialSize = 2048; HashTable byISBN (initialSize, ISBNHash, compareByISBN); ```

Unfortunately, because the hash comparison functions are not the first two parameters to the constructor, we aso have to supply an initial size, even though we would usually be content to allow that to be set to its default value.

Notice that byAuthorTitlePublEdn and byISBN are actually the same data types, even though the items within each of those maps will be arranged in very different orders. Operations like

byISBN = byAuthorTitlePublEdn;

are perfectly legal, even if it’s hard to imagine a good reason for doing this.