Sets and MultiSets
Steven J. Zeil
The std::
ADTs we have looked at so far in this course have been what are sometimes called sequential containers
- They maintain elements in a known sequence.
- We access the elements by indicating a position in the sequence from which we wish to obtain an element.
- We can indicate the position numerically, e.g.,
a[23]
- Or symbolically, e.g.,
myVector.front()
- We can indicate the position numerically, e.g.,
But always, in order to get an element, we have to know where it is in relation to the other elements.
Now we turn our attention to associative containers.
- These maintain the elements in a structure that is intended to allow rapid access to elements based upon their value.
1 Overview of Sets and Maps
The major associative classes in the standard library are
-
set
-
map
-
multiset
-
multimap
Sets are containers to which we can add elements (called “keys”) and later check to see if certain key values are present in the set.
Maps, also known in other contexts as “lookup tables” or “dictionaries”, allow us
- to store pairs consisting of a key value and associated data value and
- to later look up the data value, if any, associated with a given key.
(In some contexts, especially more mathematical ones, the set of keys is called the domain of the map and the set of associated data values is called the range of the map.)
In a set or a map, a given key value may appear only once.
- Attempting to add a key K to a set replaces any existing key equal to K.
- Adding a key-data pair
(K,D1)
to a map that already has(K,D2)
replaces(K,D2)
by(K,D1)
.
But in a multiset or multimap, the same key can occur any number of times.
- For a multiset, this means that instead of just asking “is K in this set?” we can now ask “how many K’s are in this set?”.
- For a multimap, adding a key-data pair
(K,D1)
to a multimap that already has(K,D2)
results in multimap that has both(K,D1)
and(K,D2)
.
We’ll look at sets and multisets first, and then turn to maps and multimaps in a later lecture.
2 The Set ADT
A set collects elements and lets you check to see if an element is already in the collection.
-
A given element can occur at most once in a set.
-
Provides iterators that allow you to list the elements in sorted order.
#ifndef SET_H
#define SET_H
#include <cstddef>
template <class Key, class Compare=less<Key> >
class set {
private:
Compare comparator;
public:
// typedefs:
typedef Key key_type;
typedef Key value_type;
typedef Compare key_compare;
typedef Compare value_compare;
typedef const Key& reference;
typedef const Key& const_reference;
typedef ... const_iterator;
typedef const_iterator iterator;
typedef ... const_reverse_iterator;
typedef const_reverse_iterator reverse_iterator;
typedef ... size_type;
typedef ... difference_type;
// allocation/deallocation
explicit set(const Compare& comp = Compare());
set(const set<Key, Compare>& x);
template <class InputIterator>
set(InputIterator first, InputIterator last,
const Compare& comp = Compare());
set<Key, Compare>& operator=(const set<Key, Compare>& x);
// accessors:
key_compare key_comp() const { return comparator; }
value_compare value_comp() const { return comparator; }
iterator begin() const;
iterator end() const;
reverse_iterator rbegin() const { return t.rbegin(); }
reverse_iterator rend() const { return t.rend(); }
bool empty() const { return (size() == 0); }
size_type size() const;
size_type max_size() const;
void swap(set<Key, Compare>& x);
// insert/erase
pair<iterator, bool> insert(const value_type& x);
iterator insert(iterator position, const value_type& x);
void clear();
void erase(iterator position);
size_type erase(const key_type& x);
void erase(iterator first, iterator last);
// set operations:
iterator find(const key_type& x) const;
size_type count(const key_type& x) const;
iterator lower_bound(const key_type& x) const;
iterator upper_bound(const key_type& x) const;
pair<iterator, iterator> equal_range(const key_type& x) const;
private:
⋮ // declaration of implementing data structure
};
template <class Key, class Compare>
bool operator==(const set<Key, Compare>& x,
const set<Key, Compare>& y);
template <class Key, class Compare>
bool operator< (const set<Key, Compare>& x,
const set<Key, Compare>& y);
#endif
The interface to multiset
is identical to that of the set
(aside from replacing the name “set” by “multiset”) - it’s only the behavior of a few of the operations that differ. So, as we look at the set
interface, keep in mind that the same things apply to multiset
s as well.
2.1 The template header
template <class Key, class Compare=less<Key> >
class set {
Typically, you would declare a set by instantiating the template on the date type that you want to use for the key:
set<string> myStringSet;
But as you can see from the template header, this is an oversimplification. There is a second parameter, Compare, that supplies the comparison function to be used in comparing elements of the set
. This parameter is given a default value, less<Key>
.
So the above instantiation is actually equivalent to
set<string, less<string> > myStringSet;
What does less< ... >
do? It simply uses the Key
’s own operator< to do the comparisons, and that’s good enough 90% of the time.
Actually, even this template header is somewhat oversimplified. The “real” header has yet another parameter:
template <class Key, class Compare=less<Key>,
class Allocator = allocator<Key> >
class set {
but the Allocator parameter is only used in very rare situations where the application needs specialized control over how memory for the set
will be allocated. We won’t worry about that in this course.
2.2 Internal type names
// typedefs:
typedef Key key_type;
typedef Key value_type;
typedef Compare key_compare;
typedef Compare value_compare;
typedef const Key& reference;
typedef const Key& const_reference;
typedef ... const_iterator;
typedef const_iterator iterator;
Many standard containers provide a set of type names for programming convenience and to provide a “standard” way of declaring things without making direct reference to the data structures used to implement them. We’ve seen this with type names like vector::const_iterator
and list::size_type
.
The associative containers add a couple of useful type names.
- The name
key_type
gives the data type of the keys in the container. - The name
value_type
gives the data type that describes what we insert into the container and what is returned whenever we dereference (apply operator* to) an iterator.
For set
and multiset
, the value_type
is the same as the key_type
, but that won’t be true when we get to map
and multimap
.
2.3 Constructors & Assignment
explicit set(const Compare& comp = Compare());
set(const set<Key, Compare>& x);
template <class InputIterator>
set(InputIterator first, InputIterator last,
const Compare& comp = Compare());
set<Key, Compare>& operator=(const set<Key, Compare>& x);
Looking at the constructors for set
, the most interesting things are
- The first constructor takes a single parameter, a comparator. But we already supply a comparator when we instantiate the class! It turns out that when we instantiate the class, e.g.,
typedef set<string, less<string> > MySetType;
we’re only telling the compiler what the default comparator should be.
We can use that default:
MySetType ascendingOrder; // uses < to compare strings
or we can create objects that use a different comparator:
MySetType descendingOrder(greater<string>()); // uses > to compare strings
Note that, because this constructor has a default value for its only parameter, we can use it, as we did above, as the “default constructor” for the class. (If you don’t remember what a default constructor is, go back to the ADTs lectures.)
The third constructor lets us build a new set from any range specified by a pair of iterators. For example,
vector<string> v;
⋮
multiset<string> ms (v.begin(), v.end());
2.4 Status
bool empty() const { return (size() == 0); }
size_type size() const;
size_type max_size() const;
No surprises here …
2.5 Insert & Erase
// insert/erase
pair<iterator, bool> insert(const value_type& x);
iterator insert(iterator position, const value_type& x);
void clear();
void erase(iterator position);
size_type erase(const key_type& x);
void erase(iterator first, iterator last);
As described in your text,
s.insert("def");
adds a new element, “def”, into s.
- If something equal to “def” is already in the set, nothing happens.
-
But in a multiset, we would go ahead and add a second copy of “def”.
-
The return type from this operation is a bit interesting. There’s two things you might want to know after trying to add something to a set:
-
Did it go in? (i.e., was this key already in the set), or
-
Where is it in the set?
Rather than choose between these two equally valuable pieces of information, the insert operation actually returns both of them in a pair.
- (The std::pair template was one of our first examples of a class template.)
It returns a pair (b,p)
where b is a bool indicating whether an insertion occurred, and p is the position where the key was inserted. You are supposed to ignore the value of p if b is false.
The second of the two insert functions is a bit odd:
s.insert(position, "def");
Since this is an associative container, it’s supposed to decide where to put things. Why then might we supply a position?
- The position is taken as a “hint” of where to begin searching for position to insert.
-
The set implementation may ignore the hint if a quick check shows that the new key really does not belong at that position.
-
But the C++ standard says that we are guaranteed that this hint version of insert will have an amortized O(1) worst case if we have data that is already in order and we use the position where one key was inserted as the hint for the next larger key, e.g.:
-
vector<int> v;
⋮
sort (v.begin(), v.end(), less<int>());
// we know the elements of v are in order
set<int> mySet;
set<int>::iterator hint = mySet.begin();
for (int i = 0; i < v.size(); ++i)
{
hint = mySet.insert (hint, v[i]); // amortized O(1) inserts
++hint;
}
- Many generic algorithms expect an insert operation of this form. In particular, suppose we wanted to copy a list into a set. You may remember that we can’t do this:
list<Foo> fooList;
⋮
set<Foo> fooSet;
copy (fooList.begin(), fooList.end(), fooSet.begin());
Question: Why doesn’t this copy work?
When we faced this problem copying into vectors and lists, we got around it by using back_inserter
:
list<Foo> fooList;
⋮
vector<Foo> fooVect;
copy (fooList.begin(), fooList.end(), back_inserter(fooVect));
back_inserter
is a function that returns a special iterator that uses push_back
whenever we try to store something at the iterator.
Now that won’t help for copying into sets, because sets don’t have a push_back
function. But another iterator-returning function is inserter
, that returns an iterator that uses insert(position,value)
whenever we try to store to it. For example, we can copy into the middle of a vector this way:
list<Foo> fooList;
vector<Foo> fooVect;
⋮
copy (fooList.begin(), fooList.end(),
inserter(fooVect, fooVect.begin()+fooVect.size()/2));
And, to finally get to the point of all this, because set
and multiset
have a member function that looks like insert(position,value)
, we can use inserter
with them as well:
list<Foo> fooList;
⋮
set<Foo> fooSet;
copy (fooList.begin(), fooList.end(),
inserter(fooSet, fooSet.end()));
If the list is unsorted, then the position supplied to inserter is only a hint. But if the list were sorted already, then this copy would get the amortized O(1) hint benefit, and the entire copy would be accomplished in O(fooList.size()) time.
2.6 Access
iterator find(const key_type& x) const;
size_type count(const key_type& x) const;
There are two ways to see if an element is present in a set:
set<string, less<string> >::iterator i;
i = s.find("abc");
We can search the set for "abc"
. If the key is found, find returns the position where that element resides. If not found, it returns s.end()
. So it’s not unusual to see code that looks like:
if (mySet.find(yourName) != mySet.end())
cout << "I found " << yourName << " in my set." << endl;
Alternatively, we can count the number of times we find an element.
if (mySet.count(yourName) > 0)
{
cout << "I found " << yourName << " in my set." << endl;
cout << "I found it there " << mySet.count(yourName) << " times." << endl;
}
Of course, for sets, count( … ) will always return 0 or 1, because either the key isn’t in there at all, or there’s only one copy of it in the set. But for multiset, count may return any non-negative number.
2.6.1 Searching for Ranges of Equal Items
iterator lower_bound(const key_type& x) const;
iterator upper_bound(const key_type& x) const;
pair<iterator, iterator> equal_range(const key_type& x) const;
Suppose that we’re interested, not so much in whether or not an element is present, but in where it might be.
We can, as we have seen, use find to get the position of an element in a set. But if we have a multiset, there may be many instances of the same key. How do we know which instance will be pointed out by find?
The functions shown here allow us to get the positions of all keys equal to some specified value. lower_bound returns the position of the first element in the set/multiset with the indicated value. upper_bound returns the position just after the last element equal to a given value. equal_range returns a pair consisting of both the lower_bound and the upper_bound.
For example, a library system, having many copies of the same book, might keep track of its books by ISBN number:
class Book {
public:
⋮
string author() const;
string title() const;
string isbn() const;
int copyNumber() const;
bool isCheckedOut() const;
};
bool operator < (const Book& left, const Book& right)
{
return left.isbn() < right.isbn();
}
Then, given a multiset of books:
typedef multiset<Book, less<Book> > Holdings;
Holdings library;
we could list all the copies of a given book that are checked out this way:
const string textISBN = "0-201-30879-7";
pair<Holdings::iterator, Holdings::iterator> rng
= library.equal_range (textISBN);
for (Holdings::iterator i = rng.first; i != rng.second; ++i)
{
if (i->isCheckedOut())
cout << "Copy " << i->copyNumber()
<< " is checked out." << endl;
}
Both set and multiset support these operations, though they aren’t really very useful for sets.
By the way, this is a good example of a place where the new C++11 auto type declaration and range-based for loop can simplify the code:
const string textISBN = "0-201-30879-7";
auto rng = library.equal_range (textISBN);
for (auto& book: rng)
{
if (book.isCheckedOut())
cout << "Copy " << book.copyNumber()
<< " is checked out." << endl;
}
The first auto matches the pair type and the second matches the Book type.
3 A Simple Example of Using Set
As part of a program to be presented in a later example, we need to read an English language document in plain text form and collect all the words that are used to begin sentences. For example, if we read
Hello! How are you? I haven't seen you in a long time. How is
your family?
we would want a set with the words “Hello!”, “How”, and “I”. (This particular application will need us to preserve all upper/lower case distinctions and to treat punctuation as part of the preceding word.)
This is a pretty straightforward application for set, and here is the code to collect the sentence-starting words:
void readDocument (const char* docFileName,
set<string>& startingWords,
...)
{
ifstream docIn (docFileName);
char lastChar = '.';
⋮
string word;
while (docIn >> word)
{
if (lastChar == '.' || lastChar == '?'
|| lastChar == '!') {
startingWords.insert(word);
}
lastChar = word[word.length()-1];
⋮
}
}
4 Sets versus Sequences
Most of our data structures to this point have been devoted to storing sequences of data. Arrays, std::arrays, vectors, lists and deques are all variations of sequences. Some of these sequences have been ordered and some have been unordered. But the defining characteristic of sequences is that we access data by position within the sequence.
So, we might ask, when should we use a set and when should we use a sequence?
The important characteristics of sets to think about are:
- Sets do not allow duplicate elements (though multisets do).
- Sets allow fast random access to data. (By “random access”, I mean that we can access the data in any order, in a time that does not depend on where in the container that data happens to be stored.)
- Sets allow fast random insertions and removal of data.
- The ordered sets (which we are currently considering) allow easy access to the data in sorted order.
Now, we can get some of these with sequences, but not all. If we store data in an ordered, non-list sequence, we can use binary search to get quick random access to data. But insertion and removal is still O(seq.size())
. Sets will let us do better.
Let’s look at an example an replacing a sequence with a set. In our book publishing world example, we had a class Publisher
:
/*
* publisher.h
*
* Created on: May 23, 2018
* Author: zeil
*/
#ifndef PUBLISHER_H_
#define PUBLISHER_H_
#include <string>
#include <vector>
#include "author.h"
#include "book.h"
class Publisher
{
public:
typedef std::vector<Author>::iterator author_iterator;
typedef std::vector<Author>::const_iterator const_author_iterator;
typedef std::vector<Book>::iterator book_iterator;
typedef std::vector<Book>::const_iterator const_book_iterator;
Publisher (std::string theName = std::string());
std::string getName() const {return name;}
void setName (std::string theName) {name = theName;}
int numberOfBooks() const;
book_iterator begin() {return books.begin();}
const_book_iterator begin() const {return books.begin();}
book_iterator end() {return books.end();}
const_book_iterator end() const {return books.end();}
void addBook (Book& b);
int numberOfAuthors() const;
author_iterator begin_authors() {return authors.begin();}
author_iterator end_authors() {return authors.end();}
const_author_iterator begin_authors() const {return authors.begin();}
const_author_iterator end_authors() const {return authors.end();}
author_iterator getAuthor(std::string name);
const_author_iterator getAuthor(std::string name) const;
void addAuthor (const Author& au);
bool operator== (const Publisher& right) const;
bool operator< (const Publisher& right) const;
private:
std::string name;
std::vector<Book> books;
std::vector<Author> authors;
};
std::ostream& operator<< (std::ostream& out, const Publisher& publ);
#endif /* AUTHOR_H_ */
We will focus for now on the highlighted functions that provide the ability to add and retrieve books for a publisher.
Let’s look, for a moment, at the code to add a book:
void Publisher::addBook (Book& b)
{
auto pos = find(books.begin(), books.end(), b); ➀
if (pos == books.end()) ➁
{
b.setPublisher(*this);
books.push_back(b); ➂
}
}
- ➀ First we check to see if the book is already in the vector used to hold the publisher’s catalog.
- Only if it is not (➁) do we then add the book (➂) to the vector.
Clearly, we are going to some trouble to make sure that no book gets added twice to the same publisher. But avoiding duplicates is one of the hallmarks of set-like behavior. So we might consider replace the vector with a set:
class Publisher
{
public:
typedef std::vector<Author>::iterator author_iterator;
typedef std::vector<Author>::const_iterator const_author_iterator;
typedef std::set<Book>::iterator book_iterator;
typedef std::set<Book>::const_iterator const_book_iterator;
⋮
private:
std::string name;
std::set<Book> books;
std::vector<Author> authors;
};
Now, this requires that the Book
class provide a less-than operator, but you might recall that operator<
is one of the functions that I had recommended that nearly every class should provide.
This change allows us to simplify the addBook
function:
void Publisher::addBook (Book& b)
{
b.setPublisher(*this);
books.insert(b);
}
and this is faster as well. The original, vector-based addBook
was O(numberOfBooks())
. But because std::set
is implemented via balanced binary search trees, the new version is actually O(log(numberOfBooks()))
.
5 Implementing std::set with Binary Search Trees
We can use our earlier binary search tree with iterators to implement a set. We’ll need two main data structures: the search tree and an integer counter to count the number of elements in tree. For the iterator types, we will use the same iterators already provided by the binary search tree.
template <typename Key, typename Compare=less<Key> >
class Set {
private:
Compare comparator;
public:
// typedefs:
typedef Key key_type;
typedef Key value_type;
typedef Compare key_compare;
typedef Compare value_compare;
typedef const Key& reference;
typedef const Key& const_reference;
typedef typename BinarySearchTree<Key>::const_iterator const_iterator;
typedef const_iterator iterator;
typedef unsigned int size_type;
typedef ptrdiff_t difference_type;
⋮
private:
size_type treeSize;
BinarySearchTree<key_type> bst;
};
5.1 Inserting Data
When inserting into a set, we basically insert into the tree, but increment our counter if the data gets inserted:
// insert/erase
template <typename Key, typename Compare>
std::pair<typename Set<Key,Compare>::iterator, bool>
Set<Key,Compare>::insert(const Set<Key,Compare>::value_type& x)
{
const_iterator it = bst.insert(x);
if (it == end())
return std::make_pair(it, false);
else
{
++treeSize;
return std::make_pair(it, true);
}
}
This takes advantage of a search tree insert
function that returns the location where a data value was inserted, or end()
if the data was not inserted (because it is a duplicate of a data value already in the tree, and we don’t store duplicates in a set).
Erasing data is similar, but we would decrement the treeSize
counter if we successfully remove anything.
5.2 Searching and Iterating
Searching the tree is even simpler:
template <typename Key, typename Compare>
inline
typename Set<Key,Compare>::iterator Set<Key,Compare>::find(const Set<Key,Compare>::key_type& x) const
{
return bst.find(x);
}
as are the functions to provide the beginning and ending iterators:
template <typename Key, typename Compare>
inline
typename Set<Key,Compare>::iterator Set<Key,Compare>::begin() const
{
return bst.begin();
}
template <typename Key, typename Compare>
inline
typename Set<Key,Compare>::iterator Set<Key,Compare>::end() const
{
return bst.end();
}
5.3 Copying
Finally, we note that our binary search tree already implemented its own versions of the Big 3, and none of the remaining data members in our set
class are pointers, so we can rely on the compiler-generated versions of the Big 3 for out set
.
The full version of the set implementation is available here.