Case Study: Schemes to Improve the Average Case of Sequential Search

Contents:

1 A Rough Mock-up of a Spell Checker

We have previously looked at a sequential search operation, specifically the std::find operation.

template <typename Iterator, typename T>
Iterator find (Iterator start, Iterator stop, T key)
{
    while (start != stop && !(*start == key))
       ++start;
    return start;
}

This function performs O((distance(start,stop)) comparisons in the worst case.

Let’s consider this function within a scenario of spell checking a book by loading $N$ words from a dictionary file and then performing $M$ searches of words obtained from the book.:

void checkWords (int N, int M)
{
    list<string> dictionary;
    ifstream dictIn ("dictionary.txt");
    for (int i = 0; i < N; ++i)
    {
        string word;
        dictIn >> word;
       dictionary.push_back(word);
    }
    dictIn.close();

    ifstream bookIn ("book.txt");
    for (int i = 0; i < M; ++i)
    {
        string word;
        bookIn >> word;
       auto pos = find(dictionary.begin(), dictionary.end(), word);
        if (pos == dictionary.end())
           cout << word << " appears to be misspelled." << endl;
    }
    bookIn.close();

}

In the worst case, that find call in the second loop is $O(N)$, making the second loop, and the entire function, $O(N*M)$.

Question: Let’s assume that all of the words in the book are correctly spelled. They occur somewhere in the dictionary. Assume that the words of the dictionary occur in random order.

What is the average case complexity of this function?

Answer

2 Trying to Improve the Sequential Search

2.1 Self-Organizing Lists

Now, in the real world, none of us are likely to know all of the words in a good spellcheck dictionary. And even if we knew them, in the sense of being able to know what they meant if we read them, a typical person uses far fewer words in their speech and writing than they actually understand. For example, I know the words “prolixity” and “Satyricon”, but I doubt that i have ever used either in a sentence. And, then, of course, there’s subject matter. Words like “flour” and “sugar” may occur commonly if we are spell checking a cookbook, but rarely if we are spell checking a programming textbook.

So one strategy that algorithm designers have played with is to try to move the words most likely to occur in the book closer to the beginning of the list. One way to do this is called a self organizing list: when we search for and find a word, we move it to the front of the list, so that if we see it again, we find it very quickly:

void checkWords (int N, int M)
{
    list<string> dictionary;
    ifstream dictIn ("dictionary.txt");
    for (int i = 0; i < N; ++i)
    {
        string word;
        dictIn >> word;
       dictionary.push_back(word);
    }
    dictIn.close();

    ifstream bookIn ("book.txt");
    for (int i = 0; i < M; ++i)
    {
        string word;
        bookIn >> word;
       auto pos = find(dictionary.begin(), dictionary.end(), word);
        if (pos == dictionary.end())
           cout << word << " appears to be misspelled." << endl;
       else
       {
         dictionary.erase(pos); // O(1);
         dictionary.push_front(word); // O(1)
       }
    }
    bookIn.close();

}

Of course, that word won’t stay right at the front as we continue. How far it moves back as we encounter other words depends on the vocabulary of the book’s author.

A common variation of the self-organizing list does not move the just-searched node all the way to the front of the list, but simply moves it forward one position instead:

void checkWords (int N, int M)
{
    list<string> dictionary;
    ifstream dictIn ("dictionary.txt");
    for (int i = 0; i < N; ++i)
    {
        string word;
        dictIn >> word;
       dictionary.push_back(word);
    }
    dictIn.close();

    ifstream bookIn ("book.txt");
    for (int i = 0; i < M; ++i)
    {
        string word;
        bookIn >> word;
       auto pos = find(dictionary.begin(), dictionary.end(), word);
        if (pos == dictionary.end())
           cout << word << " appears to be misspelled." << endl;
       else if (pos != dictionary.begin())
       {
         auto prev = pos;    // O(1);
         --prev;             // O(1)
         swap (*prev, *pos); // O(1)
       }
    }
    bookIn.close();

}

This is a bit more “stable” than the first version.

Over time, both versions should lead to a list becoming an approximation of an “ideal ordering” in which the most commonly searched word will be at the front of the list, followed by the second-most commonly searched word, followed by the third-most commonly searched word, and so on.

2.2 Ideally Ordered Lists

It would be hard to analyze the behavior of a self-organizing list until we have used it enough for it to settle down into an approximation of ideal ordering. But what about after that? Is the ideal ordering significant enough to actually reduce the complexity of the spell checker?

Let’s suppose that we are spellchecking the work of a new English speaker who only uses the one hundred most common words in the English language.

Question: What is the average case complexity of this function if all of the words in the book are among the 100 most common?

Answer

Now let’s suppose that our new English speaker has been growing his or her vocabulary.

Question: Suppose that 90% of the author’s words are among the 100 most common words in English, with the remaining 10% being words appearing anywhere in the list. What is the average complexity of checkWords?

Answer

Neither of these scenarios is particularly realistic. But, what we have done here is actually closely related to the practice of caching, in which recently accessed data is kept in a handy, fast store in order to speed up anticipated repeat requests for the same data. Your web browser caches pages that you have visited recently. Your operating system caches pages of virtual memory. Your CPU caches blocks of memory values that have been fetched from RAM.

Still, let’s try for a more realistic scenario. There is some evidence that the overall pattern of English word frequency in written works follows a Zipf distribution, where the probability that the $k^{th}$ most commonly used in English will be the next one seen in a document is $\frac{c}{k+2}$. That is, the most commonly used word has probability $c/2$, the next most commonly used word has probability $c/3$, and so on.

The constant $c$ can be determined by remembering that the sum of all the probabilities must be 1.0, so we can say that, for a dictionary of $N% words,

\[ \sum_{i=0}^{N-1} \frac{c}{i+2} = 1 \]

\[ c \sum_{i=0}^{N-1} \frac{1}{i+2} = 1 \]

For large $n$, the sum $\sum_{i=1}^{n} \frac{1}{i}$ is approximately equal to $\log_e(n)$.1 So $\sum_{i=0}^{N-1}$ is approximately $\log_e(N-1) - 1$. So

\[ c (\log_e(N-1) - 1) = 1 \]

\[ c = \frac{1}{\log_e(N-1) - 1} \]

Now, returning to our code, the two loops within the checkWords function are fixed to repeat $N$ and $M$ times, respectively.

However, each of the $M$ times around the second loop, the complexity of the find call will depend on the average position of the words we are searching for. We compute that average as the expected value of the position $k$ of the word:

\[ E(k) = \sum_{k=0}^{N-1} k p_k \]

where $p_i = c/(i+2)$.

\[ E(k) = \sum_{k=0}^{N-1} \frac{c*k}{k+2} = c \sum_{k=0}^{N-1} \frac{k}{k+2} \]

Now look at that fraction $\frac{k}{k+2}$. Every value of this fraction is a number less than $1$, and as $i$ gets larger, $\frac{i}{i+2}$ gets closer and closer to $1$. So we can say that $E(k)$ is approximately

\[ E(k) = c \sum_{k=0}^{N-1} 1 = c*N = \frac{N}{\log_e(N-1)-1}\]

The second loop is therefore $O\left(\frac{M*N}{\log{N}} \right)$.

That makes the average complexity of the entire function $O\left(N + \frac{M*N}{\log{N}} \right)$.

That’s definitely better than a straight sequential search, but perhaps not so dramatically better as we might hope.

3 Better than Binary?

Of course, all of the above has been predicated upon the idea that we wanted to work with sequential search in the first place. We know that a binary search would be faster than sequential search in most cases. However to use binary search, we would need to know that the input is already in sorted order, or we would have to sort it.

Is the extra cost of sorting the data worth it?

Suppose we change our code to

void checkWords (int N, int M)
{
    vector<string> dictionary;
    ifstream dictIn ("dictionary.txt");
    for (int i = 0; i < N; ++i)
    {
        string word;
        dictIn >> word;
       dictionary.push_back(word);
    }
    dictIn.close();
    sort(dictionary.begin(), dictionary.end()); 

    ifstream bookIn ("book.txt");
    for (int i = 0; i < M; ++i)
    {
        string word;
        bookIn >> word;
       auto pos = lower_bound(dictionary.begin(), dictionary.end(), word);
        if (pos == dictionary.end() || *pos != word)
           cout << word << " appears to be misspelled." << endl;
    }
    bookIn.close();

}

Question: What is the worst-case (and average case) complexity of checkWords? (The std::sort function is $O(n \log(n))$ where n = distance(start,stop).

Answer

How does $O((M+N) \log(N))$ for the sort + binary search compare against $O\left(N + \frac{M*N}{\log{N}} \right)$ for the self-organized list?

If $M$ is smaller than $\log{N}$, the dominant terms would be $N \log{N}$ for the sorted list and $M*N / \log(N)$. So if $M < \log^2{N}$, the self-ordered list would be preferred.

For anything else, the sorted list approach is better.

So it is possible that a self organizing list would be faster than a combination of sorting and binary search, but only for very selective input distributions, and it won’t be a whole lot faster (only a factor of $\log N$) even then. Self-organizing lists therefore aren’t seen in practice very often, however much they might seem like a good idea.


1: You might remember from calculus that $\frac{d \log_e{x}}{dx} = 1/x$ or, alternatively, that $\int_0^y x^{-1} dx = \log_e{y}$ (leading to one the world’s worst math jokes: what is the integral of “d cabin over cabin”? Ans: “log cabin”).

The sum $\sum_{i=1}^{n} \frac{1}{i}$ can be seen as a discrete approximation of $\int_0^n x^{-1} dx$ and, for large enough $n$, the approximation is pretty good.