Analysis of Algorithms: Motivation
Steven J. Zeil
An important theme throughout this semester will be viewing the process of developing software as an engineering process. Now, engineers in traditional engineering disciplines, civil engineers, electrical engineers, and the like, face trade offs in developing a new product, trade offs in cost, performance, and quality. One of the jobs of an engineer is to look at competing proposals for the design of a new product and to estimate ahead of time the cost, the speed, the strength, the quality, etc. of products that would result from these competing designs.
Software developers face the same kinds of choices. Early on, you may have several alternative designs and need to make a decision about which of those designs to actually pursue. It’s no good waiting until the program has already been implemented and written down in code. By then you’ve already committed to one design and invested significant resources into it.
How do we make these kinds of choices? In this course, we’ll be looking at mathematical techniques for analyzing algorithms to determine what their speed will be. It will be important that we do this both with real algorithms already written into code and with proposed algorithms that have been given a much sketchier description, probably written in “pseudocode”.
1 Case study: A Spell Checker
Suppose that we worked for a company that was producing word processors and other text manipulation programs. They have decided to add an automatic spell-check feature to the product line. Our designers have considered the process of how to check a document for spelling errors (i.e., any words not in a “dictionary” of known words). They have proposed two different algorithms for finding the set of misspelled words within a target file.
1.1 Version 1: Check every word from the document
collectMisspelledWords (/* inputs */ targetFile, dictionaryFile,
/* outputs */ misspellings)
{
read dictionaryFile into dictionary;
open targetFile;
misspellings = empty;
while more words in targetFile {
read word w from targetFile;
if w is not in dictionary {
add w to misspellings;
}
}
close targetFile;
}
In the first alternative, we read words, one at a time, from the target file. Each word that is not in the dictionary gets added to the set of misspellings.
Some of the designers, however, have objected that the first algorithm will waste time by repeatedly looking up common words like “a”, “the”, etc., in the dictionary.
They suggest an alternative algorithm.
1.2 Version 2: Build a Concordance
collectMisspelledWords (/* inputs */ targetFile, dictionaryFile,
/* outputs */ misspellings)
{
concordance = empty;
open targetFile;
while more words in targetFile {
read word w from targetFile;
add w to concordance;
}
close targetFile;
open dictionaryFile;
for each word w in the concordance {
while (last word read from dictionaryFile < w) {
read another word from the dictionaryFile;
}
if (w != last word read from dictionaryFile) {
add all occurrences of w to misspellings;
}
}
close targetFile;
}
This works by first collecting all words from the document to form a concordance, an index of all the words taken from a document together with the locations where they were found. Then each word is checked just once against the dictionary, no matter how many times that word actually occurs within the target document.
The check of each word is also faster. Because the dictionary and the concordance will (presumably) be sorted, we can compare them in a single pass through both sets of words.
For example, the concordance words for the paragraph:
This works by first collecting all words from the document
to form a concordance, an index of all the words taken from a
document together with the locations where they were found.
Then each word is checked just once agianst the dictionary,
no matter how many times that word actually occurs within the
target document.
would be:
a actually against all an by checked collecting concordance dictionary document each first form found from how index is just locations many matter no occurs of once taken target that the then they this times to together were where with within word words works
So we can check the concordance against the dictionary in a single pass through both.
- This does not work with Version 1, because the words in the original document do not come in sorted order.
We’ll need to search each document word against the entire dictionary.
2 Comparison of Spellcheck Solutions
So, which of these algorithms is likely to run faster overall?
We can make plausible arguments in either direction:
-
Searching the dictionary for “random” words in solution 1 may cause many words to be examined multiple times.
-
In solution 2, each word is checked against the dictionary at most 1 time.
-
-
Solution 2 has the extra cost of building the concordance.
-
but the concordance is probably much smaller than the full document.
-
Overall, it’s not obvious which is faster. Thinking more deeply about the question, we might ask:
-
Does the choice of the faster solution depend upon the relative sizes of the dictionary and the document?
-
or the size of the concordance?
-
or the number of misspelled words?
In the lessons that follow, we will develop the mathematical tools for answering these kinds of questions.
3 Why Not Just Time the Programs?
So, why the fuss? Why don’t we simply sit down with a stopwatch, run both programs on some test data, time them, and adopt the one that runs faster?
Of course, if we’re still at the design level, we can’t time the programs because we haven’t written them yet. But even if we actually had the code for both programs in hand, a simple timing experiment might yield different results depending on who ran it and how.
Why should there be such big difference? Well, it turns out that the kind of results we get with timing experiments like that will very considerably because of
-
differences in code quality:
Different programmers typically produce codes that run at very different speeds. That’s hardly surprising. Different programmers often have different skill levels. Interestingly enough, even if the same programmer were to code both algorithms, that programmer would probably wind up producing more efficient code for the algorithm that was more familiar to him or her.
-
differences in the machines on which we are running the programs:
It is obvious that if you take an algorithm and run it on, let’s say, a 33MHZ machine and then take it and run it on a 350MHZ machine , it’s going to run faster on the second machine.
What may be less obvious is that if you take two different processors that are rated at roughly the same speed and run both the algorithms on those processors, you may find that one of those algorithms runs faster on one processor and the other will run faster in other processor. Why? Because even processors that are roughly rated at the same speed will have different speed when we get down to the details. One may be faster at addition and the other at multiplication, so the actual speed that you get when timing an algorithm will depend upon how many addition instructions versus how many multiplication instructions you have.
-
differences in the compiler and compiler settings that we are using:
For similar reasons, you see different possibilities when you switch compilers. It is possible, if you switch compilers , one algorithm would behave better on compiler A and other algorithm would behave better on compiler B. The reason: because different compilers have different approaches for compiling certain kinds of instruction. A simple instruction like
a=2*b
might be encoded by one compiler as a multiplication by 2 but by another compiler as a addition of B to itself and by still another compiler as a shift-the-bits-left operation. Consequently the same algorithms run on different compilers (or maybe using different settings on the same compiler) may wind up giving very different results when you try to compare them. -
differences in the choice of input data that we actually use for an experiment:
Finally we have problems in terms of choosing input data. The choice of input data may very well be biased (even if unintentionally so) to one algorithm or the other. Consequently different people choosing different input data are likely to come up with very different results on this experiment.
3.1 Better than Timing
Now how we are going to overcome these problems?
-
differences in code quality:
To minimize the impact of differences in code quality, we will learn to use order of magnitude analysis to ask the question: “How quickly does running time of each algorithm increase as we increase problem size?”
The answer to this question typically is determined more by the choice of algorithm than by the incidental details in coding it up, and the analysis is mathematical rather than experimental to further focus attention on the algorithm rather than on the details of the code.
-
differences in the machines on which we are running the programs:
This same analysis will help overcome problems arising from differences in machines because this analysis is going to be on the algorithm rather than on the specific code running on specific machine.
-
differences in the compiler and compiler settings that we are using:
That also helps with differences in compilers that because we are not going to be dealing with the compilers’ encoding.
-
differences in the choice of input data that we actually use for an experiment:
The choice of input data remains a problem. Depending upon what kind of inputs we are actually choosing to analyze, we might windup with very different results.
We will deal with this in two ways:
-
We can ask the question: “For all possible inputs to this program, what would be the average behavior?” (That is, which algorithm on average would run faster than the other?)
-
More often, the question we will ask is: “For all possible inputs to this program, which input gives us the worst behavior?” (That is, which one might make us wait the longest before we actually complete the processing?)
-
In order to make our choices, we tend to use worst case more often than average case analysis because
-
it is often easier to do and
-
because in many circumstances, it better expresses users’ frustration in terms of waiting and waiting for an actual answer to come out.