Review of Machine Precision
Thomas J. Kennedy
In Chapter 1, we will cover four main topics:
- Finite Precision
- Arithmetic Error
- Cancellation Error
- Condition of a Problem
1 A Familiar Problem
The first topic, Finite Precision, is one you have encountered before. Consider a short C++ code snippet:
int main(int argc, char** argv)
{
double one_third = 1.0 / 3.0;
double one = one_third + one_third + one_third;
cout << one << "\n";
return 0;
}
In a math course, with infinite precision, we would end up with 1
as out output. However, from previous coursework we know that this code snippet will yield something close to, but not exactly, 1
.
Let us imagine that one_third
and one
store values in base-10 with user specified precision. Consider this table:
Precision | ||||||
---|---|---|---|---|---|---|
Variable | 1 | 2 | 3 | 4 | 8 | $\infty$ |
one_third |
0.3 | 0.33 | 0.333 | 0.3333 | 0.33333333 | 0.33333333$\ldots$ |
one |
0.9 | 0.99 | 0.999 | 0.9999 | 0.99999999 | 1.0 |
This table should be reminiscent of a math course, especially if the phrase “Round to n places.” still reverberates in your mind.
2 Estimating Machine Precision
When writing code, you have used float
and double
in C++ and, possibly, f32
and f64
in Rust. There is a well known algorithm attributed to Cleve Moler for estimating machine precision.
Example 1: Precision Estimation Pseudocodelet a ← 4.0 / 3.0 let b ← a - 1 let c ← b + b + b return |1 - c|
Let us stick with our example base-10 computer (i.e., pencil and paper). Let us select a few precisions (i.e., number of mantissa digits).
Mantissa Digits | |||||||
---|---|---|---|---|---|---|---|
Step | 1 | 2 | 4 | $\ldots$ | 128 | $\ldots$ | $\infty$ |
let a ← 4.0 / 3.0 |
1.3 | 1.33 | 1.3333 | $\ldots$ | $\ldots$ | 1.3333$\ldots$ | |
let b ← a - 1 |
0.3 | 0.33 | 0.3333 | $\ldots$ | $\ldots$ | 0.3333$\ldots$ | |
let c ← b + b + b |
0.9 | 0.99 | 0.9999 | $\ldots$ | 1 - $1\times 10^{128}$ | $\ldots$ | 1 |
return | c -1 | |
0.1 | 0.11 | 0.1111 | $\ldots$ | $1\times 10^{128}$ | $\ldots$ | 0 |
Using a paper-and-pencil approach, we would not write out an infinite number of decimal places, we would use fractions. If we wanted to use fractions for this algorithm, the C++ std::ratio library or Python fractions module are the most readily available. However, I would end up playing with the Rust fraction crate.
3 Scientific Notation
In this discussion, we have used base-10 scientific notation (i.e., scientific notation as used in math and science). We need not be so restrictive. We can generalize scientific notation to any base, $\beta$, where $ \beta > 1$. Let us define some notation:
$$ x \times \beta^e $$
In this expression:
- $x$ represents some real number.
- $\beta$ represents the base (e.g., 2 or 10).
- $e$ represents an integer exponents (positive, negative, or zero).
If we were working in base 10, we could have:
$$ 0.5 \times 10^{0} = 0.5 = \frac{1}{2} $$
In base-2, this would become:
$$ 1_{2} \times 2^{-1} = 0.1_{2} $$
Note the subscript $2$. When working with bases other than $10$, such subscripts specify the base.